* [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic
@ 2025-10-08 4:37 Lance Yang
2025-10-08 4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-08 4:37 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, ioworker0, richard.weiyang, linux-kernel, linux-mm
Hi all,
This series cleans up the almost-duplicated PTE scanning logic in the
collapse path.
The first one is a preparatory step that refactors both loops to use
a single if-else-if-else-if chain for checking disjoint PTEs.
The second one replaces VM_BUG_ON_FOLIO() with a more graceful
VM_WARN_ON_FOLIO() for handling non-anonymous folios.
The last one then extracts the common logic into a shared helper.
Thanks,
Lance
---
This series applies on top of patch[1], which is based on mm-new.
[1] https://lore.kernel.org/linux-mm/20251008032657.72406-1-lance.yang@linux.dev
v2 -> v3:
- #02 Collect Reviewed-by from Wei and Dev - thanks!
- #03 Use vm_normal_folio() and drop struct page altogether (per Dev)
- #03 Move the cc parameter to the end (per Dev)
- https://lore.kernel.org/linux-mm/20251006144338.96519-1-lance.yang@linux.dev
v1 -> v2:
- #01 Update the changelog (per Dev)
- #01 Collect Reviewed-by from Wei, Dev and Zi - thanks!
- #03 Make more of the scanning logic common between scan_pmd() and
_isolate() (per Dev)
- https://lore.kernel.org/linux-mm/20251002073255.14867-1-lance.yang@linux.dev
Lance Yang (3):
mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for
non-anon folios
mm/khugepaged: merge PTE scanning logic into a new helper
mm/khugepaged.c | 242 ++++++++++++++++++++++++++----------------------
1 file changed, 130 insertions(+), 112 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
2025-10-08 4:37 [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic Lance Yang
@ 2025-10-08 4:37 ` Lance Yang
2025-10-14 12:17 ` Lorenzo Stoakes
2025-10-08 4:37 ` [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios Lance Yang
2025-10-08 4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
2 siblings, 1 reply; 19+ messages in thread
From: Lance Yang @ 2025-10-08 4:37 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, ioworker0, richard.weiyang, linux-kernel, linux-mm,
Lance Yang
From: Lance Yang <lance.yang@linux.dev>
As pointed out by Dev, the PTE checks for disjoint conditions in the
scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
and pte_uffd_wp are mutually exclusive.
This patch refactors the loops in both __collapse_huge_page_isolate() and
hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
instead of separate if blocks. While at it, the redundant pte_present()
check before is_zero_pfn() is also removed.
Also, this is a preparatory step to make it easier to merge the
almost-duplicated scanning logic in these two functions, as suggested
by David.
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Suggested-by: Dev Jain <dev.jain@arm.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
mm/khugepaged.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index bec3e268dc76..e3e27223137a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -548,8 +548,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
_pte++, addr += PAGE_SIZE) {
pte_t pteval = ptep_get(_pte);
- if (pte_none(pteval) || (pte_present(pteval) &&
- is_zero_pfn(pte_pfn(pteval)))) {
+ if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
++none_or_zero;
if (!userfaultfd_armed(vma) &&
(!cc->is_khugepaged ||
@@ -560,12 +559,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
goto out;
}
- }
- if (!pte_present(pteval)) {
+ } else if (!pte_present(pteval)) {
result = SCAN_PTE_NON_PRESENT;
goto out;
- }
- if (pte_uffd_wp(pteval)) {
+ } else if (pte_uffd_wp(pteval)) {
result = SCAN_PTE_UFFD_WP;
goto out;
}
@@ -1321,8 +1318,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
goto out_unmap;
}
- }
- if (pte_uffd_wp(pteval)) {
+ } else if (pte_uffd_wp(pteval)) {
/*
* Don't collapse the page if any of the small
* PTEs are armed with uffd write protection.
--
2.49.0
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios
2025-10-08 4:37 [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic Lance Yang
2025-10-08 4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
@ 2025-10-08 4:37 ` Lance Yang
2025-10-14 12:25 ` Lorenzo Stoakes
2025-10-08 4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
2 siblings, 1 reply; 19+ messages in thread
From: Lance Yang @ 2025-10-08 4:37 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, ioworker0, richard.weiyang, linux-kernel, linux-mm,
Lance Yang
From: Lance Yang <lance.yang@linux.dev>
As Zi pointed out, we should avoid crashing the kernel for conditions
that can be handled gracefully. Encountering a non-anonymous folio in an
anonymous VMA is a bug, but a warning is sufficient.
This patch changes the VM_BUG_ON_FOLIO(!folio_test_anon(folio)) to a
VM_WARN_ON_FOLIO() in both __collapse_huge_page_isolate() and
hpage_collapse_scan_pmd(), and then aborts the scan with SCAN_PAGE_ANON.
Making more of the scanning logic common between hpage_collapse_scan_pmd()
and __collapse_huge_page_isolate(), as suggested by Dev.
Suggested-by: Dev Jain <dev.jain@arm.com>
Suggested-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
mm/khugepaged.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e3e27223137a..b5c0295c3414 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -573,7 +573,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
}
folio = page_folio(page);
- VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
+ if (!folio_test_anon(folio)) {
+ VM_WARN_ON_FOLIO(true, folio);
+ result = SCAN_PAGE_ANON;
+ goto out;
+ }
/* See hpage_collapse_scan_pmd(). */
if (folio_maybe_mapped_shared(folio)) {
@@ -1340,6 +1344,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
folio = page_folio(page);
if (!folio_test_anon(folio)) {
+ VM_WARN_ON_FOLIO(true, folio);
result = SCAN_PAGE_ANON;
goto out_unmap;
}
--
2.49.0
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
2025-10-08 4:37 [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic Lance Yang
2025-10-08 4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
2025-10-08 4:37 ` [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios Lance Yang
@ 2025-10-08 4:37 ` Lance Yang
2025-10-09 1:07 ` Andrew Morton
` (4 more replies)
2 siblings, 5 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-08 4:37 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, ioworker0, richard.weiyang, linux-kernel, linux-mm,
Lance Yang
From: Lance Yang <lance.yang@linux.dev>
As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
and __collapse_huge_page_isolate() was almost duplicated.
This patch cleans things up by moving all the common PTE checking logic
into a new shared helper, thp_collapse_check_pte(). While at it, we use
vm_normal_folio() instead of vm_normal_page().
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
1 file changed, 130 insertions(+), 113 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b5c0295c3414..7116caae1fa4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -61,6 +61,12 @@ enum scan_result {
SCAN_PAGE_FILLED,
};
+enum pte_check_result {
+ PTE_CHECK_SUCCEED,
+ PTE_CHECK_CONTINUE,
+ PTE_CHECK_FAIL,
+};
+
#define CREATE_TRACE_POINTS
#include <trace/events/huge_memory.h>
@@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
}
}
+/*
+ * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
+ * @pte: The PTE to check
+ * @vma: The VMA the PTE belongs to
+ * @addr: The virtual address corresponding to this PTE
+ * @foliop: On success, used to return a pointer to the folio
+ * Must be non-NULL
+ * @none_or_zero: Counter for none/zero PTEs. Must be non-NULL
+ * @unmapped: Counter for swap PTEs. Can be NULL if not scanning swaps
+ * @shared: Counter for shared pages. Must be non-NULL
+ * @scan_result: Used to return the failure reason (SCAN_*) on a
+ * PTE_CHECK_FAIL return. Must be non-NULL
+ * @cc: Collapse control settings
+ *
+ * Returns:
+ * PTE_CHECK_SUCCEED - PTE is suitable, proceed with further checks
+ * PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
+ * PTE_CHECK_FAIL - Abort collapse scan
+ */
+static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
+ unsigned long addr, struct folio **foliop, int *none_or_zero,
+ int *unmapped, int *shared, int *scan_result,
+ struct collapse_control *cc)
+{
+ struct folio *folio = NULL;
+
+ if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
+ (*none_or_zero)++;
+ if (!userfaultfd_armed(vma) &&
+ (!cc->is_khugepaged ||
+ *none_or_zero <= khugepaged_max_ptes_none)) {
+ return PTE_CHECK_CONTINUE;
+ } else {
+ *scan_result = SCAN_EXCEED_NONE_PTE;
+ count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+ return PTE_CHECK_FAIL;
+ }
+ } else if (!pte_present(pte)) {
+ if (!unmapped) {
+ *scan_result = SCAN_PTE_NON_PRESENT;
+ return PTE_CHECK_FAIL;
+ }
+
+ if (non_swap_entry(pte_to_swp_entry(pte))) {
+ *scan_result = SCAN_PTE_NON_PRESENT;
+ return PTE_CHECK_FAIL;
+ }
+
+ (*unmapped)++;
+ if (!cc->is_khugepaged ||
+ *unmapped <= khugepaged_max_ptes_swap) {
+ /*
+ * Always be strict with uffd-wp enabled swap
+ * entries. Please see comment below for
+ * pte_uffd_wp().
+ */
+ if (pte_swp_uffd_wp(pte)) {
+ *scan_result = SCAN_PTE_UFFD_WP;
+ return PTE_CHECK_FAIL;
+ }
+ return PTE_CHECK_CONTINUE;
+ } else {
+ *scan_result = SCAN_EXCEED_SWAP_PTE;
+ count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
+ return PTE_CHECK_FAIL;
+ }
+ } else if (pte_uffd_wp(pte)) {
+ /*
+ * Don't collapse the page if any of the small PTEs are
+ * armed with uffd write protection. Here we can also mark
+ * the new huge pmd as write protected if any of the small
+ * ones is marked but that could bring unknown userfault
+ * messages that falls outside of the registered range.
+ * So, just be simple.
+ */
+ *scan_result = SCAN_PTE_UFFD_WP;
+ return PTE_CHECK_FAIL;
+ }
+
+ folio = vm_normal_folio(vma, addr, pte);
+ if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
+ *scan_result = SCAN_PAGE_NULL;
+ return PTE_CHECK_FAIL;
+ }
+
+ if (!folio_test_anon(folio)) {
+ VM_WARN_ON_FOLIO(true, folio);
+ *scan_result = SCAN_PAGE_ANON;
+ return PTE_CHECK_FAIL;
+ }
+
+ /*
+ * We treat a single page as shared if any part of the THP
+ * is shared.
+ */
+ if (folio_maybe_mapped_shared(folio)) {
+ (*shared)++;
+ if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
+ *scan_result = SCAN_EXCEED_SHARED_PTE;
+ count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+ return PTE_CHECK_FAIL;
+ }
+ }
+
+ *foliop = folio;
+
+ return PTE_CHECK_SUCCEED;
+}
+
static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
unsigned long start_addr,
pte_t *pte,
struct collapse_control *cc,
struct list_head *compound_pagelist)
{
- struct page *page = NULL;
struct folio *folio = NULL;
unsigned long addr = start_addr;
pte_t *_pte;
int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
+ int pte_check_res;
for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
_pte++, addr += PAGE_SIZE) {
pte_t pteval = ptep_get(_pte);
- if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
- ++none_or_zero;
- if (!userfaultfd_armed(vma) &&
- (!cc->is_khugepaged ||
- none_or_zero <= khugepaged_max_ptes_none)) {
- continue;
- } else {
- result = SCAN_EXCEED_NONE_PTE;
- count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
- goto out;
- }
- } else if (!pte_present(pteval)) {
- result = SCAN_PTE_NON_PRESENT;
- goto out;
- } else if (pte_uffd_wp(pteval)) {
- result = SCAN_PTE_UFFD_WP;
- goto out;
- }
- page = vm_normal_page(vma, addr, pteval);
- if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
- result = SCAN_PAGE_NULL;
- goto out;
- }
- folio = page_folio(page);
- if (!folio_test_anon(folio)) {
- VM_WARN_ON_FOLIO(true, folio);
- result = SCAN_PAGE_ANON;
- goto out;
- }
+ pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
+ &folio, &none_or_zero, NULL, &shared,
+ &result, cc);
- /* See hpage_collapse_scan_pmd(). */
- if (folio_maybe_mapped_shared(folio)) {
- ++shared;
- if (cc->is_khugepaged &&
- shared > khugepaged_max_ptes_shared) {
- result = SCAN_EXCEED_SHARED_PTE;
- count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
- goto out;
- }
- }
+ if (pte_check_res == PTE_CHECK_CONTINUE)
+ continue;
+ else if (pte_check_res == PTE_CHECK_FAIL)
+ goto out;
if (folio_test_large(folio)) {
struct folio *f;
@@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
pte_t *pte, *_pte;
int result = SCAN_FAIL, referenced = 0;
int none_or_zero = 0, shared = 0;
- struct page *page = NULL;
struct folio *folio = NULL;
unsigned long addr;
spinlock_t *ptl;
int node = NUMA_NO_NODE, unmapped = 0;
+ int pte_check_res;
VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
@@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
_pte++, addr += PAGE_SIZE) {
pte_t pteval = ptep_get(_pte);
- if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
- ++none_or_zero;
- if (!userfaultfd_armed(vma) &&
- (!cc->is_khugepaged ||
- none_or_zero <= khugepaged_max_ptes_none)) {
- continue;
- } else {
- result = SCAN_EXCEED_NONE_PTE;
- count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
- goto out_unmap;
- }
- } else if (!pte_present(pteval)) {
- if (non_swap_entry(pte_to_swp_entry(pteval))) {
- result = SCAN_PTE_NON_PRESENT;
- goto out_unmap;
- }
- ++unmapped;
- if (!cc->is_khugepaged ||
- unmapped <= khugepaged_max_ptes_swap) {
- /*
- * Always be strict with uffd-wp
- * enabled swap entries. Please see
- * comment below for pte_uffd_wp().
- */
- if (pte_swp_uffd_wp(pteval)) {
- result = SCAN_PTE_UFFD_WP;
- goto out_unmap;
- }
- continue;
- } else {
- result = SCAN_EXCEED_SWAP_PTE;
- count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
- goto out_unmap;
- }
- } else if (pte_uffd_wp(pteval)) {
- /*
- * Don't collapse the page if any of the small
- * PTEs are armed with uffd write protection.
- * Here we can also mark the new huge pmd as
- * write protected if any of the small ones is
- * marked but that could bring unknown
- * userfault messages that falls outside of
- * the registered range. So, just be simple.
- */
- result = SCAN_PTE_UFFD_WP;
- goto out_unmap;
- }
+ pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
+ &folio, &none_or_zero, &unmapped,
+ &shared, &result, cc);
- page = vm_normal_page(vma, addr, pteval);
- if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
- result = SCAN_PAGE_NULL;
- goto out_unmap;
- }
- folio = page_folio(page);
-
- if (!folio_test_anon(folio)) {
- VM_WARN_ON_FOLIO(true, folio);
- result = SCAN_PAGE_ANON;
+ if (pte_check_res == PTE_CHECK_CONTINUE)
+ continue;
+ else if (pte_check_res == PTE_CHECK_FAIL)
goto out_unmap;
- }
-
- /*
- * We treat a single page as shared if any part of the THP
- * is shared.
- */
- if (folio_maybe_mapped_shared(folio)) {
- ++shared;
- if (cc->is_khugepaged &&
- shared > khugepaged_max_ptes_shared) {
- result = SCAN_EXCEED_SHARED_PTE;
- count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
- goto out_unmap;
- }
- }
/*
* Record which node the original page is from and save this
--
2.49.0
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
2025-10-08 4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
@ 2025-10-09 1:07 ` Andrew Morton
2025-10-09 1:49 ` Lance Yang
2025-10-10 9:10 ` Dev Jain
` (3 subsequent siblings)
4 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2025-10-09 1:07 UTC (permalink / raw)
To: Lance Yang
Cc: david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
linux-kernel, linux-mm
On Wed, 8 Oct 2025 12:37:48 +0800 Lance Yang <lance.yang@linux.dev> wrote:
> + if (!cc->is_khugepaged ||
> + *unmapped <= khugepaged_max_ptes_swap) {
> + /*
> + * Always be strict with uffd-wp enabled swap
> + * entries. Please see comment below for
> + * pte_uffd_wp().
> + */
> + if (pte_swp_uffd_wp(pte)) {
> + *scan_result = SCAN_PTE_UFFD_WP;
> + return PTE_CHECK_FAIL;
> + }
> + return PTE_CHECK_CONTINUE;
> + } else {
> + *scan_result = SCAN_EXCEED_SWAP_PTE;
> + count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> + return PTE_CHECK_FAIL;
> + }
I'm inclined to agree with checkpatch here.
WARNING: else is not generally useful after a break or return
#81: FILE: mm/khugepaged.c:574:
+ return PTE_CHECK_CONTINUE;
+ } else {
did you see this and disagree or did you forget to run checkpatch?
--- a/mm/khugepaged.c~mm-khugepaged-merge-pte-scanning-logic-into-a-new-helper-checkpatch-fixes
+++ a/mm/khugepaged.c
@@ -571,11 +571,10 @@ static inline int thp_collapse_check_pte
(!cc->is_khugepaged ||
*none_or_zero <= khugepaged_max_ptes_none)) {
return PTE_CHECK_CONTINUE;
- } else {
- *scan_result = SCAN_EXCEED_NONE_PTE;
- count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
- return PTE_CHECK_FAIL;
}
+ *scan_result = SCAN_EXCEED_NONE_PTE;
+ count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+ return PTE_CHECK_FAIL;
} else if (!pte_present(pte)) {
if (!unmapped) {
*scan_result = SCAN_PTE_NON_PRESENT;
@@ -600,11 +599,10 @@ static inline int thp_collapse_check_pte
return PTE_CHECK_FAIL;
}
return PTE_CHECK_CONTINUE;
- } else {
- *scan_result = SCAN_EXCEED_SWAP_PTE;
- count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
- return PTE_CHECK_FAIL;
}
+ *scan_result = SCAN_EXCEED_SWAP_PTE;
+ count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
+ return PTE_CHECK_FAIL;
} else if (pte_uffd_wp(pte)) {
/*
* Don't collapse the page if any of the small PTEs are
_
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
2025-10-09 1:07 ` Andrew Morton
@ 2025-10-09 1:49 ` Lance Yang
0 siblings, 0 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-09 1:49 UTC (permalink / raw)
To: Andrew Morton
Cc: david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
linux-kernel, linux-mm
On 2025/10/9 09:07, Andrew Morton wrote:
> On Wed, 8 Oct 2025 12:37:48 +0800 Lance Yang <lance.yang@linux.dev> wrote:
>
>> + if (!cc->is_khugepaged ||
>> + *unmapped <= khugepaged_max_ptes_swap) {
>> + /*
>> + * Always be strict with uffd-wp enabled swap
>> + * entries. Please see comment below for
>> + * pte_uffd_wp().
>> + */
>> + if (pte_swp_uffd_wp(pte)) {
>> + *scan_result = SCAN_PTE_UFFD_WP;
>> + return PTE_CHECK_FAIL;
>> + }
>> + return PTE_CHECK_CONTINUE;
>> + } else {
>> + *scan_result = SCAN_EXCEED_SWAP_PTE;
>> + count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> + return PTE_CHECK_FAIL;
>> + }
>
> I'm inclined to agree with checkpatch here.
Thanks!
>
> WARNING: else is not generally useful after a break or return
> #81: FILE: mm/khugepaged.c:574:
> + return PTE_CHECK_CONTINUE;
> + } else {
>
> did you see this and disagree or did you forget to run checkpatch?
Yes, I saw the warning. I kept the original style because this is just a
code move ...
>
> --- a/mm/khugepaged.c~mm-khugepaged-merge-pte-scanning-logic-into-a-new-helper-checkpatch-fixes
> +++ a/mm/khugepaged.c
> @@ -571,11 +571,10 @@ static inline int thp_collapse_check_pte
> (!cc->is_khugepaged ||
> *none_or_zero <= khugepaged_max_ptes_none)) {
> return PTE_CHECK_CONTINUE;
> - } else {
> - *scan_result = SCAN_EXCEED_NONE_PTE;
> - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> - return PTE_CHECK_FAIL;
> }
> + *scan_result = SCAN_EXCEED_NONE_PTE;
> + count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> + return PTE_CHECK_FAIL;
> } else if (!pte_present(pte)) {
> if (!unmapped) {
> *scan_result = SCAN_PTE_NON_PRESENT;
> @@ -600,11 +599,10 @@ static inline int thp_collapse_check_pte
> return PTE_CHECK_FAIL;
> }
> return PTE_CHECK_CONTINUE;
> - } else {
> - *scan_result = SCAN_EXCEED_SWAP_PTE;
> - count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> - return PTE_CHECK_FAIL;
> }
> + *scan_result = SCAN_EXCEED_SWAP_PTE;
> + count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> + return PTE_CHECK_FAIL;
> } else if (pte_uffd_wp(pte)) {
> /*
> * Don't collapse the page if any of the small PTEs are
> _
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
2025-10-08 4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
2025-10-09 1:07 ` Andrew Morton
@ 2025-10-10 9:10 ` Dev Jain
2025-10-10 10:42 ` Lance Yang
2025-10-10 13:29 ` Wei Yang
` (2 subsequent siblings)
4 siblings, 1 reply; 19+ messages in thread
From: Dev Jain @ 2025-10-10 9:10 UTC (permalink / raw)
To: Lance Yang, akpm, david, lorenzo.stoakes
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, baohua,
ioworker0, richard.weiyang, linux-kernel, linux-mm
[-- Attachment #1: Type: text/plain, Size: 1387 bytes --]
On 08/10/25 10:07 am, Lance Yang wrote:
> }
> }
>
> +/*
> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
> + * @pte: The PTE to check
> + * @vma: The VMA the PTE belongs to
> + * @addr: The virtual address corresponding to this PTE
> + * @foliop: On success, used to return a pointer to the folio
> + * Must be non-NULL
> + * @none_or_zero: Counter for none/zero PTEs. Must be non-NULL
> + * @unmapped: Counter for swap PTEs. Can be NULL if not scanning swaps
> + * @shared: Counter for shared pages. Must be non-NULL
> + * @scan_result: Used to return the failure reason (SCAN_*) on a
> + * PTE_CHECK_FAIL return. Must be non-NULL
> + * @cc: Collapse control settings
> + *
> + * Returns:
> + * PTE_CHECK_SUCCEED - PTE is suitable, proceed with further checks
> + * PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
> + * PTE_CHECK_FAIL - Abort collapse scan
> + */
> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
> + unsigned long addr, struct folio **foliop, int *none_or_zero,
> + int *unmapped, int *shared, int *scan_result,
> + struct collapse_control *cc)
> +{
> + struct folio *folio = NULL;
I think initialization is not needed here?Otherwise, LGTM. Reviewed-by: Dev Jain <dev.jain@arm.com>
[-- Attachment #2: Type: text/html, Size: 1869 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
2025-10-10 9:10 ` Dev Jain
@ 2025-10-10 10:42 ` Lance Yang
0 siblings, 0 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-10 10:42 UTC (permalink / raw)
To: Dev Jain
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, baohua,
ioworker0, david, richard.weiyang, lorenzo.stoakes, linux-kernel,
akpm, linux-mm
On 2025/10/10 17:10, Dev Jain wrote:
>
> On 08/10/25 10:07 am, Lance Yang wrote:
>> }
>> }
>>
>> +/*
>> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
>> + * @pte: The PTE to check
>> + * @vma: The VMA the PTE belongs to
>> + * @addr: The virtual address corresponding to this PTE
>> + * @foliop: On success, used to return a pointer to the folio
>> + * Must be non-NULL
>> + * @none_or_zero: Counter for none/zero PTEs. Must be non-NULL
>> + * @unmapped: Counter for swap PTEs. Can be NULL if not scanning swaps
>> + * @shared: Counter for shared pages. Must be non-NULL
>> + * @scan_result: Used to return the failure reason (SCAN_*) on a
>> + * PTE_CHECK_FAIL return. Must be non-NULL
>> + * @cc: Collapse control settings
>> + *
>> + * Returns:
>> + * PTE_CHECK_SUCCEED - PTE is suitable, proceed with further checks
>> + * PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
>> + * PTE_CHECK_FAIL - Abort collapse scan
>> + */
>> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
>> + unsigned long addr, struct folio **foliop, int *none_or_zero,
>> + int *unmapped, int *shared, int *scan_result,
>> + struct collapse_control *cc)
>> +{
>> + struct folio *folio = NULL;
>
> I think initialization is not needed here?Otherwise, LGTM. Reviewed-by: Dev Jain <dev.jain@arm.com>
Yep. It's a minor thing, so I'll fold that in if a new version is
needed. Thanks!
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
2025-10-08 4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
2025-10-09 1:07 ` Andrew Morton
2025-10-10 9:10 ` Dev Jain
@ 2025-10-10 13:29 ` Wei Yang
2025-10-10 13:55 ` Lance Yang
2025-10-14 12:36 ` Lorenzo Stoakes
2025-10-14 17:41 ` Lorenzo Stoakes
4 siblings, 1 reply; 19+ messages in thread
From: Wei Yang @ 2025-10-10 13:29 UTC (permalink / raw)
To: Lance Yang
Cc: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, ioworker0,
richard.weiyang, linux-kernel, linux-mm
On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
>From: Lance Yang <lance.yang@linux.dev>
>
>As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
>and __collapse_huge_page_isolate() was almost duplicated.
>
>This patch cleans things up by moving all the common PTE checking logic
>into a new shared helper, thp_collapse_check_pte(). While at it, we use
>vm_normal_folio() instead of vm_normal_page().
>
>Suggested-by: David Hildenbrand <david@redhat.com>
>Suggested-by: Dev Jain <dev.jain@arm.com>
>Signed-off-by: Lance Yang <lance.yang@linux.dev>
>---
> mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
> 1 file changed, 130 insertions(+), 113 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index b5c0295c3414..7116caae1fa4 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -61,6 +61,12 @@ enum scan_result {
> SCAN_PAGE_FILLED,
> };
>
>+enum pte_check_result {
>+ PTE_CHECK_SUCCEED,
>+ PTE_CHECK_CONTINUE,
>+ PTE_CHECK_FAIL,
>+};
>+
> #define CREATE_TRACE_POINTS
> #include <trace/events/huge_memory.h>
>
>@@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
> }
> }
>
>+/*
>+ * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
>+ * @pte: The PTE to check
>+ * @vma: The VMA the PTE belongs to
>+ * @addr: The virtual address corresponding to this PTE
>+ * @foliop: On success, used to return a pointer to the folio
>+ * Must be non-NULL
>+ * @none_or_zero: Counter for none/zero PTEs. Must be non-NULL
>+ * @unmapped: Counter for swap PTEs. Can be NULL if not scanning swaps
>+ * @shared: Counter for shared pages. Must be non-NULL
>+ * @scan_result: Used to return the failure reason (SCAN_*) on a
>+ * PTE_CHECK_FAIL return. Must be non-NULL
>+ * @cc: Collapse control settings
>+ *
>+ * Returns:
>+ * PTE_CHECK_SUCCEED - PTE is suitable, proceed with further checks
>+ * PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
>+ * PTE_CHECK_FAIL - Abort collapse scan
>+ */
>+static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
>+ unsigned long addr, struct folio **foliop, int *none_or_zero,
>+ int *unmapped, int *shared, int *scan_result,
>+ struct collapse_control *cc)
>+{
>+ struct folio *folio = NULL;
>+
>+ if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>+ (*none_or_zero)++;
>+ if (!userfaultfd_armed(vma) &&
>+ (!cc->is_khugepaged ||
>+ *none_or_zero <= khugepaged_max_ptes_none)) {
>+ return PTE_CHECK_CONTINUE;
>+ } else {
>+ *scan_result = SCAN_EXCEED_NONE_PTE;
>+ count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>+ return PTE_CHECK_FAIL;
>+ }
>+ } else if (!pte_present(pte)) {
>+ if (!unmapped) {
>+ *scan_result = SCAN_PTE_NON_PRESENT;
>+ return PTE_CHECK_FAIL;
>+ }
>+
>+ if (non_swap_entry(pte_to_swp_entry(pte))) {
>+ *scan_result = SCAN_PTE_NON_PRESENT;
>+ return PTE_CHECK_FAIL;
>+ }
>+
>+ (*unmapped)++;
>+ if (!cc->is_khugepaged ||
>+ *unmapped <= khugepaged_max_ptes_swap) {
>+ /*
>+ * Always be strict with uffd-wp enabled swap
>+ * entries. Please see comment below for
>+ * pte_uffd_wp().
>+ */
>+ if (pte_swp_uffd_wp(pte)) {
>+ *scan_result = SCAN_PTE_UFFD_WP;
>+ return PTE_CHECK_FAIL;
>+ }
>+ return PTE_CHECK_CONTINUE;
>+ } else {
>+ *scan_result = SCAN_EXCEED_SWAP_PTE;
>+ count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>+ return PTE_CHECK_FAIL;
>+ }
>+ } else if (pte_uffd_wp(pte)) {
>+ /*
>+ * Don't collapse the page if any of the small PTEs are
>+ * armed with uffd write protection. Here we can also mark
>+ * the new huge pmd as write protected if any of the small
>+ * ones is marked but that could bring unknown userfault
>+ * messages that falls outside of the registered range.
>+ * So, just be simple.
>+ */
>+ *scan_result = SCAN_PTE_UFFD_WP;
>+ return PTE_CHECK_FAIL;
>+ }
>+
>+ folio = vm_normal_folio(vma, addr, pte);
>+ if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
>+ *scan_result = SCAN_PAGE_NULL;
>+ return PTE_CHECK_FAIL;
>+ }
>+
>+ if (!folio_test_anon(folio)) {
>+ VM_WARN_ON_FOLIO(true, folio);
>+ *scan_result = SCAN_PAGE_ANON;
>+ return PTE_CHECK_FAIL;
>+ }
>+
>+ /*
>+ * We treat a single page as shared if any part of the THP
>+ * is shared.
>+ */
>+ if (folio_maybe_mapped_shared(folio)) {
>+ (*shared)++;
>+ if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
>+ *scan_result = SCAN_EXCEED_SHARED_PTE;
>+ count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>+ return PTE_CHECK_FAIL;
>+ }
>+ }
>+
>+ *foliop = folio;
>+
>+ return PTE_CHECK_SUCCEED;
>+}
>+
This one looks much better.
While my personal feeling is this is not a complete work to merge the scanning
logic. We still have folio_expected_ref_count() and pte_young() check present
both in __collapse_huge_page_isolate() and huge_collapse_scan_pmd().
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
2025-10-10 13:29 ` Wei Yang
@ 2025-10-10 13:55 ` Lance Yang
0 siblings, 0 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-10 13:55 UTC (permalink / raw)
To: Wei Yang
Cc: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, ioworker0, linux-kernel,
linux-mm
On 2025/10/10 21:29, Wei Yang wrote:
> On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
>> and __collapse_huge_page_isolate() was almost duplicated.
>>
>> This patch cleans things up by moving all the common PTE checking logic
>> into a new shared helper, thp_collapse_check_pte(). While at it, we use
>> vm_normal_folio() instead of vm_normal_page().
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Suggested-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
>> 1 file changed, 130 insertions(+), 113 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b5c0295c3414..7116caae1fa4 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -61,6 +61,12 @@ enum scan_result {
>> SCAN_PAGE_FILLED,
>> };
>>
>> +enum pte_check_result {
>> + PTE_CHECK_SUCCEED,
>> + PTE_CHECK_CONTINUE,
>> + PTE_CHECK_FAIL,
>> +};
>> +
>> #define CREATE_TRACE_POINTS
>> #include <trace/events/huge_memory.h>
>>
>> @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>> }
>> }
>>
>> +/*
>> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
>> + * @pte: The PTE to check
>> + * @vma: The VMA the PTE belongs to
>> + * @addr: The virtual address corresponding to this PTE
>> + * @foliop: On success, used to return a pointer to the folio
>> + * Must be non-NULL
>> + * @none_or_zero: Counter for none/zero PTEs. Must be non-NULL
>> + * @unmapped: Counter for swap PTEs. Can be NULL if not scanning swaps
>> + * @shared: Counter for shared pages. Must be non-NULL
>> + * @scan_result: Used to return the failure reason (SCAN_*) on a
>> + * PTE_CHECK_FAIL return. Must be non-NULL
>> + * @cc: Collapse control settings
>> + *
>> + * Returns:
>> + * PTE_CHECK_SUCCEED - PTE is suitable, proceed with further checks
>> + * PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
>> + * PTE_CHECK_FAIL - Abort collapse scan
>> + */
>> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
>> + unsigned long addr, struct folio **foliop, int *none_or_zero,
>> + int *unmapped, int *shared, int *scan_result,
>> + struct collapse_control *cc)
>> +{
>> + struct folio *folio = NULL;
>> +
>> + if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>> + (*none_or_zero)++;
>> + if (!userfaultfd_armed(vma) &&
>> + (!cc->is_khugepaged ||
>> + *none_or_zero <= khugepaged_max_ptes_none)) {
>> + return PTE_CHECK_CONTINUE;
>> + } else {
>> + *scan_result = SCAN_EXCEED_NONE_PTE;
>> + count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> + return PTE_CHECK_FAIL;
>> + }
>> + } else if (!pte_present(pte)) {
>> + if (!unmapped) {
>> + *scan_result = SCAN_PTE_NON_PRESENT;
>> + return PTE_CHECK_FAIL;
>> + }
>> +
>> + if (non_swap_entry(pte_to_swp_entry(pte))) {
>> + *scan_result = SCAN_PTE_NON_PRESENT;
>> + return PTE_CHECK_FAIL;
>> + }
>> +
>> + (*unmapped)++;
>> + if (!cc->is_khugepaged ||
>> + *unmapped <= khugepaged_max_ptes_swap) {
>> + /*
>> + * Always be strict with uffd-wp enabled swap
>> + * entries. Please see comment below for
>> + * pte_uffd_wp().
>> + */
>> + if (pte_swp_uffd_wp(pte)) {
>> + *scan_result = SCAN_PTE_UFFD_WP;
>> + return PTE_CHECK_FAIL;
>> + }
>> + return PTE_CHECK_CONTINUE;
>> + } else {
>> + *scan_result = SCAN_EXCEED_SWAP_PTE;
>> + count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> + return PTE_CHECK_FAIL;
>> + }
>> + } else if (pte_uffd_wp(pte)) {
>> + /*
>> + * Don't collapse the page if any of the small PTEs are
>> + * armed with uffd write protection. Here we can also mark
>> + * the new huge pmd as write protected if any of the small
>> + * ones is marked but that could bring unknown userfault
>> + * messages that falls outside of the registered range.
>> + * So, just be simple.
>> + */
>> + *scan_result = SCAN_PTE_UFFD_WP;
>> + return PTE_CHECK_FAIL;
>> + }
>> +
>> + folio = vm_normal_folio(vma, addr, pte);
>> + if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
>> + *scan_result = SCAN_PAGE_NULL;
>> + return PTE_CHECK_FAIL;
>> + }
>> +
>> + if (!folio_test_anon(folio)) {
>> + VM_WARN_ON_FOLIO(true, folio);
>> + *scan_result = SCAN_PAGE_ANON;
>> + return PTE_CHECK_FAIL;
>> + }
>> +
>> + /*
>> + * We treat a single page as shared if any part of the THP
>> + * is shared.
>> + */
>> + if (folio_maybe_mapped_shared(folio)) {
>> + (*shared)++;
>> + if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
>> + *scan_result = SCAN_EXCEED_SHARED_PTE;
>> + count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> + return PTE_CHECK_FAIL;
>> + }
>> + }
>> +
>> + *foliop = folio;
>> +
>> + return PTE_CHECK_SUCCEED;
>> +}
>> +
>
> This one looks much better.
>
> While my personal feeling is this is not a complete work to merge the scanning
> logic. We still have folio_expected_ref_count() and pte_young() check present
> both in __collapse_huge_page_isolate() and huge_collapse_scan_pmd().
Yep, good catch!
There's definitely more that can be done. For now, let's keep this patch
as-is to avoid making it too complex. Let's get this one in first, and
then we can work on the next step :)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
2025-10-08 4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
@ 2025-10-14 12:17 ` Lorenzo Stoakes
2025-10-14 12:27 ` David Hildenbrand
0 siblings, 1 reply; 19+ messages in thread
From: Lorenzo Stoakes @ 2025-10-14 12:17 UTC (permalink / raw)
To: Lance Yang
Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
linux-kernel, linux-mm
On Wed, Oct 08, 2025 at 12:37:46PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> As pointed out by Dev, the PTE checks for disjoint conditions in the
> scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
> and pte_uffd_wp are mutually exclusive.
But you're not using is_swap_pte anywhere :) This comes back to my review
quesiotn on the series this is dependent upon.
>
> This patch refactors the loops in both __collapse_huge_page_isolate() and
> hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
> instead of separate if blocks. While at it, the redundant pte_present()
> check before is_zero_pfn() is also removed.
I mean see review below, I don't see why you're doing this and I am
unconvinced by how redundant that check is.
Also this just feels like it should be part of the series where you change
these? I'm not sure why this is separate?
>
> Also, this is a preparatory step to make it easier to merge the
> almost-duplicated scanning logic in these two functions, as suggested
> by David.
>
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Dev Jain <dev.jain@arm.com>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Suggested-by: Dev Jain <dev.jain@arm.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
> mm/khugepaged.c | 12 ++++--------
> 1 file changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index bec3e268dc76..e3e27223137a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -548,8 +548,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> _pte++, addr += PAGE_SIZE) {
> pte_t pteval = ptep_get(_pte);
> - if (pte_none(pteval) || (pte_present(pteval) &&
> - is_zero_pfn(pte_pfn(pteval)))) {
> + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
You can have non-pte_none() non-present entries no? Isn't pte_present() a
prerequisite for pfe_pfn() to be valid? If it's a swap entry couldn't you
end up accidentally (unlikely but still) hitting this?
Seems like this is required isn't it? I may be missing something here...
> ++none_or_zero;
> if (!userfaultfd_armed(vma) &&
> (!cc->is_khugepaged ||
> @@ -560,12 +559,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> goto out;
> }
> - }
> - if (!pte_present(pteval)) {
> + } else if (!pte_present(pteval)) {
This seems pointless, since either the above logic will continue or goto
out right?
> result = SCAN_PTE_NON_PRESENT;
> goto out;
> - }
> - if (pte_uffd_wp(pteval)) {
> + } else if (pte_uffd_wp(pteval)) {
Again, what is the point of an else when the if() branch unconditionally
->out?
> result = SCAN_PTE_UFFD_WP;
> goto out;
> }
> @@ -1321,8 +1318,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> goto out_unmap;
> }
> - }
> - if (pte_uffd_wp(pteval)) {
> + } else if (pte_uffd_wp(pteval)) {
Same comment as above, I'm really confused about the purpose of this logic?
> /*
> * Don't collapse the page if any of the small
> * PTEs are armed with uffd write protection.
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios
2025-10-08 4:37 ` [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios Lance Yang
@ 2025-10-14 12:25 ` Lorenzo Stoakes
0 siblings, 0 replies; 19+ messages in thread
From: Lorenzo Stoakes @ 2025-10-14 12:25 UTC (permalink / raw)
To: Lance Yang
Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
linux-kernel, linux-mm
On Wed, Oct 08, 2025 at 12:37:47PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> As Zi pointed out, we should avoid crashing the kernel for conditions
> that can be handled gracefully. Encountering a non-anonymous folio in an
> anonymous VMA is a bug, but a warning is sufficient.
>
> This patch changes the VM_BUG_ON_FOLIO(!folio_test_anon(folio)) to a
> VM_WARN_ON_FOLIO() in both __collapse_huge_page_isolate() and
> hpage_collapse_scan_pmd(), and then aborts the scan with SCAN_PAGE_ANON.
Well no, in hpage_collapse_scan_pmd() there is no warning at all.
>
> Making more of the scanning logic common between hpage_collapse_scan_pmd()
> and __collapse_huge_page_isolate(), as suggested by Dev.
I mean I guess it's fine but I'm not sure it's really necessary to give a
blow-by-blow of who suggested what in the actual commit message :) This
isn't really useful information for somebody looking at this code in the
future.
>
> Suggested-by: Dev Jain <dev.jain@arm.com>
> Suggested-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
> mm/khugepaged.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e3e27223137a..b5c0295c3414 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -573,7 +573,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> }
>
> folio = page_folio(page);
> - VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> + if (!folio_test_anon(folio)) {
> + VM_WARN_ON_FOLIO(true, folio);
> + result = SCAN_PAGE_ANON;
> + goto out;
Hmm this is iffy I'm not sure I agree with Zi here - the purpose of
VM_WARN_ON() etc. is for things that programmatically _should not
happen_.
Now every single time we run this code we're doing this check.
AND it implies that it is an actual possiblity, at run time, for this to be
the case.
I really don't like this.
Also if it's a runtime check this should be a WARN_ON_ONCE() not a
VM_WARN_ON_ONCE(). Of course you lose the folio output then. So this code
is very confused.
In general I don't think we should be doing this at all, rather we should
just convert the VM_BUG_ON_FOLIO() to a VM_WARN_ON_FOLIO().
> + }
>
> /* See hpage_collapse_scan_pmd(). */
> if (folio_maybe_mapped_shared(folio)) {
> @@ -1340,6 +1344,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> folio = page_folio(page);
>
> if (!folio_test_anon(folio)) {
> + VM_WARN_ON_FOLIO(true, folio);
Err, what? This is a condition that should never, ever happen to the point
that we warn on it?
This surely is a condition that we expect to happen sometimes otherwise we
wouldn't do this no?
Either way the comments above still apply. Also VM_WARN_ON_FOLIO(true, ...)
is kinda gross... if this is an actual pattern that exists, VM_WARN_FOLIO()
would be preferable.
> result = SCAN_PAGE_ANON;
> goto out_unmap;
> }
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
2025-10-14 12:17 ` Lorenzo Stoakes
@ 2025-10-14 12:27 ` David Hildenbrand
2025-10-15 4:49 ` Lance Yang
0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2025-10-14 12:27 UTC (permalink / raw)
To: Lorenzo Stoakes, Lance Yang
Cc: akpm, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
dev.jain, baohua, ioworker0, richard.weiyang, linux-kernel,
linux-mm
On 14.10.25 14:17, Lorenzo Stoakes wrote:
> On Wed, Oct 08, 2025 at 12:37:46PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> As pointed out by Dev, the PTE checks for disjoint conditions in the
>> scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
>> and pte_uffd_wp are mutually exclusive.
>
> But you're not using is_swap_pte anywhere :) This comes back to my review
> quesiotn on the series this is dependent upon.
>
>>
>> This patch refactors the loops in both __collapse_huge_page_isolate() and
>> hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
>> instead of separate if blocks. While at it, the redundant pte_present()
>> check before is_zero_pfn() is also removed.
>
> I mean see review below, I don't see why you're doing this and I am
> unconvinced by how redundant that check is.
>
> Also this just feels like it should be part of the series where you change
> these? I'm not sure why this is separate?
I think Lance is trying to unify both scanning functions to look alike,
such that when he refactors them out in patch #3 it looks more straight
forward.
The missing pte_present() check in hpage_collapse_scan_pmd() is interesting
Likely there is one such check missing there?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
2025-10-08 4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
` (2 preceding siblings ...)
2025-10-10 13:29 ` Wei Yang
@ 2025-10-14 12:36 ` Lorenzo Stoakes
2025-10-14 17:41 ` Lorenzo Stoakes
4 siblings, 0 replies; 19+ messages in thread
From: Lorenzo Stoakes @ 2025-10-14 12:36 UTC (permalink / raw)
To: Lance Yang
Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
linux-kernel, linux-mm
On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
> and __collapse_huge_page_isolate() was almost duplicated.
>
> This patch cleans things up by moving all the common PTE checking logic
> into a new shared helper, thp_collapse_check_pte(). While at it, we use
> vm_normal_folio() instead of vm_normal_page().
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
In general I like the idea, the implementation needs work though.
Will check in more detail on respin
> ---
> mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
> 1 file changed, 130 insertions(+), 113 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b5c0295c3414..7116caae1fa4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -61,6 +61,12 @@ enum scan_result {
> SCAN_PAGE_FILLED,
> };
>
> +enum pte_check_result {
> + PTE_CHECK_SUCCEED,
> + PTE_CHECK_CONTINUE,
Don't love this logic - this feels like we're essentially abstracting
control flow, I mean what does 'continue' mean here? Other than you're in a
loop and please continue, which is relying a little too much on what the
caller does.
if we retain this logic something like PTE_CHECK_SKIP would make more sense.
> + PTE_CHECK_FAIL,
> +};
> +
> #define CREATE_TRACE_POINTS
> #include <trace/events/huge_memory.h>
>
> @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
> }
> }
>
> +/*
> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
> + * @pte: The PTE to check
> + * @vma: The VMA the PTE belongs to
> + * @addr: The virtual address corresponding to this PTE
> + * @foliop: On success, used to return a pointer to the folio
> + * Must be non-NULL
> + * @none_or_zero: Counter for none/zero PTEs. Must be non-NULL
> + * @unmapped: Counter for swap PTEs. Can be NULL if not scanning swaps
> + * @shared: Counter for shared pages. Must be non-NULL
> + * @scan_result: Used to return the failure reason (SCAN_*) on a
> + * PTE_CHECK_FAIL return. Must be non-NULL
> + * @cc: Collapse control settings
Do we really need this many parameters? THis is hard to follow.
Of course it being me, I'd immediately prefer a helper struct :)
> + *
> + * Returns:
> + * PTE_CHECK_SUCCEED - PTE is suitable, proceed with further checks
> + * PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
> + * PTE_CHECK_FAIL - Abort collapse scan
> + */
> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
There's no need for inline in an internal static function in a file.
> + unsigned long addr, struct folio **foliop, int *none_or_zero,
> + int *unmapped, int *shared, int *scan_result,
> + struct collapse_control *cc)
> +{
> + struct folio *folio = NULL;
> +
> + if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
> + (*none_or_zero)++;
> + if (!userfaultfd_armed(vma) &&
> + (!cc->is_khugepaged ||
> + *none_or_zero <= khugepaged_max_ptes_none)) {
> + return PTE_CHECK_CONTINUE;
> + } else {
> + *scan_result = SCAN_EXCEED_NONE_PTE;
> + count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> + return PTE_CHECK_FAIL;
> + }
> + } else if (!pte_present(pte)) {
You're now making the if-else issues with previous patches worse by
returning which gets even checkpatch to warn you.
if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
(of course note that I am not convinced you can drop the pte_present()
check here)
(*none_or_zero)++;
if (!userfaultfd_armed(vma) &&
(!cc->is_khugepaged ||
*none_or_zero <= khugepaged_max_ptes_none))
return PTE_CHECK_CONTINUE;
*scan_result = SCAN_EXCEED_NONE_PTE;
count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
return PTE_CHECK_FAIL;
}
if (!pte_present(pte)) {
...
}
But even that really needs seriously more refactoring - that condition
above is horrible for instance so, e.g.:
if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
(*none_or_zero)++;
/* Cases where this is acceptable. */
if (!userfaultfd_armed(vma))
return PTE_CHECK_SKIP;
if (!cc->is_khugepaged)
return PTE_CHECK_SKIP;
if (*none_or_zero <= khugepaged_max_ptes_none)
return PTE_CHECK_SKIP;
/* Otherwise, we must abort the scan. */
*scan_result = SCAN_EXCEED_NONE_PTE;
count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
return PTE_CHECK_FAIL;
}
if (!pte_present(pte)) {
...
}
Improves things a lot.
I do however _hate_ this (*blah)++; thing. A helper struct would help :)
> + if (!unmapped) {
> + *scan_result = SCAN_PTE_NON_PRESENT;
> + return PTE_CHECK_FAIL;
Can't we have a helper that sets the result, e.g.:
return pte_check_fail(scan_result, SCAN_PTE_NON_PRESENT);
static enum pte_check_result pte_check_fail(int *scan_result,
enum pte_check_result result)
{
*scan_result = result;
return PTE_CHECK_FAIL;
}
And same for success/skip
Then all of these horrible if (blah) { *foo = bar; return baz; } can be
made into
if (blah)
return pte_check_xxx(..., SCAN_PTE_...);
> + }
> +
> + if (non_swap_entry(pte_to_swp_entry(pte))) {
> + *scan_result = SCAN_PTE_NON_PRESENT;
> + return PTE_CHECK_FAIL;
> + }
> +
> + (*unmapped)++;
> + if (!cc->is_khugepaged ||
> + *unmapped <= khugepaged_max_ptes_swap) {
> + /*
> + * Always be strict with uffd-wp enabled swap
> + * entries. Please see comment below for
> + * pte_uffd_wp().
> + */
> + if (pte_swp_uffd_wp(pte)) {
> + *scan_result = SCAN_PTE_UFFD_WP;
> + return PTE_CHECK_FAIL;
> + }
> + return PTE_CHECK_CONTINUE;
> + } else {
> + *scan_result = SCAN_EXCEED_SWAP_PTE;
> + count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> + return PTE_CHECK_FAIL;
> + }
> + } else if (pte_uffd_wp(pte)) {
> + /*
> + * Don't collapse the page if any of the small PTEs are
> + * armed with uffd write protection. Here we can also mark
> + * the new huge pmd as write protected if any of the small
> + * ones is marked but that could bring unknown userfault
> + * messages that falls outside of the registered range.
> + * So, just be simple.
> + */
> + *scan_result = SCAN_PTE_UFFD_WP;
> + return PTE_CHECK_FAIL;
> + }
> +
Again as all of the comments for previous series still apply here so
obviously I don't like this as-is :)
And as Andrew has pointed out, now checkpatch is finding the 'pointless
else' issue (as mentioned above also).
> + folio = vm_normal_folio(vma, addr, pte);
> + if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
> + *scan_result = SCAN_PAGE_NULL;
> + return PTE_CHECK_FAIL;
> + }
> +
> + if (!folio_test_anon(folio)) {
> + VM_WARN_ON_FOLIO(true, folio);
> + *scan_result = SCAN_PAGE_ANON;
> + return PTE_CHECK_FAIL;
> + }
> +
> + /*
> + * We treat a single page as shared if any part of the THP
> + * is shared.
> + */
> + if (folio_maybe_mapped_shared(folio)) {
> + (*shared)++;
> + if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
> + *scan_result = SCAN_EXCEED_SHARED_PTE;
> + count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> + return PTE_CHECK_FAIL;
> + }
> + }
> +
> + *foliop = folio;
> +
> + return PTE_CHECK_SUCCEED;
> +}
> +
> static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> unsigned long start_addr,
> pte_t *pte,
> struct collapse_control *cc,
> struct list_head *compound_pagelist)
> {
> - struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr = start_addr;
> pte_t *_pte;
> int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> + int pte_check_res;
>
> for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> _pte++, addr += PAGE_SIZE) {
> pte_t pteval = ptep_get(_pte);
> - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> - ++none_or_zero;
> - if (!userfaultfd_armed(vma) &&
> - (!cc->is_khugepaged ||
> - none_or_zero <= khugepaged_max_ptes_none)) {
> - continue;
> - } else {
> - result = SCAN_EXCEED_NONE_PTE;
> - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> - goto out;
> - }
> - } else if (!pte_present(pteval)) {
> - result = SCAN_PTE_NON_PRESENT;
> - goto out;
> - } else if (pte_uffd_wp(pteval)) {
> - result = SCAN_PTE_UFFD_WP;
> - goto out;
> - }
> - page = vm_normal_page(vma, addr, pteval);
> - if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> - result = SCAN_PAGE_NULL;
> - goto out;
> - }
>
> - folio = page_folio(page);
> - if (!folio_test_anon(folio)) {
> - VM_WARN_ON_FOLIO(true, folio);
> - result = SCAN_PAGE_ANON;
> - goto out;
> - }
> + pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> + &folio, &none_or_zero, NULL, &shared,
> + &result, cc);
>
> - /* See hpage_collapse_scan_pmd(). */
> - if (folio_maybe_mapped_shared(folio)) {
> - ++shared;
> - if (cc->is_khugepaged &&
> - shared > khugepaged_max_ptes_shared) {
> - result = SCAN_EXCEED_SHARED_PTE;
> - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> - goto out;
> - }
> - }
> + if (pte_check_res == PTE_CHECK_CONTINUE)
> + continue;
> + else if (pte_check_res == PTE_CHECK_FAIL)
> + goto out;
>
> if (folio_test_large(folio)) {
> struct folio *f;
> @@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> pte_t *pte, *_pte;
> int result = SCAN_FAIL, referenced = 0;
> int none_or_zero = 0, shared = 0;
> - struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr;
> spinlock_t *ptl;
> int node = NUMA_NO_NODE, unmapped = 0;
> + int pte_check_res;
>
> VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>
> @@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> _pte++, addr += PAGE_SIZE) {
> pte_t pteval = ptep_get(_pte);
> - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> - ++none_or_zero;
> - if (!userfaultfd_armed(vma) &&
> - (!cc->is_khugepaged ||
> - none_or_zero <= khugepaged_max_ptes_none)) {
> - continue;
> - } else {
> - result = SCAN_EXCEED_NONE_PTE;
> - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> - goto out_unmap;
> - }
> - } else if (!pte_present(pteval)) {
> - if (non_swap_entry(pte_to_swp_entry(pteval))) {
> - result = SCAN_PTE_NON_PRESENT;
> - goto out_unmap;
> - }
>
> - ++unmapped;
> - if (!cc->is_khugepaged ||
> - unmapped <= khugepaged_max_ptes_swap) {
> - /*
> - * Always be strict with uffd-wp
> - * enabled swap entries. Please see
> - * comment below for pte_uffd_wp().
> - */
> - if (pte_swp_uffd_wp(pteval)) {
> - result = SCAN_PTE_UFFD_WP;
> - goto out_unmap;
> - }
> - continue;
> - } else {
> - result = SCAN_EXCEED_SWAP_PTE;
> - count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> - goto out_unmap;
> - }
> - } else if (pte_uffd_wp(pteval)) {
> - /*
> - * Don't collapse the page if any of the small
> - * PTEs are armed with uffd write protection.
> - * Here we can also mark the new huge pmd as
> - * write protected if any of the small ones is
> - * marked but that could bring unknown
> - * userfault messages that falls outside of
> - * the registered range. So, just be simple.
> - */
> - result = SCAN_PTE_UFFD_WP;
> - goto out_unmap;
> - }
> + pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> + &folio, &none_or_zero, &unmapped,
> + &shared, &result, cc);
>
> - page = vm_normal_page(vma, addr, pteval);
> - if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> - result = SCAN_PAGE_NULL;
> - goto out_unmap;
> - }
> - folio = page_folio(page);
> -
> - if (!folio_test_anon(folio)) {
> - VM_WARN_ON_FOLIO(true, folio);
> - result = SCAN_PAGE_ANON;
> + if (pte_check_res == PTE_CHECK_CONTINUE)
> + continue;
> + else if (pte_check_res == PTE_CHECK_FAIL)
> goto out_unmap;
> - }
> -
> - /*
> - * We treat a single page as shared if any part of the THP
> - * is shared.
> - */
> - if (folio_maybe_mapped_shared(folio)) {
> - ++shared;
> - if (cc->is_khugepaged &&
> - shared > khugepaged_max_ptes_shared) {
> - result = SCAN_EXCEED_SHARED_PTE;
> - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> - goto out_unmap;
> - }
> - }
>
> /*
> * Record which node the original page is from and save this
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
2025-10-08 4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
` (3 preceding siblings ...)
2025-10-14 12:36 ` Lorenzo Stoakes
@ 2025-10-14 17:41 ` Lorenzo Stoakes
2025-10-15 1:48 ` Lance Yang
4 siblings, 1 reply; 19+ messages in thread
From: Lorenzo Stoakes @ 2025-10-14 17:41 UTC (permalink / raw)
To: Lance Yang
Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
linux-kernel, linux-mm
OK so this patch _is_ based on [0] already being applied then?
I don't see any is_swap_pte() here so presumably so right?
Can we just respin these series and put them together? Because now I've
reviewed a bunch of stuff in [0], which this series depends upon, and
you're saying we should now review on this instead of that and it's a bit
of a mess.
Once review here is dealt with can you combine [0] into this series please?
For [0] I would like you to reinstate the is_swap_pte() conditional as
(copiously :) discussed over there.
[0}: https://lore.kernel.org/all/20251008032657.72406-1-lance.yang@linux.dev/
Thanks, Lorenzo
On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
> and __collapse_huge_page_isolate() was almost duplicated.
>
> This patch cleans things up by moving all the common PTE checking logic
> into a new shared helper, thp_collapse_check_pte(). While at it, we use
> vm_normal_folio() instead of vm_normal_page().
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
> mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
> 1 file changed, 130 insertions(+), 113 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b5c0295c3414..7116caae1fa4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -61,6 +61,12 @@ enum scan_result {
> SCAN_PAGE_FILLED,
> };
>
> +enum pte_check_result {
> + PTE_CHECK_SUCCEED,
> + PTE_CHECK_CONTINUE,
> + PTE_CHECK_FAIL,
> +};
> +
> #define CREATE_TRACE_POINTS
> #include <trace/events/huge_memory.h>
>
> @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
> }
> }
>
> +/*
> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
> + * @pte: The PTE to check
> + * @vma: The VMA the PTE belongs to
> + * @addr: The virtual address corresponding to this PTE
> + * @foliop: On success, used to return a pointer to the folio
> + * Must be non-NULL
> + * @none_or_zero: Counter for none/zero PTEs. Must be non-NULL
> + * @unmapped: Counter for swap PTEs. Can be NULL if not scanning swaps
> + * @shared: Counter for shared pages. Must be non-NULL
> + * @scan_result: Used to return the failure reason (SCAN_*) on a
> + * PTE_CHECK_FAIL return. Must be non-NULL
> + * @cc: Collapse control settings
> + *
> + * Returns:
> + * PTE_CHECK_SUCCEED - PTE is suitable, proceed with further checks
> + * PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
> + * PTE_CHECK_FAIL - Abort collapse scan
> + */
> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
> + unsigned long addr, struct folio **foliop, int *none_or_zero,
> + int *unmapped, int *shared, int *scan_result,
> + struct collapse_control *cc)
> +{
> + struct folio *folio = NULL;
> +
> + if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
> + (*none_or_zero)++;
> + if (!userfaultfd_armed(vma) &&
> + (!cc->is_khugepaged ||
> + *none_or_zero <= khugepaged_max_ptes_none)) {
> + return PTE_CHECK_CONTINUE;
> + } else {
> + *scan_result = SCAN_EXCEED_NONE_PTE;
> + count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> + return PTE_CHECK_FAIL;
> + }
> + } else if (!pte_present(pte)) {
> + if (!unmapped) {
> + *scan_result = SCAN_PTE_NON_PRESENT;
> + return PTE_CHECK_FAIL;
> + }
> +
> + if (non_swap_entry(pte_to_swp_entry(pte))) {
> + *scan_result = SCAN_PTE_NON_PRESENT;
> + return PTE_CHECK_FAIL;
> + }
> +
> + (*unmapped)++;
> + if (!cc->is_khugepaged ||
> + *unmapped <= khugepaged_max_ptes_swap) {
> + /*
> + * Always be strict with uffd-wp enabled swap
> + * entries. Please see comment below for
> + * pte_uffd_wp().
> + */
> + if (pte_swp_uffd_wp(pte)) {
> + *scan_result = SCAN_PTE_UFFD_WP;
> + return PTE_CHECK_FAIL;
> + }
> + return PTE_CHECK_CONTINUE;
> + } else {
> + *scan_result = SCAN_EXCEED_SWAP_PTE;
> + count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> + return PTE_CHECK_FAIL;
> + }
> + } else if (pte_uffd_wp(pte)) {
> + /*
> + * Don't collapse the page if any of the small PTEs are
> + * armed with uffd write protection. Here we can also mark
> + * the new huge pmd as write protected if any of the small
> + * ones is marked but that could bring unknown userfault
> + * messages that falls outside of the registered range.
> + * So, just be simple.
> + */
> + *scan_result = SCAN_PTE_UFFD_WP;
> + return PTE_CHECK_FAIL;
> + }
> +
> + folio = vm_normal_folio(vma, addr, pte);
> + if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
> + *scan_result = SCAN_PAGE_NULL;
> + return PTE_CHECK_FAIL;
> + }
> +
> + if (!folio_test_anon(folio)) {
> + VM_WARN_ON_FOLIO(true, folio);
> + *scan_result = SCAN_PAGE_ANON;
> + return PTE_CHECK_FAIL;
> + }
> +
> + /*
> + * We treat a single page as shared if any part of the THP
> + * is shared.
> + */
> + if (folio_maybe_mapped_shared(folio)) {
> + (*shared)++;
> + if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
> + *scan_result = SCAN_EXCEED_SHARED_PTE;
> + count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> + return PTE_CHECK_FAIL;
> + }
> + }
> +
> + *foliop = folio;
> +
> + return PTE_CHECK_SUCCEED;
> +}
> +
> static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> unsigned long start_addr,
> pte_t *pte,
> struct collapse_control *cc,
> struct list_head *compound_pagelist)
> {
> - struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr = start_addr;
> pte_t *_pte;
> int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> + int pte_check_res;
>
> for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> _pte++, addr += PAGE_SIZE) {
> pte_t pteval = ptep_get(_pte);
> - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> - ++none_or_zero;
> - if (!userfaultfd_armed(vma) &&
> - (!cc->is_khugepaged ||
> - none_or_zero <= khugepaged_max_ptes_none)) {
> - continue;
> - } else {
> - result = SCAN_EXCEED_NONE_PTE;
> - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> - goto out;
> - }
> - } else if (!pte_present(pteval)) {
> - result = SCAN_PTE_NON_PRESENT;
> - goto out;
> - } else if (pte_uffd_wp(pteval)) {
> - result = SCAN_PTE_UFFD_WP;
> - goto out;
> - }
> - page = vm_normal_page(vma, addr, pteval);
> - if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> - result = SCAN_PAGE_NULL;
> - goto out;
> - }
>
> - folio = page_folio(page);
> - if (!folio_test_anon(folio)) {
> - VM_WARN_ON_FOLIO(true, folio);
> - result = SCAN_PAGE_ANON;
> - goto out;
> - }
> + pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> + &folio, &none_or_zero, NULL, &shared,
> + &result, cc);
>
> - /* See hpage_collapse_scan_pmd(). */
> - if (folio_maybe_mapped_shared(folio)) {
> - ++shared;
> - if (cc->is_khugepaged &&
> - shared > khugepaged_max_ptes_shared) {
> - result = SCAN_EXCEED_SHARED_PTE;
> - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> - goto out;
> - }
> - }
> + if (pte_check_res == PTE_CHECK_CONTINUE)
> + continue;
> + else if (pte_check_res == PTE_CHECK_FAIL)
> + goto out;
>
> if (folio_test_large(folio)) {
> struct folio *f;
> @@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> pte_t *pte, *_pte;
> int result = SCAN_FAIL, referenced = 0;
> int none_or_zero = 0, shared = 0;
> - struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr;
> spinlock_t *ptl;
> int node = NUMA_NO_NODE, unmapped = 0;
> + int pte_check_res;
>
> VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>
> @@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> _pte++, addr += PAGE_SIZE) {
> pte_t pteval = ptep_get(_pte);
> - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> - ++none_or_zero;
> - if (!userfaultfd_armed(vma) &&
> - (!cc->is_khugepaged ||
> - none_or_zero <= khugepaged_max_ptes_none)) {
> - continue;
> - } else {
> - result = SCAN_EXCEED_NONE_PTE;
> - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> - goto out_unmap;
> - }
> - } else if (!pte_present(pteval)) {
> - if (non_swap_entry(pte_to_swp_entry(pteval))) {
> - result = SCAN_PTE_NON_PRESENT;
> - goto out_unmap;
> - }
>
> - ++unmapped;
> - if (!cc->is_khugepaged ||
> - unmapped <= khugepaged_max_ptes_swap) {
> - /*
> - * Always be strict with uffd-wp
> - * enabled swap entries. Please see
> - * comment below for pte_uffd_wp().
> - */
> - if (pte_swp_uffd_wp(pteval)) {
> - result = SCAN_PTE_UFFD_WP;
> - goto out_unmap;
> - }
> - continue;
> - } else {
> - result = SCAN_EXCEED_SWAP_PTE;
> - count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> - goto out_unmap;
> - }
> - } else if (pte_uffd_wp(pteval)) {
> - /*
> - * Don't collapse the page if any of the small
> - * PTEs are armed with uffd write protection.
> - * Here we can also mark the new huge pmd as
> - * write protected if any of the small ones is
> - * marked but that could bring unknown
> - * userfault messages that falls outside of
> - * the registered range. So, just be simple.
> - */
> - result = SCAN_PTE_UFFD_WP;
> - goto out_unmap;
> - }
> + pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> + &folio, &none_or_zero, &unmapped,
> + &shared, &result, cc);
>
> - page = vm_normal_page(vma, addr, pteval);
> - if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> - result = SCAN_PAGE_NULL;
> - goto out_unmap;
> - }
> - folio = page_folio(page);
> -
> - if (!folio_test_anon(folio)) {
> - VM_WARN_ON_FOLIO(true, folio);
> - result = SCAN_PAGE_ANON;
> + if (pte_check_res == PTE_CHECK_CONTINUE)
> + continue;
> + else if (pte_check_res == PTE_CHECK_FAIL)
> goto out_unmap;
> - }
> -
> - /*
> - * We treat a single page as shared if any part of the THP
> - * is shared.
> - */
> - if (folio_maybe_mapped_shared(folio)) {
> - ++shared;
> - if (cc->is_khugepaged &&
> - shared > khugepaged_max_ptes_shared) {
> - result = SCAN_EXCEED_SHARED_PTE;
> - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> - goto out_unmap;
> - }
> - }
>
> /*
> * Record which node the original page is from and save this
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
2025-10-14 17:41 ` Lorenzo Stoakes
@ 2025-10-15 1:48 ` Lance Yang
0 siblings, 0 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-15 1:48 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
linux-kernel, linux-mm
On 2025/10/15 01:41, Lorenzo Stoakes wrote:
> OK so this patch _is_ based on [0] already being applied then?
>
> I don't see any is_swap_pte() here so presumably so right?
>
> Can we just respin these series and put them together? Because now I've
> reviewed a bunch of stuff in [0], which this series depends upon, and
> you're saying we should now review on this instead of that and it's a bit
> of a mess.
>
> Once review here is dealt with can you combine [0] into this series please?
Got it, thanks!
>
> For [0] I would like you to reinstate the is_swap_pte() conditional as
> (copiously :) discussed over there.
>
> [0}: https://lore.kernel.org/all/20251008032657.72406-1-lance.yang@linux.dev/
>
> Thanks, Lorenzo
>
> On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
>> and __collapse_huge_page_isolate() was almost duplicated.
>>
>> This patch cleans things up by moving all the common PTE checking logic
>> into a new shared helper, thp_collapse_check_pte(). While at it, we use
>> vm_normal_folio() instead of vm_normal_page().
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Suggested-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
>> 1 file changed, 130 insertions(+), 113 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b5c0295c3414..7116caae1fa4 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -61,6 +61,12 @@ enum scan_result {
>> SCAN_PAGE_FILLED,
>> };
>>
>> +enum pte_check_result {
>> + PTE_CHECK_SUCCEED,
>> + PTE_CHECK_CONTINUE,
>> + PTE_CHECK_FAIL,
>> +};
>> +
>> #define CREATE_TRACE_POINTS
>> #include <trace/events/huge_memory.h>
>>
>> @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>> }
>> }
>>
>> +/*
>> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
>> + * @pte: The PTE to check
>> + * @vma: The VMA the PTE belongs to
>> + * @addr: The virtual address corresponding to this PTE
>> + * @foliop: On success, used to return a pointer to the folio
>> + * Must be non-NULL
>> + * @none_or_zero: Counter for none/zero PTEs. Must be non-NULL
>> + * @unmapped: Counter for swap PTEs. Can be NULL if not scanning swaps
>> + * @shared: Counter for shared pages. Must be non-NULL
>> + * @scan_result: Used to return the failure reason (SCAN_*) on a
>> + * PTE_CHECK_FAIL return. Must be non-NULL
>> + * @cc: Collapse control settings
>> + *
>> + * Returns:
>> + * PTE_CHECK_SUCCEED - PTE is suitable, proceed with further checks
>> + * PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
>> + * PTE_CHECK_FAIL - Abort collapse scan
>> + */
>> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
>> + unsigned long addr, struct folio **foliop, int *none_or_zero,
>> + int *unmapped, int *shared, int *scan_result,
>> + struct collapse_control *cc)
>> +{
>> + struct folio *folio = NULL;
>> +
>> + if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>> + (*none_or_zero)++;
>> + if (!userfaultfd_armed(vma) &&
>> + (!cc->is_khugepaged ||
>> + *none_or_zero <= khugepaged_max_ptes_none)) {
>> + return PTE_CHECK_CONTINUE;
>> + } else {
>> + *scan_result = SCAN_EXCEED_NONE_PTE;
>> + count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> + return PTE_CHECK_FAIL;
>> + }
>> + } else if (!pte_present(pte)) {
>> + if (!unmapped) {
>> + *scan_result = SCAN_PTE_NON_PRESENT;
>> + return PTE_CHECK_FAIL;
>> + }
>> +
>> + if (non_swap_entry(pte_to_swp_entry(pte))) {
>> + *scan_result = SCAN_PTE_NON_PRESENT;
>> + return PTE_CHECK_FAIL;
>> + }
>> +
>> + (*unmapped)++;
>> + if (!cc->is_khugepaged ||
>> + *unmapped <= khugepaged_max_ptes_swap) {
>> + /*
>> + * Always be strict with uffd-wp enabled swap
>> + * entries. Please see comment below for
>> + * pte_uffd_wp().
>> + */
>> + if (pte_swp_uffd_wp(pte)) {
>> + *scan_result = SCAN_PTE_UFFD_WP;
>> + return PTE_CHECK_FAIL;
>> + }
>> + return PTE_CHECK_CONTINUE;
>> + } else {
>> + *scan_result = SCAN_EXCEED_SWAP_PTE;
>> + count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> + return PTE_CHECK_FAIL;
>> + }
>> + } else if (pte_uffd_wp(pte)) {
>> + /*
>> + * Don't collapse the page if any of the small PTEs are
>> + * armed with uffd write protection. Here we can also mark
>> + * the new huge pmd as write protected if any of the small
>> + * ones is marked but that could bring unknown userfault
>> + * messages that falls outside of the registered range.
>> + * So, just be simple.
>> + */
>> + *scan_result = SCAN_PTE_UFFD_WP;
>> + return PTE_CHECK_FAIL;
>> + }
>> +
>> + folio = vm_normal_folio(vma, addr, pte);
>> + if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
>> + *scan_result = SCAN_PAGE_NULL;
>> + return PTE_CHECK_FAIL;
>> + }
>> +
>> + if (!folio_test_anon(folio)) {
>> + VM_WARN_ON_FOLIO(true, folio);
>> + *scan_result = SCAN_PAGE_ANON;
>> + return PTE_CHECK_FAIL;
>> + }
>> +
>> + /*
>> + * We treat a single page as shared if any part of the THP
>> + * is shared.
>> + */
>> + if (folio_maybe_mapped_shared(folio)) {
>> + (*shared)++;
>> + if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
>> + *scan_result = SCAN_EXCEED_SHARED_PTE;
>> + count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> + return PTE_CHECK_FAIL;
>> + }
>> + }
>> +
>> + *foliop = folio;
>> +
>> + return PTE_CHECK_SUCCEED;
>> +}
>> +
>> static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>> unsigned long start_addr,
>> pte_t *pte,
>> struct collapse_control *cc,
>> struct list_head *compound_pagelist)
>> {
>> - struct page *page = NULL;
>> struct folio *folio = NULL;
>> unsigned long addr = start_addr;
>> pte_t *_pte;
>> int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>> + int pte_check_res;
>>
>> for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>> _pte++, addr += PAGE_SIZE) {
>> pte_t pteval = ptep_get(_pte);
>> - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> - ++none_or_zero;
>> - if (!userfaultfd_armed(vma) &&
>> - (!cc->is_khugepaged ||
>> - none_or_zero <= khugepaged_max_ptes_none)) {
>> - continue;
>> - } else {
>> - result = SCAN_EXCEED_NONE_PTE;
>> - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> - goto out;
>> - }
>> - } else if (!pte_present(pteval)) {
>> - result = SCAN_PTE_NON_PRESENT;
>> - goto out;
>> - } else if (pte_uffd_wp(pteval)) {
>> - result = SCAN_PTE_UFFD_WP;
>> - goto out;
>> - }
>> - page = vm_normal_page(vma, addr, pteval);
>> - if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
>> - result = SCAN_PAGE_NULL;
>> - goto out;
>> - }
>>
>> - folio = page_folio(page);
>> - if (!folio_test_anon(folio)) {
>> - VM_WARN_ON_FOLIO(true, folio);
>> - result = SCAN_PAGE_ANON;
>> - goto out;
>> - }
>> + pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
>> + &folio, &none_or_zero, NULL, &shared,
>> + &result, cc);
>>
>> - /* See hpage_collapse_scan_pmd(). */
>> - if (folio_maybe_mapped_shared(folio)) {
>> - ++shared;
>> - if (cc->is_khugepaged &&
>> - shared > khugepaged_max_ptes_shared) {
>> - result = SCAN_EXCEED_SHARED_PTE;
>> - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> - goto out;
>> - }
>> - }
>> + if (pte_check_res == PTE_CHECK_CONTINUE)
>> + continue;
>> + else if (pte_check_res == PTE_CHECK_FAIL)
>> + goto out;
>>
>> if (folio_test_large(folio)) {
>> struct folio *f;
>> @@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> pte_t *pte, *_pte;
>> int result = SCAN_FAIL, referenced = 0;
>> int none_or_zero = 0, shared = 0;
>> - struct page *page = NULL;
>> struct folio *folio = NULL;
>> unsigned long addr;
>> spinlock_t *ptl;
>> int node = NUMA_NO_NODE, unmapped = 0;
>> + int pte_check_res;
>>
>> VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>
>> @@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> _pte++, addr += PAGE_SIZE) {
>> pte_t pteval = ptep_get(_pte);
>> - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> - ++none_or_zero;
>> - if (!userfaultfd_armed(vma) &&
>> - (!cc->is_khugepaged ||
>> - none_or_zero <= khugepaged_max_ptes_none)) {
>> - continue;
>> - } else {
>> - result = SCAN_EXCEED_NONE_PTE;
>> - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> - goto out_unmap;
>> - }
>> - } else if (!pte_present(pteval)) {
>> - if (non_swap_entry(pte_to_swp_entry(pteval))) {
>> - result = SCAN_PTE_NON_PRESENT;
>> - goto out_unmap;
>> - }
>>
>> - ++unmapped;
>> - if (!cc->is_khugepaged ||
>> - unmapped <= khugepaged_max_ptes_swap) {
>> - /*
>> - * Always be strict with uffd-wp
>> - * enabled swap entries. Please see
>> - * comment below for pte_uffd_wp().
>> - */
>> - if (pte_swp_uffd_wp(pteval)) {
>> - result = SCAN_PTE_UFFD_WP;
>> - goto out_unmap;
>> - }
>> - continue;
>> - } else {
>> - result = SCAN_EXCEED_SWAP_PTE;
>> - count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> - goto out_unmap;
>> - }
>> - } else if (pte_uffd_wp(pteval)) {
>> - /*
>> - * Don't collapse the page if any of the small
>> - * PTEs are armed with uffd write protection.
>> - * Here we can also mark the new huge pmd as
>> - * write protected if any of the small ones is
>> - * marked but that could bring unknown
>> - * userfault messages that falls outside of
>> - * the registered range. So, just be simple.
>> - */
>> - result = SCAN_PTE_UFFD_WP;
>> - goto out_unmap;
>> - }
>> + pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
>> + &folio, &none_or_zero, &unmapped,
>> + &shared, &result, cc);
>>
>> - page = vm_normal_page(vma, addr, pteval);
>> - if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
>> - result = SCAN_PAGE_NULL;
>> - goto out_unmap;
>> - }
>> - folio = page_folio(page);
>> -
>> - if (!folio_test_anon(folio)) {
>> - VM_WARN_ON_FOLIO(true, folio);
>> - result = SCAN_PAGE_ANON;
>> + if (pte_check_res == PTE_CHECK_CONTINUE)
>> + continue;
>> + else if (pte_check_res == PTE_CHECK_FAIL)
>> goto out_unmap;
>> - }
>> -
>> - /*
>> - * We treat a single page as shared if any part of the THP
>> - * is shared.
>> - */
>> - if (folio_maybe_mapped_shared(folio)) {
>> - ++shared;
>> - if (cc->is_khugepaged &&
>> - shared > khugepaged_max_ptes_shared) {
>> - result = SCAN_EXCEED_SHARED_PTE;
>> - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> - goto out_unmap;
>> - }
>> - }
>>
>> /*
>> * Record which node the original page is from and save this
>> --
>> 2.49.0
>>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
2025-10-14 12:27 ` David Hildenbrand
@ 2025-10-15 4:49 ` Lance Yang
2025-10-15 9:16 ` Lorenzo Stoakes
0 siblings, 1 reply; 19+ messages in thread
From: Lance Yang @ 2025-10-15 4:49 UTC (permalink / raw)
To: David Hildenbrand, Lorenzo Stoakes
Cc: akpm, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
dev.jain, baohua, ioworker0, richard.weiyang, linux-kernel,
linux-mm
On 2025/10/14 20:27, David Hildenbrand wrote:
> On 14.10.25 14:17, Lorenzo Stoakes wrote:
>> On Wed, Oct 08, 2025 at 12:37:46PM +0800, Lance Yang wrote:
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> As pointed out by Dev, the PTE checks for disjoint conditions in the
>>> scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
>>> and pte_uffd_wp are mutually exclusive.
>>
>> But you're not using is_swap_pte anywhere :) This comes back to my review
>> quesiotn on the series this is dependent upon.
>>
>>>
>>> This patch refactors the loops in both __collapse_huge_page_isolate()
>>> and
>>> hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
>>> instead of separate if blocks. While at it, the redundant pte_present()
>>> check before is_zero_pfn() is also removed.
>>
>> I mean see review below, I don't see why you're doing this and I am
>> unconvinced by how redundant that check is.
Ah, good catch! Lorenzo, thanks!!!
>>
>> Also this just feels like it should be part of the series where you
>> change
>> these? I'm not sure why this is separate?
>
> I think Lance is trying to unify both scanning functions to look alike,
> such that when he refactors them out in patch #3 it looks more straight
> forward.
>
> The missing pte_present() check in hpage_collapse_scan_pmd() is interesting
Yep, indeed ;)
>
> Likely there is one such check missing there?
I think the risk is exactly how pte_pfn() would handle a swap PTE ...
A swap PTE contains completely different data(swap type and offset).
pte_pfn() doesn't know this, so if we feed a swap entry to it, it will
spit out a junk PFN :)
What if that junk PFN happens to match the zeropage's PFN by sheer
chance? IHMO, it's really unlikely, but it would be really bad if it did.
Clearly, pte_present() prevents this ;)
By the way, I noticed there are other places in khugepaged.c that
call pte_pfn() without being under a pte_present() check.
Perhaps those should all be fixed as well in a separate patch?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
2025-10-15 4:49 ` Lance Yang
@ 2025-10-15 9:16 ` Lorenzo Stoakes
2025-10-15 9:31 ` Lance Yang
0 siblings, 1 reply; 19+ messages in thread
From: Lorenzo Stoakes @ 2025-10-15 9:16 UTC (permalink / raw)
To: Lance Yang
Cc: David Hildenbrand, akpm, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
linux-kernel, linux-mm
On Wed, Oct 15, 2025 at 12:49:26PM +0800, Lance Yang wrote:
>
>
> On 2025/10/14 20:27, David Hildenbrand wrote:
> > On 14.10.25 14:17, Lorenzo Stoakes wrote:
> > > On Wed, Oct 08, 2025 at 12:37:46PM +0800, Lance Yang wrote:
> > > > From: Lance Yang <lance.yang@linux.dev>
> > > >
> > > > As pointed out by Dev, the PTE checks for disjoint conditions in the
> > > > scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
> > > > and pte_uffd_wp are mutually exclusive.
> > >
> > > But you're not using is_swap_pte anywhere :) This comes back to my review
> > > quesiotn on the series this is dependent upon.
> > >
> > > >
> > > > This patch refactors the loops in both
> > > > __collapse_huge_page_isolate() and
> > > > hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
> > > > instead of separate if blocks. While at it, the redundant pte_present()
> > > > check before is_zero_pfn() is also removed.
> > >
> > > I mean see review below, I don't see why you're doing this and I am
> > > unconvinced by how redundant that check is.
>
> Ah, good catch! Lorenzo, thanks!!!
>
> > >
> > > Also this just feels like it should be part of the series where you
> > > change
> > > these? I'm not sure why this is separate?
> >
> > I think Lance is trying to unify both scanning functions to look alike,
> > such that when he refactors them out in patch #3 it looks more straight
> > forward.
> >
> > The missing pte_present() check in hpage_collapse_scan_pmd() is interesting
>
> Yep, indeed ;)
>
> >
> > Likely there is one such check missing there?
>
> I think the risk is exactly how pte_pfn() would handle a swap PTE ...
>
> A swap PTE contains completely different data(swap type and offset).
> pte_pfn() doesn't know this, so if we feed a swap entry to it, it will
> spit out a junk PFN :)
>
> What if that junk PFN happens to match the zeropage's PFN by sheer
> chance? IHMO, it's really unlikely, but it would be really bad if it did.
>
> Clearly, pte_present() prevents this ;)
Yeah, not so clearly kinda the whole point I'm making here. I mean all this code
sucks because we have deeply nested conditions and you have to keep in your mind
that 'oh we already checked for X so we can do Y'.
But the THP code is horrible in general.
Anyway let's stay focused (so I can stand a chance of catching up with my
post-vacation backlog :), I will check the respin once you send!
>
> By the way, I noticed there are other places in khugepaged.c that
> call pte_pfn() without being under a pte_present() check.
>
> Perhaps those should all be fixed as well in a separate patch?
Yeah I'm not surprised and sure indeed.
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
2025-10-15 9:16 ` Lorenzo Stoakes
@ 2025-10-15 9:31 ` Lance Yang
0 siblings, 0 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-15 9:31 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, akpm, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
linux-kernel, linux-mm
On 2025/10/15 17:16, Lorenzo Stoakes wrote:
> On Wed, Oct 15, 2025 at 12:49:26PM +0800, Lance Yang wrote:
>>
>>
>> On 2025/10/14 20:27, David Hildenbrand wrote:
>>> On 14.10.25 14:17, Lorenzo Stoakes wrote:
>>>> On Wed, Oct 08, 2025 at 12:37:46PM +0800, Lance Yang wrote:
>>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>>
>>>>> As pointed out by Dev, the PTE checks for disjoint conditions in the
>>>>> scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
>>>>> and pte_uffd_wp are mutually exclusive.
>>>>
>>>> But you're not using is_swap_pte anywhere :) This comes back to my review
>>>> quesiotn on the series this is dependent upon.
>>>>
>>>>>
>>>>> This patch refactors the loops in both
>>>>> __collapse_huge_page_isolate() and
>>>>> hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
>>>>> instead of separate if blocks. While at it, the redundant pte_present()
>>>>> check before is_zero_pfn() is also removed.
>>>>
>>>> I mean see review below, I don't see why you're doing this and I am
>>>> unconvinced by how redundant that check is.
>>
>> Ah, good catch! Lorenzo, thanks!!!
>>
>>>>
>>>> Also this just feels like it should be part of the series where you
>>>> change
>>>> these? I'm not sure why this is separate?
>>>
>>> I think Lance is trying to unify both scanning functions to look alike,
>>> such that when he refactors them out in patch #3 it looks more straight
>>> forward.
>>>
>>> The missing pte_present() check in hpage_collapse_scan_pmd() is interesting
>>
>> Yep, indeed ;)
>>
>>>
>>> Likely there is one such check missing there?
>>
>> I think the risk is exactly how pte_pfn() would handle a swap PTE ...
>>
>> A swap PTE contains completely different data(swap type and offset).
>> pte_pfn() doesn't know this, so if we feed a swap entry to it, it will
>> spit out a junk PFN :)
>>
>> What if that junk PFN happens to match the zeropage's PFN by sheer
>> chance? IHMO, it's really unlikely, but it would be really bad if it did.
>>
>> Clearly, pte_present() prevents this ;)
>
> Yeah, not so clearly kinda the whole point I'm making here. I mean all this code
> sucks because we have deeply nested conditions and you have to keep in your mind
> that 'oh we already checked for X so we can do Y'.
I see :)
>
> But the THP code is horrible in general.
Yep, you'll get no argument from me on that one ;p
The code is indeed tricky ...
>
> Anyway let's stay focused (so I can stand a chance of catching up with my
> post-vacation backlog :), I will check the respin once you send!
Cheers!
This series has been dropped from mm-new. I'm going to take a bit to
rethink the approach based on the feedback.
>
>>
>> By the way, I noticed there are other places in khugepaged.c that
>> call pte_pfn() without being under a pte_present() check.
>>
>> Perhaps those should all be fixed as well in a separate patch?
>
> Yeah I'm not surprised and sure indeed.
>
> Thanks, Lorenzo
In the meantime, I'll send up a separate patch for the missing
pte_present() checks first ;)
Thanks,
Lance
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-10-15 9:31 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-08 4:37 [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic Lance Yang
2025-10-08 4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
2025-10-14 12:17 ` Lorenzo Stoakes
2025-10-14 12:27 ` David Hildenbrand
2025-10-15 4:49 ` Lance Yang
2025-10-15 9:16 ` Lorenzo Stoakes
2025-10-15 9:31 ` Lance Yang
2025-10-08 4:37 ` [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios Lance Yang
2025-10-14 12:25 ` Lorenzo Stoakes
2025-10-08 4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
2025-10-09 1:07 ` Andrew Morton
2025-10-09 1:49 ` Lance Yang
2025-10-10 9:10 ` Dev Jain
2025-10-10 10:42 ` Lance Yang
2025-10-10 13:29 ` Wei Yang
2025-10-10 13:55 ` Lance Yang
2025-10-14 12:36 ` Lorenzo Stoakes
2025-10-14 17:41 ` Lorenzo Stoakes
2025-10-15 1:48 ` Lance Yang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox