linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic
@ 2025-10-08  4:37 Lance Yang
  2025-10-08  4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-08  4:37 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, ioworker0, richard.weiyang, linux-kernel, linux-mm

Hi all,

This series cleans up the almost-duplicated PTE scanning logic in the
collapse path.

The first one is a preparatory step that refactors both loops to use
a single if-else-if-else-if chain for checking disjoint PTEs.

The second one replaces VM_BUG_ON_FOLIO() with a more graceful
VM_WARN_ON_FOLIO() for handling non-anonymous folios.

The last one then extracts the common logic into a shared helper.

Thanks,
Lance

---
This series applies on top of patch[1], which is based on mm-new.
[1] https://lore.kernel.org/linux-mm/20251008032657.72406-1-lance.yang@linux.dev

v2 -> v3:
- #02 Collect Reviewed-by from Wei and Dev - thanks!
- #03 Use vm_normal_folio() and drop struct page altogether (per Dev)
- #03 Move the cc parameter to the end (per Dev)
- https://lore.kernel.org/linux-mm/20251006144338.96519-1-lance.yang@linux.dev

v1 -> v2:
- #01 Update the changelog (per Dev)
- #01 Collect Reviewed-by from Wei, Dev and Zi - thanks!
- #03 Make more of the scanning logic common between scan_pmd() and
      _isolate() (per Dev)
- https://lore.kernel.org/linux-mm/20251002073255.14867-1-lance.yang@linux.dev

Lance Yang (3):
  mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
  mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for
    non-anon folios
  mm/khugepaged: merge PTE scanning logic into a new helper

 mm/khugepaged.c | 242 ++++++++++++++++++++++++++----------------------
 1 file changed, 130 insertions(+), 112 deletions(-)

-- 
2.49.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
  2025-10-08  4:37 [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic Lance Yang
@ 2025-10-08  4:37 ` Lance Yang
  2025-10-14 12:17   ` Lorenzo Stoakes
  2025-10-08  4:37 ` [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios Lance Yang
  2025-10-08  4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
  2 siblings, 1 reply; 19+ messages in thread
From: Lance Yang @ 2025-10-08  4:37 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, ioworker0, richard.weiyang, linux-kernel, linux-mm,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>

As pointed out by Dev, the PTE checks for disjoint conditions in the
scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
and pte_uffd_wp are mutually exclusive.

This patch refactors the loops in both __collapse_huge_page_isolate() and
hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
instead of separate if blocks. While at it, the redundant pte_present()
check before is_zero_pfn() is also removed.

Also, this is a preparatory step to make it easier to merge the
almost-duplicated scanning logic in these two functions, as suggested
by David.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Suggested-by: Dev Jain <dev.jain@arm.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 mm/khugepaged.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index bec3e268dc76..e3e27223137a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -548,8 +548,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
 	     _pte++, addr += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
-		if (pte_none(pteval) || (pte_present(pteval) &&
-				is_zero_pfn(pte_pfn(pteval)))) {
+		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
@@ -560,12 +559,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 				goto out;
 			}
-		}
-		if (!pte_present(pteval)) {
+		} else if (!pte_present(pteval)) {
 			result = SCAN_PTE_NON_PRESENT;
 			goto out;
-		}
-		if (pte_uffd_wp(pteval)) {
+		} else if (pte_uffd_wp(pteval)) {
 			result = SCAN_PTE_UFFD_WP;
 			goto out;
 		}
@@ -1321,8 +1318,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
 				goto out_unmap;
 			}
-		}
-		if (pte_uffd_wp(pteval)) {
+		} else if (pte_uffd_wp(pteval)) {
 			/*
 			 * Don't collapse the page if any of the small
 			 * PTEs are armed with uffd write protection.
-- 
2.49.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios
  2025-10-08  4:37 [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic Lance Yang
  2025-10-08  4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
@ 2025-10-08  4:37 ` Lance Yang
  2025-10-14 12:25   ` Lorenzo Stoakes
  2025-10-08  4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
  2 siblings, 1 reply; 19+ messages in thread
From: Lance Yang @ 2025-10-08  4:37 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, ioworker0, richard.weiyang, linux-kernel, linux-mm,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>

As Zi pointed out, we should avoid crashing the kernel for conditions
that can be handled gracefully. Encountering a non-anonymous folio in an
anonymous VMA is a bug, but a warning is sufficient.

This patch changes the VM_BUG_ON_FOLIO(!folio_test_anon(folio)) to a
VM_WARN_ON_FOLIO() in both __collapse_huge_page_isolate() and
hpage_collapse_scan_pmd(), and then aborts the scan with SCAN_PAGE_ANON.

Making more of the scanning logic common between hpage_collapse_scan_pmd()
and __collapse_huge_page_isolate(), as suggested by Dev.

Suggested-by: Dev Jain <dev.jain@arm.com>
Suggested-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 mm/khugepaged.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e3e27223137a..b5c0295c3414 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -573,7 +573,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		}
 
 		folio = page_folio(page);
-		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
+		if (!folio_test_anon(folio)) {
+			VM_WARN_ON_FOLIO(true, folio);
+			result = SCAN_PAGE_ANON;
+			goto out;
+		}
 
 		/* See hpage_collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
@@ -1340,6 +1344,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		folio = page_folio(page);
 
 		if (!folio_test_anon(folio)) {
+			VM_WARN_ON_FOLIO(true, folio);
 			result = SCAN_PAGE_ANON;
 			goto out_unmap;
 		}
-- 
2.49.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
  2025-10-08  4:37 [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic Lance Yang
  2025-10-08  4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
  2025-10-08  4:37 ` [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios Lance Yang
@ 2025-10-08  4:37 ` Lance Yang
  2025-10-09  1:07   ` Andrew Morton
                     ` (4 more replies)
  2 siblings, 5 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-08  4:37 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
	baohua, ioworker0, richard.weiyang, linux-kernel, linux-mm,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>

As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
and __collapse_huge_page_isolate() was almost duplicated.

This patch cleans things up by moving all the common PTE checking logic
into a new shared helper, thp_collapse_check_pte(). While at it, we use
vm_normal_folio() instead of vm_normal_page().

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
 1 file changed, 130 insertions(+), 113 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b5c0295c3414..7116caae1fa4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -61,6 +61,12 @@ enum scan_result {
 	SCAN_PAGE_FILLED,
 };
 
+enum pte_check_result {
+	PTE_CHECK_SUCCEED,
+	PTE_CHECK_CONTINUE,
+	PTE_CHECK_FAIL,
+};
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/huge_memory.h>
 
@@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
 	}
 }
 
+/*
+ * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
+ * @pte:           The PTE to check
+ * @vma:           The VMA the PTE belongs to
+ * @addr:          The virtual address corresponding to this PTE
+ * @foliop:        On success, used to return a pointer to the folio
+ *                 Must be non-NULL
+ * @none_or_zero:  Counter for none/zero PTEs. Must be non-NULL
+ * @unmapped:      Counter for swap PTEs. Can be NULL if not scanning swaps
+ * @shared:        Counter for shared pages. Must be non-NULL
+ * @scan_result:   Used to return the failure reason (SCAN_*) on a
+ *                 PTE_CHECK_FAIL return. Must be non-NULL
+ * @cc:            Collapse control settings
+ *
+ * Returns:
+ *   PTE_CHECK_SUCCEED  - PTE is suitable, proceed with further checks
+ *   PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
+ *   PTE_CHECK_FAIL     - Abort collapse scan
+ */
+static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
+		unsigned long addr, struct folio **foliop, int *none_or_zero,
+		int *unmapped, int *shared, int *scan_result,
+		struct collapse_control *cc)
+{
+	struct folio *folio = NULL;
+
+	if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
+		(*none_or_zero)++;
+		if (!userfaultfd_armed(vma) &&
+		    (!cc->is_khugepaged ||
+		     *none_or_zero <= khugepaged_max_ptes_none)) {
+			return PTE_CHECK_CONTINUE;
+		} else {
+			*scan_result = SCAN_EXCEED_NONE_PTE;
+			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+			return PTE_CHECK_FAIL;
+		}
+	} else if (!pte_present(pte)) {
+		if (!unmapped) {
+			*scan_result = SCAN_PTE_NON_PRESENT;
+			return PTE_CHECK_FAIL;
+		}
+
+		if (non_swap_entry(pte_to_swp_entry(pte))) {
+			*scan_result = SCAN_PTE_NON_PRESENT;
+			return PTE_CHECK_FAIL;
+		}
+
+		(*unmapped)++;
+		if (!cc->is_khugepaged ||
+		    *unmapped <= khugepaged_max_ptes_swap) {
+			/*
+			 * Always be strict with uffd-wp enabled swap
+			 * entries. Please see comment below for
+			 * pte_uffd_wp().
+			 */
+			if (pte_swp_uffd_wp(pte)) {
+				*scan_result = SCAN_PTE_UFFD_WP;
+				return PTE_CHECK_FAIL;
+			}
+			return PTE_CHECK_CONTINUE;
+		} else {
+			*scan_result = SCAN_EXCEED_SWAP_PTE;
+			count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
+			return PTE_CHECK_FAIL;
+		}
+	} else if (pte_uffd_wp(pte)) {
+		/*
+		 * Don't collapse the page if any of the small PTEs are
+		 * armed with uffd write protection. Here we can also mark
+		 * the new huge pmd as write protected if any of the small
+		 * ones is marked but that could bring unknown userfault
+		 * messages that falls outside of the registered range.
+		 * So, just be simple.
+		 */
+		*scan_result = SCAN_PTE_UFFD_WP;
+		return PTE_CHECK_FAIL;
+	}
+
+	folio = vm_normal_folio(vma, addr, pte);
+	if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
+		*scan_result = SCAN_PAGE_NULL;
+		return PTE_CHECK_FAIL;
+	}
+
+	if (!folio_test_anon(folio)) {
+		VM_WARN_ON_FOLIO(true, folio);
+		*scan_result = SCAN_PAGE_ANON;
+		return PTE_CHECK_FAIL;
+	}
+
+	/*
+	 * We treat a single page as shared if any part of the THP
+	 * is shared.
+	 */
+	if (folio_maybe_mapped_shared(folio)) {
+		(*shared)++;
+		if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
+			*scan_result = SCAN_EXCEED_SHARED_PTE;
+			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
+			return PTE_CHECK_FAIL;
+		}
+	}
+
+	*foliop = folio;
+
+	return PTE_CHECK_SUCCEED;
+}
+
 static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long start_addr,
 					pte_t *pte,
 					struct collapse_control *cc,
 					struct list_head *compound_pagelist)
 {
-	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long addr = start_addr;
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
+	int pte_check_res;
 
 	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
 	     _pte++, addr += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
-		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-			++none_or_zero;
-			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
-				continue;
-			} else {
-				result = SCAN_EXCEED_NONE_PTE;
-				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
-				goto out;
-			}
-		} else if (!pte_present(pteval)) {
-			result = SCAN_PTE_NON_PRESENT;
-			goto out;
-		} else if (pte_uffd_wp(pteval)) {
-			result = SCAN_PTE_UFFD_WP;
-			goto out;
-		}
-		page = vm_normal_page(vma, addr, pteval);
-		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
-			result = SCAN_PAGE_NULL;
-			goto out;
-		}
 
-		folio = page_folio(page);
-		if (!folio_test_anon(folio)) {
-			VM_WARN_ON_FOLIO(true, folio);
-			result = SCAN_PAGE_ANON;
-			goto out;
-		}
+		pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
+					&folio, &none_or_zero, NULL, &shared,
+					&result, cc);
 
-		/* See hpage_collapse_scan_pmd(). */
-		if (folio_maybe_mapped_shared(folio)) {
-			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
-				result = SCAN_EXCEED_SHARED_PTE;
-				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
-				goto out;
-			}
-		}
+		if (pte_check_res == PTE_CHECK_CONTINUE)
+			continue;
+		else if (pte_check_res == PTE_CHECK_FAIL)
+			goto out;
 
 		if (folio_test_large(folio)) {
 			struct folio *f;
@@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	pte_t *pte, *_pte;
 	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
-	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long addr;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
+	int pte_check_res;
 
 	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
 
@@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
 	     _pte++, addr += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
-		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-			++none_or_zero;
-			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
-				continue;
-			} else {
-				result = SCAN_EXCEED_NONE_PTE;
-				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
-				goto out_unmap;
-			}
-		} else if (!pte_present(pteval)) {
-			if (non_swap_entry(pte_to_swp_entry(pteval))) {
-				result = SCAN_PTE_NON_PRESENT;
-				goto out_unmap;
-			}
 
-			++unmapped;
-			if (!cc->is_khugepaged ||
-			    unmapped <= khugepaged_max_ptes_swap) {
-				/*
-				 * Always be strict with uffd-wp
-				 * enabled swap entries.  Please see
-				 * comment below for pte_uffd_wp().
-				 */
-				if (pte_swp_uffd_wp(pteval)) {
-					result = SCAN_PTE_UFFD_WP;
-					goto out_unmap;
-				}
-				continue;
-			} else {
-				result = SCAN_EXCEED_SWAP_PTE;
-				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
-				goto out_unmap;
-			}
-		} else if (pte_uffd_wp(pteval)) {
-			/*
-			 * Don't collapse the page if any of the small
-			 * PTEs are armed with uffd write protection.
-			 * Here we can also mark the new huge pmd as
-			 * write protected if any of the small ones is
-			 * marked but that could bring unknown
-			 * userfault messages that falls outside of
-			 * the registered range.  So, just be simple.
-			 */
-			result = SCAN_PTE_UFFD_WP;
-			goto out_unmap;
-		}
+		pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
+					&folio, &none_or_zero, &unmapped,
+					&shared, &result, cc);
 
-		page = vm_normal_page(vma, addr, pteval);
-		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
-			result = SCAN_PAGE_NULL;
-			goto out_unmap;
-		}
-		folio = page_folio(page);
-
-		if (!folio_test_anon(folio)) {
-			VM_WARN_ON_FOLIO(true, folio);
-			result = SCAN_PAGE_ANON;
+		if (pte_check_res == PTE_CHECK_CONTINUE)
+			continue;
+		else if (pte_check_res == PTE_CHECK_FAIL)
 			goto out_unmap;
-		}
-
-		/*
-		 * We treat a single page as shared if any part of the THP
-		 * is shared.
-		 */
-		if (folio_maybe_mapped_shared(folio)) {
-			++shared;
-			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
-				result = SCAN_EXCEED_SHARED_PTE;
-				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
-				goto out_unmap;
-			}
-		}
 
 		/*
 		 * Record which node the original page is from and save this
-- 
2.49.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
  2025-10-08  4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
@ 2025-10-09  1:07   ` Andrew Morton
  2025-10-09  1:49     ` Lance Yang
  2025-10-10  9:10   ` Dev Jain
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2025-10-09  1:07 UTC (permalink / raw)
  To: Lance Yang
  Cc: david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
	linux-kernel, linux-mm

On Wed,  8 Oct 2025 12:37:48 +0800 Lance Yang <lance.yang@linux.dev> wrote:

> +		if (!cc->is_khugepaged ||
> +		    *unmapped <= khugepaged_max_ptes_swap) {
> +			/*
> +			 * Always be strict with uffd-wp enabled swap
> +			 * entries. Please see comment below for
> +			 * pte_uffd_wp().
> +			 */
> +			if (pte_swp_uffd_wp(pte)) {
> +				*scan_result = SCAN_PTE_UFFD_WP;
> +				return PTE_CHECK_FAIL;
> +			}
> +			return PTE_CHECK_CONTINUE;
> +		} else {
> +			*scan_result = SCAN_EXCEED_SWAP_PTE;
> +			count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> +			return PTE_CHECK_FAIL;
> +		}

I'm inclined to agree with checkpatch here.

WARNING: else is not generally useful after a break or return
#81: FILE: mm/khugepaged.c:574:
+			return PTE_CHECK_CONTINUE;
+		} else {

did you see this and disagree or did you forget to run checkpatch?

--- a/mm/khugepaged.c~mm-khugepaged-merge-pte-scanning-logic-into-a-new-helper-checkpatch-fixes
+++ a/mm/khugepaged.c
@@ -571,11 +571,10 @@ static inline int thp_collapse_check_pte
 		    (!cc->is_khugepaged ||
 		     *none_or_zero <= khugepaged_max_ptes_none)) {
 			return PTE_CHECK_CONTINUE;
-		} else {
-			*scan_result = SCAN_EXCEED_NONE_PTE;
-			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
-			return PTE_CHECK_FAIL;
 		}
+		*scan_result = SCAN_EXCEED_NONE_PTE;
+		count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+		return PTE_CHECK_FAIL;
 	} else if (!pte_present(pte)) {
 		if (!unmapped) {
 			*scan_result = SCAN_PTE_NON_PRESENT;
@@ -600,11 +599,10 @@ static inline int thp_collapse_check_pte
 				return PTE_CHECK_FAIL;
 			}
 			return PTE_CHECK_CONTINUE;
-		} else {
-			*scan_result = SCAN_EXCEED_SWAP_PTE;
-			count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
-			return PTE_CHECK_FAIL;
 		}
+		*scan_result = SCAN_EXCEED_SWAP_PTE;
+		count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
+		return PTE_CHECK_FAIL;
 	} else if (pte_uffd_wp(pte)) {
 		/*
 		 * Don't collapse the page if any of the small PTEs are
_



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
  2025-10-09  1:07   ` Andrew Morton
@ 2025-10-09  1:49     ` Lance Yang
  0 siblings, 0 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-09  1:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
	linux-kernel, linux-mm



On 2025/10/9 09:07, Andrew Morton wrote:
> On Wed,  8 Oct 2025 12:37:48 +0800 Lance Yang <lance.yang@linux.dev> wrote:
> 
>> +		if (!cc->is_khugepaged ||
>> +		    *unmapped <= khugepaged_max_ptes_swap) {
>> +			/*
>> +			 * Always be strict with uffd-wp enabled swap
>> +			 * entries. Please see comment below for
>> +			 * pte_uffd_wp().
>> +			 */
>> +			if (pte_swp_uffd_wp(pte)) {
>> +				*scan_result = SCAN_PTE_UFFD_WP;
>> +				return PTE_CHECK_FAIL;
>> +			}
>> +			return PTE_CHECK_CONTINUE;
>> +		} else {
>> +			*scan_result = SCAN_EXCEED_SWAP_PTE;
>> +			count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> +			return PTE_CHECK_FAIL;
>> +		}
> 
> I'm inclined to agree with checkpatch here.

Thanks!

> 
> WARNING: else is not generally useful after a break or return
> #81: FILE: mm/khugepaged.c:574:
> +			return PTE_CHECK_CONTINUE;
> +		} else {
> 
> did you see this and disagree or did you forget to run checkpatch?

Yes, I saw the warning. I kept the original style because this is just a 
code move ...

> 
> --- a/mm/khugepaged.c~mm-khugepaged-merge-pte-scanning-logic-into-a-new-helper-checkpatch-fixes
> +++ a/mm/khugepaged.c
> @@ -571,11 +571,10 @@ static inline int thp_collapse_check_pte
>   		    (!cc->is_khugepaged ||
>   		     *none_or_zero <= khugepaged_max_ptes_none)) {
>   			return PTE_CHECK_CONTINUE;
> -		} else {
> -			*scan_result = SCAN_EXCEED_NONE_PTE;
> -			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> -			return PTE_CHECK_FAIL;
>   		}
> +		*scan_result = SCAN_EXCEED_NONE_PTE;
> +		count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +		return PTE_CHECK_FAIL;
>   	} else if (!pte_present(pte)) {
>   		if (!unmapped) {
>   			*scan_result = SCAN_PTE_NON_PRESENT;
> @@ -600,11 +599,10 @@ static inline int thp_collapse_check_pte
>   				return PTE_CHECK_FAIL;
>   			}
>   			return PTE_CHECK_CONTINUE;
> -		} else {
> -			*scan_result = SCAN_EXCEED_SWAP_PTE;
> -			count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> -			return PTE_CHECK_FAIL;
>   		}
> +		*scan_result = SCAN_EXCEED_SWAP_PTE;
> +		count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> +		return PTE_CHECK_FAIL;
>   	} else if (pte_uffd_wp(pte)) {
>   		/*
>   		 * Don't collapse the page if any of the small PTEs are
> _
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
  2025-10-08  4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
  2025-10-09  1:07   ` Andrew Morton
@ 2025-10-10  9:10   ` Dev Jain
  2025-10-10 10:42     ` Lance Yang
  2025-10-10 13:29   ` Wei Yang
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Dev Jain @ 2025-10-10  9:10 UTC (permalink / raw)
  To: Lance Yang, akpm, david, lorenzo.stoakes
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, baohua,
	ioworker0, richard.weiyang, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1387 bytes --]


On 08/10/25 10:07 am, Lance Yang wrote:
>   	}
>   }
>   
> +/*
> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
> + * @pte:           The PTE to check
> + * @vma:           The VMA the PTE belongs to
> + * @addr:          The virtual address corresponding to this PTE
> + * @foliop:        On success, used to return a pointer to the folio
> + *                 Must be non-NULL
> + * @none_or_zero:  Counter for none/zero PTEs. Must be non-NULL
> + * @unmapped:      Counter for swap PTEs. Can be NULL if not scanning swaps
> + * @shared:        Counter for shared pages. Must be non-NULL
> + * @scan_result:   Used to return the failure reason (SCAN_*) on a
> + *                 PTE_CHECK_FAIL return. Must be non-NULL
> + * @cc:            Collapse control settings
> + *
> + * Returns:
> + *   PTE_CHECK_SUCCEED  - PTE is suitable, proceed with further checks
> + *   PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
> + *   PTE_CHECK_FAIL     - Abort collapse scan
> + */
> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
> +		unsigned long addr, struct folio **foliop, int *none_or_zero,
> +		int *unmapped, int *shared, int *scan_result,
> +		struct collapse_control *cc)
> +{
> +	struct folio *folio = NULL;

I think initialization is not needed here?Otherwise, LGTM. Reviewed-by: Dev Jain <dev.jain@arm.com>

[-- Attachment #2: Type: text/html, Size: 1869 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
  2025-10-10  9:10   ` Dev Jain
@ 2025-10-10 10:42     ` Lance Yang
  0 siblings, 0 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-10 10:42 UTC (permalink / raw)
  To: Dev Jain
  Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, baohua,
	ioworker0, david, richard.weiyang, lorenzo.stoakes, linux-kernel,
	akpm, linux-mm



On 2025/10/10 17:10, Dev Jain wrote:
> 
> On 08/10/25 10:07 am, Lance Yang wrote:
>>   	}
>>   }
>>   
>> +/*
>> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
>> + * @pte:           The PTE to check
>> + * @vma:           The VMA the PTE belongs to
>> + * @addr:          The virtual address corresponding to this PTE
>> + * @foliop:        On success, used to return a pointer to the folio
>> + *                 Must be non-NULL
>> + * @none_or_zero:  Counter for none/zero PTEs. Must be non-NULL
>> + * @unmapped:      Counter for swap PTEs. Can be NULL if not scanning swaps
>> + * @shared:        Counter for shared pages. Must be non-NULL
>> + * @scan_result:   Used to return the failure reason (SCAN_*) on a
>> + *                 PTE_CHECK_FAIL return. Must be non-NULL
>> + * @cc:            Collapse control settings
>> + *
>> + * Returns:
>> + *   PTE_CHECK_SUCCEED  - PTE is suitable, proceed with further checks
>> + *   PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
>> + *   PTE_CHECK_FAIL     - Abort collapse scan
>> + */
>> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
>> +		unsigned long addr, struct folio **foliop, int *none_or_zero,
>> +		int *unmapped, int *shared, int *scan_result,
>> +		struct collapse_control *cc)
>> +{
>> +	struct folio *folio = NULL;
> 
> I think initialization is not needed here?Otherwise, LGTM. Reviewed-by: Dev Jain <dev.jain@arm.com>

Yep. It's a minor thing, so I'll fold that in if a new version is 
needed. Thanks!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
  2025-10-08  4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
  2025-10-09  1:07   ` Andrew Morton
  2025-10-10  9:10   ` Dev Jain
@ 2025-10-10 13:29   ` Wei Yang
  2025-10-10 13:55     ` Lance Yang
  2025-10-14 12:36   ` Lorenzo Stoakes
  2025-10-14 17:41   ` Lorenzo Stoakes
  4 siblings, 1 reply; 19+ messages in thread
From: Wei Yang @ 2025-10-10 13:29 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0,
	richard.weiyang, linux-kernel, linux-mm

On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
>From: Lance Yang <lance.yang@linux.dev>
>
>As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
>and __collapse_huge_page_isolate() was almost duplicated.
>
>This patch cleans things up by moving all the common PTE checking logic
>into a new shared helper, thp_collapse_check_pte(). While at it, we use
>vm_normal_folio() instead of vm_normal_page().
>
>Suggested-by: David Hildenbrand <david@redhat.com>
>Suggested-by: Dev Jain <dev.jain@arm.com>
>Signed-off-by: Lance Yang <lance.yang@linux.dev>
>---
> mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
> 1 file changed, 130 insertions(+), 113 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index b5c0295c3414..7116caae1fa4 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -61,6 +61,12 @@ enum scan_result {
> 	SCAN_PAGE_FILLED,
> };
> 
>+enum pte_check_result {
>+	PTE_CHECK_SUCCEED,
>+	PTE_CHECK_CONTINUE,
>+	PTE_CHECK_FAIL,
>+};
>+
> #define CREATE_TRACE_POINTS
> #include <trace/events/huge_memory.h>
> 
>@@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
> 	}
> }
> 
>+/*
>+ * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
>+ * @pte:           The PTE to check
>+ * @vma:           The VMA the PTE belongs to
>+ * @addr:          The virtual address corresponding to this PTE
>+ * @foliop:        On success, used to return a pointer to the folio
>+ *                 Must be non-NULL
>+ * @none_or_zero:  Counter for none/zero PTEs. Must be non-NULL
>+ * @unmapped:      Counter for swap PTEs. Can be NULL if not scanning swaps
>+ * @shared:        Counter for shared pages. Must be non-NULL
>+ * @scan_result:   Used to return the failure reason (SCAN_*) on a
>+ *                 PTE_CHECK_FAIL return. Must be non-NULL
>+ * @cc:            Collapse control settings
>+ *
>+ * Returns:
>+ *   PTE_CHECK_SUCCEED  - PTE is suitable, proceed with further checks
>+ *   PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
>+ *   PTE_CHECK_FAIL     - Abort collapse scan
>+ */
>+static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
>+		unsigned long addr, struct folio **foliop, int *none_or_zero,
>+		int *unmapped, int *shared, int *scan_result,
>+		struct collapse_control *cc)
>+{
>+	struct folio *folio = NULL;
>+
>+	if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>+		(*none_or_zero)++;
>+		if (!userfaultfd_armed(vma) &&
>+		    (!cc->is_khugepaged ||
>+		     *none_or_zero <= khugepaged_max_ptes_none)) {
>+			return PTE_CHECK_CONTINUE;
>+		} else {
>+			*scan_result = SCAN_EXCEED_NONE_PTE;
>+			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>+			return PTE_CHECK_FAIL;
>+		}
>+	} else if (!pte_present(pte)) {
>+		if (!unmapped) {
>+			*scan_result = SCAN_PTE_NON_PRESENT;
>+			return PTE_CHECK_FAIL;
>+		}
>+
>+		if (non_swap_entry(pte_to_swp_entry(pte))) {
>+			*scan_result = SCAN_PTE_NON_PRESENT;
>+			return PTE_CHECK_FAIL;
>+		}
>+
>+		(*unmapped)++;
>+		if (!cc->is_khugepaged ||
>+		    *unmapped <= khugepaged_max_ptes_swap) {
>+			/*
>+			 * Always be strict with uffd-wp enabled swap
>+			 * entries. Please see comment below for
>+			 * pte_uffd_wp().
>+			 */
>+			if (pte_swp_uffd_wp(pte)) {
>+				*scan_result = SCAN_PTE_UFFD_WP;
>+				return PTE_CHECK_FAIL;
>+			}
>+			return PTE_CHECK_CONTINUE;
>+		} else {
>+			*scan_result = SCAN_EXCEED_SWAP_PTE;
>+			count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>+			return PTE_CHECK_FAIL;
>+		}
>+	} else if (pte_uffd_wp(pte)) {
>+		/*
>+		 * Don't collapse the page if any of the small PTEs are
>+		 * armed with uffd write protection. Here we can also mark
>+		 * the new huge pmd as write protected if any of the small
>+		 * ones is marked but that could bring unknown userfault
>+		 * messages that falls outside of the registered range.
>+		 * So, just be simple.
>+		 */
>+		*scan_result = SCAN_PTE_UFFD_WP;
>+		return PTE_CHECK_FAIL;
>+	}
>+
>+	folio = vm_normal_folio(vma, addr, pte);
>+	if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
>+		*scan_result = SCAN_PAGE_NULL;
>+		return PTE_CHECK_FAIL;
>+	}
>+
>+	if (!folio_test_anon(folio)) {
>+		VM_WARN_ON_FOLIO(true, folio);
>+		*scan_result = SCAN_PAGE_ANON;
>+		return PTE_CHECK_FAIL;
>+	}
>+
>+	/*
>+	 * We treat a single page as shared if any part of the THP
>+	 * is shared.
>+	 */
>+	if (folio_maybe_mapped_shared(folio)) {
>+		(*shared)++;
>+		if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
>+			*scan_result = SCAN_EXCEED_SHARED_PTE;
>+			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>+			return PTE_CHECK_FAIL;
>+		}
>+	}
>+
>+	*foliop = folio;
>+
>+	return PTE_CHECK_SUCCEED;
>+}
>+

This one looks much better.

While my personal feeling is this is not a complete work to merge the scanning
logic. We still have folio_expected_ref_count() and pte_young() check present
both in __collapse_huge_page_isolate() and huge_collapse_scan_pmd().

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
  2025-10-10 13:29   ` Wei Yang
@ 2025-10-10 13:55     ` Lance Yang
  0 siblings, 0 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-10 13:55 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, linux-kernel,
	linux-mm



On 2025/10/10 21:29, Wei Yang wrote:
> On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
>> and __collapse_huge_page_isolate() was almost duplicated.
>>
>> This patch cleans things up by moving all the common PTE checking logic
>> into a new shared helper, thp_collapse_check_pte(). While at it, we use
>> vm_normal_folio() instead of vm_normal_page().
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Suggested-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
>> 1 file changed, 130 insertions(+), 113 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b5c0295c3414..7116caae1fa4 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -61,6 +61,12 @@ enum scan_result {
>> 	SCAN_PAGE_FILLED,
>> };
>>
>> +enum pte_check_result {
>> +	PTE_CHECK_SUCCEED,
>> +	PTE_CHECK_CONTINUE,
>> +	PTE_CHECK_FAIL,
>> +};
>> +
>> #define CREATE_TRACE_POINTS
>> #include <trace/events/huge_memory.h>
>>
>> @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>> 	}
>> }
>>
>> +/*
>> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
>> + * @pte:           The PTE to check
>> + * @vma:           The VMA the PTE belongs to
>> + * @addr:          The virtual address corresponding to this PTE
>> + * @foliop:        On success, used to return a pointer to the folio
>> + *                 Must be non-NULL
>> + * @none_or_zero:  Counter for none/zero PTEs. Must be non-NULL
>> + * @unmapped:      Counter for swap PTEs. Can be NULL if not scanning swaps
>> + * @shared:        Counter for shared pages. Must be non-NULL
>> + * @scan_result:   Used to return the failure reason (SCAN_*) on a
>> + *                 PTE_CHECK_FAIL return. Must be non-NULL
>> + * @cc:            Collapse control settings
>> + *
>> + * Returns:
>> + *   PTE_CHECK_SUCCEED  - PTE is suitable, proceed with further checks
>> + *   PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
>> + *   PTE_CHECK_FAIL     - Abort collapse scan
>> + */
>> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
>> +		unsigned long addr, struct folio **foliop, int *none_or_zero,
>> +		int *unmapped, int *shared, int *scan_result,
>> +		struct collapse_control *cc)
>> +{
>> +	struct folio *folio = NULL;
>> +
>> +	if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>> +		(*none_or_zero)++;
>> +		if (!userfaultfd_armed(vma) &&
>> +		    (!cc->is_khugepaged ||
>> +		     *none_or_zero <= khugepaged_max_ptes_none)) {
>> +			return PTE_CHECK_CONTINUE;
>> +		} else {
>> +			*scan_result = SCAN_EXCEED_NONE_PTE;
>> +			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> +			return PTE_CHECK_FAIL;
>> +		}
>> +	} else if (!pte_present(pte)) {
>> +		if (!unmapped) {
>> +			*scan_result = SCAN_PTE_NON_PRESENT;
>> +			return PTE_CHECK_FAIL;
>> +		}
>> +
>> +		if (non_swap_entry(pte_to_swp_entry(pte))) {
>> +			*scan_result = SCAN_PTE_NON_PRESENT;
>> +			return PTE_CHECK_FAIL;
>> +		}
>> +
>> +		(*unmapped)++;
>> +		if (!cc->is_khugepaged ||
>> +		    *unmapped <= khugepaged_max_ptes_swap) {
>> +			/*
>> +			 * Always be strict with uffd-wp enabled swap
>> +			 * entries. Please see comment below for
>> +			 * pte_uffd_wp().
>> +			 */
>> +			if (pte_swp_uffd_wp(pte)) {
>> +				*scan_result = SCAN_PTE_UFFD_WP;
>> +				return PTE_CHECK_FAIL;
>> +			}
>> +			return PTE_CHECK_CONTINUE;
>> +		} else {
>> +			*scan_result = SCAN_EXCEED_SWAP_PTE;
>> +			count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> +			return PTE_CHECK_FAIL;
>> +		}
>> +	} else if (pte_uffd_wp(pte)) {
>> +		/*
>> +		 * Don't collapse the page if any of the small PTEs are
>> +		 * armed with uffd write protection. Here we can also mark
>> +		 * the new huge pmd as write protected if any of the small
>> +		 * ones is marked but that could bring unknown userfault
>> +		 * messages that falls outside of the registered range.
>> +		 * So, just be simple.
>> +		 */
>> +		*scan_result = SCAN_PTE_UFFD_WP;
>> +		return PTE_CHECK_FAIL;
>> +	}
>> +
>> +	folio = vm_normal_folio(vma, addr, pte);
>> +	if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
>> +		*scan_result = SCAN_PAGE_NULL;
>> +		return PTE_CHECK_FAIL;
>> +	}
>> +
>> +	if (!folio_test_anon(folio)) {
>> +		VM_WARN_ON_FOLIO(true, folio);
>> +		*scan_result = SCAN_PAGE_ANON;
>> +		return PTE_CHECK_FAIL;
>> +	}
>> +
>> +	/*
>> +	 * We treat a single page as shared if any part of the THP
>> +	 * is shared.
>> +	 */
>> +	if (folio_maybe_mapped_shared(folio)) {
>> +		(*shared)++;
>> +		if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
>> +			*scan_result = SCAN_EXCEED_SHARED_PTE;
>> +			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> +			return PTE_CHECK_FAIL;
>> +		}
>> +	}
>> +
>> +	*foliop = folio;
>> +
>> +	return PTE_CHECK_SUCCEED;
>> +}
>> +
> 
> This one looks much better.
> 
> While my personal feeling is this is not a complete work to merge the scanning
> logic. We still have folio_expected_ref_count() and pte_young() check present
> both in __collapse_huge_page_isolate() and huge_collapse_scan_pmd().

Yep, good catch!

There's definitely more that can be done. For now, let's keep this patch
as-is to avoid making it too complex. Let's get this one in first, and
then we can work on the next step :)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
  2025-10-08  4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
@ 2025-10-14 12:17   ` Lorenzo Stoakes
  2025-10-14 12:27     ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: Lorenzo Stoakes @ 2025-10-14 12:17 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
	linux-kernel, linux-mm

On Wed, Oct 08, 2025 at 12:37:46PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> As pointed out by Dev, the PTE checks for disjoint conditions in the
> scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
> and pte_uffd_wp are mutually exclusive.

But you're not using is_swap_pte anywhere :) This comes back to my review
quesiotn on the series this is dependent upon.

>
> This patch refactors the loops in both __collapse_huge_page_isolate() and
> hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
> instead of separate if blocks. While at it, the redundant pte_present()
> check before is_zero_pfn() is also removed.

I mean see review below, I don't see why you're doing this and I am
unconvinced by how redundant that check is.

Also this just feels like it should be part of the series where you change
these? I'm not sure why this is separate?

>
> Also, this is a preparatory step to make it easier to merge the
> almost-duplicated scanning logic in these two functions, as suggested
> by David.
>
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Dev Jain <dev.jain@arm.com>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Suggested-by: Dev Jain <dev.jain@arm.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>  mm/khugepaged.c | 12 ++++--------
>  1 file changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index bec3e268dc76..e3e27223137a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -548,8 +548,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
> -		if (pte_none(pteval) || (pte_present(pteval) &&
> -				is_zero_pfn(pte_pfn(pteval)))) {
> +		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {

You can have non-pte_none() non-present entries no? Isn't pte_present() a
prerequisite for pfe_pfn() to be valid? If it's a swap entry couldn't you
end up accidentally (unlikely but still) hitting this?

Seems like this is required isn't it? I may be missing something here...

>  			++none_or_zero;
>  			if (!userfaultfd_armed(vma) &&
>  			    (!cc->is_khugepaged ||
> @@ -560,12 +559,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>  				goto out;
>  			}
> -		}
> -		if (!pte_present(pteval)) {
> +		} else if (!pte_present(pteval)) {

This seems pointless, since either the above logic will continue or goto
out right?

>  			result = SCAN_PTE_NON_PRESENT;
>  			goto out;
> -		}
> -		if (pte_uffd_wp(pteval)) {
> +		} else if (pte_uffd_wp(pteval)) {

Again, what is the point of an else when the if() branch unconditionally
->out?

>  			result = SCAN_PTE_UFFD_WP;
>  			goto out;
>  		}
> @@ -1321,8 +1318,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>  				goto out_unmap;
>  			}
> -		}
> -		if (pte_uffd_wp(pteval)) {
> +		} else if (pte_uffd_wp(pteval)) {

Same comment as above, I'm really confused about the purpose of this logic?


>  			/*
>  			 * Don't collapse the page if any of the small
>  			 * PTEs are armed with uffd write protection.
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios
  2025-10-08  4:37 ` [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios Lance Yang
@ 2025-10-14 12:25   ` Lorenzo Stoakes
  0 siblings, 0 replies; 19+ messages in thread
From: Lorenzo Stoakes @ 2025-10-14 12:25 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
	linux-kernel, linux-mm

On Wed, Oct 08, 2025 at 12:37:47PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> As Zi pointed out, we should avoid crashing the kernel for conditions
> that can be handled gracefully. Encountering a non-anonymous folio in an
> anonymous VMA is a bug, but a warning is sufficient.
>
> This patch changes the VM_BUG_ON_FOLIO(!folio_test_anon(folio)) to a
> VM_WARN_ON_FOLIO() in both __collapse_huge_page_isolate() and
> hpage_collapse_scan_pmd(), and then aborts the scan with SCAN_PAGE_ANON.

Well no, in hpage_collapse_scan_pmd() there is no warning at all.

>
> Making more of the scanning logic common between hpage_collapse_scan_pmd()
> and __collapse_huge_page_isolate(), as suggested by Dev.

I mean I guess it's fine but I'm not sure it's really necessary to give a
blow-by-blow of who suggested what in the actual commit message :) This
isn't really useful information for somebody looking at this code in the
future.

>
> Suggested-by: Dev Jain <dev.jain@arm.com>
> Suggested-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>  mm/khugepaged.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e3e27223137a..b5c0295c3414 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -573,7 +573,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		}
>
>  		folio = page_folio(page);
> -		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> +		if (!folio_test_anon(folio)) {
> +			VM_WARN_ON_FOLIO(true, folio);
> +			result = SCAN_PAGE_ANON;
> +			goto out;

Hmm this is iffy I'm not sure I agree with Zi here - the purpose of
VM_WARN_ON() etc. is for things that programmatically _should not
happen_.

Now every single time we run this code we're doing this check.

AND it implies that it is an actual possiblity, at run time, for this to be
the case.

I really don't like this.

Also if it's a runtime check this should be a WARN_ON_ONCE() not a
VM_WARN_ON_ONCE(). Of course you lose the folio output then. So this code
is very confused.

In general I don't think we should be doing this at all, rather we should
just convert the VM_BUG_ON_FOLIO() to a VM_WARN_ON_FOLIO().


> +		}
>
>  		/* See hpage_collapse_scan_pmd(). */
>  		if (folio_maybe_mapped_shared(folio)) {
> @@ -1340,6 +1344,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  		folio = page_folio(page);
>
>  		if (!folio_test_anon(folio)) {
> +			VM_WARN_ON_FOLIO(true, folio);

Err, what? This is a condition that should never, ever happen to the point
that we warn on it?

This surely is a condition that we expect to happen sometimes otherwise we
wouldn't do this no?

Either way the comments above still apply. Also VM_WARN_ON_FOLIO(true, ...)
is kinda gross... if this is an actual pattern that exists, VM_WARN_FOLIO()
would be preferable.

>  			result = SCAN_PAGE_ANON;
>  			goto out_unmap;
>  		}
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
  2025-10-14 12:17   ` Lorenzo Stoakes
@ 2025-10-14 12:27     ` David Hildenbrand
  2025-10-15  4:49       ` Lance Yang
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2025-10-14 12:27 UTC (permalink / raw)
  To: Lorenzo Stoakes, Lance Yang
  Cc: akpm, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, ioworker0, richard.weiyang, linux-kernel,
	linux-mm

On 14.10.25 14:17, Lorenzo Stoakes wrote:
> On Wed, Oct 08, 2025 at 12:37:46PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> As pointed out by Dev, the PTE checks for disjoint conditions in the
>> scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
>> and pte_uffd_wp are mutually exclusive.
> 
> But you're not using is_swap_pte anywhere :) This comes back to my review
> quesiotn on the series this is dependent upon.
> 
>>
>> This patch refactors the loops in both __collapse_huge_page_isolate() and
>> hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
>> instead of separate if blocks. While at it, the redundant pte_present()
>> check before is_zero_pfn() is also removed.
> 
> I mean see review below, I don't see why you're doing this and I am
> unconvinced by how redundant that check is.
> 
> Also this just feels like it should be part of the series where you change
> these? I'm not sure why this is separate?

I think Lance is trying to unify both scanning functions to look alike, 
such that when he refactors them out in patch #3 it looks more straight 
forward.

The missing pte_present() check in hpage_collapse_scan_pmd() is interesting

Likely there is one such check missing there?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
  2025-10-08  4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
                     ` (2 preceding siblings ...)
  2025-10-10 13:29   ` Wei Yang
@ 2025-10-14 12:36   ` Lorenzo Stoakes
  2025-10-14 17:41   ` Lorenzo Stoakes
  4 siblings, 0 replies; 19+ messages in thread
From: Lorenzo Stoakes @ 2025-10-14 12:36 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
	linux-kernel, linux-mm

On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
> and __collapse_huge_page_isolate() was almost duplicated.
>
> This patch cleans things up by moving all the common PTE checking logic
> into a new shared helper, thp_collapse_check_pte(). While at it, we use
> vm_normal_folio() instead of vm_normal_page().
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>

In general I like the idea, the implementation needs work though.

Will check in more detail on respin

> ---
>  mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
>  1 file changed, 130 insertions(+), 113 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b5c0295c3414..7116caae1fa4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -61,6 +61,12 @@ enum scan_result {
>  	SCAN_PAGE_FILLED,
>  };
>
> +enum pte_check_result {
> +	PTE_CHECK_SUCCEED,
> +	PTE_CHECK_CONTINUE,

Don't love this logic - this feels like we're essentially abstracting
control flow, I mean what does 'continue' mean here? Other than you're in a
loop and please continue, which is relying a little too much on what the
caller does.

if we retain this logic something like PTE_CHECK_SKIP would make more sense.


> +	PTE_CHECK_FAIL,
> +};
> +
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/huge_memory.h>
>
> @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>  	}
>  }
>
> +/*
> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
> + * @pte:           The PTE to check
> + * @vma:           The VMA the PTE belongs to
> + * @addr:          The virtual address corresponding to this PTE
> + * @foliop:        On success, used to return a pointer to the folio
> + *                 Must be non-NULL
> + * @none_or_zero:  Counter for none/zero PTEs. Must be non-NULL
> + * @unmapped:      Counter for swap PTEs. Can be NULL if not scanning swaps
> + * @shared:        Counter for shared pages. Must be non-NULL
> + * @scan_result:   Used to return the failure reason (SCAN_*) on a
> + *                 PTE_CHECK_FAIL return. Must be non-NULL
> + * @cc:            Collapse control settings

Do we really need this many parameters? THis is hard to follow.

Of course it being me, I'd immediately prefer a helper struct :)

> + *
> + * Returns:
> + *   PTE_CHECK_SUCCEED  - PTE is suitable, proceed with further checks
> + *   PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
> + *   PTE_CHECK_FAIL     - Abort collapse scan
> + */
> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,

There's no need for inline in an internal static function in a file.

> +		unsigned long addr, struct folio **foliop, int *none_or_zero,
> +		int *unmapped, int *shared, int *scan_result,
> +		struct collapse_control *cc)
> +{
> +	struct folio *folio = NULL;
> +
> +	if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
> +		(*none_or_zero)++;
> +		if (!userfaultfd_armed(vma) &&
> +		    (!cc->is_khugepaged ||
> +		     *none_or_zero <= khugepaged_max_ptes_none)) {
> +			return PTE_CHECK_CONTINUE;
> +		} else {
> +			*scan_result = SCAN_EXCEED_NONE_PTE;
> +			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +			return PTE_CHECK_FAIL;
> +		}
> +	} else if (!pte_present(pte)) {

You're now making the if-else issues with previous patches worse by
returning which gets even checkpatch to warn you.

	if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {

(of course note that I am not convinced you can drop the pte_present()
check here)

		(*none_or_zero)++;
		if (!userfaultfd_armed(vma) &&
		    (!cc->is_khugepaged ||
		     *none_or_zero <= khugepaged_max_ptes_none))
			return PTE_CHECK_CONTINUE;

		*scan_result = SCAN_EXCEED_NONE_PTE;
		count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
		return PTE_CHECK_FAIL;
	}

	if (!pte_present(pte)) {
		...
	}

But even that really needs seriously more refactoring - that condition
above is horrible for instance so, e.g.:

	if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
		(*none_or_zero)++;

		/* Cases where this is acceptable. */
		if (!userfaultfd_armed(vma))
			return PTE_CHECK_SKIP;
		if (!cc->is_khugepaged)
			return PTE_CHECK_SKIP;
		if (*none_or_zero <= khugepaged_max_ptes_none)
			return PTE_CHECK_SKIP;

		/* Otherwise, we must abort the scan. */
		*scan_result = SCAN_EXCEED_NONE_PTE;
		count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
		return PTE_CHECK_FAIL;
	}

	if (!pte_present(pte)) {
		...
	}

Improves things a lot.

I do however _hate_ this (*blah)++; thing. A helper struct would help :)



> +		if (!unmapped) {
> +			*scan_result = SCAN_PTE_NON_PRESENT;
> +			return PTE_CHECK_FAIL;

Can't we have a helper that sets the result, e.g.:

	return pte_check_fail(scan_result, SCAN_PTE_NON_PRESENT);

static enum pte_check_result pte_check_fail(int *scan_result,
		enum pte_check_result result)
{
	*scan_result = result;
	return PTE_CHECK_FAIL;
}

And same for success/skip

Then all of these horrible if (blah) { *foo = bar; return baz; } can be
made into

	if (blah)
		return pte_check_xxx(..., SCAN_PTE_...);

> +		}
> +
> +		if (non_swap_entry(pte_to_swp_entry(pte))) {
> +			*scan_result = SCAN_PTE_NON_PRESENT;
> +			return PTE_CHECK_FAIL;
> +		}
> +
> +		(*unmapped)++;
> +		if (!cc->is_khugepaged ||
> +		    *unmapped <= khugepaged_max_ptes_swap) {
> +			/*
> +			 * Always be strict with uffd-wp enabled swap
> +			 * entries. Please see comment below for
> +			 * pte_uffd_wp().
> +			 */
> +			if (pte_swp_uffd_wp(pte)) {
> +				*scan_result = SCAN_PTE_UFFD_WP;
> +				return PTE_CHECK_FAIL;
> +			}
> +			return PTE_CHECK_CONTINUE;
> +		} else {
> +			*scan_result = SCAN_EXCEED_SWAP_PTE;
> +			count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> +			return PTE_CHECK_FAIL;
> +		}
> +	} else if (pte_uffd_wp(pte)) {
> +		/*
> +		 * Don't collapse the page if any of the small PTEs are
> +		 * armed with uffd write protection. Here we can also mark
> +		 * the new huge pmd as write protected if any of the small
> +		 * ones is marked but that could bring unknown userfault
> +		 * messages that falls outside of the registered range.
> +		 * So, just be simple.
> +		 */
> +		*scan_result = SCAN_PTE_UFFD_WP;
> +		return PTE_CHECK_FAIL;
> +	}
> +

Again as all of the comments for previous series still apply here so
obviously I don't like this as-is :)

And as Andrew has pointed out, now checkpatch is finding the 'pointless
else' issue (as mentioned above also).

> +	folio = vm_normal_folio(vma, addr, pte);
> +	if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
> +		*scan_result = SCAN_PAGE_NULL;
> +		return PTE_CHECK_FAIL;
> +	}
> +
> +	if (!folio_test_anon(folio)) {
> +		VM_WARN_ON_FOLIO(true, folio);
> +		*scan_result = SCAN_PAGE_ANON;
> +		return PTE_CHECK_FAIL;
> +	}
> +
> +	/*
> +	 * We treat a single page as shared if any part of the THP
> +	 * is shared.
> +	 */
> +	if (folio_maybe_mapped_shared(folio)) {
> +		(*shared)++;
> +		if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
> +			*scan_result = SCAN_EXCEED_SHARED_PTE;
> +			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +			return PTE_CHECK_FAIL;
> +		}
> +	}
> +
> +	*foliop = folio;
> +
> +	return PTE_CHECK_SUCCEED;
> +}
> +
>  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  					unsigned long start_addr,
>  					pte_t *pte,
>  					struct collapse_control *cc,
>  					struct list_head *compound_pagelist)
>  {
> -	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr = start_addr;
>  	pte_t *_pte;
>  	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> +	int pte_check_res;
>
>  	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> -			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> -				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> -				goto out;
> -			}
> -		} else if (!pte_present(pteval)) {
> -			result = SCAN_PTE_NON_PRESENT;
> -			goto out;
> -		} else if (pte_uffd_wp(pteval)) {
> -			result = SCAN_PTE_UFFD_WP;
> -			goto out;
> -		}
> -		page = vm_normal_page(vma, addr, pteval);
> -		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> -			result = SCAN_PAGE_NULL;
> -			goto out;
> -		}
>
> -		folio = page_folio(page);
> -		if (!folio_test_anon(folio)) {
> -			VM_WARN_ON_FOLIO(true, folio);
> -			result = SCAN_PAGE_ANON;
> -			goto out;
> -		}
> +		pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> +					&folio, &none_or_zero, NULL, &shared,
> +					&result, cc);
>
> -		/* See hpage_collapse_scan_pmd(). */
> -		if (folio_maybe_mapped_shared(folio)) {
> -			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> -				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> -				goto out;
> -			}
> -		}
> +		if (pte_check_res == PTE_CHECK_CONTINUE)
> +			continue;
> +		else if (pte_check_res == PTE_CHECK_FAIL)
> +			goto out;
>
>  		if (folio_test_large(folio)) {
>  			struct folio *f;
> @@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  	pte_t *pte, *_pte;
>  	int result = SCAN_FAIL, referenced = 0;
>  	int none_or_zero = 0, shared = 0;
> -	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
> +	int pte_check_res;
>
>  	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>
> @@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> -			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> -				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> -				goto out_unmap;
> -			}
> -		} else if (!pte_present(pteval)) {
> -			if (non_swap_entry(pte_to_swp_entry(pteval))) {
> -				result = SCAN_PTE_NON_PRESENT;
> -				goto out_unmap;
> -			}
>
> -			++unmapped;
> -			if (!cc->is_khugepaged ||
> -			    unmapped <= khugepaged_max_ptes_swap) {
> -				/*
> -				 * Always be strict with uffd-wp
> -				 * enabled swap entries.  Please see
> -				 * comment below for pte_uffd_wp().
> -				 */
> -				if (pte_swp_uffd_wp(pteval)) {
> -					result = SCAN_PTE_UFFD_WP;
> -					goto out_unmap;
> -				}
> -				continue;
> -			} else {
> -				result = SCAN_EXCEED_SWAP_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> -				goto out_unmap;
> -			}
> -		} else if (pte_uffd_wp(pteval)) {
> -			/*
> -			 * Don't collapse the page if any of the small
> -			 * PTEs are armed with uffd write protection.
> -			 * Here we can also mark the new huge pmd as
> -			 * write protected if any of the small ones is
> -			 * marked but that could bring unknown
> -			 * userfault messages that falls outside of
> -			 * the registered range.  So, just be simple.
> -			 */
> -			result = SCAN_PTE_UFFD_WP;
> -			goto out_unmap;
> -		}
> +		pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> +					&folio, &none_or_zero, &unmapped,
> +					&shared, &result, cc);
>
> -		page = vm_normal_page(vma, addr, pteval);
> -		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> -			result = SCAN_PAGE_NULL;
> -			goto out_unmap;
> -		}
> -		folio = page_folio(page);
> -
> -		if (!folio_test_anon(folio)) {
> -			VM_WARN_ON_FOLIO(true, folio);
> -			result = SCAN_PAGE_ANON;
> +		if (pte_check_res == PTE_CHECK_CONTINUE)
> +			continue;
> +		else if (pte_check_res == PTE_CHECK_FAIL)
>  			goto out_unmap;
> -		}
> -
> -		/*
> -		 * We treat a single page as shared if any part of the THP
> -		 * is shared.
> -		 */
> -		if (folio_maybe_mapped_shared(folio)) {
> -			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> -				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> -				goto out_unmap;
> -			}
> -		}
>
>  		/*
>  		 * Record which node the original page is from and save this
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
  2025-10-08  4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
                     ` (3 preceding siblings ...)
  2025-10-14 12:36   ` Lorenzo Stoakes
@ 2025-10-14 17:41   ` Lorenzo Stoakes
  2025-10-15  1:48     ` Lance Yang
  4 siblings, 1 reply; 19+ messages in thread
From: Lorenzo Stoakes @ 2025-10-14 17:41 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
	linux-kernel, linux-mm

OK so this patch _is_ based on [0] already being applied then?

I don't see any is_swap_pte() here so presumably so right?

Can we just respin these series and put them together? Because now I've
reviewed a bunch of stuff in [0], which this series depends upon, and
you're saying we should now review on this instead of that and it's a bit
of a mess.

Once review here is dealt with can you combine [0] into this series please?

For [0] I would like you to reinstate the is_swap_pte() conditional as
(copiously :) discussed over there.

[0}: https://lore.kernel.org/all/20251008032657.72406-1-lance.yang@linux.dev/

Thanks, Lorenzo

On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
> and __collapse_huge_page_isolate() was almost duplicated.
>
> This patch cleans things up by moving all the common PTE checking logic
> into a new shared helper, thp_collapse_check_pte(). While at it, we use
> vm_normal_folio() instead of vm_normal_page().
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>  mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
>  1 file changed, 130 insertions(+), 113 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b5c0295c3414..7116caae1fa4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -61,6 +61,12 @@ enum scan_result {
>  	SCAN_PAGE_FILLED,
>  };
>
> +enum pte_check_result {
> +	PTE_CHECK_SUCCEED,
> +	PTE_CHECK_CONTINUE,
> +	PTE_CHECK_FAIL,
> +};
> +
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/huge_memory.h>
>
> @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>  	}
>  }
>
> +/*
> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
> + * @pte:           The PTE to check
> + * @vma:           The VMA the PTE belongs to
> + * @addr:          The virtual address corresponding to this PTE
> + * @foliop:        On success, used to return a pointer to the folio
> + *                 Must be non-NULL
> + * @none_or_zero:  Counter for none/zero PTEs. Must be non-NULL
> + * @unmapped:      Counter for swap PTEs. Can be NULL if not scanning swaps
> + * @shared:        Counter for shared pages. Must be non-NULL
> + * @scan_result:   Used to return the failure reason (SCAN_*) on a
> + *                 PTE_CHECK_FAIL return. Must be non-NULL
> + * @cc:            Collapse control settings
> + *
> + * Returns:
> + *   PTE_CHECK_SUCCEED  - PTE is suitable, proceed with further checks
> + *   PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
> + *   PTE_CHECK_FAIL     - Abort collapse scan
> + */
> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
> +		unsigned long addr, struct folio **foliop, int *none_or_zero,
> +		int *unmapped, int *shared, int *scan_result,
> +		struct collapse_control *cc)
> +{
> +	struct folio *folio = NULL;
> +
> +	if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
> +		(*none_or_zero)++;
> +		if (!userfaultfd_armed(vma) &&
> +		    (!cc->is_khugepaged ||
> +		     *none_or_zero <= khugepaged_max_ptes_none)) {
> +			return PTE_CHECK_CONTINUE;
> +		} else {
> +			*scan_result = SCAN_EXCEED_NONE_PTE;
> +			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +			return PTE_CHECK_FAIL;
> +		}
> +	} else if (!pte_present(pte)) {
> +		if (!unmapped) {
> +			*scan_result = SCAN_PTE_NON_PRESENT;
> +			return PTE_CHECK_FAIL;
> +		}
> +
> +		if (non_swap_entry(pte_to_swp_entry(pte))) {
> +			*scan_result = SCAN_PTE_NON_PRESENT;
> +			return PTE_CHECK_FAIL;
> +		}
> +
> +		(*unmapped)++;
> +		if (!cc->is_khugepaged ||
> +		    *unmapped <= khugepaged_max_ptes_swap) {
> +			/*
> +			 * Always be strict with uffd-wp enabled swap
> +			 * entries. Please see comment below for
> +			 * pte_uffd_wp().
> +			 */
> +			if (pte_swp_uffd_wp(pte)) {
> +				*scan_result = SCAN_PTE_UFFD_WP;
> +				return PTE_CHECK_FAIL;
> +			}
> +			return PTE_CHECK_CONTINUE;
> +		} else {
> +			*scan_result = SCAN_EXCEED_SWAP_PTE;
> +			count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> +			return PTE_CHECK_FAIL;
> +		}
> +	} else if (pte_uffd_wp(pte)) {
> +		/*
> +		 * Don't collapse the page if any of the small PTEs are
> +		 * armed with uffd write protection. Here we can also mark
> +		 * the new huge pmd as write protected if any of the small
> +		 * ones is marked but that could bring unknown userfault
> +		 * messages that falls outside of the registered range.
> +		 * So, just be simple.
> +		 */
> +		*scan_result = SCAN_PTE_UFFD_WP;
> +		return PTE_CHECK_FAIL;
> +	}
> +
> +	folio = vm_normal_folio(vma, addr, pte);
> +	if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
> +		*scan_result = SCAN_PAGE_NULL;
> +		return PTE_CHECK_FAIL;
> +	}
> +
> +	if (!folio_test_anon(folio)) {
> +		VM_WARN_ON_FOLIO(true, folio);
> +		*scan_result = SCAN_PAGE_ANON;
> +		return PTE_CHECK_FAIL;
> +	}
> +
> +	/*
> +	 * We treat a single page as shared if any part of the THP
> +	 * is shared.
> +	 */
> +	if (folio_maybe_mapped_shared(folio)) {
> +		(*shared)++;
> +		if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
> +			*scan_result = SCAN_EXCEED_SHARED_PTE;
> +			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> +			return PTE_CHECK_FAIL;
> +		}
> +	}
> +
> +	*foliop = folio;
> +
> +	return PTE_CHECK_SUCCEED;
> +}
> +
>  static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  					unsigned long start_addr,
>  					pte_t *pte,
>  					struct collapse_control *cc,
>  					struct list_head *compound_pagelist)
>  {
> -	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr = start_addr;
>  	pte_t *_pte;
>  	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> +	int pte_check_res;
>
>  	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> -			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> -				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> -				goto out;
> -			}
> -		} else if (!pte_present(pteval)) {
> -			result = SCAN_PTE_NON_PRESENT;
> -			goto out;
> -		} else if (pte_uffd_wp(pteval)) {
> -			result = SCAN_PTE_UFFD_WP;
> -			goto out;
> -		}
> -		page = vm_normal_page(vma, addr, pteval);
> -		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> -			result = SCAN_PAGE_NULL;
> -			goto out;
> -		}
>
> -		folio = page_folio(page);
> -		if (!folio_test_anon(folio)) {
> -			VM_WARN_ON_FOLIO(true, folio);
> -			result = SCAN_PAGE_ANON;
> -			goto out;
> -		}
> +		pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> +					&folio, &none_or_zero, NULL, &shared,
> +					&result, cc);
>
> -		/* See hpage_collapse_scan_pmd(). */
> -		if (folio_maybe_mapped_shared(folio)) {
> -			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> -				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> -				goto out;
> -			}
> -		}
> +		if (pte_check_res == PTE_CHECK_CONTINUE)
> +			continue;
> +		else if (pte_check_res == PTE_CHECK_FAIL)
> +			goto out;
>
>  		if (folio_test_large(folio)) {
>  			struct folio *f;
> @@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  	pte_t *pte, *_pte;
>  	int result = SCAN_FAIL, referenced = 0;
>  	int none_or_zero = 0, shared = 0;
> -	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
> +	int pte_check_res;
>
>  	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>
> @@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>  	     _pte++, addr += PAGE_SIZE) {
>  		pte_t pteval = ptep_get(_pte);
> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> -			++none_or_zero;
> -			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> -				continue;
> -			} else {
> -				result = SCAN_EXCEED_NONE_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> -				goto out_unmap;
> -			}
> -		} else if (!pte_present(pteval)) {
> -			if (non_swap_entry(pte_to_swp_entry(pteval))) {
> -				result = SCAN_PTE_NON_PRESENT;
> -				goto out_unmap;
> -			}
>
> -			++unmapped;
> -			if (!cc->is_khugepaged ||
> -			    unmapped <= khugepaged_max_ptes_swap) {
> -				/*
> -				 * Always be strict with uffd-wp
> -				 * enabled swap entries.  Please see
> -				 * comment below for pte_uffd_wp().
> -				 */
> -				if (pte_swp_uffd_wp(pteval)) {
> -					result = SCAN_PTE_UFFD_WP;
> -					goto out_unmap;
> -				}
> -				continue;
> -			} else {
> -				result = SCAN_EXCEED_SWAP_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> -				goto out_unmap;
> -			}
> -		} else if (pte_uffd_wp(pteval)) {
> -			/*
> -			 * Don't collapse the page if any of the small
> -			 * PTEs are armed with uffd write protection.
> -			 * Here we can also mark the new huge pmd as
> -			 * write protected if any of the small ones is
> -			 * marked but that could bring unknown
> -			 * userfault messages that falls outside of
> -			 * the registered range.  So, just be simple.
> -			 */
> -			result = SCAN_PTE_UFFD_WP;
> -			goto out_unmap;
> -		}
> +		pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> +					&folio, &none_or_zero, &unmapped,
> +					&shared, &result, cc);
>
> -		page = vm_normal_page(vma, addr, pteval);
> -		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> -			result = SCAN_PAGE_NULL;
> -			goto out_unmap;
> -		}
> -		folio = page_folio(page);
> -
> -		if (!folio_test_anon(folio)) {
> -			VM_WARN_ON_FOLIO(true, folio);
> -			result = SCAN_PAGE_ANON;
> +		if (pte_check_res == PTE_CHECK_CONTINUE)
> +			continue;
> +		else if (pte_check_res == PTE_CHECK_FAIL)
>  			goto out_unmap;
> -		}
> -
> -		/*
> -		 * We treat a single page as shared if any part of the THP
> -		 * is shared.
> -		 */
> -		if (folio_maybe_mapped_shared(folio)) {
> -			++shared;
> -			if (cc->is_khugepaged &&
> -			    shared > khugepaged_max_ptes_shared) {
> -				result = SCAN_EXCEED_SHARED_PTE;
> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> -				goto out_unmap;
> -			}
> -		}
>
>  		/*
>  		 * Record which node the original page is from and save this
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper
  2025-10-14 17:41   ` Lorenzo Stoakes
@ 2025-10-15  1:48     ` Lance Yang
  0 siblings, 0 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-15  1:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
	linux-kernel, linux-mm



On 2025/10/15 01:41, Lorenzo Stoakes wrote:
> OK so this patch _is_ based on [0] already being applied then?
> 
> I don't see any is_swap_pte() here so presumably so right?
> 
> Can we just respin these series and put them together? Because now I've
> reviewed a bunch of stuff in [0], which this series depends upon, and
> you're saying we should now review on this instead of that and it's a bit
> of a mess.
> 
> Once review here is dealt with can you combine [0] into this series please?

Got it, thanks!

> 
> For [0] I would like you to reinstate the is_swap_pte() conditional as
> (copiously :) discussed over there.
> 
> [0}: https://lore.kernel.org/all/20251008032657.72406-1-lance.yang@linux.dev/
> 
> Thanks, Lorenzo
> 
> On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
>> and __collapse_huge_page_isolate() was almost duplicated.
>>
>> This patch cleans things up by moving all the common PTE checking logic
>> into a new shared helper, thp_collapse_check_pte(). While at it, we use
>> vm_normal_folio() instead of vm_normal_page().
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Suggested-by: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
>>   1 file changed, 130 insertions(+), 113 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b5c0295c3414..7116caae1fa4 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -61,6 +61,12 @@ enum scan_result {
>>   	SCAN_PAGE_FILLED,
>>   };
>>
>> +enum pte_check_result {
>> +	PTE_CHECK_SUCCEED,
>> +	PTE_CHECK_CONTINUE,
>> +	PTE_CHECK_FAIL,
>> +};
>> +
>>   #define CREATE_TRACE_POINTS
>>   #include <trace/events/huge_memory.h>
>>
>> @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>>   	}
>>   }
>>
>> +/*
>> + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
>> + * @pte:           The PTE to check
>> + * @vma:           The VMA the PTE belongs to
>> + * @addr:          The virtual address corresponding to this PTE
>> + * @foliop:        On success, used to return a pointer to the folio
>> + *                 Must be non-NULL
>> + * @none_or_zero:  Counter for none/zero PTEs. Must be non-NULL
>> + * @unmapped:      Counter for swap PTEs. Can be NULL if not scanning swaps
>> + * @shared:        Counter for shared pages. Must be non-NULL
>> + * @scan_result:   Used to return the failure reason (SCAN_*) on a
>> + *                 PTE_CHECK_FAIL return. Must be non-NULL
>> + * @cc:            Collapse control settings
>> + *
>> + * Returns:
>> + *   PTE_CHECK_SUCCEED  - PTE is suitable, proceed with further checks
>> + *   PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
>> + *   PTE_CHECK_FAIL     - Abort collapse scan
>> + */
>> +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
>> +		unsigned long addr, struct folio **foliop, int *none_or_zero,
>> +		int *unmapped, int *shared, int *scan_result,
>> +		struct collapse_control *cc)
>> +{
>> +	struct folio *folio = NULL;
>> +
>> +	if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
>> +		(*none_or_zero)++;
>> +		if (!userfaultfd_armed(vma) &&
>> +		    (!cc->is_khugepaged ||
>> +		     *none_or_zero <= khugepaged_max_ptes_none)) {
>> +			return PTE_CHECK_CONTINUE;
>> +		} else {
>> +			*scan_result = SCAN_EXCEED_NONE_PTE;
>> +			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> +			return PTE_CHECK_FAIL;
>> +		}
>> +	} else if (!pte_present(pte)) {
>> +		if (!unmapped) {
>> +			*scan_result = SCAN_PTE_NON_PRESENT;
>> +			return PTE_CHECK_FAIL;
>> +		}
>> +
>> +		if (non_swap_entry(pte_to_swp_entry(pte))) {
>> +			*scan_result = SCAN_PTE_NON_PRESENT;
>> +			return PTE_CHECK_FAIL;
>> +		}
>> +
>> +		(*unmapped)++;
>> +		if (!cc->is_khugepaged ||
>> +		    *unmapped <= khugepaged_max_ptes_swap) {
>> +			/*
>> +			 * Always be strict with uffd-wp enabled swap
>> +			 * entries. Please see comment below for
>> +			 * pte_uffd_wp().
>> +			 */
>> +			if (pte_swp_uffd_wp(pte)) {
>> +				*scan_result = SCAN_PTE_UFFD_WP;
>> +				return PTE_CHECK_FAIL;
>> +			}
>> +			return PTE_CHECK_CONTINUE;
>> +		} else {
>> +			*scan_result = SCAN_EXCEED_SWAP_PTE;
>> +			count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> +			return PTE_CHECK_FAIL;
>> +		}
>> +	} else if (pte_uffd_wp(pte)) {
>> +		/*
>> +		 * Don't collapse the page if any of the small PTEs are
>> +		 * armed with uffd write protection. Here we can also mark
>> +		 * the new huge pmd as write protected if any of the small
>> +		 * ones is marked but that could bring unknown userfault
>> +		 * messages that falls outside of the registered range.
>> +		 * So, just be simple.
>> +		 */
>> +		*scan_result = SCAN_PTE_UFFD_WP;
>> +		return PTE_CHECK_FAIL;
>> +	}
>> +
>> +	folio = vm_normal_folio(vma, addr, pte);
>> +	if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
>> +		*scan_result = SCAN_PAGE_NULL;
>> +		return PTE_CHECK_FAIL;
>> +	}
>> +
>> +	if (!folio_test_anon(folio)) {
>> +		VM_WARN_ON_FOLIO(true, folio);
>> +		*scan_result = SCAN_PAGE_ANON;
>> +		return PTE_CHECK_FAIL;
>> +	}
>> +
>> +	/*
>> +	 * We treat a single page as shared if any part of the THP
>> +	 * is shared.
>> +	 */
>> +	if (folio_maybe_mapped_shared(folio)) {
>> +		(*shared)++;
>> +		if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
>> +			*scan_result = SCAN_EXCEED_SHARED_PTE;
>> +			count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> +			return PTE_CHECK_FAIL;
>> +		}
>> +	}
>> +
>> +	*foliop = folio;
>> +
>> +	return PTE_CHECK_SUCCEED;
>> +}
>> +
>>   static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>   					unsigned long start_addr,
>>   					pte_t *pte,
>>   					struct collapse_control *cc,
>>   					struct list_head *compound_pagelist)
>>   {
>> -	struct page *page = NULL;
>>   	struct folio *folio = NULL;
>>   	unsigned long addr = start_addr;
>>   	pte_t *_pte;
>>   	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>> +	int pte_check_res;
>>
>>   	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>   	     _pte++, addr += PAGE_SIZE) {
>>   		pte_t pteval = ptep_get(_pte);
>> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> -			++none_or_zero;
>> -			if (!userfaultfd_armed(vma) &&
>> -			    (!cc->is_khugepaged ||
>> -			     none_or_zero <= khugepaged_max_ptes_none)) {
>> -				continue;
>> -			} else {
>> -				result = SCAN_EXCEED_NONE_PTE;
>> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> -				goto out;
>> -			}
>> -		} else if (!pte_present(pteval)) {
>> -			result = SCAN_PTE_NON_PRESENT;
>> -			goto out;
>> -		} else if (pte_uffd_wp(pteval)) {
>> -			result = SCAN_PTE_UFFD_WP;
>> -			goto out;
>> -		}
>> -		page = vm_normal_page(vma, addr, pteval);
>> -		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
>> -			result = SCAN_PAGE_NULL;
>> -			goto out;
>> -		}
>>
>> -		folio = page_folio(page);
>> -		if (!folio_test_anon(folio)) {
>> -			VM_WARN_ON_FOLIO(true, folio);
>> -			result = SCAN_PAGE_ANON;
>> -			goto out;
>> -		}
>> +		pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
>> +					&folio, &none_or_zero, NULL, &shared,
>> +					&result, cc);
>>
>> -		/* See hpage_collapse_scan_pmd(). */
>> -		if (folio_maybe_mapped_shared(folio)) {
>> -			++shared;
>> -			if (cc->is_khugepaged &&
>> -			    shared > khugepaged_max_ptes_shared) {
>> -				result = SCAN_EXCEED_SHARED_PTE;
>> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> -				goto out;
>> -			}
>> -		}
>> +		if (pte_check_res == PTE_CHECK_CONTINUE)
>> +			continue;
>> +		else if (pte_check_res == PTE_CHECK_FAIL)
>> +			goto out;
>>
>>   		if (folio_test_large(folio)) {
>>   			struct folio *f;
>> @@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>   	pte_t *pte, *_pte;
>>   	int result = SCAN_FAIL, referenced = 0;
>>   	int none_or_zero = 0, shared = 0;
>> -	struct page *page = NULL;
>>   	struct folio *folio = NULL;
>>   	unsigned long addr;
>>   	spinlock_t *ptl;
>>   	int node = NUMA_NO_NODE, unmapped = 0;
>> +	int pte_check_res;
>>
>>   	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>
>> @@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>   	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>   	     _pte++, addr += PAGE_SIZE) {
>>   		pte_t pteval = ptep_get(_pte);
>> -		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> -			++none_or_zero;
>> -			if (!userfaultfd_armed(vma) &&
>> -			    (!cc->is_khugepaged ||
>> -			     none_or_zero <= khugepaged_max_ptes_none)) {
>> -				continue;
>> -			} else {
>> -				result = SCAN_EXCEED_NONE_PTE;
>> -				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> -				goto out_unmap;
>> -			}
>> -		} else if (!pte_present(pteval)) {
>> -			if (non_swap_entry(pte_to_swp_entry(pteval))) {
>> -				result = SCAN_PTE_NON_PRESENT;
>> -				goto out_unmap;
>> -			}
>>
>> -			++unmapped;
>> -			if (!cc->is_khugepaged ||
>> -			    unmapped <= khugepaged_max_ptes_swap) {
>> -				/*
>> -				 * Always be strict with uffd-wp
>> -				 * enabled swap entries.  Please see
>> -				 * comment below for pte_uffd_wp().
>> -				 */
>> -				if (pte_swp_uffd_wp(pteval)) {
>> -					result = SCAN_PTE_UFFD_WP;
>> -					goto out_unmap;
>> -				}
>> -				continue;
>> -			} else {
>> -				result = SCAN_EXCEED_SWAP_PTE;
>> -				count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>> -				goto out_unmap;
>> -			}
>> -		} else if (pte_uffd_wp(pteval)) {
>> -			/*
>> -			 * Don't collapse the page if any of the small
>> -			 * PTEs are armed with uffd write protection.
>> -			 * Here we can also mark the new huge pmd as
>> -			 * write protected if any of the small ones is
>> -			 * marked but that could bring unknown
>> -			 * userfault messages that falls outside of
>> -			 * the registered range.  So, just be simple.
>> -			 */
>> -			result = SCAN_PTE_UFFD_WP;
>> -			goto out_unmap;
>> -		}
>> +		pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
>> +					&folio, &none_or_zero, &unmapped,
>> +					&shared, &result, cc);
>>
>> -		page = vm_normal_page(vma, addr, pteval);
>> -		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
>> -			result = SCAN_PAGE_NULL;
>> -			goto out_unmap;
>> -		}
>> -		folio = page_folio(page);
>> -
>> -		if (!folio_test_anon(folio)) {
>> -			VM_WARN_ON_FOLIO(true, folio);
>> -			result = SCAN_PAGE_ANON;
>> +		if (pte_check_res == PTE_CHECK_CONTINUE)
>> +			continue;
>> +		else if (pte_check_res == PTE_CHECK_FAIL)
>>   			goto out_unmap;
>> -		}
>> -
>> -		/*
>> -		 * We treat a single page as shared if any part of the THP
>> -		 * is shared.
>> -		 */
>> -		if (folio_maybe_mapped_shared(folio)) {
>> -			++shared;
>> -			if (cc->is_khugepaged &&
>> -			    shared > khugepaged_max_ptes_shared) {
>> -				result = SCAN_EXCEED_SHARED_PTE;
>> -				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>> -				goto out_unmap;
>> -			}
>> -		}
>>
>>   		/*
>>   		 * Record which node the original page is from and save this
>> --
>> 2.49.0
>>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
  2025-10-14 12:27     ` David Hildenbrand
@ 2025-10-15  4:49       ` Lance Yang
  2025-10-15  9:16         ` Lorenzo Stoakes
  0 siblings, 1 reply; 19+ messages in thread
From: Lance Yang @ 2025-10-15  4:49 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: akpm, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, baohua, ioworker0, richard.weiyang, linux-kernel,
	linux-mm



On 2025/10/14 20:27, David Hildenbrand wrote:
> On 14.10.25 14:17, Lorenzo Stoakes wrote:
>> On Wed, Oct 08, 2025 at 12:37:46PM +0800, Lance Yang wrote:
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> As pointed out by Dev, the PTE checks for disjoint conditions in the
>>> scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
>>> and pte_uffd_wp are mutually exclusive.
>>
>> But you're not using is_swap_pte anywhere :) This comes back to my review
>> quesiotn on the series this is dependent upon.
>>
>>>
>>> This patch refactors the loops in both __collapse_huge_page_isolate() 
>>> and
>>> hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
>>> instead of separate if blocks. While at it, the redundant pte_present()
>>> check before is_zero_pfn() is also removed.
>>
>> I mean see review below, I don't see why you're doing this and I am
>> unconvinced by how redundant that check is.

Ah, good catch! Lorenzo, thanks!!!

>>
>> Also this just feels like it should be part of the series where you 
>> change
>> these? I'm not sure why this is separate?
> 
> I think Lance is trying to unify both scanning functions to look alike, 
> such that when he refactors them out in patch #3 it looks more straight 
> forward.
> 
> The missing pte_present() check in hpage_collapse_scan_pmd() is interesting

Yep, indeed ;)

> 
> Likely there is one such check missing there?

I think the risk is exactly how pte_pfn() would handle a swap PTE ...

A swap PTE contains completely different data(swap type and offset).
pte_pfn() doesn't know this, so if we feed a swap entry to it, it will
spit out a junk PFN :)

What if that junk PFN happens to match the zeropage's PFN by sheer
chance? IHMO, it's really unlikely, but it would be really bad if it did.

Clearly, pte_present() prevents this ;)

By the way, I noticed there are other places in khugepaged.c that
call pte_pfn() without being under a pte_present() check.

Perhaps those should all be fixed as well in a separate patch?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
  2025-10-15  4:49       ` Lance Yang
@ 2025-10-15  9:16         ` Lorenzo Stoakes
  2025-10-15  9:31           ` Lance Yang
  0 siblings, 1 reply; 19+ messages in thread
From: Lorenzo Stoakes @ 2025-10-15  9:16 UTC (permalink / raw)
  To: Lance Yang
  Cc: David Hildenbrand, akpm, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
	linux-kernel, linux-mm

On Wed, Oct 15, 2025 at 12:49:26PM +0800, Lance Yang wrote:
>
>
> On 2025/10/14 20:27, David Hildenbrand wrote:
> > On 14.10.25 14:17, Lorenzo Stoakes wrote:
> > > On Wed, Oct 08, 2025 at 12:37:46PM +0800, Lance Yang wrote:
> > > > From: Lance Yang <lance.yang@linux.dev>
> > > >
> > > > As pointed out by Dev, the PTE checks for disjoint conditions in the
> > > > scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
> > > > and pte_uffd_wp are mutually exclusive.
> > >
> > > But you're not using is_swap_pte anywhere :) This comes back to my review
> > > quesiotn on the series this is dependent upon.
> > >
> > > >
> > > > This patch refactors the loops in both
> > > > __collapse_huge_page_isolate() and
> > > > hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
> > > > instead of separate if blocks. While at it, the redundant pte_present()
> > > > check before is_zero_pfn() is also removed.
> > >
> > > I mean see review below, I don't see why you're doing this and I am
> > > unconvinced by how redundant that check is.
>
> Ah, good catch! Lorenzo, thanks!!!
>
> > >
> > > Also this just feels like it should be part of the series where you
> > > change
> > > these? I'm not sure why this is separate?
> >
> > I think Lance is trying to unify both scanning functions to look alike,
> > such that when he refactors them out in patch #3 it looks more straight
> > forward.
> >
> > The missing pte_present() check in hpage_collapse_scan_pmd() is interesting
>
> Yep, indeed ;)
>
> >
> > Likely there is one such check missing there?
>
> I think the risk is exactly how pte_pfn() would handle a swap PTE ...
>
> A swap PTE contains completely different data(swap type and offset).
> pte_pfn() doesn't know this, so if we feed a swap entry to it, it will
> spit out a junk PFN :)
>
> What if that junk PFN happens to match the zeropage's PFN by sheer
> chance? IHMO, it's really unlikely, but it would be really bad if it did.
>
> Clearly, pte_present() prevents this ;)

Yeah, not so clearly kinda the whole point I'm making here. I mean all this code
sucks because we have deeply nested conditions and you have to keep in your mind
that 'oh we already checked for X so we can do Y'.

But the THP code is horrible in general.

Anyway let's stay focused (so I can stand a chance of catching up with my
post-vacation backlog :), I will check the respin once you send!

>
> By the way, I noticed there are other places in khugepaged.c that
> call pte_pfn() without being under a pte_present() check.
>
> Perhaps those should all be fixed as well in a separate patch?

Yeah I'm not surprised and sure indeed.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain
  2025-10-15  9:16         ` Lorenzo Stoakes
@ 2025-10-15  9:31           ` Lance Yang
  0 siblings, 0 replies; 19+ messages in thread
From: Lance Yang @ 2025-10-15  9:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, akpm, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, ioworker0, richard.weiyang,
	linux-kernel, linux-mm



On 2025/10/15 17:16, Lorenzo Stoakes wrote:
> On Wed, Oct 15, 2025 at 12:49:26PM +0800, Lance Yang wrote:
>>
>>
>> On 2025/10/14 20:27, David Hildenbrand wrote:
>>> On 14.10.25 14:17, Lorenzo Stoakes wrote:
>>>> On Wed, Oct 08, 2025 at 12:37:46PM +0800, Lance Yang wrote:
>>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>>
>>>>> As pointed out by Dev, the PTE checks for disjoint conditions in the
>>>>> scanning loops can be optimized. is_swap_pte, (pte_none && is_zero_pfn),
>>>>> and pte_uffd_wp are mutually exclusive.
>>>>
>>>> But you're not using is_swap_pte anywhere :) This comes back to my review
>>>> quesiotn on the series this is dependent upon.
>>>>
>>>>>
>>>>> This patch refactors the loops in both
>>>>> __collapse_huge_page_isolate() and
>>>>> hpage_collapse_scan_pmd() to use a continuous if-else-if-else-if chain
>>>>> instead of separate if blocks. While at it, the redundant pte_present()
>>>>> check before is_zero_pfn() is also removed.
>>>>
>>>> I mean see review below, I don't see why you're doing this and I am
>>>> unconvinced by how redundant that check is.
>>
>> Ah, good catch! Lorenzo, thanks!!!
>>
>>>>
>>>> Also this just feels like it should be part of the series where you
>>>> change
>>>> these? I'm not sure why this is separate?
>>>
>>> I think Lance is trying to unify both scanning functions to look alike,
>>> such that when he refactors them out in patch #3 it looks more straight
>>> forward.
>>>
>>> The missing pte_present() check in hpage_collapse_scan_pmd() is interesting
>>
>> Yep, indeed ;)
>>
>>>
>>> Likely there is one such check missing there?
>>
>> I think the risk is exactly how pte_pfn() would handle a swap PTE ...
>>
>> A swap PTE contains completely different data(swap type and offset).
>> pte_pfn() doesn't know this, so if we feed a swap entry to it, it will
>> spit out a junk PFN :)
>>
>> What if that junk PFN happens to match the zeropage's PFN by sheer
>> chance? IHMO, it's really unlikely, but it would be really bad if it did.
>>
>> Clearly, pte_present() prevents this ;)
> 
> Yeah, not so clearly kinda the whole point I'm making here. I mean all this code
> sucks because we have deeply nested conditions and you have to keep in your mind
> that 'oh we already checked for X so we can do Y'.

I see :)

> 
> But the THP code is horrible in general.

Yep, you'll get no argument from me on that one ;p

The code is indeed tricky ...

> 
> Anyway let's stay focused (so I can stand a chance of catching up with my
> post-vacation backlog :), I will check the respin once you send!

Cheers!

This series has been dropped from mm-new. I'm going to take a bit to
rethink the approach based on the feedback.

> 
>>
>> By the way, I noticed there are other places in khugepaged.c that
>> call pte_pfn() without being under a pte_present() check.
>>
>> Perhaps those should all be fixed as well in a separate patch?
> 
> Yeah I'm not surprised and sure indeed.
> 
> Thanks, Lorenzo

In the meantime, I'll send up a separate patch for the missing
pte_present() checks first ;)

Thanks,
Lance


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-10-15  9:31 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-08  4:37 [PATCH mm-new v3 0/3] refactor and merge PTE scanning logic Lance Yang
2025-10-08  4:37 ` [PATCH mm-new v3 1/3] mm/khugepaged: optimize PTE scanning with if-else-if-else-if chain Lance Yang
2025-10-14 12:17   ` Lorenzo Stoakes
2025-10-14 12:27     ` David Hildenbrand
2025-10-15  4:49       ` Lance Yang
2025-10-15  9:16         ` Lorenzo Stoakes
2025-10-15  9:31           ` Lance Yang
2025-10-08  4:37 ` [PATCH mm-new v3 2/3] mm/khugepaged: use VM_WARN_ON_FOLIO instead of VM_BUG_ON_FOLIO for non-anon folios Lance Yang
2025-10-14 12:25   ` Lorenzo Stoakes
2025-10-08  4:37 ` [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper Lance Yang
2025-10-09  1:07   ` Andrew Morton
2025-10-09  1:49     ` Lance Yang
2025-10-10  9:10   ` Dev Jain
2025-10-10 10:42     ` Lance Yang
2025-10-10 13:29   ` Wei Yang
2025-10-10 13:55     ` Lance Yang
2025-10-14 12:36   ` Lorenzo Stoakes
2025-10-14 17:41   ` Lorenzo Stoakes
2025-10-15  1:48     ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox