linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-new v5 0/5] Improve khugepaged scan logic
@ 2026-01-23  8:22 Vernon Yang
  2026-01-23  8:22 ` [PATCH mm-new v5 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Vernon Yang @ 2026-01-23  8:22 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Hi all,

This series is improve the khugepaged scan logic, reduce CPU consumption,
prioritize scanning task that access memory frequently.

The following data is traced by bpftrace[1] on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan by
khugepaged.

@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
@scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
@scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
@scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
total progress size: 674 MB
Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs

The khugepaged has below phenomenon: the khugepaged list is scanned in a
FIFO manner, as long as the task is not destroyed,
1. the task no longer has memory that can be collapsed into hugepage,
   continues scan it always.
2. the task at the front of the khugepaged scan list is cold, they are
   still scanned first.
3. everyone scan at intervals of khugepaged_scan_sleep_millisecs
   (default 10s). If we always scan the above two cases first, the valid
   scan will have to wait for a long time.

For the first case, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE, just skip it.

For the second case, if the user has explicitly informed us via
MADV_FREE that these folios will be freed, just skip it only.

The below is some performance test results.

kernbench results (testing on x86_64 machine):

                     baseline w/o patches   test w/ patches
Amean     user-32    18522.51 (   0.00%)    18333.64 *   1.02%*
Amean     syst-32     1137.96 (   0.00%)     1113.79 *   2.12%*
Amean     elsp-32      666.04 (   0.00%)      659.44 *   0.99%*
BAmean-95 user-32    18520.01 (   0.00%)    18323.57 (   1.06%)
BAmean-95 syst-32     1137.68 (   0.00%)     1110.50 (   2.39%)
BAmean-95 elsp-32      665.92 (   0.00%)      659.06 (   1.03%)
BAmean-99 user-32    18520.01 (   0.00%)    18323.57 (   1.06%)
BAmean-99 syst-32     1137.68 (   0.00%)     1110.50 (   2.39%)
BAmean-99 elsp-32      665.92 (   0.00%)      659.06 (   1.03%)

Create three task[2]: hot1 -> cold -> hot2. After all three task are
created, each allocate memory 128MB. the hot1/hot2 task continuously
access 128 MB memory, while the cold task only accesses its memory
briefly andthen call madvise(MADV_FREE). Here are the performance test
results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
| cycles per access   |  4.96         |  2.21         | -55.44% |
| Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
| dTLB-load-misses    |  284814532    |  69597236     | -75.56% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.29         |  2.07         | -71.60% |
| Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
| dTLB-load-misses    |  241600871    |  3216108      | -98.67% |

This series is based on mm-new.

Thank you very much for your comments and discussions.


V4 -> V5:
- Patch #3 are squashed to Patch #2
- File patch utilize "xas->xa_index" to fix issue.
- folio_is_lazyfree() to folio_test_lazyfree()
- Just skip lazyfree folio simply.
- Again test kernbench in the performance mode to improve stability.
- pickup Acked-by and Reviewed-by.

V3 -> V4:
- Rebase on mm-new.
- Make Patch #2 cleaner
- Fix the lazyfree folio continue to be collapsed when skipped ahead.

V2 -> V3:
- Refine scan progress number, add folio_is_lazyfree helper
- Fix warnings at SCAN_PTE_MAPPED_HUGEPAGE.
- For MADV_FREE, we will skip the lazy-free folios instead.
- For MADV_COLD, remove it.
- Used hpage_collapse_test_exit_or_disable() instead of vma = NULL.
- pickup Reviewed-by.

V1 -> V2:
- Rename full to full_scan_finished, pickup Acked-by.
- Just skip SCAN_PMD_MAPPED/NO_PTE_TABLE memory, not remove mm.
- Set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE to just skip, not move mm.
- Again test performance at the v6.19-rc2.

V4 : https://lore.kernel.org/linux-mm/20260111121909.8410-1-yanglincheng@kylinos.cn
V3 : https://lore.kernel.org/linux-mm/20260104054112.4541-1-yanglincheng@kylinos.cn
V2 : https://lore.kernel.org/linux-mm/20251229055151.54887-1-yanglincheng@kylinos.cn
V1 : https://lore.kernel.org/linux-mm/20251215090419.174418-1-yanglincheng@kylinos.cn

[1] https://github.com/vernon2gh/app_and_module/blob/main/khugepaged/khugepaged_mm.bt
[2] https://github.com/vernon2gh/app_and_module/blob/main/khugepaged/app.c

Vernon Yang (5):
  mm: khugepaged: add trace_mm_khugepaged_scan event
  mm: khugepaged: refine scan progress number
  mm: add folio_test_lazyfree helper
  mm: khugepaged: skip lazy-free folios
  mm: khugepaged: set to next mm direct when mm has
    MMF_DISABLE_THP_COMPLETELY

 include/linux/page-flags.h         |  5 +++
 include/linux/xarray.h             |  9 +++++
 include/trace/events/huge_memory.h | 26 +++++++++++++
 mm/khugepaged.c                    | 62 +++++++++++++++++++++++++-----
 mm/rmap.c                          |  2 +-
 mm/vmscan.c                        |  5 +--
 6 files changed, 95 insertions(+), 14 deletions(-)

--
2.51.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH mm-new v5 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event
  2026-01-23  8:22 [PATCH mm-new v5 0/5] Improve khugepaged scan logic Vernon Yang
@ 2026-01-23  8:22 ` Vernon Yang
  2026-01-23 10:25   ` Dev Jain
  2026-01-23  8:22 ` [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number Vernon Yang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Vernon Yang @ 2026-01-23  8:22 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Add mm_khugepaged_scan event to track the total time for full scan
and the total number of pages scanned of khugepaged.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
---
 include/trace/events/huge_memory.h | 25 +++++++++++++++++++++++++
 mm/khugepaged.c                    |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4e41bff31888..384e29f6bef0 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -237,5 +237,30 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
 		__print_symbolic(__entry->result, SCAN_STATUS))
 );
 
+TRACE_EVENT(mm_khugepaged_scan,
+
+	TP_PROTO(struct mm_struct *mm, unsigned int progress,
+		 bool full_scan_finished),
+
+	TP_ARGS(mm, progress, full_scan_finished),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned int, progress)
+		__field(bool, full_scan_finished)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->progress = progress;
+		__entry->full_scan_finished = full_scan_finished;
+	),
+
+	TP_printk("mm=%p, progress=%u, full_scan_finished=%d",
+		__entry->mm,
+		__entry->progress,
+		__entry->full_scan_finished)
+);
+
 #endif /* __HUGE_MEMORY_H */
 #include <trace/define_trace.h>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e76f42243534..6f0f05148765 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2535,6 +2535,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 		collect_mm_slot(slot);
 	}
 
+	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
+
 	return progress;
 }
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-23  8:22 [PATCH mm-new v5 0/5] Improve khugepaged scan logic Vernon Yang
  2026-01-23  8:22 ` [PATCH mm-new v5 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
@ 2026-01-23  8:22 ` Vernon Yang
  2026-01-23 10:46   ` Dev Jain
                     ` (2 more replies)
  2026-01-23  8:22 ` [PATCH mm-new v5 3/5] mm: add folio_test_lazyfree helper Vernon Yang
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 31+ messages in thread
From: Vernon Yang @ 2026-01-23  8:22 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Currently, each scan always increases "progress" by HPAGE_PMD_NR,
even if only scanning a single PTE/PMD entry.

- When only scanning a sigle PTE entry, let me provide a detailed
  example:

static int hpage_collapse_scan_pmd()
{
	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
	     _pte++, addr += PAGE_SIZE) {
		pte_t pteval = ptep_get(_pte);
		...
		if (pte_uffd_wp(pteval)) { <-- first scan hit
			result = SCAN_PTE_UFFD_WP;
			goto out_unmap;
		}
	}
}

During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
directly. In practice, only one PTE is scanned before termination.
Here, "progress += 1" reflects the actual number of PTEs scanned, but
previously "progress += HPAGE_PMD_NR" always.

- When the memory has been collapsed to PMD, let me provide a detailed
  example:

The following data is traced by bpftrace on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
by khugepaged.

@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
@scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
@scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
@scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
total progress size: 674 MB
Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs

The khugepaged_scan list save all task that support collapse into hugepage,
as long as the task is not destroyed, khugepaged will not remove it from
the khugepaged_scan list. This exist a phenomenon where task has already
collapsed all memory regions into hugepage, but khugepaged continues to
scan it, which wastes CPU time and invalid, and due to
khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
scanning a large number of invalid task, so scanning really valid task
is later.

After applying this patch, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE, just skip it, as follow:

@scan_pmd_status[6]: 2
@scan_pmd_status[3]: 147
@scan_pmd_status[2]: 173
total progress size: 45 MB
Total time         : 20 seconds

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/linux/xarray.h |  9 ++++++++
 mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
 2 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index be850174e802..f77d97d7b957 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
 	xas->xa_node = XAS_RESTART;
 }
 
+/**
+ * xas_get_index() - Get XArray operation state for a different index.
+ * @xas: XArray operation state.
+ */
+static inline unsigned long xas_get_index(struct xa_state *xas)
+{
+	return xas->xa_index;
+}
+
 /**
  * xas_advance() - Skip over sibling entries.
  * @xas: XArray operation state.
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6f0f05148765..de95029e3763 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -68,7 +68,10 @@ enum scan_result {
 static struct task_struct *khugepaged_thread __read_mostly;
 static DEFINE_MUTEX(khugepaged_mutex);
 
-/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
+/*
+ * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
+ * every 10 second.
+ */
 static unsigned int khugepaged_pages_to_scan __read_mostly;
 static unsigned int khugepaged_pages_collapsed;
 static unsigned int khugepaged_full_scans;
@@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 }
 
 static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
+		struct vm_area_struct *vma, unsigned long start_addr,
+		bool *mmap_locked, unsigned int *cur_progress,
 		struct collapse_control *cc)
 {
 	pmd_t *pmd;
@@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 
 	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
 
+	if (cur_progress)
+		*cur_progress += 1;
+
 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
 	if (result != SCAN_SUCCEED)
 		goto out;
@@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 		result = SCAN_SUCCEED;
 	}
 out_unmap:
+	if (cur_progress) {
+		if (_pte >= pte + HPAGE_PMD_NR)
+			*cur_progress += HPAGE_PMD_NR - 1;
+		else
+			*cur_progress += _pte - pte;
+	}
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, start_addr, referenced,
@@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
-		struct file *file, pgoff_t start, struct collapse_control *cc)
+static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
+		unsigned long addr, struct file *file, pgoff_t start,
+		unsigned int *cur_progress, struct collapse_control *cc)
 {
 	struct folio *folio = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
 			cond_resched_rcu();
 		}
 	}
+	if (cur_progress) {
+		unsigned long idx = xas_get_index(&xas) - start;
+
+		if (folio == NULL)
+			*cur_progress += HPAGE_PMD_NR;
+		else if (xa_is_value(folio))
+			*cur_progress += idx + (1 << xas_get_order(&xas));
+		else if (folio_order(folio) == HPAGE_PMD_ORDER)
+			*cur_progress += idx + 1;
+		else
+			*cur_progress += idx + folio_nr_pages(folio);
+	}
 	rcu_read_unlock();
 
 	if (result == SCAN_SUCCEED) {
@@ -2456,6 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 
 		while (khugepaged_scan.address < hend) {
 			bool mmap_locked = true;
+			unsigned int cur_progress = 0;
 
 			cond_resched();
 			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
@@ -2472,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 				mmap_read_unlock(mm);
 				mmap_locked = false;
 				*result = hpage_collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
+					khugepaged_scan.address, file, pgoff,
+					&cur_progress, cc);
 				fput(file);
 				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
 					mmap_read_lock(mm);
@@ -2486,7 +2514,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 				}
 			} else {
 				*result = hpage_collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
+					khugepaged_scan.address, &mmap_locked,
+					&cur_progress, cc);
 			}
 
 			if (*result == SCAN_SUCCEED)
@@ -2494,7 +2523,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
-			progress += HPAGE_PMD_NR;
+			progress += cur_progress;
 			if (!mmap_locked)
 				/*
 				 * We released mmap_lock so break loop.  Note
@@ -2817,7 +2846,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			mmap_locked = false;
 			*lock_dropped = true;
 			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
+							  NULL, cc);
 
 			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
 			    mapping_can_writeback(file->f_mapping)) {
@@ -2832,7 +2861,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			fput(file);
 		} else {
 			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
+							 &mmap_locked, NULL, cc);
 		}
 		if (!mmap_locked)
 			*lock_dropped = true;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH mm-new v5 3/5] mm: add folio_test_lazyfree helper
  2026-01-23  8:22 [PATCH mm-new v5 0/5] Improve khugepaged scan logic Vernon Yang
  2026-01-23  8:22 ` [PATCH mm-new v5 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
  2026-01-23  8:22 ` [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number Vernon Yang
@ 2026-01-23  8:22 ` Vernon Yang
  2026-01-23 10:54   ` Dev Jain
  2026-01-26  1:52   ` Barry Song
  2026-01-23  8:22 ` [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
  2026-01-23  8:22 ` [PATCH mm-new v5 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
  4 siblings, 2 replies; 31+ messages in thread
From: Vernon Yang @ 2026-01-23  8:22 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Add folio_test_lazyfree() function to identify lazy-free folios to improve
code readability.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
---
 include/linux/page-flags.h | 5 +++++
 mm/rmap.c                  | 2 +-
 mm/vmscan.c                | 5 ++---
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f7a0e4af0c73..415e9f2ef616 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -724,6 +724,11 @@ static __always_inline bool folio_test_anon(const struct folio *folio)
 	return ((unsigned long)folio->mapping & FOLIO_MAPPING_ANON) != 0;
 }
 
+static __always_inline bool folio_test_lazyfree(const struct folio *folio)
+{
+	return folio_test_anon(folio) && !folio_test_swapbacked(folio);
+}
+
 static __always_inline bool PageAnonNotKsm(const struct page *page)
 {
 	unsigned long flags = (unsigned long)page_folio(page)->mapping;
diff --git a/mm/rmap.c b/mm/rmap.c
index 088723f909e4..6b09be06c707 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2049,7 +2049,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		if (!pvmw.pte) {
-			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
+			if (folio_test_lazyfree(folio)) {
 				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
 					goto walk_done;
 				/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8643c0fb9162..1f32c74ec738 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -944,8 +944,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
 	 * They could be mistakenly treated as file lru. So further anon
 	 * test is needed.
 	 */
-	if (!folio_is_file_lru(folio) ||
-	    (folio_test_anon(folio) && !folio_test_swapbacked(folio))) {
+	if (!folio_is_file_lru(folio) || folio_test_lazyfree(folio)) {
 		*dirty = false;
 		*writeback = false;
 		return;
@@ -1489,7 +1488,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			}
 		}
 
-		if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
+		if (folio_test_lazyfree(folio)) {
 			/* follow __remove_mapping for reference */
 			if (!folio_ref_freeze(folio, 1))
 				goto keep_locked;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
  2026-01-23  8:22 [PATCH mm-new v5 0/5] Improve khugepaged scan logic Vernon Yang
                   ` (2 preceding siblings ...)
  2026-01-23  8:22 ` [PATCH mm-new v5 3/5] mm: add folio_test_lazyfree helper Vernon Yang
@ 2026-01-23  8:22 ` Vernon Yang
  2026-01-23  9:09   ` Lance Yang
  2026-01-23  8:22 ` [PATCH mm-new v5 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
  4 siblings, 1 reply; 31+ messages in thread
From: Vernon Yang @ 2026-01-23  8:22 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.

And if we collapse with a lazyfree page, that content will never be none
and the deferred shrinker cannot reclaim them.

So if the user has explicitly informed us via MADV_FREE that this memory
will be freed, it is appropriate for khugepaged to skip it only, thereby
avoiding unnecessary scan and collapse operations to reducing CPU
wastage.

Here are the performance test results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
| cycles per access   |  4.96         |  2.21         | -55.44% |
| Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
| dTLB-load-misses    |  284814532    |  69597236     | -75.56% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.29         |  2.07         | -71.60% |
| Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
| dTLB-load-misses    |  241600871    |  3216108      | -98.67% |

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/trace/events/huge_memory.h |  1 +
 mm/khugepaged.c                    | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 384e29f6bef0..bcdc57eea270 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -25,6 +25,7 @@
 	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
 	EM( SCAN_PAGE_LOCK,		"page_locked")			\
 	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
+	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
 	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
 	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
 	EM( SCAN_VMA_NULL,		"vma_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index de95029e3763..be1c09842ea2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -46,6 +46,7 @@ enum scan_result {
 	SCAN_PAGE_LRU,
 	SCAN_PAGE_LOCK,
 	SCAN_PAGE_ANON,
+	SCAN_PAGE_LAZYFREE,
 	SCAN_PAGE_COMPOUND,
 	SCAN_ANY_PROCESS,
 	SCAN_VMA_NULL,
@@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
+		if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
+			result = SCAN_PAGE_LAZYFREE;
+			goto out;
+		}
+
 		/* See hpage_collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
@@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 		}
 		folio = page_folio(page);
 
+		if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
+			result = SCAN_PAGE_LAZYFREE;
+			goto out_unmap;
+		}
+
 		if (!folio_test_anon(folio)) {
 			result = SCAN_PAGE_ANON;
 			goto out_unmap;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH mm-new v5 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2026-01-23  8:22 [PATCH mm-new v5 0/5] Improve khugepaged scan logic Vernon Yang
                   ` (3 preceding siblings ...)
  2026-01-23  8:22 ` [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
@ 2026-01-23  8:22 ` Vernon Yang
  2026-01-23 12:40   ` Dev Jain
  4 siblings, 1 reply; 31+ messages in thread
From: Vernon Yang @ 2026-01-23  8:22 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
reduce redundant operation.

Without this patch, entering khugepaged_scan_mm_slot() next time, we
will set khugepaged_scan.mm_slot to the next mm_slot.

With this patch, we will directly set khugepaged_scan.mm_slot to the
next mm_slot.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
---
 mm/khugepaged.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index be1c09842ea2..1dc19c1b1f97 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2558,7 +2558,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (hpage_collapse_test_exit(mm) || !vma) {
+	if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
-- 
2.51.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
  2026-01-23  8:22 ` [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
@ 2026-01-23  9:09   ` Lance Yang
  2026-01-23 15:08     ` Vernon Yang
  0 siblings, 1 reply; 31+ messages in thread
From: Lance Yang @ 2026-01-23  9:09 UTC (permalink / raw)
  To: Vernon Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, linux-mm, linux-kernel,
	david, Vernon Yang, akpm



On 2026/1/23 16:22, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
> 
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
> 
> And if we collapse with a lazyfree page, that content will never be none
> and the deferred shrinker cannot reclaim them.
> 
> So if the user has explicitly informed us via MADV_FREE that this memory
> will be freed, it is appropriate for khugepaged to skip it only, thereby
> avoiding unnecessary scan and collapse operations to reducing CPU
> wastage.
> 
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
> 
> Testing on x86_64 machine:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> | cycles per access   |  4.96         |  2.21         | -55.44% |
> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> 
> Testing on qemu-system-x86_64 -enable-kvm:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> | cycles per access   |  7.29         |  2.07         | -71.60% |
> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   include/trace/events/huge_memory.h |  1 +
>   mm/khugepaged.c                    | 11 +++++++++++
>   2 files changed, 12 insertions(+)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 384e29f6bef0..bcdc57eea270 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -25,6 +25,7 @@
>   	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
>   	EM( SCAN_PAGE_LOCK,		"page_locked")			\
>   	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
> +	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
>   	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
>   	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
>   	EM( SCAN_VMA_NULL,		"vma_null")			\
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index de95029e3763..be1c09842ea2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -46,6 +46,7 @@ enum scan_result {
>   	SCAN_PAGE_LRU,
>   	SCAN_PAGE_LOCK,
>   	SCAN_PAGE_ANON,
> +	SCAN_PAGE_LAZYFREE,
>   	SCAN_PAGE_COMPOUND,
>   	SCAN_ANY_PROCESS,
>   	SCAN_VMA_NULL,
> @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		folio = page_folio(page);
>   		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>   
> +		if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {

I'm wondering if we need "cc->is_khugepaged &&" as well here?

We should allow users to enforce collapse via the madvise_collapse()
path even if pages are marked lazyfree, IMHO.

> +			result = SCAN_PAGE_LAZYFREE;
> +			goto out;
> +		}
> +
>   		/* See hpage_collapse_scan_pmd(). */
>   		if (folio_maybe_mapped_shared(folio)) {
>   			++shared;
> @@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>   		}
>   		folio = page_folio(page);
>   
> +		if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {

Ditto.

> +			result = SCAN_PAGE_LAZYFREE;
> +			goto out_unmap;
> +		}
> +
>   		if (!folio_test_anon(folio)) {
>   			result = SCAN_PAGE_ANON;
>   			goto out_unmap;



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event
  2026-01-23  8:22 ` [PATCH mm-new v5 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
@ 2026-01-23 10:25   ` Dev Jain
  0 siblings, 0 replies; 31+ messages in thread
From: Dev Jain @ 2026-01-23 10:25 UTC (permalink / raw)
  To: Vernon Yang, akpm, david
  Cc: lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm, linux-kernel,
	Vernon Yang


On 23/01/26 1:52 pm, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Add mm_khugepaged_scan event to track the total time for full scan
> and the total number of pages scanned of khugepaged.
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Reviewed-by: Barry Song <baohua@kernel.org>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>

Reviewed-by: Dev Jain <dev.jain@arm.com>

> ---
>  include/trace/events/huge_memory.h | 25 +++++++++++++++++++++++++
>  mm/khugepaged.c                    |  2 ++
>  2 files changed, 27 insertions(+)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 4e41bff31888..384e29f6bef0 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -237,5 +237,30 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
>  		__print_symbolic(__entry->result, SCAN_STATUS))
>  );
>  
> +TRACE_EVENT(mm_khugepaged_scan,
> +
> +	TP_PROTO(struct mm_struct *mm, unsigned int progress,
> +		 bool full_scan_finished),
> +
> +	TP_ARGS(mm, progress, full_scan_finished),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(unsigned int, progress)
> +		__field(bool, full_scan_finished)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->progress = progress;
> +		__entry->full_scan_finished = full_scan_finished;
> +	),
> +
> +	TP_printk("mm=%p, progress=%u, full_scan_finished=%d",
> +		__entry->mm,
> +		__entry->progress,
> +		__entry->full_scan_finished)
> +);
> +
>  #endif /* __HUGE_MEMORY_H */
>  #include <trace/define_trace.h>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e76f42243534..6f0f05148765 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2535,6 +2535,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  		collect_mm_slot(slot);
>  	}
>  
> +	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> +
>  	return progress;
>  }
>  


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-23  8:22 ` [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number Vernon Yang
@ 2026-01-23 10:46   ` Dev Jain
  2026-01-23 15:25     ` Vernon Yang
  2026-01-23 15:19   ` Matthew Wilcox
  2026-01-28  8:29   ` Dev Jain
  2 siblings, 1 reply; 31+ messages in thread
From: Dev Jain @ 2026-01-23 10:46 UTC (permalink / raw)
  To: Vernon Yang, akpm, david
  Cc: lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm, linux-kernel,
	Vernon Yang


On 23/01/26 1:52 pm, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Currently, each scan always increases "progress" by HPAGE_PMD_NR,
> even if only scanning a single PTE/PMD entry.
>
> - When only scanning a sigle PTE entry, let me provide a detailed
>   example:
>
> static int hpage_collapse_scan_pmd()
> {
> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> 	     _pte++, addr += PAGE_SIZE) {
> 		pte_t pteval = ptep_get(_pte);
> 		...
> 		if (pte_uffd_wp(pteval)) { <-- first scan hit
> 			result = SCAN_PTE_UFFD_WP;
> 			goto out_unmap;
> 		}
> 	}
> }
>
> During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
> directly. In practice, only one PTE is scanned before termination.
> Here, "progress += 1" reflects the actual number of PTEs scanned, but
> previously "progress += HPAGE_PMD_NR" always.
>
> - When the memory has been collapsed to PMD, let me provide a detailed
>   example:
>
> The following data is traced by bpftrace on a desktop system. After
> the system has been left idle for 10 minutes upon booting, a lot of
> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
> by khugepaged.
>
> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
> @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
> @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
> total progress size: 674 MB
> Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
>
> The khugepaged_scan list save all task that support collapse into hugepage,
> as long as the task is not destroyed, khugepaged will not remove it from
> the khugepaged_scan list. This exist a phenomenon where task has already
> collapsed all memory regions into hugepage, but khugepaged continues to
> scan it, which wastes CPU time and invalid, and due to
> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> scanning a large number of invalid task, so scanning really valid task
> is later.
>
> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
> SCAN_NO_PTE_TABLE, just skip it, as follow:
>
> @scan_pmd_status[6]: 2
> @scan_pmd_status[3]: 147
> @scan_pmd_status[2]: 173
> total progress size: 45 MB
> Total time         : 20 seconds
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>  include/linux/xarray.h |  9 ++++++++
>  mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
>  2 files changed, 47 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index be850174e802..f77d97d7b957 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
>  	xas->xa_node = XAS_RESTART;
>  }
>  
> +/**
> + * xas_get_index() - Get XArray operation state for a different index.
> + * @xas: XArray operation state.
> + */
> +static inline unsigned long xas_get_index(struct xa_state *xas)
> +{
> +	return xas->xa_index;
> +}
> +

Why is this needed?

>  /**
>   * xas_advance() - Skip over sibling entries.
>   * @xas: XArray operation state.
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6f0f05148765..de95029e3763 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -68,7 +68,10 @@ enum scan_result {
>  static struct task_struct *khugepaged_thread __read_mostly;
>  static DEFINE_MUTEX(khugepaged_mutex);
>  
> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> +/*
> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> + * every 10 second.
> + */
>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>  static unsigned int khugepaged_pages_collapsed;
>  static unsigned int khugepaged_full_scans;
> @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  }
>  
>  static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
> +		struct vm_area_struct *vma, unsigned long start_addr,
> +		bool *mmap_locked, unsigned int *cur_progress,
>  		struct collapse_control *cc)
>  {
>  	pmd_t *pmd;
> @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>  
>  	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>  
> +	if (cur_progress)
> +		*cur_progress += 1;
> +

1. Why do we need to do if (cur_progress). Isn't it guaranteed that the pointer will never be NULL.
2. Why do we increment this on function entry?

>  	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
> @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>  		result = SCAN_SUCCEED;
>  	}
>  out_unmap:
> +	if (cur_progress) {
> +		if (_pte >= pte + HPAGE_PMD_NR)
> +			*cur_progress += HPAGE_PMD_NR - 1;
> +		else
> +			*cur_progress += _pte - pte;
> +	}

Why are two cases required here - shouldn't it just be _pte - pte?

>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
>  		result = collapse_huge_page(mm, start_addr, referenced,
> @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  	return result;
>  }
>  
> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> -		struct file *file, pgoff_t start, struct collapse_control *cc)
> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> +		unsigned long addr, struct file *file, pgoff_t start,
> +		unsigned int *cur_progress, struct collapse_control *cc)
>  {
>  	struct folio *folio = NULL;
>  	struct address_space *mapping = file->f_mapping;
> @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>  			cond_resched_rcu();
>  		}
>  	}
> +	if (cur_progress) {
> +		unsigned long idx = xas_get_index(&xas) - start;
> +
> +		if (folio == NULL)
> +			*cur_progress += HPAGE_PMD_NR;
> +		else if (xa_is_value(folio))
> +			*cur_progress += idx + (1 << xas_get_order(&xas));
> +		else if (folio_order(folio) == HPAGE_PMD_ORDER)
> +			*cur_progress += idx + 1;
> +		else
> +			*cur_progress += idx + folio_nr_pages(folio);
> +	}
>  	rcu_read_unlock();
>  
>  	if (result == SCAN_SUCCEED) {
> @@ -2456,6 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  
>  		while (khugepaged_scan.address < hend) {
>  			bool mmap_locked = true;
> +			unsigned int cur_progress = 0;
>  
>  			cond_resched();
>  			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> @@ -2472,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  				mmap_read_unlock(mm);
>  				mmap_locked = false;
>  				*result = hpage_collapse_scan_file(mm,
> -					khugepaged_scan.address, file, pgoff, cc);
> +					khugepaged_scan.address, file, pgoff,
> +					&cur_progress, cc);
>  				fput(file);
>  				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
>  					mmap_read_lock(mm);
> @@ -2486,7 +2514,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  				}
>  			} else {
>  				*result = hpage_collapse_scan_pmd(mm, vma,
> -					khugepaged_scan.address, &mmap_locked, cc);
> +					khugepaged_scan.address, &mmap_locked,
> +					&cur_progress, cc);
>  			}
>  
>  			if (*result == SCAN_SUCCEED)
> @@ -2494,7 +2523,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  
>  			/* move to next address */
>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
> -			progress += HPAGE_PMD_NR;
> +			progress += cur_progress;
>  			if (!mmap_locked)
>  				/*
>  				 * We released mmap_lock so break loop.  Note
> @@ -2817,7 +2846,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  			mmap_locked = false;
>  			*lock_dropped = true;
>  			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> -							  cc);
> +							  NULL, cc);
>  
>  			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>  			    mapping_can_writeback(file->f_mapping)) {
> @@ -2832,7 +2861,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  			fput(file);
>  		} else {
>  			result = hpage_collapse_scan_pmd(mm, vma, addr,
> -							 &mmap_locked, cc);
> +							 &mmap_locked, NULL, cc);
>  		}
>  		if (!mmap_locked)
>  			*lock_dropped = true;


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 3/5] mm: add folio_test_lazyfree helper
  2026-01-23  8:22 ` [PATCH mm-new v5 3/5] mm: add folio_test_lazyfree helper Vernon Yang
@ 2026-01-23 10:54   ` Dev Jain
  2026-01-26  1:52   ` Barry Song
  1 sibling, 0 replies; 31+ messages in thread
From: Dev Jain @ 2026-01-23 10:54 UTC (permalink / raw)
  To: Vernon Yang, akpm, david
  Cc: lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm, linux-kernel,
	Vernon Yang


On 23/01/26 1:52 pm, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Add folio_test_lazyfree() function to identify lazy-free folios to improve
> code readability.
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>

Thanks, this is very useful.

Reviewed-by: Dev Jain <dev.jain@arm.com>

> ---
>  include/linux/page-flags.h | 5 +++++
>  mm/rmap.c                  | 2 +-
>  mm/vmscan.c                | 5 ++---
>  3 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f7a0e4af0c73..415e9f2ef616 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -724,6 +724,11 @@ static __always_inline bool folio_test_anon(const struct folio *folio)
>  	return ((unsigned long)folio->mapping & FOLIO_MAPPING_ANON) != 0;
>  }
>  
> +static __always_inline bool folio_test_lazyfree(const struct folio *folio)
> +{
> +	return folio_test_anon(folio) && !folio_test_swapbacked(folio);
> +}
> +
>  static __always_inline bool PageAnonNotKsm(const struct page *page)
>  {
>  	unsigned long flags = (unsigned long)page_folio(page)->mapping;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 088723f909e4..6b09be06c707 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2049,7 +2049,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		}
>  
>  		if (!pvmw.pte) {
> -			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> +			if (folio_test_lazyfree(folio)) {
>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>  					goto walk_done;
>  				/*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8643c0fb9162..1f32c74ec738 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -944,8 +944,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
>  	 * They could be mistakenly treated as file lru. So further anon
>  	 * test is needed.
>  	 */
> -	if (!folio_is_file_lru(folio) ||
> -	    (folio_test_anon(folio) && !folio_test_swapbacked(folio))) {
> +	if (!folio_is_file_lru(folio) || folio_test_lazyfree(folio)) {
>  		*dirty = false;
>  		*writeback = false;
>  		return;
> @@ -1489,7 +1488,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  			}
>  		}
>  
> -		if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> +		if (folio_test_lazyfree(folio)) {
>  			/* follow __remove_mapping for reference */
>  			if (!folio_ref_freeze(folio, 1))
>  				goto keep_locked;


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2026-01-23  8:22 ` [PATCH mm-new v5 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
@ 2026-01-23 12:40   ` Dev Jain
  2026-01-23 15:32     ` Vernon Yang
  2026-01-26  2:18     ` Barry Song
  0 siblings, 2 replies; 31+ messages in thread
From: Dev Jain @ 2026-01-23 12:40 UTC (permalink / raw)
  To: Vernon Yang, akpm, david
  Cc: lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm, linux-kernel,
	Vernon Yang


On 23/01/26 1:52 pm, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
> scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
> reduce redundant operation.
>
> Without this patch, entering khugepaged_scan_mm_slot() next time, we
> will set khugepaged_scan.mm_slot to the next mm_slot.
>
> With this patch, we will directly set khugepaged_scan.mm_slot to the
> next mm_slot.
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> ---
>  mm/khugepaged.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index be1c09842ea2..1dc19c1b1f97 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2558,7 +2558,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  	 * Release the current mm_slot if this mm is about to die, or
>  	 * if we scanned all vmas of this mm.
>  	 */

Need to update comment: "...or THP got disabled".

Reviewed-by: Dev Jain <dev.jain@arm.com>

> -	if (hpage_collapse_test_exit(mm) || !vma) {
> +	if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
>  		/*
>  		 * Make sure that if mm_users is reaching zero while
>  		 * khugepaged runs here, khugepaged_exit will find


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
  2026-01-23  9:09   ` Lance Yang
@ 2026-01-23 15:08     ` Vernon Yang
  2026-01-23 16:32       ` Lance Yang
  0 siblings, 1 reply; 31+ messages in thread
From: Vernon Yang @ 2026-01-23 15:08 UTC (permalink / raw)
  To: Lance Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, linux-mm, linux-kernel,
	david, Vernon Yang, akpm

On Fri, Jan 23, 2026 at 5:09 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> On 2026/1/23 16:22, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > And if we collapse with a lazyfree page, that content will never be none
> > and the deferred shrinker cannot reclaim them.
> >
> > So if the user has explicitly informed us via MADV_FREE that this memory
> > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > avoiding unnecessary scan and collapse operations to reducing CPU
> > wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   include/trace/events/huge_memory.h |  1 +
> >   mm/khugepaged.c                    | 11 +++++++++++
> >   2 files changed, 12 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 384e29f6bef0..bcdc57eea270 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -25,6 +25,7 @@
> >       EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> >       EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> >       EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> > +     EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> >       EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> >       EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> >       EM( SCAN_VMA_NULL,              "vma_null")                     \
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index de95029e3763..be1c09842ea2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -46,6 +46,7 @@ enum scan_result {
> >       SCAN_PAGE_LRU,
> >       SCAN_PAGE_LOCK,
> >       SCAN_PAGE_ANON,
> > +     SCAN_PAGE_LAZYFREE,
> >       SCAN_PAGE_COMPOUND,
> >       SCAN_ANY_PROCESS,
> >       SCAN_VMA_NULL,
> > @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >               folio = page_folio(page);
> >               VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >
> > +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>
> I'm wondering if we need "cc->is_khugepaged &&" as well here?
>
> We should allow users to enforce collapse via the madvise_collapse()
> path even if pages are marked lazyfree, IMHO.

$ man madvise
MADV_COLLAPSE
        Perform a best-effort synchronous collapse of the native pages
        mapped by the memory range into Transparent Huge Pages (THPs).

The semantics of MADV_COLLAPSE are best-effort and do not imply to enforce
collapsing, so we don't need "cc->is_khugepaged" here.

We can imagine that if a user simultaneously uses MADV_FREE and
MADV_COLLAPSE, it indicates a misunderstanding of their semantics.
As the kernel, we need to safeguard the baseline.

> > +                     result = SCAN_PAGE_LAZYFREE;
> > +                     goto out;
> > +             }
> > +
> >               /* See hpage_collapse_scan_pmd(). */
> >               if (folio_maybe_mapped_shared(folio)) {
> >                       ++shared;
> > @@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >               }
> >               folio = page_folio(page);
> >
> > +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>
> Ditto.
>
> > +                     result = SCAN_PAGE_LAZYFREE;
> > +                     goto out_unmap;
> > +             }
> > +
> >               if (!folio_test_anon(folio)) {
> >                       result = SCAN_PAGE_ANON;
> >                       goto out_unmap;
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-23  8:22 ` [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number Vernon Yang
  2026-01-23 10:46   ` Dev Jain
@ 2026-01-23 15:19   ` Matthew Wilcox
  2026-01-23 15:29     ` Vernon Yang
  2026-01-28  8:29   ` Dev Jain
  2 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2026-01-23 15:19 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Fri, Jan 23, 2026 at 04:22:29PM +0800, Vernon Yang wrote:
> +	if (cur_progress) {
> +		unsigned long idx = xas_get_index(&xas) - start;

I agree with Dev.  This is less readable than:

		unsigned long idx = xas.xa_index - start;



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-23 10:46   ` Dev Jain
@ 2026-01-23 15:25     ` Vernon Yang
  0 siblings, 0 replies; 31+ messages in thread
From: Vernon Yang @ 2026-01-23 15:25 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Fri, Jan 23, 2026 at 6:46 PM Dev Jain <dev.jain@arm.com> wrote:
>
> On 23/01/26 1:52 pm, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > Currently, each scan always increases "progress" by HPAGE_PMD_NR,
> > even if only scanning a single PTE/PMD entry.
> >
> > - When only scanning a sigle PTE entry, let me provide a detailed
> >   example:
> >
> > static int hpage_collapse_scan_pmd()
> > {
> >       for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >            _pte++, addr += PAGE_SIZE) {
> >               pte_t pteval = ptep_get(_pte);
> >               ...
> >               if (pte_uffd_wp(pteval)) { <-- first scan hit
> >                       result = SCAN_PTE_UFFD_WP;
> >                       goto out_unmap;
> >               }
> >       }
> > }
> >
> > During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
> > directly. In practice, only one PTE is scanned before termination.
> > Here, "progress += 1" reflects the actual number of PTEs scanned, but
> > previously "progress += HPAGE_PMD_NR" always.
> >
> > - When the memory has been collapsed to PMD, let me provide a detailed
> >   example:
> >
> > The following data is traced by bpftrace on a desktop system. After
> > the system has been left idle for 10 minutes upon booting, a lot of
> > SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
> > by khugepaged.
> >
> > @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> > @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
> > @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
> > @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
> > total progress size: 674 MB
> > Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
> >
> > The khugepaged_scan list save all task that support collapse into hugepage,
> > as long as the task is not destroyed, khugepaged will not remove it from
> > the khugepaged_scan list. This exist a phenomenon where task has already
> > collapsed all memory regions into hugepage, but khugepaged continues to
> > scan it, which wastes CPU time and invalid, and due to
> > khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> > scanning a large number of invalid task, so scanning really valid task
> > is later.
> >
> > After applying this patch, when the memory is either SCAN_PMD_MAPPED or
> > SCAN_NO_PTE_TABLE, just skip it, as follow:
> >
> > @scan_pmd_status[6]: 2
> > @scan_pmd_status[3]: 147
> > @scan_pmd_status[2]: 173
> > total progress size: 45 MB
> > Total time         : 20 seconds
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >  include/linux/xarray.h |  9 ++++++++
> >  mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
> >  2 files changed, 47 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> > index be850174e802..f77d97d7b957 100644
> > --- a/include/linux/xarray.h
> > +++ b/include/linux/xarray.h
> > @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
> >       xas->xa_node = XAS_RESTART;
> >  }
> >
> > +/**
> > + * xas_get_index() - Get XArray operation state for a different index.
> > + * @xas: XArray operation state.
> > + */
> > +static inline unsigned long xas_get_index(struct xa_state *xas)
> > +{
> > +     return xas->xa_index;
> > +}
> > +
>
> Why is this needed?

Please refer to hpage_collapse_scan_file(), which obtains the starting
address of the last folio when exiting xas_for_each().

Using "xas->xa_index" directly is also acceptable.

> >  /**
> >   * xas_advance() - Skip over sibling entries.
> >   * @xas: XArray operation state.
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 6f0f05148765..de95029e3763 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -68,7 +68,10 @@ enum scan_result {
> >  static struct task_struct *khugepaged_thread __read_mostly;
> >  static DEFINE_MUTEX(khugepaged_mutex);
> >
> > -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> > +/*
> > + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> > + * every 10 second.
> > + */
> >  static unsigned int khugepaged_pages_to_scan __read_mostly;
> >  static unsigned int khugepaged_pages_collapsed;
> >  static unsigned int khugepaged_full_scans;
> > @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  }
> >
> >  static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> > -             struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
> > +             struct vm_area_struct *vma, unsigned long start_addr,
> > +             bool *mmap_locked, unsigned int *cur_progress,
> >               struct collapse_control *cc)
> >  {
> >       pmd_t *pmd;
> > @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >
> >       VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
> >
> > +     if (cur_progress)
> > +             *cur_progress += 1;
> > +
>
> 1. Why do we need to do if (cur_progress). Isn't it guaranteed that the pointer will never be NULL.

cur_progress is NULL at madvise_collapse()

> 2. Why do we increment this on function entry?

When the memory has been collapsed to PMD, we increment one only, same as
the previous PATCH #3 [1].

[1] https://lore.kernel.org/linux-mm/20260111121909.8410-4-yanglincheng@kylinos.cn

>
> >       result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out;
> > @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >               result = SCAN_SUCCEED;
> >       }
> >  out_unmap:
> > +     if (cur_progress) {
> > +             if (_pte >= pte + HPAGE_PMD_NR)
> > +                     *cur_progress += HPAGE_PMD_NR - 1;
> > +             else
> > +                     *cur_progress += _pte - pte;
> > +     }
>
> Why are two cases required here - shouldn't it just be _pte - pte?

At the function entry, "cur_progress += 1", If this PMD is SCAN_PMD_MAPPED
or SCAN_NO_PTE_TABLE, return directly. otherwise,

When "_pte == pte + HPAGE_PMD_NR" to exit loop, HPAGE_PMD_NR
(also "_pte - pte") PTEs have been scanned, the "cur_progress" needs to be
incremented by "HPAGE_PMD_NR - 1".

When "_pte < pte + HPAGE_PMD_NR" to break loop, "_pte - pte + 1" PTEs have
been scanned, and "cur_progress" needs to be incremented by "_pte - pte".

> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> >               result = collapse_huge_page(mm, start_addr, referenced,
> > @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> >       return result;
> >  }
> >
> > -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > -             struct file *file, pgoff_t start, struct collapse_control *cc)
> > +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> > +             unsigned long addr, struct file *file, pgoff_t start,
> > +             unsigned int *cur_progress, struct collapse_control *cc)
> >  {
> >       struct folio *folio = NULL;
> >       struct address_space *mapping = file->f_mapping;
> > @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
> >                       cond_resched_rcu();
> >               }
> >       }
> > +     if (cur_progress) {
> > +             unsigned long idx = xas_get_index(&xas) - start;
> > +
> > +             if (folio == NULL)
> > +                     *cur_progress += HPAGE_PMD_NR;
> > +             else if (xa_is_value(folio))
> > +                     *cur_progress += idx + (1 << xas_get_order(&xas));
> > +             else if (folio_order(folio) == HPAGE_PMD_ORDER)
> > +                     *cur_progress += idx + 1;
> > +             else
> > +                     *cur_progress += idx + folio_nr_pages(folio);
> > +     }
> >       rcu_read_unlock();
> >
> >       if (result == SCAN_SUCCEED) {
> > @@ -2456,6 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >
> >               while (khugepaged_scan.address < hend) {
> >                       bool mmap_locked = true;
> > +                     unsigned int cur_progress = 0;
> >
> >                       cond_resched();
> >                       if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> > @@ -2472,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >                               mmap_read_unlock(mm);
> >                               mmap_locked = false;
> >                               *result = hpage_collapse_scan_file(mm,
> > -                                     khugepaged_scan.address, file, pgoff, cc);
> > +                                     khugepaged_scan.address, file, pgoff,
> > +                                     &cur_progress, cc);
> >                               fput(file);
> >                               if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> >                                       mmap_read_lock(mm);
> > @@ -2486,7 +2514,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >                               }
> >                       } else {
> >                               *result = hpage_collapse_scan_pmd(mm, vma,
> > -                                     khugepaged_scan.address, &mmap_locked, cc);
> > +                                     khugepaged_scan.address, &mmap_locked,
> > +                                     &cur_progress, cc);
> >                       }
> >
> >                       if (*result == SCAN_SUCCEED)
> > @@ -2494,7 +2523,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >
> >                       /* move to next address */
> >                       khugepaged_scan.address += HPAGE_PMD_SIZE;
> > -                     progress += HPAGE_PMD_NR;
> > +                     progress += cur_progress;
> >                       if (!mmap_locked)
> >                               /*
> >                                * We released mmap_lock so break loop.  Note
> > @@ -2817,7 +2846,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >                       mmap_locked = false;
> >                       *lock_dropped = true;
> >                       result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > -                                                       cc);
> > +                                                       NULL, cc);
> >
> >                       if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> >                           mapping_can_writeback(file->f_mapping)) {
> > @@ -2832,7 +2861,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >                       fput(file);
> >               } else {
> >                       result = hpage_collapse_scan_pmd(mm, vma, addr,
> > -                                                      &mmap_locked, cc);
> > +                                                      &mmap_locked, NULL, cc);
> >               }
> >               if (!mmap_locked)
> >                       *lock_dropped = true;


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-23 15:19   ` Matthew Wilcox
@ 2026-01-23 15:29     ` Vernon Yang
  0 siblings, 0 replies; 31+ messages in thread
From: Vernon Yang @ 2026-01-23 15:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, david, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Fri, Jan 23, 2026 at 11:19 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jan 23, 2026 at 04:22:29PM +0800, Vernon Yang wrote:
> > +     if (cur_progress) {
> > +             unsigned long idx = xas_get_index(&xas) - start;
>
> I agree with Dev.  This is less readable than:
>
>                 unsigned long idx = xas.xa_index - start;

LGTM, I will use "xas->xa_index" directly in the next version. Thanks!


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2026-01-23 12:40   ` Dev Jain
@ 2026-01-23 15:32     ` Vernon Yang
  2026-01-26  2:18     ` Barry Song
  1 sibling, 0 replies; 31+ messages in thread
From: Vernon Yang @ 2026-01-23 15:32 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Fri, Jan 23, 2026 at 8:40 PM Dev Jain <dev.jain@arm.com> wrote:
>
> On 23/01/26 1:52 pm, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
> > scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
> > reduce redundant operation.
> >
> > Without this patch, entering khugepaged_scan_mm_slot() next time, we
> > will set khugepaged_scan.mm_slot to the next mm_slot.
> >
> > With this patch, we will directly set khugepaged_scan.mm_slot to the
> > next mm_slot.
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
> > Reviewed-by: Lance Yang <lance.yang@linux.dev>
> > ---
> >  mm/khugepaged.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index be1c09842ea2..1dc19c1b1f97 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2558,7 +2558,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >        * Release the current mm_slot if this mm is about to die, or
> >        * if we scanned all vmas of this mm.
> >        */
>
> Need to update comment: "...or THP got disabled".

LGTM, I will update comment in the next version. Thanks!

> Reviewed-by: Dev Jain <dev.jain@arm.com>
>
> > -     if (hpage_collapse_test_exit(mm) || !vma) {
> > +     if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
> >               /*
> >                * Make sure that if mm_users is reaching zero while
> >                * khugepaged runs here, khugepaged_exit will find


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
  2026-01-23 15:08     ` Vernon Yang
@ 2026-01-23 16:32       ` Lance Yang
  2026-01-24  3:22         ` Vernon Yang
  0 siblings, 1 reply; 31+ messages in thread
From: Lance Yang @ 2026-01-23 16:32 UTC (permalink / raw)
  To: Vernon Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, linux-mm, linux-kernel,
	david, Vernon Yang, akpm



On 2026/1/23 23:08, Vernon Yang wrote:
> On Fri, Jan 23, 2026 at 5:09 PM Lance Yang <lance.yang@linux.dev> wrote:
>>
>> On 2026/1/23 16:22, Vernon Yang wrote:
>>> From: Vernon Yang <yanglincheng@kylinos.cn>
>>>

[...]

>>> @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>                folio = page_folio(page);
>>>                VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>>>
>>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>>
>> I'm wondering if we need "cc->is_khugepaged &&" as well here?
>>
>> We should allow users to enforce collapse via the madvise_collapse()
>> path even if pages are marked lazyfree, IMHO.
> 
> $ man madvise
> MADV_COLLAPSE
>          Perform a best-effort synchronous collapse of the native pages
>          mapped by the memory range into Transparent Huge Pages (THPs).
> 
> The semantics of MADV_COLLAPSE are best-effort and do not imply to enforce
> collapsing, so we don't need "cc->is_khugepaged" here.
> 
> We can imagine that if a user simultaneously uses MADV_FREE and
> MADV_COLLAPSE, it indicates a misunderstanding of their semantics.
> As the kernel, we need to safeguard the baseline.

No. Afraid I don't think so.

To be clear, what I meant by "enforce":

Yep, MADV_COLLAPSE is best-effort - it can fail. But when users
call MADV_COLLAPSE, they're explicitly asking for collapse.

Compared to khugepaged just scanning around, that's already "enforce"
- users are actively requesting it, not passively waiting for.

Note that you're *breaking* userspace. Users would not be able
to collapse the range where there are any lazyfree pages anymore,
even when they explicitly call MADV_COLLAPSE.

For khugepaged, skipping lazyfree makes sense.

> 
>>> +                     result = SCAN_PAGE_LAZYFREE;
>>> +                     goto out;
>>> +             }
>>> +
>>>                /* See hpage_collapse_scan_pmd(). */
>>>                if (folio_maybe_mapped_shared(folio)) {
>>>                        ++shared;
>>> @@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>                }
>>>                folio = page_folio(page);
>>>
>>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>>
>> Ditto.
>>
>>> +                     result = SCAN_PAGE_LAZYFREE;
>>> +                     goto out_unmap;
>>> +             }
>>> +
>>>                if (!folio_test_anon(folio)) {
>>>                        result = SCAN_PAGE_ANON;
>>>                        goto out_unmap;
>>



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
  2026-01-23 16:32       ` Lance Yang
@ 2026-01-24  3:22         ` Vernon Yang
  2026-01-24  6:48           ` Dev Jain
  0 siblings, 1 reply; 31+ messages in thread
From: Vernon Yang @ 2026-01-24  3:22 UTC (permalink / raw)
  To: david, Lance Yang, baohua, dev.jain
  Cc: lorenzo.stoakes, ziy, linux-mm, linux-kernel, Vernon Yang, akpm

On Sat, Jan 24, 2026 at 12:32 AM Lance Yang <lance.yang@linux.dev> wrote:
>
> On 2026/1/23 23:08, Vernon Yang wrote:
> > On Fri, Jan 23, 2026 at 5:09 PM Lance Yang <lance.yang@linux.dev> wrote:
> >>
> >> On 2026/1/23 16:22, Vernon Yang wrote:
> >>> From: Vernon Yang <yanglincheng@kylinos.cn>
> >>>
>
> [...]
>
> >>> @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>>                folio = page_folio(page);
> >>>                VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >>>
> >>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
> >>
> >> I'm wondering if we need "cc->is_khugepaged &&" as well here?
> >>
> >> We should allow users to enforce collapse via the madvise_collapse()
> >> path even if pages are marked lazyfree, IMHO.
> >
> > $ man madvise
> > MADV_COLLAPSE
> >          Perform a best-effort synchronous collapse of the native pages
> >          mapped by the memory range into Transparent Huge Pages (THPs).
> >
> > The semantics of MADV_COLLAPSE are best-effort and do not imply to enforce
> > collapsing, so we don't need "cc->is_khugepaged" here.
> >
> > We can imagine that if a user simultaneously uses MADV_FREE and
> > MADV_COLLAPSE, it indicates a misunderstanding of their semantics.
> > As the kernel, we need to safeguard the baseline.
>
> No. Afraid I don't think so.
>
> To be clear, what I meant by "enforce":
>
> Yep, MADV_COLLAPSE is best-effort - it can fail. But when users
> call MADV_COLLAPSE, they're explicitly asking for collapse.
>
> Compared to khugepaged just scanning around, that's already "enforce"
> - users are actively requesting it, not passively waiting for.
>
> Note that you're *breaking* userspace. Users would not be able
> to collapse the range where there are any lazyfree pages anymore,
> even when they explicitly call MADV_COLLAPSE.
>
> For khugepaged, skipping lazyfree makes sense.

I got your meaning, this is equivalent to two questions:

1. Does the semantics of best-effort imply any "enforce" meaning?
2. When madvise(MADV_FREE| MADV_COLLAPSE), do we want to collapse
   lazyfree folios?

This is a semantic warning, and I'd like to hear others' opinions.

> >
> >>> +                     result = SCAN_PAGE_LAZYFREE;
> >>> +                     goto out;
> >>> +             }
> >>> +
> >>>                /* See hpage_collapse_scan_pmd(). */
> >>>                if (folio_maybe_mapped_shared(folio)) {
> >>>                        ++shared;
> >>> @@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>                }
> >>>                folio = page_folio(page);
> >>>
> >>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
> >>
> >> Ditto.
> >>
> >>> +                     result = SCAN_PAGE_LAZYFREE;
> >>> +                     goto out_unmap;
> >>> +             }
> >>> +
> >>>                if (!folio_test_anon(folio)) {
> >>>                        result = SCAN_PAGE_ANON;
> >>>                        goto out_unmap;
> >>
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
  2026-01-24  3:22         ` Vernon Yang
@ 2026-01-24  6:48           ` Dev Jain
  2026-01-26  2:06             ` Barry Song
  0 siblings, 1 reply; 31+ messages in thread
From: Dev Jain @ 2026-01-24  6:48 UTC (permalink / raw)
  To: Vernon Yang, david, Lance Yang, baohua
  Cc: lorenzo.stoakes, ziy, linux-mm, linux-kernel, Vernon Yang, akpm


On 24/01/26 8:52 am, Vernon Yang wrote:
> On Sat, Jan 24, 2026 at 12:32 AM Lance Yang <lance.yang@linux.dev> wrote:
>> On 2026/1/23 23:08, Vernon Yang wrote:
>>> On Fri, Jan 23, 2026 at 5:09 PM Lance Yang <lance.yang@linux.dev> wrote:
>>>> On 2026/1/23 16:22, Vernon Yang wrote:
>>>>> From: Vernon Yang <yanglincheng@kylinos.cn>
>>>>>
>> [...]
>>
>>>>> @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>>>                folio = page_folio(page);
>>>>>                VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>>>>>
>>>>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>>>> I'm wondering if we need "cc->is_khugepaged &&" as well here?
>>>>
>>>> We should allow users to enforce collapse via the madvise_collapse()
>>>> path even if pages are marked lazyfree, IMHO.
>>> $ man madvise
>>> MADV_COLLAPSE
>>>          Perform a best-effort synchronous collapse of the native pages
>>>          mapped by the memory range into Transparent Huge Pages (THPs).
>>>
>>> The semantics of MADV_COLLAPSE are best-effort and do not imply to enforce
>>> collapsing, so we don't need "cc->is_khugepaged" here.
>>>
>>> We can imagine that if a user simultaneously uses MADV_FREE and
>>> MADV_COLLAPSE, it indicates a misunderstanding of their semantics.
>>> As the kernel, we need to safeguard the baseline.
>> No. Afraid I don't think so.
>>
>> To be clear, what I meant by "enforce":
>>
>> Yep, MADV_COLLAPSE is best-effort - it can fail. But when users
>> call MADV_COLLAPSE, they're explicitly asking for collapse.
>>
>> Compared to khugepaged just scanning around, that's already "enforce"
>> - users are actively requesting it, not passively waiting for.
>>
>> Note that you're *breaking* userspace. Users would not be able
>> to collapse the range where there are any lazyfree pages anymore,
>> even when they explicitly call MADV_COLLAPSE.
>>
>> For khugepaged, skipping lazyfree makes sense.
> I got your meaning, this is equivalent to two questions:
>
> 1. Does the semantics of best-effort imply any "enforce" meaning?
> 2. When madvise(MADV_FREE| MADV_COLLAPSE), do we want to collapse
>    lazyfree folios?
>
> This is a semantic warning, and I'd like to hear others' opinions.

Lance is right. When user does MADV_COLLAPSE, kernel needs to try its
best to collapse. It may not be in the best interest of the user to
do MADV_FREE then MADV_COLLAPSE, but that is something the user has
to fix - kernel does not need to think about it.

Regarding "best-effort", it is best-effort in the sense that, the
madvise(MADV_COLLAPSE) is a syscall needed not for correctness,
but for optimization purposes. So it is not the end of the world
if the syscall fails. But, since the user has decided to do an
expensive operation (syscall), kernel needs to try harder to
make sure those CPU cycles weren't a waste.

>
>>>>> +                     result = SCAN_PAGE_LAZYFREE;
>>>>> +                     goto out;
>>>>> +             }
>>>>> +
>>>>>                /* See hpage_collapse_scan_pmd(). */
>>>>>                if (folio_maybe_mapped_shared(folio)) {
>>>>>                        ++shared;
>>>>> @@ -1330,6 +1336,11 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>>                }
>>>>>                folio = page_folio(page);
>>>>>
>>>>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
>>>> Ditto.
>>>>
>>>>> +                     result = SCAN_PAGE_LAZYFREE;
>>>>> +                     goto out_unmap;
>>>>> +             }
>>>>> +
>>>>>                if (!folio_test_anon(folio)) {
>>>>>                        result = SCAN_PAGE_ANON;
>>>>>                        goto out_unmap;


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 3/5] mm: add folio_test_lazyfree helper
  2026-01-23  8:22 ` [PATCH mm-new v5 3/5] mm: add folio_test_lazyfree helper Vernon Yang
  2026-01-23 10:54   ` Dev Jain
@ 2026-01-26  1:52   ` Barry Song
  1 sibling, 0 replies; 31+ messages in thread
From: Barry Song @ 2026-01-26  1:52 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, dev.jain, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Fri, Jan 23, 2026 at 4:23 PM Vernon Yang <vernon2gm@gmail.com> wrote:
>
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Add folio_test_lazyfree() function to identify lazy-free folios to improve
> code readability.
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>

LGTM,

Reviewed-by: Barry Song <baohua@kernel.org>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios
  2026-01-24  6:48           ` Dev Jain
@ 2026-01-26  2:06             ` Barry Song
  0 siblings, 0 replies; 31+ messages in thread
From: Barry Song @ 2026-01-26  2:06 UTC (permalink / raw)
  To: Dev Jain
  Cc: Vernon Yang, david, Lance Yang, lorenzo.stoakes, ziy, linux-mm,
	linux-kernel, Vernon Yang, akpm

On Sat, Jan 24, 2026 at 2:48 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 24/01/26 8:52 am, Vernon Yang wrote:
> > On Sat, Jan 24, 2026 at 12:32 AM Lance Yang <lance.yang@linux.dev> wrote:
> >> On 2026/1/23 23:08, Vernon Yang wrote:
> >>> On Fri, Jan 23, 2026 at 5:09 PM Lance Yang <lance.yang@linux.dev> wrote:
> >>>> On 2026/1/23 16:22, Vernon Yang wrote:
> >>>>> From: Vernon Yang <yanglincheng@kylinos.cn>
> >>>>>
> >> [...]
> >>
> >>>>> @@ -583,6 +584,11 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>>>>                folio = page_folio(page);
> >>>>>                VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >>>>>
> >>>>> +             if (!pte_dirty(pteval) && folio_test_lazyfree(folio)) {
> >>>> I'm wondering if we need "cc->is_khugepaged &&" as well here?
> >>>>
> >>>> We should allow users to enforce collapse via the madvise_collapse()
> >>>> path even if pages are marked lazyfree, IMHO.
> >>> $ man madvise
> >>> MADV_COLLAPSE
> >>>          Perform a best-effort synchronous collapse of the native pages
> >>>          mapped by the memory range into Transparent Huge Pages (THPs).
> >>>
> >>> The semantics of MADV_COLLAPSE are best-effort and do not imply to enforce
> >>> collapsing, so we don't need "cc->is_khugepaged" here.
> >>>
> >>> We can imagine that if a user simultaneously uses MADV_FREE and
> >>> MADV_COLLAPSE, it indicates a misunderstanding of their semantics.
> >>> As the kernel, we need to safeguard the baseline.
> >> No. Afraid I don't think so.
> >>
> >> To be clear, what I meant by "enforce":
> >>
> >> Yep, MADV_COLLAPSE is best-effort - it can fail. But when users
> >> call MADV_COLLAPSE, they're explicitly asking for collapse.
> >>
> >> Compared to khugepaged just scanning around, that's already "enforce"
> >> - users are actively requesting it, not passively waiting for.
> >>
> >> Note that you're *breaking* userspace. Users would not be able
> >> to collapse the range where there are any lazyfree pages anymore,
> >> even when they explicitly call MADV_COLLAPSE.
> >>
> >> For khugepaged, skipping lazyfree makes sense.
> > I got your meaning, this is equivalent to two questions:
> >
> > 1. Does the semantics of best-effort imply any "enforce" meaning?
> > 2. When madvise(MADV_FREE| MADV_COLLAPSE), do we want to collapse
> >    lazyfree folios?
> >
> > This is a semantic warning, and I'd like to hear others' opinions.
>

That said, it does feel a bit unfortunate. I was wondering whether we
want to give users a hint in this case, e.g. via something like:

pr_warn("Attempt to enforce hugepage collapse on lazyfree memory");

But I'm not sure whether this is actually worth a printk, or if it would
just add noise without providing actionable value.

> Regarding "best-effort", it is best-effort in the sense that, the
> madvise(MADV_COLLAPSE) is a syscall needed not for correctness,
> but for optimization purposes. So it is not the end of the world
> if the syscall fails. But, since the user has decided to do an
> expensive operation (syscall), kernel needs to try harder to
> make sure those CPU cycles weren't a waste.
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2026-01-23 12:40   ` Dev Jain
  2026-01-23 15:32     ` Vernon Yang
@ 2026-01-26  2:18     ` Barry Song
  1 sibling, 0 replies; 31+ messages in thread
From: Barry Song @ 2026-01-26  2:18 UTC (permalink / raw)
  To: Dev Jain
  Cc: Vernon Yang, akpm, david, lorenzo.stoakes, ziy, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Fri, Jan 23, 2026 at 8:40 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 23/01/26 1:52 pm, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
> > scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
> > reduce redundant operation.
> >
> > Without this patch, entering khugepaged_scan_mm_slot() next time, we
> > will set khugepaged_scan.mm_slot to the next mm_slot.
> >
> > With this patch, we will directly set khugepaged_scan.mm_slot to the
> > next mm_slot.
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
> > Reviewed-by: Lance Yang <lance.yang@linux.dev>
> > ---
> >  mm/khugepaged.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index be1c09842ea2..1dc19c1b1f97 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2558,7 +2558,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >        * Release the current mm_slot if this mm is about to die, or
> >        * if we scanned all vmas of this mm.
> >        */
>
> Need to update comment: "...or THP got disabled".
>

+1
With that,
Reviewed-by: Barry Song <baohua@kernel.org>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-23  8:22 ` [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number Vernon Yang
  2026-01-23 10:46   ` Dev Jain
  2026-01-23 15:19   ` Matthew Wilcox
@ 2026-01-28  8:29   ` Dev Jain
  2026-01-28 14:34     ` Vernon Yang
  2 siblings, 1 reply; 31+ messages in thread
From: Dev Jain @ 2026-01-28  8:29 UTC (permalink / raw)
  To: Vernon Yang, akpm, david
  Cc: lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm, linux-kernel,
	Vernon Yang


On 23/01/26 1:52 pm, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Currently, each scan always increases "progress" by HPAGE_PMD_NR,
> even if only scanning a single PTE/PMD entry.
>
> - When only scanning a sigle PTE entry, let me provide a detailed
>   example:
>
> static int hpage_collapse_scan_pmd()
> {
> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> 	     _pte++, addr += PAGE_SIZE) {
> 		pte_t pteval = ptep_get(_pte);
> 		...
> 		if (pte_uffd_wp(pteval)) { <-- first scan hit
> 			result = SCAN_PTE_UFFD_WP;
> 			goto out_unmap;
> 		}
> 	}
> }
>
> During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
> directly. In practice, only one PTE is scanned before termination.
> Here, "progress += 1" reflects the actual number of PTEs scanned, but
> previously "progress += HPAGE_PMD_NR" always.
>
> - When the memory has been collapsed to PMD, let me provide a detailed
>   example:
>
> The following data is traced by bpftrace on a desktop system. After
> the system has been left idle for 10 minutes upon booting, a lot of
> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
> by khugepaged.
>
> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
> @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
> @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE

Could you elaborate what is [1], [6] etc and 1,2,142, etc?

> total progress size: 674 MB
> Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
>
> The khugepaged_scan list save all task that support collapse into hugepage,
> as long as the task is not destroyed, khugepaged will not remove it from
> the khugepaged_scan list. This exist a phenomenon where task has already
> collapsed all memory regions into hugepage, but khugepaged continues to
> scan it, which wastes CPU time and invalid, and due to
> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> scanning a large number of invalid task, so scanning really valid task
> is later.
>
> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
> SCAN_NO_PTE_TABLE, just skip it, as follow:
>
> @scan_pmd_status[6]: 2
> @scan_pmd_status[3]: 147
> @scan_pmd_status[2]: 173
> total progress size: 45 MB
> Total time         : 20 seconds
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>  include/linux/xarray.h |  9 ++++++++
>  mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
>  2 files changed, 47 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index be850174e802..f77d97d7b957 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
>  	xas->xa_node = XAS_RESTART;
>  }
>  
> +/**
> + * xas_get_index() - Get XArray operation state for a different index.
> + * @xas: XArray operation state.
> + */
> +static inline unsigned long xas_get_index(struct xa_state *xas)
> +{
> +	return xas->xa_index;
> +}
> +
>  /**
>   * xas_advance() - Skip over sibling entries.
>   * @xas: XArray operation state.
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6f0f05148765..de95029e3763 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -68,7 +68,10 @@ enum scan_result {
>  static struct task_struct *khugepaged_thread __read_mostly;
>  static DEFINE_MUTEX(khugepaged_mutex);
>  
> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> +/*
> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> + * every 10 second.
> + */
>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>  static unsigned int khugepaged_pages_collapsed;
>  static unsigned int khugepaged_full_scans;
> @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>  }
>  
>  static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
> +		struct vm_area_struct *vma, unsigned long start_addr,
> +		bool *mmap_locked, unsigned int *cur_progress,
>  		struct collapse_control *cc)
>  {
>  	pmd_t *pmd;
> @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>  
>  	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>  
> +	if (cur_progress)
> +		*cur_progress += 1;

Why not be a little more explicit, and do this addition if find_pmd_or_thp_or_none fails,
or pte_offset_map_lock fails? The way you do it right now is not readable - it gives no
idea as to why on function entry we do a +1 right away. Please do add some comments too.

> +
>  	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
> @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>  		result = SCAN_SUCCEED;
>  	}
>  out_unmap:
> +	if (cur_progress) {
> +		if (_pte >= pte + HPAGE_PMD_NR)
> +			*cur_progress += HPAGE_PMD_NR - 1;
> +		else
> +			*cur_progress += _pte - pte;
> +	}
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
>  		result = collapse_huge_page(mm, start_addr, referenced,
> @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>  	return result;
>  }
>  
> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> -		struct file *file, pgoff_t start, struct collapse_control *cc)
> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> +		unsigned long addr, struct file *file, pgoff_t start,
> +		unsigned int *cur_progress, struct collapse_control *cc)
>  {
>  	struct folio *folio = NULL;
>  	struct address_space *mapping = file->f_mapping;
> @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>  			cond_resched_rcu();
>  		}
>  	}
> +	if (cur_progress) {
> +		unsigned long idx = xas_get_index(&xas) - start;
> +
> +		if (folio == NULL)
> +			*cur_progress += HPAGE_PMD_NR;

I think this whole block needs some comments. Can you explain, why you
do a particular increment in each case?

> +		else if (xa_is_value(folio))
> +			*cur_progress += idx + (1 << xas_get_order(&xas));
> +		else if (folio_order(folio) == HPAGE_PMD_ORDER)
> +			*cur_progress += idx + 1;
> +		else
> +			*cur_progress += idx + folio_nr_pages(folio);
> +	}
>  	rcu_read_unlock();
>  
>  	if (result == SCAN_SUCCEED) {
> @@ -2456,6 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  
>  		while (khugepaged_scan.address < hend) {
>  			bool mmap_locked = true;
> +			unsigned int cur_progress = 0;
>  
>  			cond_resched();
>  			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> @@ -2472,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  				mmap_read_unlock(mm);
>  				mmap_locked = false;
>  				*result = hpage_collapse_scan_file(mm,
> -					khugepaged_scan.address, file, pgoff, cc);
> +					khugepaged_scan.address, file, pgoff,
> +					&cur_progress, cc);
>  				fput(file);
>  				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
>  					mmap_read_lock(mm);
> @@ -2486,7 +2514,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  				}
>  			} else {
>  				*result = hpage_collapse_scan_pmd(mm, vma,
> -					khugepaged_scan.address, &mmap_locked, cc);
> +					khugepaged_scan.address, &mmap_locked,
> +					&cur_progress, cc);
>  			}
>  
>  			if (*result == SCAN_SUCCEED)
> @@ -2494,7 +2523,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>  
>  			/* move to next address */
>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
> -			progress += HPAGE_PMD_NR;
> +			progress += cur_progress;
>  			if (!mmap_locked)
>  				/*
>  				 * We released mmap_lock so break loop.  Note
> @@ -2817,7 +2846,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  			mmap_locked = false;
>  			*lock_dropped = true;
>  			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> -							  cc);
> +							  NULL, cc);
>  
>  			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>  			    mapping_can_writeback(file->f_mapping)) {
> @@ -2832,7 +2861,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  			fput(file);
>  		} else {
>  			result = hpage_collapse_scan_pmd(mm, vma, addr,
> -							 &mmap_locked, cc);
> +							 &mmap_locked, NULL, cc);
>  		}
>  		if (!mmap_locked)
>  			*lock_dropped = true;


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-28  8:29   ` Dev Jain
@ 2026-01-28 14:34     ` Vernon Yang
  2026-01-29  5:35       ` Dev Jain
  0 siblings, 1 reply; 31+ messages in thread
From: Vernon Yang @ 2026-01-28 14:34 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Wed, Jan 28, 2026 at 01:59:33PM +0530, Dev Jain wrote:
>
> On 23/01/26 1:52 pm, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > Currently, each scan always increases "progress" by HPAGE_PMD_NR,
> > even if only scanning a single PTE/PMD entry.
> >
> > - When only scanning a sigle PTE entry, let me provide a detailed
> >   example:
> >
> > static int hpage_collapse_scan_pmd()
> > {
> > 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > 	     _pte++, addr += PAGE_SIZE) {
> > 		pte_t pteval = ptep_get(_pte);
> > 		...
> > 		if (pte_uffd_wp(pteval)) { <-- first scan hit
> > 			result = SCAN_PTE_UFFD_WP;
> > 			goto out_unmap;
> > 		}
> > 	}
> > }
> >
> > During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
> > directly. In practice, only one PTE is scanned before termination.
> > Here, "progress += 1" reflects the actual number of PTEs scanned, but
> > previously "progress += HPAGE_PMD_NR" always.
> >
> > - When the memory has been collapsed to PMD, let me provide a detailed
> >   example:
> >
> > The following data is traced by bpftrace on a desktop system. After
> > the system has been left idle for 10 minutes upon booting, a lot of
> > SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
> > by khugepaged.
> >
> > @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> > @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
> > @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
> > @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
>
> Could you elaborate what is [1], [6] etc and 1,2,142, etc?

These 1,6 are value of "enum scan_result", you can directly refer to the
notes on the right.

These 1,2,142,178 are number of different "enum scan_result" from
trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file.

as example, SCAN_PMD_MAPPED has 142 times during a full scan by
khugepaged.

> > total progress size: 674 MB
> > Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
> >
> > The khugepaged_scan list save all task that support collapse into hugepage,
> > as long as the task is not destroyed, khugepaged will not remove it from
> > the khugepaged_scan list. This exist a phenomenon where task has already
> > collapsed all memory regions into hugepage, but khugepaged continues to
> > scan it, which wastes CPU time and invalid, and due to
> > khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> > scanning a large number of invalid task, so scanning really valid task
> > is later.
> >
> > After applying this patch, when the memory is either SCAN_PMD_MAPPED or
> > SCAN_NO_PTE_TABLE, just skip it, as follow:
> >
> > @scan_pmd_status[6]: 2
> > @scan_pmd_status[3]: 147
> > @scan_pmd_status[2]: 173
> > total progress size: 45 MB
> > Total time         : 20 seconds
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >  include/linux/xarray.h |  9 ++++++++
> >  mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
> >  2 files changed, 47 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> > index be850174e802..f77d97d7b957 100644
> > --- a/include/linux/xarray.h
> > +++ b/include/linux/xarray.h
> > @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
> >  	xas->xa_node = XAS_RESTART;
> >  }
> >
> > +/**
> > + * xas_get_index() - Get XArray operation state for a different index.
> > + * @xas: XArray operation state.
> > + */
> > +static inline unsigned long xas_get_index(struct xa_state *xas)
> > +{
> > +	return xas->xa_index;
> > +}
> > +
> >  /**
> >   * xas_advance() - Skip over sibling entries.
> >   * @xas: XArray operation state.
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 6f0f05148765..de95029e3763 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -68,7 +68,10 @@ enum scan_result {
> >  static struct task_struct *khugepaged_thread __read_mostly;
> >  static DEFINE_MUTEX(khugepaged_mutex);
> >
> > -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> > +/*
> > + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> > + * every 10 second.
> > + */
> >  static unsigned int khugepaged_pages_to_scan __read_mostly;
> >  static unsigned int khugepaged_pages_collapsed;
> >  static unsigned int khugepaged_full_scans;
> > @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >  }
> >
> >  static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> > -		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
> > +		struct vm_area_struct *vma, unsigned long start_addr,
> > +		bool *mmap_locked, unsigned int *cur_progress,
> >  		struct collapse_control *cc)
> >  {
> >  	pmd_t *pmd;
> > @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >
> >  	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
> >
> > +	if (cur_progress)
> > +		*cur_progress += 1;
>
> Why not be a little more explicit, and do this addition if find_pmd_or_thp_or_none fails,
> or pte_offset_map_lock fails? The way you do it right now is not readable - it gives no
> idea as to why on function entry we do a +1 right away. Please do add some comments too.

If this way is not clear enough, we can directly add 1 in
find_pmd_or_thp_or_none() etc, BUT it's a bit redundant.
Please take a look at which one is better.

case 1:
as the V4 PATCH #2 [1] and #3 [2], only hpage_collapse_scan_pmd().
[1] https://lore.kernel.org/linux-mm/20260111121909.8410-3-yanglincheng@kylinos.cn
[2] https://lore.kernel.org/linux-mm/20260111121909.8410-4-yanglincheng@kylinos.cn

static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
		struct vm_area_struct *vma, unsigned long start_addr,
		bool *mmap_locked, unsigned int *cur_progress,
		struct collapse_control *cc)
{
	...
	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
	if (result != SCAN_SUCCEED) {
		if (cur_progress)
			*cur_progress += 1; // here
		goto out;
	}
	...
	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
	if (!pte) {
		if (cur_progress)
			*cur_progress += 1; // here
		result = SCAN_NO_PTE_TABLE;
		goto out;
	}

	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
	     _pte++, addr += PAGE_SIZE) {
		if (cur_progress)
			*cur_progress += 1; // here
		...
	}
}

case 2:

static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
		struct vm_area_struct *vma, unsigned long start_addr,
		bool *mmap_locked, unsigned int *cur_progress,
		struct collapse_control *cc)
{
	...
	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
	if (result != SCAN_SUCCEED) {
		if (cur_progress)
			*cur_progress += 1; // here
		goto out;
	}
	...
	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
	if (!pte) {
		if (cur_progress)
			*cur_progress += 1; // here
		result = SCAN_NO_PTE_TABLE;
		goto out;
	}

	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
	     _pte++, addr += PAGE_SIZE) {
		...
	}
	...
out_unmap:
	if (cur_progress) {
		if (_pte >= pte + HPAGE_PMD_NR)
			*cur_progress += HPAGE_PMD_NR;   // here
		else
			*cur_progress += _pte - pte + 1; // here
	}
}

case 3:
	current patch, and add more comments to clearer.

> > +
> >  	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> >  	if (result != SCAN_SUCCEED)
> >  		goto out;
> > @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >  		result = SCAN_SUCCEED;
> >  	}
> >  out_unmap:
> > +	if (cur_progress) {
> > +		if (_pte >= pte + HPAGE_PMD_NR)
> > +			*cur_progress += HPAGE_PMD_NR - 1;
> > +		else
> > +			*cur_progress += _pte - pte;
> > +	}
> >  	pte_unmap_unlock(pte, ptl);
> >  	if (result == SCAN_SUCCEED) {
> >  		result = collapse_huge_page(mm, start_addr, referenced,
> > @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> >  	return result;
> >  }
> >
> > -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > -		struct file *file, pgoff_t start, struct collapse_control *cc)
> > +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> > +		unsigned long addr, struct file *file, pgoff_t start,
> > +		unsigned int *cur_progress, struct collapse_control *cc)
> >  {
> >  	struct folio *folio = NULL;
> >  	struct address_space *mapping = file->f_mapping;
> > @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
> >  			cond_resched_rcu();
> >  		}
> >  	}
> > +	if (cur_progress) {
> > +		unsigned long idx = xas_get_index(&xas) - start;
> > +
> > +		if (folio == NULL)
> > +			*cur_progress += HPAGE_PMD_NR;
>
> I think this whole block needs some comments. Can you explain, why you
> do a particular increment in each case?
>
> > +		else if (xa_is_value(folio))
> > +			*cur_progress += idx + (1 << xas_get_order(&xas));
> > +		else if (folio_order(folio) == HPAGE_PMD_ORDER)
> > +			*cur_progress += idx + 1;
> > +		else
> > +			*cur_progress += idx + folio_nr_pages(folio);
> > +	}

The "idx" represent PTEs number already scanned when exiting
xas_for_each().

However, the last valid folio size was not counted in "idx" (except
folio == NULL, "idx" equal to HPAGE_PMD_NR), which can be further
divided into three cases:

1. shmem swap entries (xa_is_value), add folio size.
2. the folio is HPAGE_PMD_ORDER, the memory has been collapsed
   to PMD, so add 1 only.
3. Normal folio, add folio size.

Is the version below more readable?

	if (cur_progress) {
		*cur_progress += xas.xa_index - start;

		if (folio) {
			if (xa_is_value(folio))
				*cur_progress += 1 << xas_get_order(&xas);
			else if (folio_order(folio) == HPAGE_PMD_ORDER)
				*cur_progress += 1;
			else
				*cur_progress += folio_nr_pages(folio);
		}
	}

> >  	rcu_read_unlock();
> >
> >  	if (result == SCAN_SUCCEED) {
> > @@ -2456,6 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >
> >  		while (khugepaged_scan.address < hend) {
> >  			bool mmap_locked = true;
> > +			unsigned int cur_progress = 0;
> >
> >  			cond_resched();
> >  			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> > @@ -2472,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >  				mmap_read_unlock(mm);
> >  				mmap_locked = false;
> >  				*result = hpage_collapse_scan_file(mm,
> > -					khugepaged_scan.address, file, pgoff, cc);
> > +					khugepaged_scan.address, file, pgoff,
> > +					&cur_progress, cc);
> >  				fput(file);
> >  				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> >  					mmap_read_lock(mm);
> > @@ -2486,7 +2514,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >  				}
> >  			} else {
> >  				*result = hpage_collapse_scan_pmd(mm, vma,
> > -					khugepaged_scan.address, &mmap_locked, cc);
> > +					khugepaged_scan.address, &mmap_locked,
> > +					&cur_progress, cc);
> >  			}
> >
> >  			if (*result == SCAN_SUCCEED)
> > @@ -2494,7 +2523,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >
> >  			/* move to next address */
> >  			khugepaged_scan.address += HPAGE_PMD_SIZE;
> > -			progress += HPAGE_PMD_NR;
> > +			progress += cur_progress;
> >  			if (!mmap_locked)
> >  				/*
> >  				 * We released mmap_lock so break loop.  Note
> > @@ -2817,7 +2846,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >  			mmap_locked = false;
> >  			*lock_dropped = true;
> >  			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> > -							  cc);
> > +							  NULL, cc);
> >
> >  			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> >  			    mapping_can_writeback(file->f_mapping)) {
> > @@ -2832,7 +2861,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >  			fput(file);
> >  		} else {
> >  			result = hpage_collapse_scan_pmd(mm, vma, addr,
> > -							 &mmap_locked, cc);
> > +							 &mmap_locked, NULL, cc);
> >  		}
> >  		if (!mmap_locked)
> >  			*lock_dropped = true;
>

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-28 14:34     ` Vernon Yang
@ 2026-01-29  5:35       ` Dev Jain
  2026-01-29  7:59         ` Vernon Yang
  2026-01-29  9:18         ` Lance Yang
  0 siblings, 2 replies; 31+ messages in thread
From: Dev Jain @ 2026-01-29  5:35 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang


On 28/01/26 8:04 pm, Vernon Yang wrote:
> On Wed, Jan 28, 2026 at 01:59:33PM +0530, Dev Jain wrote:
>> On 23/01/26 1:52 pm, Vernon Yang wrote:
>>> From: Vernon Yang <yanglincheng@kylinos.cn>
>>>
>>> Currently, each scan always increases "progress" by HPAGE_PMD_NR,
>>> even if only scanning a single PTE/PMD entry.
>>>
>>> - When only scanning a sigle PTE entry, let me provide a detailed
>>>   example:
>>>
>>> static int hpage_collapse_scan_pmd()
>>> {
>>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> 	     _pte++, addr += PAGE_SIZE) {
>>> 		pte_t pteval = ptep_get(_pte);
>>> 		...
>>> 		if (pte_uffd_wp(pteval)) { <-- first scan hit
>>> 			result = SCAN_PTE_UFFD_WP;
>>> 			goto out_unmap;
>>> 		}
>>> 	}
>>> }
>>>
>>> During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
>>> directly. In practice, only one PTE is scanned before termination.
>>> Here, "progress += 1" reflects the actual number of PTEs scanned, but
>>> previously "progress += HPAGE_PMD_NR" always.
>>>
>>> - When the memory has been collapsed to PMD, let me provide a detailed
>>>   example:
>>>
>>> The following data is traced by bpftrace on a desktop system. After
>>> the system has been left idle for 10 minutes upon booting, a lot of
>>> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
>>> by khugepaged.
>>>
>>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
>>> @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
>>> @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
>>> @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
>> Could you elaborate what is [1], [6] etc and 1,2,142, etc?
> These 1,6 are value of "enum scan_result", you can directly refer to the
> notes on the right.
>
> These 1,2,142,178 are number of different "enum scan_result" from
> trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file.
>
> as example, SCAN_PMD_MAPPED has 142 times during a full scan by
> khugepaged.

Thanks. Can you please mention this in the patch description. You can simply
right it like this:

"From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
following statuses were observed, with frequency mentioned next to them:

SCAN_SUCCEED: 1
SCAN_PMD_MAPPED: 142
....."

and so on.

>
>>> total progress size: 674 MB
>>> Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
>>>
>>> The khugepaged_scan list save all task that support collapse into hugepage,
>>> as long as the task is not destroyed, khugepaged will not remove it from
>>> the khugepaged_scan list. This exist a phenomenon where task has already
>>> collapsed all memory regions into hugepage, but khugepaged continues to
>>> scan it, which wastes CPU time and invalid, and due to
>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>>> scanning a large number of invalid task, so scanning really valid task
>>> is later.
>>>
>>> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
>>> SCAN_NO_PTE_TABLE, just skip it, as follow:
>>>
>>> @scan_pmd_status[6]: 2
>>> @scan_pmd_status[3]: 147
>>> @scan_pmd_status[2]: 173
>>> total progress size: 45 MB
>>> Total time         : 20 seconds
>>>
>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>> ---
>>>  include/linux/xarray.h |  9 ++++++++
>>>  mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
>>>  2 files changed, 47 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
>>> index be850174e802..f77d97d7b957 100644
>>> --- a/include/linux/xarray.h
>>> +++ b/include/linux/xarray.h
>>> @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
>>>  	xas->xa_node = XAS_RESTART;
>>>  }
>>>
>>> +/**
>>> + * xas_get_index() - Get XArray operation state for a different index.
>>> + * @xas: XArray operation state.
>>> + */
>>> +static inline unsigned long xas_get_index(struct xa_state *xas)
>>> +{
>>> +	return xas->xa_index;
>>> +}
>>> +
>>>  /**
>>>   * xas_advance() - Skip over sibling entries.
>>>   * @xas: XArray operation state.
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 6f0f05148765..de95029e3763 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -68,7 +68,10 @@ enum scan_result {
>>>  static struct task_struct *khugepaged_thread __read_mostly;
>>>  static DEFINE_MUTEX(khugepaged_mutex);
>>>
>>> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
>>> +/*
>>> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
>>> + * every 10 second.
>>> + */
>>>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>>>  static unsigned int khugepaged_pages_collapsed;
>>>  static unsigned int khugepaged_full_scans;
>>> @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>>  }
>>>
>>>  static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>> -		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
>>> +		struct vm_area_struct *vma, unsigned long start_addr,
>>> +		bool *mmap_locked, unsigned int *cur_progress,
>>>  		struct collapse_control *cc)
>>>  {
>>>  	pmd_t *pmd;
>>> @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>
>>>  	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>>
>>> +	if (cur_progress)
>>> +		*cur_progress += 1;
>> Why not be a little more explicit, and do this addition if find_pmd_or_thp_or_none fails,
>> or pte_offset_map_lock fails? The way you do it right now is not readable - it gives no
>> idea as to why on function entry we do a +1 right away. Please do add some comments too.
> If this way is not clear enough, we can directly add 1 in
> find_pmd_or_thp_or_none() etc, BUT it's a bit redundant.
> Please take a look at which one is better.
>
> case 1:
> as the V4 PATCH #2 [1] and #3 [2], only hpage_collapse_scan_pmd().
> [1] https://lore.kernel.org/linux-mm/20260111121909.8410-3-yanglincheng@kylinos.cn
> [2] https://lore.kernel.org/linux-mm/20260111121909.8410-4-yanglincheng@kylinos.cn
>
> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> 		struct vm_area_struct *vma, unsigned long start_addr,
> 		bool *mmap_locked, unsigned int *cur_progress,
> 		struct collapse_control *cc)
> {
> 	...
> 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> 	if (result != SCAN_SUCCEED) {
> 		if (cur_progress)
> 			*cur_progress += 1; // here
> 		goto out;
> 	}
> 	...
> 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> 	if (!pte) {
> 		if (cur_progress)
> 			*cur_progress += 1; // here
> 		result = SCAN_NO_PTE_TABLE;
> 		goto out;
> 	}
>
> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> 	     _pte++, addr += PAGE_SIZE) {
> 		if (cur_progress)
> 			*cur_progress += 1; // here
> 		...
> 	}
> }
>
> case 2:
>
> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> 		struct vm_area_struct *vma, unsigned long start_addr,
> 		bool *mmap_locked, unsigned int *cur_progress,
> 		struct collapse_control *cc)
> {
> 	...
> 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> 	if (result != SCAN_SUCCEED) {
> 		if (cur_progress)
> 			*cur_progress += 1; // here

Let us be more explicit and set this equal to 1, instead of adding 1.

> 		goto out;
> 	}
> 	...
> 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> 	if (!pte) {
> 		if (cur_progress)
> 			*cur_progress += 1; // here

Same comment as above.

> 		result = SCAN_NO_PTE_TABLE;
> 		goto out;
> 	}
>
> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> 	     _pte++, addr += PAGE_SIZE) {
> 		...
> 	}
> 	...
> out_unmap:
> 	if (cur_progress) {
> 		if (_pte >= pte + HPAGE_PMD_NR)
> 			*cur_progress += HPAGE_PMD_NR;   // here
> 		else
> 			*cur_progress += _pte - pte + 1; // here
> 	}
> }

I will vote case 2. In case 1 I don't like the fact that the if (cur_progress)
branch will be checked each iteration - and I don't think the compiler can
optimize this since the body of the loop is complex, so this check cannot
be hoisted out of the loop.


>
> case 3:
> 	current patch, and add more comments to clearer.
>
>>> +
>>>  	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>>>  	if (result != SCAN_SUCCEED)
>>>  		goto out;
>>> @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>  		result = SCAN_SUCCEED;
>>>  	}
>>>  out_unmap:
>>> +	if (cur_progress) {
>>> +		if (_pte >= pte + HPAGE_PMD_NR)
>>> +			*cur_progress += HPAGE_PMD_NR - 1;
>>> +		else
>>> +			*cur_progress += _pte - pte;
>>> +	}
>>>  	pte_unmap_unlock(pte, ptl);
>>>  	if (result == SCAN_SUCCEED) {
>>>  		result = collapse_huge_page(mm, start_addr, referenced,
>>> @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>  	return result;
>>>  }
>>>
>>> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>>> -		struct file *file, pgoff_t start, struct collapse_control *cc)
>>> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>>> +		unsigned long addr, struct file *file, pgoff_t start,
>>> +		unsigned int *cur_progress, struct collapse_control *cc)
>>>  {
>>>  	struct folio *folio = NULL;
>>>  	struct address_space *mapping = file->f_mapping;
>>> @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>>>  			cond_resched_rcu();
>>>  		}
>>>  	}
>>> +	if (cur_progress) {
>>> +		unsigned long idx = xas_get_index(&xas) - start;
>>> +
>>> +		if (folio == NULL)
>>> +			*cur_progress += HPAGE_PMD_NR;
>> I think this whole block needs some comments. Can you explain, why you
>> do a particular increment in each case?
>>
>>> +		else if (xa_is_value(folio))
>>> +			*cur_progress += idx + (1 << xas_get_order(&xas));
>>> +		else if (folio_order(folio) == HPAGE_PMD_ORDER)
>>> +			*cur_progress += idx + 1;
>>> +		else
>>> +			*cur_progress += idx + folio_nr_pages(folio);
>>> +	}
> The "idx" represent PTEs number already scanned when exiting
> xas_for_each().
>
> However, the last valid folio size was not counted in "idx" (except
> folio == NULL, "idx" equal to HPAGE_PMD_NR), which can be further
> divided into three cases:

But, the number of PTEs you account in these three cases, are *not*
scanned, right? So we can simply drop these 3 cases.

>
> 1. shmem swap entries (xa_is_value), add folio size.
> 2. the folio is HPAGE_PMD_ORDER, the memory has been collapsed
>    to PMD, so add 1 only.
> 3. Normal folio, add folio size.
>
> Is the version below more readable?
>
> 	if (cur_progress) {
> 		*cur_progress += xas.xa_index - start;
>
> 		if (folio) {
> 			if (xa_is_value(folio))
> 				*cur_progress += 1 << xas_get_order(&xas);
> 			else if (folio_order(folio) == HPAGE_PMD_ORDER)
> 				*cur_progress += 1;
> 			else
> 				*cur_progress += folio_nr_pages(folio);
> 		}
> 	}

Yep, this is unneeded complexity. This looks really ugly and the benefits of
this are not clear. You can simply do

if (cur_progress)
	*cur_progress = xas.xa_index - start;

>
>>>  	rcu_read_unlock();
>>>
>>>  	if (result == SCAN_SUCCEED) {
>>> @@ -2456,6 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>
>>>  		while (khugepaged_scan.address < hend) {
>>>  			bool mmap_locked = true;
>>> +			unsigned int cur_progress = 0;
>>>
>>>  			cond_resched();
>>>  			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>>> @@ -2472,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>  				mmap_read_unlock(mm);
>>>  				mmap_locked = false;
>>>  				*result = hpage_collapse_scan_file(mm,
>>> -					khugepaged_scan.address, file, pgoff, cc);
>>> +					khugepaged_scan.address, file, pgoff,
>>> +					&cur_progress, cc);
>>>  				fput(file);
>>>  				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
>>>  					mmap_read_lock(mm);
>>> @@ -2486,7 +2514,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>  				}
>>>  			} else {
>>>  				*result = hpage_collapse_scan_pmd(mm, vma,
>>> -					khugepaged_scan.address, &mmap_locked, cc);
>>> +					khugepaged_scan.address, &mmap_locked,
>>> +					&cur_progress, cc);
>>>  			}
>>>
>>>  			if (*result == SCAN_SUCCEED)
>>> @@ -2494,7 +2523,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>
>>>  			/* move to next address */
>>>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
>>> -			progress += HPAGE_PMD_NR;
>>> +			progress += cur_progress;
>>>  			if (!mmap_locked)
>>>  				/*
>>>  				 * We released mmap_lock so break loop.  Note
>>> @@ -2817,7 +2846,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>  			mmap_locked = false;
>>>  			*lock_dropped = true;
>>>  			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
>>> -							  cc);
>>> +							  NULL, cc);
>>>
>>>  			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>>>  			    mapping_can_writeback(file->f_mapping)) {
>>> @@ -2832,7 +2861,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>  			fput(file);
>>>  		} else {
>>>  			result = hpage_collapse_scan_pmd(mm, vma, addr,
>>> -							 &mmap_locked, cc);
>>> +							 &mmap_locked, NULL, cc);
>>>  		}
>>>  		if (!mmap_locked)
>>>  			*lock_dropped = true;
> --
> Thanks,
> Vernon


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-29  5:35       ` Dev Jain
@ 2026-01-29  7:59         ` Vernon Yang
  2026-01-29  8:32           ` Dev Jain
  2026-01-29  9:18         ` Lance Yang
  1 sibling, 1 reply; 31+ messages in thread
From: Vernon Yang @ 2026-01-29  7:59 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Thu, Jan 29, 2026 at 11:05:36AM +0530, Dev Jain wrote:
>
> On 28/01/26 8:04 pm, Vernon Yang wrote:
> > On Wed, Jan 28, 2026 at 01:59:33PM +0530, Dev Jain wrote:
> >> On 23/01/26 1:52 pm, Vernon Yang wrote:
> >>> From: Vernon Yang <yanglincheng@kylinos.cn>
> >>>
> >>> Currently, each scan always increases "progress" by HPAGE_PMD_NR,
> >>> even if only scanning a single PTE/PMD entry.
> >>>
> >>> - When only scanning a sigle PTE entry, let me provide a detailed
> >>>   example:
> >>>
> >>> static int hpage_collapse_scan_pmd()
> >>> {
> >>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >>> 	     _pte++, addr += PAGE_SIZE) {
> >>> 		pte_t pteval = ptep_get(_pte);
> >>> 		...
> >>> 		if (pte_uffd_wp(pteval)) { <-- first scan hit
> >>> 			result = SCAN_PTE_UFFD_WP;
> >>> 			goto out_unmap;
> >>> 		}
> >>> 	}
> >>> }
> >>>
> >>> During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
> >>> directly. In practice, only one PTE is scanned before termination.
> >>> Here, "progress += 1" reflects the actual number of PTEs scanned, but
> >>> previously "progress += HPAGE_PMD_NR" always.
> >>>
> >>> - When the memory has been collapsed to PMD, let me provide a detailed
> >>>   example:
> >>>
> >>> The following data is traced by bpftrace on a desktop system. After
> >>> the system has been left idle for 10 minutes upon booting, a lot of
> >>> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
> >>> by khugepaged.
> >>>
> >>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> >>> @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
> >>> @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
> >>> @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
> >> Could you elaborate what is [1], [6] etc and 1,2,142, etc?
> > These 1,6 are value of "enum scan_result", you can directly refer to the
> > notes on the right.
> >
> > These 1,2,142,178 are number of different "enum scan_result" from
> > trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file.
> >
> > as example, SCAN_PMD_MAPPED has 142 times during a full scan by
> > khugepaged.
>
> Thanks. Can you please mention this in the patch description. You can simply
> right it like this:
>
> "From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
> following statuses were observed, with frequency mentioned next to them:
>
> SCAN_SUCCEED: 1
> SCAN_PMD_MAPPED: 142
> ....."
>
> and so on.

LGTM, I will do it in the next version. Thanks!

> >
> >>> total progress size: 674 MB
> >>> Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
> >>>
> >>> The khugepaged_scan list save all task that support collapse into hugepage,
> >>> as long as the task is not destroyed, khugepaged will not remove it from
> >>> the khugepaged_scan list. This exist a phenomenon where task has already
> >>> collapsed all memory regions into hugepage, but khugepaged continues to
> >>> scan it, which wastes CPU time and invalid, and due to
> >>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >>> scanning a large number of invalid task, so scanning really valid task
> >>> is later.
> >>>
> >>> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
> >>> SCAN_NO_PTE_TABLE, just skip it, as follow:
> >>>
> >>> @scan_pmd_status[6]: 2
> >>> @scan_pmd_status[3]: 147
> >>> @scan_pmd_status[2]: 173
> >>> total progress size: 45 MB
> >>> Total time         : 20 seconds
> >>>
> >>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> >>> ---
> >>>  include/linux/xarray.h |  9 ++++++++
> >>>  mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
> >>>  2 files changed, 47 insertions(+), 9 deletions(-)
> >>>
> >>> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> >>> index be850174e802..f77d97d7b957 100644
> >>> --- a/include/linux/xarray.h
> >>> +++ b/include/linux/xarray.h
> >>> @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
> >>>  	xas->xa_node = XAS_RESTART;
> >>>  }
> >>>
> >>> +/**
> >>> + * xas_get_index() - Get XArray operation state for a different index.
> >>> + * @xas: XArray operation state.
> >>> + */
> >>> +static inline unsigned long xas_get_index(struct xa_state *xas)
> >>> +{
> >>> +	return xas->xa_index;
> >>> +}
> >>> +
> >>>  /**
> >>>   * xas_advance() - Skip over sibling entries.
> >>>   * @xas: XArray operation state.
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 6f0f05148765..de95029e3763 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -68,7 +68,10 @@ enum scan_result {
> >>>  static struct task_struct *khugepaged_thread __read_mostly;
> >>>  static DEFINE_MUTEX(khugepaged_mutex);
> >>>
> >>> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> >>> +/*
> >>> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> >>> + * every 10 second.
> >>> + */
> >>>  static unsigned int khugepaged_pages_to_scan __read_mostly;
> >>>  static unsigned int khugepaged_pages_collapsed;
> >>>  static unsigned int khugepaged_full_scans;
> >>> @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >>>  }
> >>>
> >>>  static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>> -		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
> >>> +		struct vm_area_struct *vma, unsigned long start_addr,
> >>> +		bool *mmap_locked, unsigned int *cur_progress,
> >>>  		struct collapse_control *cc)
> >>>  {
> >>>  	pmd_t *pmd;
> >>> @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>
> >>>  	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
> >>>
> >>> +	if (cur_progress)
> >>> +		*cur_progress += 1;
> >> Why not be a little more explicit, and do this addition if find_pmd_or_thp_or_none fails,
> >> or pte_offset_map_lock fails? The way you do it right now is not readable - it gives no
> >> idea as to why on function entry we do a +1 right away. Please do add some comments too.
> > If this way is not clear enough, we can directly add 1 in
> > find_pmd_or_thp_or_none() etc, BUT it's a bit redundant.
> > Please take a look at which one is better.
> >
> > case 1:
> > as the V4 PATCH #2 [1] and #3 [2], only hpage_collapse_scan_pmd().
> > [1] https://lore.kernel.org/linux-mm/20260111121909.8410-3-yanglincheng@kylinos.cn
> > [2] https://lore.kernel.org/linux-mm/20260111121909.8410-4-yanglincheng@kylinos.cn
> >
> > static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> > 		struct vm_area_struct *vma, unsigned long start_addr,
> > 		bool *mmap_locked, unsigned int *cur_progress,
> > 		struct collapse_control *cc)
> > {
> > 	...
> > 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> > 	if (result != SCAN_SUCCEED) {
> > 		if (cur_progress)
> > 			*cur_progress += 1; // here
> > 		goto out;
> > 	}
> > 	...
> > 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> > 	if (!pte) {
> > 		if (cur_progress)
> > 			*cur_progress += 1; // here
> > 		result = SCAN_NO_PTE_TABLE;
> > 		goto out;
> > 	}
> >
> > 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > 	     _pte++, addr += PAGE_SIZE) {
> > 		if (cur_progress)
> > 			*cur_progress += 1; // here
> > 		...
> > 	}
> > }
> >
> > case 2:
> >
> > static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> > 		struct vm_area_struct *vma, unsigned long start_addr,
> > 		bool *mmap_locked, unsigned int *cur_progress,
> > 		struct collapse_control *cc)
> > {
> > 	...
> > 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> > 	if (result != SCAN_SUCCEED) {
> > 		if (cur_progress)
> > 			*cur_progress += 1; // here
>
> Let us be more explicit and set this equal to 1, instead of adding 1.

LGTM, I will do it in the next version. Thanks!

> > 		goto out;
> > 	}
> > 	...
> > 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> > 	if (!pte) {
> > 		if (cur_progress)
> > 			*cur_progress += 1; // here
>
> Same comment as above.
>
> > 		result = SCAN_NO_PTE_TABLE;
> > 		goto out;
> > 	}
> >
> > 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > 	     _pte++, addr += PAGE_SIZE) {
> > 		...
> > 	}
> > 	...
> > out_unmap:
> > 	if (cur_progress) {
> > 		if (_pte >= pte + HPAGE_PMD_NR)
> > 			*cur_progress += HPAGE_PMD_NR;   // here
> > 		else
> > 			*cur_progress += _pte - pte + 1; // here
> > 	}
> > }
>
> I will vote case 2. In case 1 I don't like the fact that the if (cur_progress)
> branch will be checked each iteration - and I don't think the compiler can
> optimize this since the body of the loop is complex, so this check cannot
> be hoisted out of the loop.
>
>
> >
> > case 3:
> > 	current patch, and add more comments to clearer.
> >
> >>> +
> >>>  	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> >>>  	if (result != SCAN_SUCCEED)
> >>>  		goto out;
> >>> @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>  		result = SCAN_SUCCEED;
> >>>  	}
> >>>  out_unmap:
> >>> +	if (cur_progress) {
> >>> +		if (_pte >= pte + HPAGE_PMD_NR)
> >>> +			*cur_progress += HPAGE_PMD_NR - 1;
> >>> +		else
> >>> +			*cur_progress += _pte - pte;
> >>> +	}
> >>>  	pte_unmap_unlock(pte, ptl);
> >>>  	if (result == SCAN_SUCCEED) {
> >>>  		result = collapse_huge_page(mm, start_addr, referenced,
> >>> @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> >>>  	return result;
> >>>  }
> >>>
> >>> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >>> -		struct file *file, pgoff_t start, struct collapse_control *cc)
> >>> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> >>> +		unsigned long addr, struct file *file, pgoff_t start,
> >>> +		unsigned int *cur_progress, struct collapse_control *cc)
> >>>  {
> >>>  	struct folio *folio = NULL;
> >>>  	struct address_space *mapping = file->f_mapping;
> >>> @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
> >>>  			cond_resched_rcu();
> >>>  		}
> >>>  	}
> >>> +	if (cur_progress) {
> >>> +		unsigned long idx = xas_get_index(&xas) - start;
> >>> +
> >>> +		if (folio == NULL)
> >>> +			*cur_progress += HPAGE_PMD_NR;
> >> I think this whole block needs some comments. Can you explain, why you
> >> do a particular increment in each case?
> >>
> >>> +		else if (xa_is_value(folio))
> >>> +			*cur_progress += idx + (1 << xas_get_order(&xas));
> >>> +		else if (folio_order(folio) == HPAGE_PMD_ORDER)
> >>> +			*cur_progress += idx + 1;
> >>> +		else
> >>> +			*cur_progress += idx + folio_nr_pages(folio);
> >>> +	}
> > The "idx" represent PTEs number already scanned when exiting
> > xas_for_each().
> >
> > However, the last valid folio size was not counted in "idx" (except
> > folio == NULL, "idx" equal to HPAGE_PMD_NR), which can be further
> > divided into three cases:
>
> But, the number of PTEs you account in these three cases, are *not*
> scanned, right? So we can simply drop these 3 cases.

No, these three cases are the last scanning folio to break, I think we
need to add them. Imagine that if we trigger HPAGE_PMD_ORDER folio
firstly, "idx" is equal to 0.

> >
> > 1. shmem swap entries (xa_is_value), add folio size.
> > 2. the folio is HPAGE_PMD_ORDER, the memory has been collapsed
> >    to PMD, so add 1 only.
> > 3. Normal folio, add folio size.
> >
> > Is the version below more readable?
> >
> > 	if (cur_progress) {
> > 		*cur_progress += xas.xa_index - start;
> >
> > 		if (folio) {
> > 			if (xa_is_value(folio))
> > 				*cur_progress += 1 << xas_get_order(&xas);
> > 			else if (folio_order(folio) == HPAGE_PMD_ORDER)
> > 				*cur_progress += 1;
> > 			else
> > 				*cur_progress += folio_nr_pages(folio);
> > 		}
> > 	}
>
> Yep, this is unneeded complexity. This looks really ugly and the benefits of
> this are not clear. You can simply do
>
> if (cur_progress)
> 	*cur_progress = xas.xa_index - start;
>
> >
> >>>  	rcu_read_unlock();
> >>>
> >>>  	if (result == SCAN_SUCCEED) {
> >>> @@ -2456,6 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >>>
> >>>  		while (khugepaged_scan.address < hend) {
> >>>  			bool mmap_locked = true;
> >>> +			unsigned int cur_progress = 0;
> >>>
> >>>  			cond_resched();
> >>>  			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> >>> @@ -2472,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >>>  				mmap_read_unlock(mm);
> >>>  				mmap_locked = false;
> >>>  				*result = hpage_collapse_scan_file(mm,
> >>> -					khugepaged_scan.address, file, pgoff, cc);
> >>> +					khugepaged_scan.address, file, pgoff,
> >>> +					&cur_progress, cc);
> >>>  				fput(file);
> >>>  				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> >>>  					mmap_read_lock(mm);
> >>> @@ -2486,7 +2514,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >>>  				}
> >>>  			} else {
> >>>  				*result = hpage_collapse_scan_pmd(mm, vma,
> >>> -					khugepaged_scan.address, &mmap_locked, cc);
> >>> +					khugepaged_scan.address, &mmap_locked,
> >>> +					&cur_progress, cc);
> >>>  			}
> >>>
> >>>  			if (*result == SCAN_SUCCEED)
> >>> @@ -2494,7 +2523,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >>>
> >>>  			/* move to next address */
> >>>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
> >>> -			progress += HPAGE_PMD_NR;
> >>> +			progress += cur_progress;
> >>>  			if (!mmap_locked)
> >>>  				/*
> >>>  				 * We released mmap_lock so break loop.  Note
> >>> @@ -2817,7 +2846,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >>>  			mmap_locked = false;
> >>>  			*lock_dropped = true;
> >>>  			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> >>> -							  cc);
> >>> +							  NULL, cc);
> >>>
> >>>  			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> >>>  			    mapping_can_writeback(file->f_mapping)) {
> >>> @@ -2832,7 +2861,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >>>  			fput(file);
> >>>  		} else {
> >>>  			result = hpage_collapse_scan_pmd(mm, vma, addr,
> >>> -							 &mmap_locked, cc);
> >>> +							 &mmap_locked, NULL, cc);
> >>>  		}
> >>>  		if (!mmap_locked)
> >>>  			*lock_dropped = true;
> > --
> > Thanks,
> > Vernon
>



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-29  7:59         ` Vernon Yang
@ 2026-01-29  8:32           ` Dev Jain
  2026-01-29 12:24             ` Vernon Yang
  0 siblings, 1 reply; 31+ messages in thread
From: Dev Jain @ 2026-01-29  8:32 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang


On 29/01/26 1:29 pm, Vernon Yang wrote:
> On Thu, Jan 29, 2026 at 11:05:36AM +0530, Dev Jain wrote:
>> On 28/01/26 8:04 pm, Vernon Yang wrote:
>>> On Wed, Jan 28, 2026 at 01:59:33PM +0530, Dev Jain wrote:
>>>> On 23/01/26 1:52 pm, Vernon Yang wrote:
>>>>> From: Vernon Yang <yanglincheng@kylinos.cn>
>>>>>
>>>>> Currently, each scan always increases "progress" by HPAGE_PMD_NR,
>>>>> even if only scanning a single PTE/PMD entry.
>>>>>
>>>>> - When only scanning a sigle PTE entry, let me provide a detailed
>>>>>   example:
>>>>>
>>>>> static int hpage_collapse_scan_pmd()
>>>>> {
>>>>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>>> 	     _pte++, addr += PAGE_SIZE) {
>>>>> 		pte_t pteval = ptep_get(_pte);
>>>>> 		...
>>>>> 		if (pte_uffd_wp(pteval)) { <-- first scan hit
>>>>> 			result = SCAN_PTE_UFFD_WP;
>>>>> 			goto out_unmap;
>>>>> 		}
>>>>> 	}
>>>>> }
>>>>>
>>>>> During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
>>>>> directly. In practice, only one PTE is scanned before termination.
>>>>> Here, "progress += 1" reflects the actual number of PTEs scanned, but
>>>>> previously "progress += HPAGE_PMD_NR" always.
>>>>>
>>>>> - When the memory has been collapsed to PMD, let me provide a detailed
>>>>>   example:
>>>>>
>>>>> The following data is traced by bpftrace on a desktop system. After
>>>>> the system has been left idle for 10 minutes upon booting, a lot of
>>>>> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
>>>>> by khugepaged.
>>>>>
>>>>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
>>>>> @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
>>>>> @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
>>>>> @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
>>>> Could you elaborate what is [1], [6] etc and 1,2,142, etc?
>>> These 1,6 are value of "enum scan_result", you can directly refer to the
>>> notes on the right.
>>>
>>> These 1,2,142,178 are number of different "enum scan_result" from
>>> trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file.
>>>
>>> as example, SCAN_PMD_MAPPED has 142 times during a full scan by
>>> khugepaged.
>> Thanks. Can you please mention this in the patch description. You can simply
>> right it like this:
>>
>> "From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
>> following statuses were observed, with frequency mentioned next to them:
>>
>> SCAN_SUCCEED: 1
>> SCAN_PMD_MAPPED: 142
>> ....."
>>
>> and so on.
> LGTM, I will do it in the next version. Thanks!
>
>>>>> total progress size: 674 MB
>>>>> Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
>>>>>
>>>>> The khugepaged_scan list save all task that support collapse into hugepage,
>>>>> as long as the task is not destroyed, khugepaged will not remove it from
>>>>> the khugepaged_scan list. This exist a phenomenon where task has already
>>>>> collapsed all memory regions into hugepage, but khugepaged continues to
>>>>> scan it, which wastes CPU time and invalid, and due to
>>>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>>>>> scanning a large number of invalid task, so scanning really valid task
>>>>> is later.
>>>>>
>>>>> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
>>>>> SCAN_NO_PTE_TABLE, just skip it, as follow:
>>>>>
>>>>> @scan_pmd_status[6]: 2
>>>>> @scan_pmd_status[3]: 147
>>>>> @scan_pmd_status[2]: 173
>>>>> total progress size: 45 MB
>>>>> Total time         : 20 seconds
>>>>>
>>>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>>>> ---
>>>>>  include/linux/xarray.h |  9 ++++++++
>>>>>  mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
>>>>>  2 files changed, 47 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
>>>>> index be850174e802..f77d97d7b957 100644
>>>>> --- a/include/linux/xarray.h
>>>>> +++ b/include/linux/xarray.h
>>>>> @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
>>>>>  	xas->xa_node = XAS_RESTART;
>>>>>  }
>>>>>
>>>>> +/**
>>>>> + * xas_get_index() - Get XArray operation state for a different index.
>>>>> + * @xas: XArray operation state.
>>>>> + */
>>>>> +static inline unsigned long xas_get_index(struct xa_state *xas)
>>>>> +{
>>>>> +	return xas->xa_index;
>>>>> +}
>>>>> +
>>>>>  /**
>>>>>   * xas_advance() - Skip over sibling entries.
>>>>>   * @xas: XArray operation state.
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 6f0f05148765..de95029e3763 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -68,7 +68,10 @@ enum scan_result {
>>>>>  static struct task_struct *khugepaged_thread __read_mostly;
>>>>>  static DEFINE_MUTEX(khugepaged_mutex);
>>>>>
>>>>> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
>>>>> +/*
>>>>> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
>>>>> + * every 10 second.
>>>>> + */
>>>>>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>>>>>  static unsigned int khugepaged_pages_collapsed;
>>>>>  static unsigned int khugepaged_full_scans;
>>>>> @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>>>>  }
>>>>>
>>>>>  static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>> -		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
>>>>> +		struct vm_area_struct *vma, unsigned long start_addr,
>>>>> +		bool *mmap_locked, unsigned int *cur_progress,
>>>>>  		struct collapse_control *cc)
>>>>>  {
>>>>>  	pmd_t *pmd;
>>>>> @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>>
>>>>>  	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>>>>
>>>>> +	if (cur_progress)
>>>>> +		*cur_progress += 1;
>>>> Why not be a little more explicit, and do this addition if find_pmd_or_thp_or_none fails,
>>>> or pte_offset_map_lock fails? The way you do it right now is not readable - it gives no
>>>> idea as to why on function entry we do a +1 right away. Please do add some comments too.
>>> If this way is not clear enough, we can directly add 1 in
>>> find_pmd_or_thp_or_none() etc, BUT it's a bit redundant.
>>> Please take a look at which one is better.
>>>
>>> case 1:
>>> as the V4 PATCH #2 [1] and #3 [2], only hpage_collapse_scan_pmd().
>>> [1] https://lore.kernel.org/linux-mm/20260111121909.8410-3-yanglincheng@kylinos.cn
>>> [2] https://lore.kernel.org/linux-mm/20260111121909.8410-4-yanglincheng@kylinos.cn
>>>
>>> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>> 		struct vm_area_struct *vma, unsigned long start_addr,
>>> 		bool *mmap_locked, unsigned int *cur_progress,
>>> 		struct collapse_control *cc)
>>> {
>>> 	...
>>> 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>>> 	if (result != SCAN_SUCCEED) {
>>> 		if (cur_progress)
>>> 			*cur_progress += 1; // here
>>> 		goto out;
>>> 	}
>>> 	...
>>> 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>>> 	if (!pte) {
>>> 		if (cur_progress)
>>> 			*cur_progress += 1; // here
>>> 		result = SCAN_NO_PTE_TABLE;
>>> 		goto out;
>>> 	}
>>>
>>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> 	     _pte++, addr += PAGE_SIZE) {
>>> 		if (cur_progress)
>>> 			*cur_progress += 1; // here
>>> 		...
>>> 	}
>>> }
>>>
>>> case 2:
>>>
>>> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>> 		struct vm_area_struct *vma, unsigned long start_addr,
>>> 		bool *mmap_locked, unsigned int *cur_progress,
>>> 		struct collapse_control *cc)
>>> {
>>> 	...
>>> 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>>> 	if (result != SCAN_SUCCEED) {
>>> 		if (cur_progress)
>>> 			*cur_progress += 1; // here
>> Let us be more explicit and set this equal to 1, instead of adding 1.
> LGTM, I will do it in the next version. Thanks!
>
>>> 		goto out;
>>> 	}
>>> 	...
>>> 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>>> 	if (!pte) {
>>> 		if (cur_progress)
>>> 			*cur_progress += 1; // here
>> Same comment as above.
>>
>>> 		result = SCAN_NO_PTE_TABLE;
>>> 		goto out;
>>> 	}
>>>
>>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> 	     _pte++, addr += PAGE_SIZE) {
>>> 		...
>>> 	}
>>> 	...
>>> out_unmap:
>>> 	if (cur_progress) {
>>> 		if (_pte >= pte + HPAGE_PMD_NR)
>>> 			*cur_progress += HPAGE_PMD_NR;   // here
>>> 		else
>>> 			*cur_progress += _pte - pte + 1; // here
>>> 	}
>>> }
>> I will vote case 2. In case 1 I don't like the fact that the if (cur_progress)
>> branch will be checked each iteration - and I don't think the compiler can
>> optimize this since the body of the loop is complex, so this check cannot
>> be hoisted out of the loop.
>>
>>
>>> case 3:
>>> 	current patch, and add more comments to clearer.
>>>
>>>>> +
>>>>>  	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>>>>>  	if (result != SCAN_SUCCEED)
>>>>>  		goto out;
>>>>> @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>>  		result = SCAN_SUCCEED;
>>>>>  	}
>>>>>  out_unmap:
>>>>> +	if (cur_progress) {
>>>>> +		if (_pte >= pte + HPAGE_PMD_NR)
>>>>> +			*cur_progress += HPAGE_PMD_NR - 1;
>>>>> +		else
>>>>> +			*cur_progress += _pte - pte;
>>>>> +	}
>>>>>  	pte_unmap_unlock(pte, ptl);
>>>>>  	if (result == SCAN_SUCCEED) {
>>>>>  		result = collapse_huge_page(mm, start_addr, referenced,
>>>>> @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>>>  	return result;
>>>>>  }
>>>>>
>>>>> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>>>>> -		struct file *file, pgoff_t start, struct collapse_control *cc)
>>>>> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>>>>> +		unsigned long addr, struct file *file, pgoff_t start,
>>>>> +		unsigned int *cur_progress, struct collapse_control *cc)
>>>>>  {
>>>>>  	struct folio *folio = NULL;
>>>>>  	struct address_space *mapping = file->f_mapping;
>>>>> @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>>>>>  			cond_resched_rcu();
>>>>>  		}
>>>>>  	}
>>>>> +	if (cur_progress) {
>>>>> +		unsigned long idx = xas_get_index(&xas) - start;
>>>>> +
>>>>> +		if (folio == NULL)
>>>>> +			*cur_progress += HPAGE_PMD_NR;
>>>> I think this whole block needs some comments. Can you explain, why you
>>>> do a particular increment in each case?
>>>>
>>>>> +		else if (xa_is_value(folio))
>>>>> +			*cur_progress += idx + (1 << xas_get_order(&xas));
>>>>> +		else if (folio_order(folio) == HPAGE_PMD_ORDER)
>>>>> +			*cur_progress += idx + 1;
>>>>> +		else
>>>>> +			*cur_progress += idx + folio_nr_pages(folio);
>>>>> +	}
>>> The "idx" represent PTEs number already scanned when exiting
>>> xas_for_each().
>>>
>>> However, the last valid folio size was not counted in "idx" (except
>>> folio == NULL, "idx" equal to HPAGE_PMD_NR), which can be further
>>> divided into three cases:
>> But, the number of PTEs you account in these three cases, are *not*
>> scanned, right? So we can simply drop these 3 cases.
> No, these three cases are the last scanning folio to break, I think we
> need to add them. Imagine that if we trigger HPAGE_PMD_ORDER folio
> firstly, "idx" is equal to 0.

If you hit a large folio and break, you don't consume any extra iterations.
The number of iterations is completely determined by xa_index. This is kind
of analogous to SCAN_PMD_MAPPED - in one "iteration", you decide to stop
scanning, and set progress to 1.

In any case, the problem which you describe in the patch description is
completely solved by setting progress to 1 upon failure of find_pmd_or_thp_or_none.
So anything else which you do (like counting iterations in hpage_collapse_scan_pmd
or hpage_collapse_scan_file) is extra and it is not clear if it has a practical
benefit. The patch benefits us because there are a lot of SCAN_PMD_MAPPED and SCAN_PMD_NONE.
Will we practically also encounter a large number of SCAN_EXCEED_PTE_SWAP, PTE_NONE, etc?

I tilt towards keeping the other bits of the patch, if they can be simplified, and
because this patch is relatively harmless. Just like you count the number of iterations
in hpage_collapse_scan_pmd(), you can do the same here using xa_index.

>
>>> 1. shmem swap entries (xa_is_value), add folio size.
>>> 2. the folio is HPAGE_PMD_ORDER, the memory has been collapsed
>>>    to PMD, so add 1 only.
>>> 3. Normal folio, add folio size.
>>>
>>> Is the version below more readable?
>>>
>>> 	if (cur_progress) {
>>> 		*cur_progress += xas.xa_index - start;
>>>
>>> 		if (folio) {
>>> 			if (xa_is_value(folio))
>>> 				*cur_progress += 1 << xas_get_order(&xas);
>>> 			else if (folio_order(folio) == HPAGE_PMD_ORDER)
>>> 				*cur_progress += 1;
>>> 			else
>>> 				*cur_progress += folio_nr_pages(folio);
>>> 		}
>>> 	}
>> Yep, this is unneeded complexity. This looks really ugly and the benefits of
>> this are not clear. You can simply do
>>
>> if (cur_progress)
>> 	*cur_progress = xas.xa_index - start;
>>
>>>>>  	rcu_read_unlock();
>>>>>
>>>>>  	if (result == SCAN_SUCCEED) {
>>>>> @@ -2456,6 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>>>
>>>>>  		while (khugepaged_scan.address < hend) {
>>>>>  			bool mmap_locked = true;
>>>>> +			unsigned int cur_progress = 0;
>>>>>
>>>>>  			cond_resched();
>>>>>  			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>>>>> @@ -2472,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>>>  				mmap_read_unlock(mm);
>>>>>  				mmap_locked = false;
>>>>>  				*result = hpage_collapse_scan_file(mm,
>>>>> -					khugepaged_scan.address, file, pgoff, cc);
>>>>> +					khugepaged_scan.address, file, pgoff,
>>>>> +					&cur_progress, cc);
>>>>>  				fput(file);
>>>>>  				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
>>>>>  					mmap_read_lock(mm);
>>>>> @@ -2486,7 +2514,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>>>  				}
>>>>>  			} else {
>>>>>  				*result = hpage_collapse_scan_pmd(mm, vma,
>>>>> -					khugepaged_scan.address, &mmap_locked, cc);
>>>>> +					khugepaged_scan.address, &mmap_locked,
>>>>> +					&cur_progress, cc);
>>>>>  			}
>>>>>
>>>>>  			if (*result == SCAN_SUCCEED)
>>>>> @@ -2494,7 +2523,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>>>
>>>>>  			/* move to next address */
>>>>>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
>>>>> -			progress += HPAGE_PMD_NR;
>>>>> +			progress += cur_progress;
>>>>>  			if (!mmap_locked)
>>>>>  				/*
>>>>>  				 * We released mmap_lock so break loop.  Note
>>>>> @@ -2817,7 +2846,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>>>  			mmap_locked = false;
>>>>>  			*lock_dropped = true;
>>>>>  			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
>>>>> -							  cc);
>>>>> +							  NULL, cc);
>>>>>
>>>>>  			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>>>>>  			    mapping_can_writeback(file->f_mapping)) {
>>>>> @@ -2832,7 +2861,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>>>  			fput(file);
>>>>>  		} else {
>>>>>  			result = hpage_collapse_scan_pmd(mm, vma, addr,
>>>>> -							 &mmap_locked, cc);
>>>>> +							 &mmap_locked, NULL, cc);
>>>>>  		}
>>>>>  		if (!mmap_locked)
>>>>>  			*lock_dropped = true;
>>> --
>>> Thanks,
>>> Vernon


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-29  5:35       ` Dev Jain
  2026-01-29  7:59         ` Vernon Yang
@ 2026-01-29  9:18         ` Lance Yang
  2026-01-29 12:28           ` Vernon Yang
  1 sibling, 1 reply; 31+ messages in thread
From: Lance Yang @ 2026-01-29  9:18 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, linux-mm,
	linux-kernel, Dev Jain, Vernon Yang



On 2026/1/29 13:35, Dev Jain wrote:
> 
> On 28/01/26 8:04 pm, Vernon Yang wrote:
>> On Wed, Jan 28, 2026 at 01:59:33PM +0530, Dev Jain wrote:
>>> On 23/01/26 1:52 pm, Vernon Yang wrote:
>>>> From: Vernon Yang <yanglincheng@kylinos.cn>
>>>>
>>>> Currently, each scan always increases "progress" by HPAGE_PMD_NR,
>>>> even if only scanning a single PTE/PMD entry.
>>>>
>>>> - When only scanning a sigle PTE entry, let me provide a detailed
>>>>    example:
>>>>
>>>> static int hpage_collapse_scan_pmd()
>>>> {
>>>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>> 	     _pte++, addr += PAGE_SIZE) {
>>>> 		pte_t pteval = ptep_get(_pte);
>>>> 		...
>>>> 		if (pte_uffd_wp(pteval)) { <-- first scan hit
>>>> 			result = SCAN_PTE_UFFD_WP;
>>>> 			goto out_unmap;
>>>> 		}
>>>> 	}
>>>> }
>>>>
>>>> During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
>>>> directly. In practice, only one PTE is scanned before termination.
>>>> Here, "progress += 1" reflects the actual number of PTEs scanned, but
>>>> previously "progress += HPAGE_PMD_NR" always.
>>>>
>>>> - When the memory has been collapsed to PMD, let me provide a detailed
>>>>    example:
>>>>
>>>> The following data is traced by bpftrace on a desktop system. After
>>>> the system has been left idle for 10 minutes upon booting, a lot of
>>>> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
>>>> by khugepaged.
>>>>
>>>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
>>>> @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
>>>> @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
>>>> @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
>>> Could you elaborate what is [1], [6] etc and 1,2,142, etc?
>> These 1,6 are value of "enum scan_result", you can directly refer to the
>> notes on the right.
>>
>> These 1,2,142,178 are number of different "enum scan_result" from
>> trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file.
>>
>> as example, SCAN_PMD_MAPPED has 142 times during a full scan by
>> khugepaged.
> 
> Thanks. Can you please mention this in the patch description. You can simply
> right it like this:
> 
> "From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
> following statuses were observed, with frequency mentioned next to them:
> 
> SCAN_SUCCEED: 1
> SCAN_PMD_MAPPED: 142
> ....."
> 
> and so on.
> 
>>
>>>> total progress size: 674 MB
>>>> Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
>>>>
>>>> The khugepaged_scan list save all task that support collapse into hugepage,
>>>> as long as the task is not destroyed, khugepaged will not remove it from
>>>> the khugepaged_scan list. This exist a phenomenon where task has already
>>>> collapsed all memory regions into hugepage, but khugepaged continues to
>>>> scan it, which wastes CPU time and invalid, and due to
>>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>>>> scanning a large number of invalid task, so scanning really valid task
>>>> is later.
>>>>
>>>> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
>>>> SCAN_NO_PTE_TABLE, just skip it, as follow:
>>>>
>>>> @scan_pmd_status[6]: 2
>>>> @scan_pmd_status[3]: 147
>>>> @scan_pmd_status[2]: 173
>>>> total progress size: 45 MB
>>>> Total time         : 20 seconds
>>>>
>>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>>> ---
>>>>   include/linux/xarray.h |  9 ++++++++
>>>>   mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
>>>>   2 files changed, 47 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
>>>> index be850174e802..f77d97d7b957 100644
>>>> --- a/include/linux/xarray.h
>>>> +++ b/include/linux/xarray.h
>>>> @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
>>>>   	xas->xa_node = XAS_RESTART;
>>>>   }
>>>>
>>>> +/**
>>>> + * xas_get_index() - Get XArray operation state for a different index.
>>>> + * @xas: XArray operation state.
>>>> + */
>>>> +static inline unsigned long xas_get_index(struct xa_state *xas)
>>>> +{
>>>> +	return xas->xa_index;
>>>> +}
>>>> +
>>>>   /**
>>>>    * xas_advance() - Skip over sibling entries.
>>>>    * @xas: XArray operation state.
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 6f0f05148765..de95029e3763 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -68,7 +68,10 @@ enum scan_result {
>>>>   static struct task_struct *khugepaged_thread __read_mostly;
>>>>   static DEFINE_MUTEX(khugepaged_mutex);
>>>>
>>>> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
>>>> +/*
>>>> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
>>>> + * every 10 second.
>>>> + */
>>>>   static unsigned int khugepaged_pages_to_scan __read_mostly;
>>>>   static unsigned int khugepaged_pages_collapsed;
>>>>   static unsigned int khugepaged_full_scans;
>>>> @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>>>   }
>>>>
>>>>   static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>> -		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
>>>> +		struct vm_area_struct *vma, unsigned long start_addr,
>>>> +		bool *mmap_locked, unsigned int *cur_progress,
>>>>   		struct collapse_control *cc)
>>>>   {
>>>>   	pmd_t *pmd;
>>>> @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>
>>>>   	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>>>
>>>> +	if (cur_progress)
>>>> +		*cur_progress += 1;
>>> Why not be a little more explicit, and do this addition if find_pmd_or_thp_or_none fails,
>>> or pte_offset_map_lock fails? The way you do it right now is not readable - it gives no
>>> idea as to why on function entry we do a +1 right away. Please do add some comments too.
>> If this way is not clear enough, we can directly add 1 in
>> find_pmd_or_thp_or_none() etc, BUT it's a bit redundant.
>> Please take a look at which one is better.
>>
>> case 1:
>> as the V4 PATCH #2 [1] and #3 [2], only hpage_collapse_scan_pmd().
>> [1] https://lore.kernel.org/linux-mm/20260111121909.8410-3-yanglincheng@kylinos.cn
>> [2] https://lore.kernel.org/linux-mm/20260111121909.8410-4-yanglincheng@kylinos.cn
>>
>> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>> 		struct vm_area_struct *vma, unsigned long start_addr,
>> 		bool *mmap_locked, unsigned int *cur_progress,
>> 		struct collapse_control *cc)
>> {
>> 	...
>> 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>> 	if (result != SCAN_SUCCEED) {
>> 		if (cur_progress)
>> 			*cur_progress += 1; // here
>> 		goto out;
>> 	}
>> 	...
>> 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>> 	if (!pte) {
>> 		if (cur_progress)
>> 			*cur_progress += 1; // here
>> 		result = SCAN_NO_PTE_TABLE;
>> 		goto out;
>> 	}
>>
>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> 	     _pte++, addr += PAGE_SIZE) {
>> 		if (cur_progress)
>> 			*cur_progress += 1; // here
>> 		...
>> 	}
>> }
>>
>> case 2:
>>
>> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>> 		struct vm_area_struct *vma, unsigned long start_addr,
>> 		bool *mmap_locked, unsigned int *cur_progress,
>> 		struct collapse_control *cc)
>> {
>> 	...
>> 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>> 	if (result != SCAN_SUCCEED) {
>> 		if (cur_progress)
>> 			*cur_progress += 1; // here
> 
> Let us be more explicit and set this equal to 1, instead of adding 1.
> 
>> 		goto out;
>> 	}
>> 	...
>> 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>> 	if (!pte) {
>> 		if (cur_progress)
>> 			*cur_progress += 1; // here
> 
> Same comment as above.
> 
>> 		result = SCAN_NO_PTE_TABLE;
>> 		goto out;
>> 	}
>>
>> 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>> 	     _pte++, addr += PAGE_SIZE) {
>> 		...
>> 	}
>> 	...
>> out_unmap:
>> 	if (cur_progress) {
>> 		if (_pte >= pte + HPAGE_PMD_NR)
>> 			*cur_progress += HPAGE_PMD_NR;   // here
>> 		else
>> 			*cur_progress += _pte - pte + 1; // here
>> 	}
>> }
> 
> I will vote case 2. In case 1 I don't like the fact that the if (cur_progress)
> branch will be checked each iteration - and I don't think the compiler can
> optimize this since the body of the loop is complex, so this check cannot
> be hoisted out of the loop.
> 
> 
>>
>> case 3:
>> 	current patch, and add more comments to clearer.
>>
>>>> +
>>>>   	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>>>>   	if (result != SCAN_SUCCEED)
>>>>   		goto out;
>>>> @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>   		result = SCAN_SUCCEED;
>>>>   	}
>>>>   out_unmap:
>>>> +	if (cur_progress) {
>>>> +		if (_pte >= pte + HPAGE_PMD_NR)
>>>> +			*cur_progress += HPAGE_PMD_NR - 1;
>>>> +		else
>>>> +			*cur_progress += _pte - pte;
>>>> +	}
>>>>   	pte_unmap_unlock(pte, ptl);
>>>>   	if (result == SCAN_SUCCEED) {
>>>>   		result = collapse_huge_page(mm, start_addr, referenced,
>>>> @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>>   	return result;
>>>>   }
>>>>
>>>> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>>>> -		struct file *file, pgoff_t start, struct collapse_control *cc)
>>>> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>>>> +		unsigned long addr, struct file *file, pgoff_t start,
>>>> +		unsigned int *cur_progress, struct collapse_control *cc)
>>>>   {
>>>>   	struct folio *folio = NULL;
>>>>   	struct address_space *mapping = file->f_mapping;
>>>> @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>>>>   			cond_resched_rcu();
>>>>   		}
>>>>   	}
>>>> +	if (cur_progress) {
>>>> +		unsigned long idx = xas_get_index(&xas) - start;
>>>> +
>>>> +		if (folio == NULL)
>>>> +			*cur_progress += HPAGE_PMD_NR;
>>> I think this whole block needs some comments. Can you explain, why you
>>> do a particular increment in each case?
>>>
>>>> +		else if (xa_is_value(folio))
>>>> +			*cur_progress += idx + (1 << xas_get_order(&xas));
>>>> +		else if (folio_order(folio) == HPAGE_PMD_ORDER)
>>>> +			*cur_progress += idx + 1;
>>>> +		else
>>>> +			*cur_progress += idx + folio_nr_pages(folio);
>>>> +	}
>> The "idx" represent PTEs number already scanned when exiting
>> xas_for_each().
>>
>> However, the last valid folio size was not counted in "idx" (except
>> folio == NULL, "idx" equal to HPAGE_PMD_NR), which can be further
>> divided into three cases:
> 
> But, the number of PTEs you account in these three cases, are *not*
> scanned, right? So we can simply drop these 3 cases.
> 
>>
>> 1. shmem swap entries (xa_is_value), add folio size.
>> 2. the folio is HPAGE_PMD_ORDER, the memory has been collapsed
>>     to PMD, so add 1 only.
>> 3. Normal folio, add folio size.
>>
>> Is the version below more readable?
>>
>> 	if (cur_progress) {
>> 		*cur_progress += xas.xa_index - start;
>>
>> 		if (folio) {
>> 			if (xa_is_value(folio))
>> 				*cur_progress += 1 << xas_get_order(&xas);
>> 			else if (folio_order(folio) == HPAGE_PMD_ORDER)
>> 				*cur_progress += 1;
>> 			else
>> 				*cur_progress += folio_nr_pages(folio);
>> 		}
>> 	}
> 
> Yep, this is unneeded complexity. This looks really ugly and the benefits of
> this are not clear. You can simply do
> 
> if (cur_progress)
> 	*cur_progress = xas.xa_index - start;
> 

I agree with Dev here. The extra complexity in hpage_collapse_scan_file()
doesn't seem worth it.

Suggest:

if (cur_progress)
	*cur_progress = max(xas.xa_index - start, 1UL);

Just keeps it simple, and handles the idx=0 case you mentioned as well.

[...]


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-29  8:32           ` Dev Jain
@ 2026-01-29 12:24             ` Vernon Yang
  2026-01-29 12:46               ` Dev Jain
  0 siblings, 1 reply; 31+ messages in thread
From: Vernon Yang @ 2026-01-29 12:24 UTC (permalink / raw)
  To: Dev Jain, lance.yang
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, linux-mm,
	linux-kernel, Vernon Yang

On Thu, Jan 29, 2026 at 4:32 PM Dev Jain <dev.jain@arm.com> wrote:
>
> On 29/01/26 1:29 pm, Vernon Yang wrote:
> > On Thu, Jan 29, 2026 at 11:05:36AM +0530, Dev Jain wrote:
> >> On 28/01/26 8:04 pm, Vernon Yang wrote:
> >>> On Wed, Jan 28, 2026 at 01:59:33PM +0530, Dev Jain wrote:
> >>>> On 23/01/26 1:52 pm, Vernon Yang wrote:
> >>>>> From: Vernon Yang <yanglincheng@kylinos.cn>
> >>>>>
> >>>>> Currently, each scan always increases "progress" by HPAGE_PMD_NR,
> >>>>> even if only scanning a single PTE/PMD entry.
> >>>>>
> >>>>> - When only scanning a sigle PTE entry, let me provide a detailed
> >>>>>   example:
> >>>>>
> >>>>> static int hpage_collapse_scan_pmd()
> >>>>> {
> >>>>>   for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >>>>>        _pte++, addr += PAGE_SIZE) {
> >>>>>           pte_t pteval = ptep_get(_pte);
> >>>>>           ...
> >>>>>           if (pte_uffd_wp(pteval)) { <-- first scan hit
> >>>>>                   result = SCAN_PTE_UFFD_WP;
> >>>>>                   goto out_unmap;
> >>>>>           }
> >>>>>   }
> >>>>> }
> >>>>>
> >>>>> During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
> >>>>> directly. In practice, only one PTE is scanned before termination.
> >>>>> Here, "progress += 1" reflects the actual number of PTEs scanned, but
> >>>>> previously "progress += HPAGE_PMD_NR" always.
> >>>>>
> >>>>> - When the memory has been collapsed to PMD, let me provide a detailed
> >>>>>   example:
> >>>>>
> >>>>> The following data is traced by bpftrace on a desktop system. After
> >>>>> the system has been left idle for 10 minutes upon booting, a lot of
> >>>>> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
> >>>>> by khugepaged.
> >>>>>
> >>>>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> >>>>> @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
> >>>>> @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
> >>>>> @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
> >>>> Could you elaborate what is [1], [6] etc and 1,2,142, etc?
> >>> These 1,6 are value of "enum scan_result", you can directly refer to the
> >>> notes on the right.
> >>>
> >>> These 1,2,142,178 are number of different "enum scan_result" from
> >>> trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file.
> >>>
> >>> as example, SCAN_PMD_MAPPED has 142 times during a full scan by
> >>> khugepaged.
> >> Thanks. Can you please mention this in the patch description. You can simply
> >> right it like this:
> >>
> >> "From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
> >> following statuses were observed, with frequency mentioned next to them:
> >>
> >> SCAN_SUCCEED: 1
> >> SCAN_PMD_MAPPED: 142
> >> ....."
> >>
> >> and so on.
> > LGTM, I will do it in the next version. Thanks!
> >
> >>>>> total progress size: 674 MB
> >>>>> Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
> >>>>>
> >>>>> The khugepaged_scan list save all task that support collapse into hugepage,
> >>>>> as long as the task is not destroyed, khugepaged will not remove it from
> >>>>> the khugepaged_scan list. This exist a phenomenon where task has already
> >>>>> collapsed all memory regions into hugepage, but khugepaged continues to
> >>>>> scan it, which wastes CPU time and invalid, and due to
> >>>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >>>>> scanning a large number of invalid task, so scanning really valid task
> >>>>> is later.
> >>>>>
> >>>>> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
> >>>>> SCAN_NO_PTE_TABLE, just skip it, as follow:
> >>>>>
> >>>>> @scan_pmd_status[6]: 2
> >>>>> @scan_pmd_status[3]: 147
> >>>>> @scan_pmd_status[2]: 173
> >>>>> total progress size: 45 MB
> >>>>> Total time         : 20 seconds
> >>>>>
> >>>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> >>>>> ---
> >>>>>  include/linux/xarray.h |  9 ++++++++
> >>>>>  mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
> >>>>>  2 files changed, 47 insertions(+), 9 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> >>>>> index be850174e802..f77d97d7b957 100644
> >>>>> --- a/include/linux/xarray.h
> >>>>> +++ b/include/linux/xarray.h
> >>>>> @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
> >>>>>   xas->xa_node = XAS_RESTART;
> >>>>>  }
> >>>>>
> >>>>> +/**
> >>>>> + * xas_get_index() - Get XArray operation state for a different index.
> >>>>> + * @xas: XArray operation state.
> >>>>> + */
> >>>>> +static inline unsigned long xas_get_index(struct xa_state *xas)
> >>>>> +{
> >>>>> + return xas->xa_index;
> >>>>> +}
> >>>>> +
> >>>>>  /**
> >>>>>   * xas_advance() - Skip over sibling entries.
> >>>>>   * @xas: XArray operation state.
> >>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>>>> index 6f0f05148765..de95029e3763 100644
> >>>>> --- a/mm/khugepaged.c
> >>>>> +++ b/mm/khugepaged.c
> >>>>> @@ -68,7 +68,10 @@ enum scan_result {
> >>>>>  static struct task_struct *khugepaged_thread __read_mostly;
> >>>>>  static DEFINE_MUTEX(khugepaged_mutex);
> >>>>>
> >>>>> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> >>>>> +/*
> >>>>> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> >>>>> + * every 10 second.
> >>>>> + */
> >>>>>  static unsigned int khugepaged_pages_to_scan __read_mostly;
> >>>>>  static unsigned int khugepaged_pages_collapsed;
> >>>>>  static unsigned int khugepaged_full_scans;
> >>>>> @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >>>>>  }
> >>>>>
> >>>>>  static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>>> -         struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
> >>>>> +         struct vm_area_struct *vma, unsigned long start_addr,
> >>>>> +         bool *mmap_locked, unsigned int *cur_progress,
> >>>>>           struct collapse_control *cc)
> >>>>>  {
> >>>>>   pmd_t *pmd;
> >>>>> @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>>>
> >>>>>   VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
> >>>>>
> >>>>> + if (cur_progress)
> >>>>> +         *cur_progress += 1;
> >>>> Why not be a little more explicit, and do this addition if find_pmd_or_thp_or_none fails,
> >>>> or pte_offset_map_lock fails? The way you do it right now is not readable - it gives no
> >>>> idea as to why on function entry we do a +1 right away. Please do add some comments too.
> >>> If this way is not clear enough, we can directly add 1 in
> >>> find_pmd_or_thp_or_none() etc, BUT it's a bit redundant.
> >>> Please take a look at which one is better.
> >>>
> >>> case 1:
> >>> as the V4 PATCH #2 [1] and #3 [2], only hpage_collapse_scan_pmd().
> >>> [1] https://lore.kernel.org/linux-mm/20260111121909.8410-3-yanglincheng@kylinos.cn
> >>> [2] https://lore.kernel.org/linux-mm/20260111121909.8410-4-yanglincheng@kylinos.cn
> >>>
> >>> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>             struct vm_area_struct *vma, unsigned long start_addr,
> >>>             bool *mmap_locked, unsigned int *cur_progress,
> >>>             struct collapse_control *cc)
> >>> {
> >>>     ...
> >>>     result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> >>>     if (result != SCAN_SUCCEED) {
> >>>             if (cur_progress)
> >>>                     *cur_progress += 1; // here
> >>>             goto out;
> >>>     }
> >>>     ...
> >>>     pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >>>     if (!pte) {
> >>>             if (cur_progress)
> >>>                     *cur_progress += 1; // here
> >>>             result = SCAN_NO_PTE_TABLE;
> >>>             goto out;
> >>>     }
> >>>
> >>>     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >>>          _pte++, addr += PAGE_SIZE) {
> >>>             if (cur_progress)
> >>>                     *cur_progress += 1; // here
> >>>             ...
> >>>     }
> >>> }
> >>>
> >>> case 2:
> >>>
> >>> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>             struct vm_area_struct *vma, unsigned long start_addr,
> >>>             bool *mmap_locked, unsigned int *cur_progress,
> >>>             struct collapse_control *cc)
> >>> {
> >>>     ...
> >>>     result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> >>>     if (result != SCAN_SUCCEED) {
> >>>             if (cur_progress)
> >>>                     *cur_progress += 1; // here
> >> Let us be more explicit and set this equal to 1, instead of adding 1.
> > LGTM, I will do it in the next version. Thanks!
> >
> >>>             goto out;
> >>>     }
> >>>     ...
> >>>     pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >>>     if (!pte) {
> >>>             if (cur_progress)
> >>>                     *cur_progress += 1; // here
> >> Same comment as above.
> >>
> >>>             result = SCAN_NO_PTE_TABLE;
> >>>             goto out;
> >>>     }
> >>>
> >>>     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >>>          _pte++, addr += PAGE_SIZE) {
> >>>             ...
> >>>     }
> >>>     ...
> >>> out_unmap:
> >>>     if (cur_progress) {
> >>>             if (_pte >= pte + HPAGE_PMD_NR)
> >>>                     *cur_progress += HPAGE_PMD_NR;   // here
> >>>             else
> >>>                     *cur_progress += _pte - pte + 1; // here
> >>>     }
> >>> }
> >> I will vote case 2. In case 1 I don't like the fact that the if (cur_progress)
> >> branch will be checked each iteration - and I don't think the compiler can
> >> optimize this since the body of the loop is complex, so this check cannot
> >> be hoisted out of the loop.
> >>
> >>
> >>> case 3:
> >>>     current patch, and add more comments to clearer.
> >>>
> >>>>> +
> >>>>>   result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> >>>>>   if (result != SCAN_SUCCEED)
> >>>>>           goto out;
> >>>>> @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>>>           result = SCAN_SUCCEED;
> >>>>>   }
> >>>>>  out_unmap:
> >>>>> + if (cur_progress) {
> >>>>> +         if (_pte >= pte + HPAGE_PMD_NR)
> >>>>> +                 *cur_progress += HPAGE_PMD_NR - 1;
> >>>>> +         else
> >>>>> +                 *cur_progress += _pte - pte;
> >>>>> + }
> >>>>>   pte_unmap_unlock(pte, ptl);
> >>>>>   if (result == SCAN_SUCCEED) {
> >>>>>           result = collapse_huge_page(mm, start_addr, referenced,
> >>>>> @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> >>>>>   return result;
> >>>>>  }
> >>>>>
> >>>>> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >>>>> -         struct file *file, pgoff_t start, struct collapse_control *cc)
> >>>>> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> >>>>> +         unsigned long addr, struct file *file, pgoff_t start,
> >>>>> +         unsigned int *cur_progress, struct collapse_control *cc)
> >>>>>  {
> >>>>>   struct folio *folio = NULL;
> >>>>>   struct address_space *mapping = file->f_mapping;
> >>>>> @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
> >>>>>                   cond_resched_rcu();
> >>>>>           }
> >>>>>   }
> >>>>> + if (cur_progress) {
> >>>>> +         unsigned long idx = xas_get_index(&xas) - start;
> >>>>> +
> >>>>> +         if (folio == NULL)
> >>>>> +                 *cur_progress += HPAGE_PMD_NR;
> >>>> I think this whole block needs some comments. Can you explain, why you
> >>>> do a particular increment in each case?
> >>>>
> >>>>> +         else if (xa_is_value(folio))
> >>>>> +                 *cur_progress += idx + (1 << xas_get_order(&xas));
> >>>>> +         else if (folio_order(folio) == HPAGE_PMD_ORDER)
> >>>>> +                 *cur_progress += idx + 1;
> >>>>> +         else
> >>>>> +                 *cur_progress += idx + folio_nr_pages(folio);
> >>>>> + }
> >>> The "idx" represent PTEs number already scanned when exiting
> >>> xas_for_each().
> >>>
> >>> However, the last valid folio size was not counted in "idx" (except
> >>> folio == NULL, "idx" equal to HPAGE_PMD_NR), which can be further
> >>> divided into three cases:
> >> But, the number of PTEs you account in these three cases, are *not*
> >> scanned, right? So we can simply drop these 3 cases.
> > No, these three cases are the last scanning folio to break, I think we
> > need to add them. Imagine that if we trigger HPAGE_PMD_ORDER folio
> > firstly, "idx" is equal to 0.
>
> If you hit a large folio and break, you don't consume any extra iterations.
> The number of iterations is completely determined by xa_index. This is kind
> of analogous to SCAN_PMD_MAPPED - in one "iteration", you decide to stop
> scanning, and set progress to 1.
>
> In any case, the problem which you describe in the patch description is
> completely solved by setting progress to 1 upon failure of find_pmd_or_thp_or_none.
> So anything else which you do (like counting iterations in hpage_collapse_scan_pmd
> or hpage_collapse_scan_file) is extra and it is not clear if it has a practical
> benefit. The patch benefits us because there are a lot of SCAN_PMD_MAPPED and SCAN_PMD_NONE.
> Will we practically also encounter a large number of SCAN_EXCEED_PTE_SWAP, PTE_NONE, etc?
>
> I tilt towards keeping the other bits of the patch, if they can be simplified, and
> because this patch is relatively harmless. Just like you count the number of iterations
> in hpage_collapse_scan_pmd(), you can do the same here using xa_index.

The semantics of hpage_collapse_scan_pmd() _pte and hpage_collapse_scan_file()
xas.xa_index are different.

Let me give a detailed example :)

xas->xa_index represents the starting address of the last folio when exiting
xas_for_each().

Assuming each folio size is 32KB, when we iterate "folio1 -> folio2" and break,
xas->xa_index equals the starting address of folio2, so "idx = 8". However,
folio2 is not counted in "idx".

In reality, folio2 has been scanned, so we need to add the folio2 size, making
"idx = 16" the correct value.

There are two ways for folio2:
1. shmem swap entries (xa_is_value), breaking out of the xas_for_each loop
   due to SCAN_EXCEED_SWAP_PTE.
2. Normal folio, breaking out of the xas_for_each loop due to
   SCAN_SCAN_ABORT/SCAN_PAGE_LRU/SCAN_PAGE_COUNT.

Move Lance suggestion to here.
> if (cur_progress)
>       *cur_progress = max(xas.xa_index - start, 1UL);

If we use the code suggested by Lance, we will miss the folio2 size,
the result is "idx = 8".
Is this the result we wanted? If Yes, Lance's suggested code is perfect.

Another more specific scenario is when the first iterated folio is
HPAGE_PMD_ORDER, "idx = 0", Yep, we can directly set "cur_progress = 1",
it's simple enough.

> >
> >>> 1. shmem swap entries (xa_is_value), add folio size.
> >>> 2. the folio is HPAGE_PMD_ORDER, the memory has been collapsed
> >>>    to PMD, so add 1 only.
> >>> 3. Normal folio, add folio size.
> >>>
> >>> Is the version below more readable?
> >>>
> >>>     if (cur_progress) {
> >>>             *cur_progress += xas.xa_index - start;
> >>>
> >>>             if (folio) {
> >>>                     if (xa_is_value(folio))
> >>>                             *cur_progress += 1 << xas_get_order(&xas);
> >>>                     else if (folio_order(folio) == HPAGE_PMD_ORDER)
> >>>                             *cur_progress += 1;
> >>>                     else
> >>>                             *cur_progress += folio_nr_pages(folio);
> >>>             }
> >>>     }
> >> Yep, this is unneeded complexity. This looks really ugly and the benefits of
> >> this are not clear. You can simply do
> >>
> >> if (cur_progress)
> >>      *cur_progress = xas.xa_index - start;
> >>
> >>>>>   rcu_read_unlock();
> >>>>>
> >>>>>   if (result == SCAN_SUCCEED) {
> >>>>> @@ -2456,6 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >>>>>
> >>>>>           while (khugepaged_scan.address < hend) {
> >>>>>                   bool mmap_locked = true;
> >>>>> +                 unsigned int cur_progress = 0;
> >>>>>
> >>>>>                   cond_resched();
> >>>>>                   if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> >>>>> @@ -2472,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >>>>>                           mmap_read_unlock(mm);
> >>>>>                           mmap_locked = false;
> >>>>>                           *result = hpage_collapse_scan_file(mm,
> >>>>> -                                 khugepaged_scan.address, file, pgoff, cc);
> >>>>> +                                 khugepaged_scan.address, file, pgoff,
> >>>>> +                                 &cur_progress, cc);
> >>>>>                           fput(file);
> >>>>>                           if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> >>>>>                                   mmap_read_lock(mm);
> >>>>> @@ -2486,7 +2514,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >>>>>                           }
> >>>>>                   } else {
> >>>>>                           *result = hpage_collapse_scan_pmd(mm, vma,
> >>>>> -                                 khugepaged_scan.address, &mmap_locked, cc);
> >>>>> +                                 khugepaged_scan.address, &mmap_locked,
> >>>>> +                                 &cur_progress, cc);
> >>>>>                   }
> >>>>>
> >>>>>                   if (*result == SCAN_SUCCEED)
> >>>>> @@ -2494,7 +2523,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
> >>>>>
> >>>>>                   /* move to next address */
> >>>>>                   khugepaged_scan.address += HPAGE_PMD_SIZE;
> >>>>> -                 progress += HPAGE_PMD_NR;
> >>>>> +                 progress += cur_progress;
> >>>>>                   if (!mmap_locked)
> >>>>>                           /*
> >>>>>                            * We released mmap_lock so break loop.  Note
> >>>>> @@ -2817,7 +2846,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >>>>>                   mmap_locked = false;
> >>>>>                   *lock_dropped = true;
> >>>>>                   result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> >>>>> -                                                   cc);
> >>>>> +                                                   NULL, cc);
> >>>>>
> >>>>>                   if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> >>>>>                       mapping_can_writeback(file->f_mapping)) {
> >>>>> @@ -2832,7 +2861,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >>>>>                   fput(file);
> >>>>>           } else {
> >>>>>                   result = hpage_collapse_scan_pmd(mm, vma, addr,
> >>>>> -                                                  &mmap_locked, cc);
> >>>>> +                                                  &mmap_locked, NULL, cc);
> >>>>>           }
> >>>>>           if (!mmap_locked)
> >>>>>                   *lock_dropped = true;
> >>> --
> >>> Thanks,
> >>> Vernon


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-29  9:18         ` Lance Yang
@ 2026-01-29 12:28           ` Vernon Yang
  0 siblings, 0 replies; 31+ messages in thread
From: Vernon Yang @ 2026-01-29 12:28 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, linux-mm,
	linux-kernel, Dev Jain, Vernon Yang

On Thu, Jan 29, 2026 at 5:18 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> On 2026/1/29 13:35, Dev Jain wrote:
> >
> > On 28/01/26 8:04 pm, Vernon Yang wrote:
> >> On Wed, Jan 28, 2026 at 01:59:33PM +0530, Dev Jain wrote:
> >>> On 23/01/26 1:52 pm, Vernon Yang wrote:
> >>>> From: Vernon Yang <yanglincheng@kylinos.cn>
> >>>>
> >>>> Currently, each scan always increases "progress" by HPAGE_PMD_NR,
> >>>> even if only scanning a single PTE/PMD entry.
> >>>>
> >>>> - When only scanning a sigle PTE entry, let me provide a detailed
> >>>>    example:
> >>>>
> >>>> static int hpage_collapse_scan_pmd()
> >>>> {
> >>>>    for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >>>>         _pte++, addr += PAGE_SIZE) {
> >>>>            pte_t pteval = ptep_get(_pte);
> >>>>            ...
> >>>>            if (pte_uffd_wp(pteval)) { <-- first scan hit
> >>>>                    result = SCAN_PTE_UFFD_WP;
> >>>>                    goto out_unmap;
> >>>>            }
> >>>>    }
> >>>> }
> >>>>
> >>>> During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
> >>>> directly. In practice, only one PTE is scanned before termination.
> >>>> Here, "progress += 1" reflects the actual number of PTEs scanned, but
> >>>> previously "progress += HPAGE_PMD_NR" always.
> >>>>
> >>>> - When the memory has been collapsed to PMD, let me provide a detailed
> >>>>    example:
> >>>>
> >>>> The following data is traced by bpftrace on a desktop system. After
> >>>> the system has been left idle for 10 minutes upon booting, a lot of
> >>>> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
> >>>> by khugepaged.
> >>>>
> >>>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> >>>> @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
> >>>> @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
> >>>> @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
> >>> Could you elaborate what is [1], [6] etc and 1,2,142, etc?
> >> These 1,6 are value of "enum scan_result", you can directly refer to the
> >> notes on the right.
> >>
> >> These 1,2,142,178 are number of different "enum scan_result" from
> >> trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file.
> >>
> >> as example, SCAN_PMD_MAPPED has 142 times during a full scan by
> >> khugepaged.
> >
> > Thanks. Can you please mention this in the patch description. You can simply
> > right it like this:
> >
> > "From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
> > following statuses were observed, with frequency mentioned next to them:
> >
> > SCAN_SUCCEED: 1
> > SCAN_PMD_MAPPED: 142
> > ....."
> >
> > and so on.
> >
> >>
> >>>> total progress size: 674 MB
> >>>> Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
> >>>>
> >>>> The khugepaged_scan list save all task that support collapse into hugepage,
> >>>> as long as the task is not destroyed, khugepaged will not remove it from
> >>>> the khugepaged_scan list. This exist a phenomenon where task has already
> >>>> collapsed all memory regions into hugepage, but khugepaged continues to
> >>>> scan it, which wastes CPU time and invalid, and due to
> >>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >>>> scanning a large number of invalid task, so scanning really valid task
> >>>> is later.
> >>>>
> >>>> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
> >>>> SCAN_NO_PTE_TABLE, just skip it, as follow:
> >>>>
> >>>> @scan_pmd_status[6]: 2
> >>>> @scan_pmd_status[3]: 147
> >>>> @scan_pmd_status[2]: 173
> >>>> total progress size: 45 MB
> >>>> Total time         : 20 seconds
> >>>>
> >>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> >>>> ---
> >>>>   include/linux/xarray.h |  9 ++++++++
> >>>>   mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
> >>>>   2 files changed, 47 insertions(+), 9 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> >>>> index be850174e802..f77d97d7b957 100644
> >>>> --- a/include/linux/xarray.h
> >>>> +++ b/include/linux/xarray.h
> >>>> @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
> >>>>    xas->xa_node = XAS_RESTART;
> >>>>   }
> >>>>
> >>>> +/**
> >>>> + * xas_get_index() - Get XArray operation state for a different index.
> >>>> + * @xas: XArray operation state.
> >>>> + */
> >>>> +static inline unsigned long xas_get_index(struct xa_state *xas)
> >>>> +{
> >>>> +  return xas->xa_index;
> >>>> +}
> >>>> +
> >>>>   /**
> >>>>    * xas_advance() - Skip over sibling entries.
> >>>>    * @xas: XArray operation state.
> >>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>>> index 6f0f05148765..de95029e3763 100644
> >>>> --- a/mm/khugepaged.c
> >>>> +++ b/mm/khugepaged.c
> >>>> @@ -68,7 +68,10 @@ enum scan_result {
> >>>>   static struct task_struct *khugepaged_thread __read_mostly;
> >>>>   static DEFINE_MUTEX(khugepaged_mutex);
> >>>>
> >>>> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> >>>> +/*
> >>>> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> >>>> + * every 10 second.
> >>>> + */
> >>>>   static unsigned int khugepaged_pages_to_scan __read_mostly;
> >>>>   static unsigned int khugepaged_pages_collapsed;
> >>>>   static unsigned int khugepaged_full_scans;
> >>>> @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >>>>   }
> >>>>
> >>>>   static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>> -          struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
> >>>> +          struct vm_area_struct *vma, unsigned long start_addr,
> >>>> +          bool *mmap_locked, unsigned int *cur_progress,
> >>>>            struct collapse_control *cc)
> >>>>   {
> >>>>    pmd_t *pmd;
> >>>> @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>>
> >>>>    VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
> >>>>
> >>>> +  if (cur_progress)
> >>>> +          *cur_progress += 1;
> >>> Why not be a little more explicit, and do this addition if find_pmd_or_thp_or_none fails,
> >>> or pte_offset_map_lock fails? The way you do it right now is not readable - it gives no
> >>> idea as to why on function entry we do a +1 right away. Please do add some comments too.
> >> If this way is not clear enough, we can directly add 1 in
> >> find_pmd_or_thp_or_none() etc, BUT it's a bit redundant.
> >> Please take a look at which one is better.
> >>
> >> case 1:
> >> as the V4 PATCH #2 [1] and #3 [2], only hpage_collapse_scan_pmd().
> >> [1] https://lore.kernel.org/linux-mm/20260111121909.8410-3-yanglincheng@kylinos.cn
> >> [2] https://lore.kernel.org/linux-mm/20260111121909.8410-4-yanglincheng@kylinos.cn
> >>
> >> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>              struct vm_area_struct *vma, unsigned long start_addr,
> >>              bool *mmap_locked, unsigned int *cur_progress,
> >>              struct collapse_control *cc)
> >> {
> >>      ...
> >>      result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> >>      if (result != SCAN_SUCCEED) {
> >>              if (cur_progress)
> >>                      *cur_progress += 1; // here
> >>              goto out;
> >>      }
> >>      ...
> >>      pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >>      if (!pte) {
> >>              if (cur_progress)
> >>                      *cur_progress += 1; // here
> >>              result = SCAN_NO_PTE_TABLE;
> >>              goto out;
> >>      }
> >>
> >>      for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >>           _pte++, addr += PAGE_SIZE) {
> >>              if (cur_progress)
> >>                      *cur_progress += 1; // here
> >>              ...
> >>      }
> >> }
> >>
> >> case 2:
> >>
> >> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>              struct vm_area_struct *vma, unsigned long start_addr,
> >>              bool *mmap_locked, unsigned int *cur_progress,
> >>              struct collapse_control *cc)
> >> {
> >>      ...
> >>      result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> >>      if (result != SCAN_SUCCEED) {
> >>              if (cur_progress)
> >>                      *cur_progress += 1; // here
> >
> > Let us be more explicit and set this equal to 1, instead of adding 1.
> >
> >>              goto out;
> >>      }
> >>      ...
> >>      pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >>      if (!pte) {
> >>              if (cur_progress)
> >>                      *cur_progress += 1; // here
> >
> > Same comment as above.
> >
> >>              result = SCAN_NO_PTE_TABLE;
> >>              goto out;
> >>      }
> >>
> >>      for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >>           _pte++, addr += PAGE_SIZE) {
> >>              ...
> >>      }
> >>      ...
> >> out_unmap:
> >>      if (cur_progress) {
> >>              if (_pte >= pte + HPAGE_PMD_NR)
> >>                      *cur_progress += HPAGE_PMD_NR;   // here
> >>              else
> >>                      *cur_progress += _pte - pte + 1; // here
> >>      }
> >> }
> >
> > I will vote case 2. In case 1 I don't like the fact that the if (cur_progress)
> > branch will be checked each iteration - and I don't think the compiler can
> > optimize this since the body of the loop is complex, so this check cannot
> > be hoisted out of the loop.
> >
> >
> >>
> >> case 3:
> >>      current patch, and add more comments to clearer.
> >>
> >>>> +
> >>>>    result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> >>>>    if (result != SCAN_SUCCEED)
> >>>>            goto out;
> >>>> @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>>            result = SCAN_SUCCEED;
> >>>>    }
> >>>>   out_unmap:
> >>>> +  if (cur_progress) {
> >>>> +          if (_pte >= pte + HPAGE_PMD_NR)
> >>>> +                  *cur_progress += HPAGE_PMD_NR - 1;
> >>>> +          else
> >>>> +                  *cur_progress += _pte - pte;
> >>>> +  }
> >>>>    pte_unmap_unlock(pte, ptl);
> >>>>    if (result == SCAN_SUCCEED) {
> >>>>            result = collapse_huge_page(mm, start_addr, referenced,
> >>>> @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> >>>>    return result;
> >>>>   }
> >>>>
> >>>> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >>>> -          struct file *file, pgoff_t start, struct collapse_control *cc)
> >>>> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> >>>> +          unsigned long addr, struct file *file, pgoff_t start,
> >>>> +          unsigned int *cur_progress, struct collapse_control *cc)
> >>>>   {
> >>>>    struct folio *folio = NULL;
> >>>>    struct address_space *mapping = file->f_mapping;
> >>>> @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
> >>>>                    cond_resched_rcu();
> >>>>            }
> >>>>    }
> >>>> +  if (cur_progress) {
> >>>> +          unsigned long idx = xas_get_index(&xas) - start;
> >>>> +
> >>>> +          if (folio == NULL)
> >>>> +                  *cur_progress += HPAGE_PMD_NR;
> >>> I think this whole block needs some comments. Can you explain, why you
> >>> do a particular increment in each case?
> >>>
> >>>> +          else if (xa_is_value(folio))
> >>>> +                  *cur_progress += idx + (1 << xas_get_order(&xas));
> >>>> +          else if (folio_order(folio) == HPAGE_PMD_ORDER)
> >>>> +                  *cur_progress += idx + 1;
> >>>> +          else
> >>>> +                  *cur_progress += idx + folio_nr_pages(folio);
> >>>> +  }
> >> The "idx" represent PTEs number already scanned when exiting
> >> xas_for_each().
> >>
> >> However, the last valid folio size was not counted in "idx" (except
> >> folio == NULL, "idx" equal to HPAGE_PMD_NR), which can be further
> >> divided into three cases:
> >
> > But, the number of PTEs you account in these three cases, are *not*
> > scanned, right? So we can simply drop these 3 cases.
> >
> >>
> >> 1. shmem swap entries (xa_is_value), add folio size.
> >> 2. the folio is HPAGE_PMD_ORDER, the memory has been collapsed
> >>     to PMD, so add 1 only.
> >> 3. Normal folio, add folio size.
> >>
> >> Is the version below more readable?
> >>
> >>      if (cur_progress) {
> >>              *cur_progress += xas.xa_index - start;
> >>
> >>              if (folio) {
> >>                      if (xa_is_value(folio))
> >>                              *cur_progress += 1 << xas_get_order(&xas);
> >>                      else if (folio_order(folio) == HPAGE_PMD_ORDER)
> >>                              *cur_progress += 1;
> >>                      else
> >>                              *cur_progress += folio_nr_pages(folio);
> >>              }
> >>      }
> >
> > Yep, this is unneeded complexity. This looks really ugly and the benefits of
> > this are not clear. You can simply do
> >
> > if (cur_progress)
> >       *cur_progress = xas.xa_index - start;
> >
>
> I agree with Dev here. The extra complexity in hpage_collapse_scan_file()
> doesn't seem worth it.
>
> Suggest:
>
> if (cur_progress)
>         *cur_progress = max(xas.xa_index - start, 1UL);
>
> Just keeps it simple, and handles the idx=0 case you mentioned as well.

Thank you for your suggestion, but two scenarios are still missing. For a
detailed explanation, please refer to the link below.

https://lore.kernel.org/linux-mm/20260123082232.16413-1-vernon2gm@gmail.com/T/#mc3969cb0d52e28ffea2fb96260f0880a5f005848


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number
  2026-01-29 12:24             ` Vernon Yang
@ 2026-01-29 12:46               ` Dev Jain
  0 siblings, 0 replies; 31+ messages in thread
From: Dev Jain @ 2026-01-29 12:46 UTC (permalink / raw)
  To: Vernon Yang, lance.yang
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, linux-mm,
	linux-kernel, Vernon Yang


On 29/01/26 5:54 pm, Vernon Yang wrote:
> On Thu, Jan 29, 2026 at 4:32 PM Dev Jain <dev.jain@arm.com> wrote:
>> On 29/01/26 1:29 pm, Vernon Yang wrote:
>>> On Thu, Jan 29, 2026 at 11:05:36AM +0530, Dev Jain wrote:
>>>> On 28/01/26 8:04 pm, Vernon Yang wrote:
>>>>> On Wed, Jan 28, 2026 at 01:59:33PM +0530, Dev Jain wrote:
>>>>>> On 23/01/26 1:52 pm, Vernon Yang wrote:
>>>>>>> From: Vernon Yang <yanglincheng@kylinos.cn>
>>>>>>>
>>>>>>> Currently, each scan always increases "progress" by HPAGE_PMD_NR,
>>>>>>> even if only scanning a single PTE/PMD entry.
>>>>>>>
>>>>>>> - When only scanning a sigle PTE entry, let me provide a detailed
>>>>>>>   example:
>>>>>>>
>>>>>>> static int hpage_collapse_scan_pmd()
>>>>>>> {
>>>>>>>   for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>>>>>        _pte++, addr += PAGE_SIZE) {
>>>>>>>           pte_t pteval = ptep_get(_pte);
>>>>>>>           ...
>>>>>>>           if (pte_uffd_wp(pteval)) { <-- first scan hit
>>>>>>>                   result = SCAN_PTE_UFFD_WP;
>>>>>>>                   goto out_unmap;
>>>>>>>           }
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
>>>>>>> directly. In practice, only one PTE is scanned before termination.
>>>>>>> Here, "progress += 1" reflects the actual number of PTEs scanned, but
>>>>>>> previously "progress += HPAGE_PMD_NR" always.
>>>>>>>
>>>>>>> - When the memory has been collapsed to PMD, let me provide a detailed
>>>>>>>   example:
>>>>>>>
>>>>>>> The following data is traced by bpftrace on a desktop system. After
>>>>>>> the system has been left idle for 10 minutes upon booting, a lot of
>>>>>>> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
>>>>>>> by khugepaged.
>>>>>>>
>>>>>>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
>>>>>>> @scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
>>>>>>> @scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
>>>>>>> @scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
>>>>>> Could you elaborate what is [1], [6] etc and 1,2,142, etc?
>>>>> These 1,6 are value of "enum scan_result", you can directly refer to the
>>>>> notes on the right.
>>>>>
>>>>> These 1,2,142,178 are number of different "enum scan_result" from
>>>>> trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file.
>>>>>
>>>>> as example, SCAN_PMD_MAPPED has 142 times during a full scan by
>>>>> khugepaged.
>>>> Thanks. Can you please mention this in the patch description. You can simply
>>>> right it like this:
>>>>
>>>> "From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
>>>> following statuses were observed, with frequency mentioned next to them:
>>>>
>>>> SCAN_SUCCEED: 1
>>>> SCAN_PMD_MAPPED: 142
>>>> ....."
>>>>
>>>> and so on.
>>> LGTM, I will do it in the next version. Thanks!
>>>
>>>>>>> total progress size: 674 MB
>>>>>>> Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs
>>>>>>>
>>>>>>> The khugepaged_scan list save all task that support collapse into hugepage,
>>>>>>> as long as the task is not destroyed, khugepaged will not remove it from
>>>>>>> the khugepaged_scan list. This exist a phenomenon where task has already
>>>>>>> collapsed all memory regions into hugepage, but khugepaged continues to
>>>>>>> scan it, which wastes CPU time and invalid, and due to
>>>>>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>>>>>>> scanning a large number of invalid task, so scanning really valid task
>>>>>>> is later.
>>>>>>>
>>>>>>> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
>>>>>>> SCAN_NO_PTE_TABLE, just skip it, as follow:
>>>>>>>
>>>>>>> @scan_pmd_status[6]: 2
>>>>>>> @scan_pmd_status[3]: 147
>>>>>>> @scan_pmd_status[2]: 173
>>>>>>> total progress size: 45 MB
>>>>>>> Total time         : 20 seconds
>>>>>>>
>>>>>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>>>>>> ---
>>>>>>>  include/linux/xarray.h |  9 ++++++++
>>>>>>>  mm/khugepaged.c        | 47 ++++++++++++++++++++++++++++++++++--------
>>>>>>>  2 files changed, 47 insertions(+), 9 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
>>>>>>> index be850174e802..f77d97d7b957 100644
>>>>>>> --- a/include/linux/xarray.h
>>>>>>> +++ b/include/linux/xarray.h
>>>>>>> @@ -1646,6 +1646,15 @@ static inline void xas_set(struct xa_state *xas, unsigned long index)
>>>>>>>   xas->xa_node = XAS_RESTART;
>>>>>>>  }
>>>>>>>
>>>>>>> +/**
>>>>>>> + * xas_get_index() - Get XArray operation state for a different index.
>>>>>>> + * @xas: XArray operation state.
>>>>>>> + */
>>>>>>> +static inline unsigned long xas_get_index(struct xa_state *xas)
>>>>>>> +{
>>>>>>> + return xas->xa_index;
>>>>>>> +}
>>>>>>> +
>>>>>>>  /**
>>>>>>>   * xas_advance() - Skip over sibling entries.
>>>>>>>   * @xas: XArray operation state.
>>>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>>>> index 6f0f05148765..de95029e3763 100644
>>>>>>> --- a/mm/khugepaged.c
>>>>>>> +++ b/mm/khugepaged.c
>>>>>>> @@ -68,7 +68,10 @@ enum scan_result {
>>>>>>>  static struct task_struct *khugepaged_thread __read_mostly;
>>>>>>>  static DEFINE_MUTEX(khugepaged_mutex);
>>>>>>>
>>>>>>> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
>>>>>>> +/*
>>>>>>> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
>>>>>>> + * every 10 second.
>>>>>>> + */
>>>>>>>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>>>>>>>  static unsigned int khugepaged_pages_collapsed;
>>>>>>>  static unsigned int khugepaged_full_scans;
>>>>>>> @@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>>>>>>>  }
>>>>>>>
>>>>>>>  static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>>>> -         struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
>>>>>>> +         struct vm_area_struct *vma, unsigned long start_addr,
>>>>>>> +         bool *mmap_locked, unsigned int *cur_progress,
>>>>>>>           struct collapse_control *cc)
>>>>>>>  {
>>>>>>>   pmd_t *pmd;
>>>>>>> @@ -1255,6 +1259,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>>>>
>>>>>>>   VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>>>>>>
>>>>>>> + if (cur_progress)
>>>>>>> +         *cur_progress += 1;
>>>>>> Why not be a little more explicit, and do this addition if find_pmd_or_thp_or_none fails,
>>>>>> or pte_offset_map_lock fails? The way you do it right now is not readable - it gives no
>>>>>> idea as to why on function entry we do a +1 right away. Please do add some comments too.
>>>>> If this way is not clear enough, we can directly add 1 in
>>>>> find_pmd_or_thp_or_none() etc, BUT it's a bit redundant.
>>>>> Please take a look at which one is better.
>>>>>
>>>>> case 1:
>>>>> as the V4 PATCH #2 [1] and #3 [2], only hpage_collapse_scan_pmd().
>>>>> [1] https://lore.kernel.org/linux-mm/20260111121909.8410-3-yanglincheng@kylinos.cn
>>>>> [2] https://lore.kernel.org/linux-mm/20260111121909.8410-4-yanglincheng@kylinos.cn
>>>>>
>>>>> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>>             struct vm_area_struct *vma, unsigned long start_addr,
>>>>>             bool *mmap_locked, unsigned int *cur_progress,
>>>>>             struct collapse_control *cc)
>>>>> {
>>>>>     ...
>>>>>     result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>>>>>     if (result != SCAN_SUCCEED) {
>>>>>             if (cur_progress)
>>>>>                     *cur_progress += 1; // here
>>>>>             goto out;
>>>>>     }
>>>>>     ...
>>>>>     pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>>>>>     if (!pte) {
>>>>>             if (cur_progress)
>>>>>                     *cur_progress += 1; // here
>>>>>             result = SCAN_NO_PTE_TABLE;
>>>>>             goto out;
>>>>>     }
>>>>>
>>>>>     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>>>          _pte++, addr += PAGE_SIZE) {
>>>>>             if (cur_progress)
>>>>>                     *cur_progress += 1; // here
>>>>>             ...
>>>>>     }
>>>>> }
>>>>>
>>>>> case 2:
>>>>>
>>>>> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>>             struct vm_area_struct *vma, unsigned long start_addr,
>>>>>             bool *mmap_locked, unsigned int *cur_progress,
>>>>>             struct collapse_control *cc)
>>>>> {
>>>>>     ...
>>>>>     result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>>>>>     if (result != SCAN_SUCCEED) {
>>>>>             if (cur_progress)
>>>>>                     *cur_progress += 1; // here
>>>> Let us be more explicit and set this equal to 1, instead of adding 1.
>>> LGTM, I will do it in the next version. Thanks!
>>>
>>>>>             goto out;
>>>>>     }
>>>>>     ...
>>>>>     pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>>>>>     if (!pte) {
>>>>>             if (cur_progress)
>>>>>                     *cur_progress += 1; // here
>>>> Same comment as above.
>>>>
>>>>>             result = SCAN_NO_PTE_TABLE;
>>>>>             goto out;
>>>>>     }
>>>>>
>>>>>     for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>>>          _pte++, addr += PAGE_SIZE) {
>>>>>             ...
>>>>>     }
>>>>>     ...
>>>>> out_unmap:
>>>>>     if (cur_progress) {
>>>>>             if (_pte >= pte + HPAGE_PMD_NR)
>>>>>                     *cur_progress += HPAGE_PMD_NR;   // here
>>>>>             else
>>>>>                     *cur_progress += _pte - pte + 1; // here
>>>>>     }
>>>>> }
>>>> I will vote case 2. In case 1 I don't like the fact that the if (cur_progress)
>>>> branch will be checked each iteration - and I don't think the compiler can
>>>> optimize this since the body of the loop is complex, so this check cannot
>>>> be hoisted out of the loop.
>>>>
>>>>
>>>>> case 3:
>>>>>     current patch, and add more comments to clearer.
>>>>>
>>>>>>> +
>>>>>>>   result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>>>>>>>   if (result != SCAN_SUCCEED)
>>>>>>>           goto out;
>>>>>>> @@ -1396,6 +1403,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>>>>           result = SCAN_SUCCEED;
>>>>>>>   }
>>>>>>>  out_unmap:
>>>>>>> + if (cur_progress) {
>>>>>>> +         if (_pte >= pte + HPAGE_PMD_NR)
>>>>>>> +                 *cur_progress += HPAGE_PMD_NR - 1;
>>>>>>> +         else
>>>>>>> +                 *cur_progress += _pte - pte;
>>>>>>> + }
>>>>>>>   pte_unmap_unlock(pte, ptl);
>>>>>>>   if (result == SCAN_SUCCEED) {
>>>>>>>           result = collapse_huge_page(mm, start_addr, referenced,
>>>>>>> @@ -2286,8 +2299,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>>>>>   return result;
>>>>>>>  }
>>>>>>>
>>>>>>> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>>>>>>> -         struct file *file, pgoff_t start, struct collapse_control *cc)
>>>>>>> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>>>>>>> +         unsigned long addr, struct file *file, pgoff_t start,
>>>>>>> +         unsigned int *cur_progress, struct collapse_control *cc)
>>>>>>>  {
>>>>>>>   struct folio *folio = NULL;
>>>>>>>   struct address_space *mapping = file->f_mapping;
>>>>>>> @@ -2376,6 +2390,18 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>>>>>>>                   cond_resched_rcu();
>>>>>>>           }
>>>>>>>   }
>>>>>>> + if (cur_progress) {
>>>>>>> +         unsigned long idx = xas_get_index(&xas) - start;
>>>>>>> +
>>>>>>> +         if (folio == NULL)
>>>>>>> +                 *cur_progress += HPAGE_PMD_NR;
>>>>>> I think this whole block needs some comments. Can you explain, why you
>>>>>> do a particular increment in each case?
>>>>>>
>>>>>>> +         else if (xa_is_value(folio))
>>>>>>> +                 *cur_progress += idx + (1 << xas_get_order(&xas));
>>>>>>> +         else if (folio_order(folio) == HPAGE_PMD_ORDER)
>>>>>>> +                 *cur_progress += idx + 1;
>>>>>>> +         else
>>>>>>> +                 *cur_progress += idx + folio_nr_pages(folio);
>>>>>>> + }
>>>>> The "idx" represent PTEs number already scanned when exiting
>>>>> xas_for_each().
>>>>>
>>>>> However, the last valid folio size was not counted in "idx" (except
>>>>> folio == NULL, "idx" equal to HPAGE_PMD_NR), which can be further
>>>>> divided into three cases:
>>>> But, the number of PTEs you account in these three cases, are *not*
>>>> scanned, right? So we can simply drop these 3 cases.
>>> No, these three cases are the last scanning folio to break, I think we
>>> need to add them. Imagine that if we trigger HPAGE_PMD_ORDER folio
>>> firstly, "idx" is equal to 0.
>> If you hit a large folio and break, you don't consume any extra iterations.
>> The number of iterations is completely determined by xa_index. This is kind
>> of analogous to SCAN_PMD_MAPPED - in one "iteration", you decide to stop
>> scanning, and set progress to 1.
>>
>> In any case, the problem which you describe in the patch description is
>> completely solved by setting progress to 1 upon failure of find_pmd_or_thp_or_none.
>> So anything else which you do (like counting iterations in hpage_collapse_scan_pmd
>> or hpage_collapse_scan_file) is extra and it is not clear if it has a practical
>> benefit. The patch benefits us because there are a lot of SCAN_PMD_MAPPED and SCAN_PMD_NONE.
>> Will we practically also encounter a large number of SCAN_EXCEED_PTE_SWAP, PTE_NONE, etc?
>>
>> I tilt towards keeping the other bits of the patch, if they can be simplified, and
>> because this patch is relatively harmless. Just like you count the number of iterations
>> in hpage_collapse_scan_pmd(), you can do the same here using xa_index.
> The semantics of hpage_collapse_scan_pmd() _pte and hpage_collapse_scan_file()
> xas.xa_index are different.
>
> Let me give a detailed example :)
>
> xas->xa_index represents the starting address of the last folio when exiting
> xas_for_each().
>
> Assuming each folio size is 32KB, when we iterate "folio1 -> folio2" and break,
> xas->xa_index equals the starting address of folio2, so "idx = 8". However,
> folio2 is not counted in "idx".
>
> In reality, folio2 has been scanned, so we need to add the folio2 size, making
> "idx = 16" the correct value.

The "scanning" of folio2 is O(1). It is *not* proportional to the number of ptes
covered by folio2 (which is 8), as we break immediately.

As I said above, you will be contradicting the patch's objective here - if you
see a PMD_ORDER folio and set progress = HPAGE_PMD_NR, that exactly corresponds
to seeing a PMD mapping in the anon collapse path, and setting progress = HPAGE_PMD_NR.
But that is not appropriate - deducing that the PMD maps a PMD folio, is an O(1)
operation w.r.t the number of ptes.

>
> There are two ways for folio2:
> 1. shmem swap entries (xa_is_value), breaking out of the xas_for_each loop
>    due to SCAN_EXCEED_SWAP_PTE.
> 2. Normal folio, breaking out of the xas_for_each loop due to
>    SCAN_SCAN_ABORT/SCAN_PAGE_LRU/SCAN_PAGE_COUNT.
>
> Move Lance suggestion to here.
>> if (cur_progress)
>>       *cur_progress = max(xas.xa_index - start, 1UL);
> If we use the code suggested by Lance, we will miss the folio2 size,
> the result is "idx = 8".
> Is this the result we wanted? If Yes, Lance's suggested code is perfect.
>
> Another more specific scenario is when the first iterated folio is
> HPAGE_PMD_ORDER, "idx = 0", Yep, we can directly set "cur_progress = 1",
> it's simple enough.

Yes, let us do that.

>
>>>>> 1. shmem swap entries (xa_is_value), add folio size.
>>>>> 2. the folio is HPAGE_PMD_ORDER, the memory has been collapsed
>>>>>    to PMD, so add 1 only.
>>>>> 3. Normal folio, add folio size.
>>>>>
>>>>> Is the version below more readable?
>>>>>
>>>>>     if (cur_progress) {
>>>>>             *cur_progress += xas.xa_index - start;
>>>>>
>>>>>             if (folio) {
>>>>>                     if (xa_is_value(folio))
>>>>>                             *cur_progress += 1 << xas_get_order(&xas);
>>>>>                     else if (folio_order(folio) == HPAGE_PMD_ORDER)
>>>>>                             *cur_progress += 1;
>>>>>                     else
>>>>>                             *cur_progress += folio_nr_pages(folio);
>>>>>             }
>>>>>     }
>>>> Yep, this is unneeded complexity. This looks really ugly and the benefits of
>>>> this are not clear. You can simply do
>>>>
>>>> if (cur_progress)
>>>>      *cur_progress = xas.xa_index - start;
>>>>
>>>>>>>   rcu_read_unlock();
>>>>>>>
>>>>>>>   if (result == SCAN_SUCCEED) {
>>>>>>> @@ -2456,6 +2482,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>>>>>
>>>>>>>           while (khugepaged_scan.address < hend) {
>>>>>>>                   bool mmap_locked = true;
>>>>>>> +                 unsigned int cur_progress = 0;
>>>>>>>
>>>>>>>                   cond_resched();
>>>>>>>                   if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>>>>>>> @@ -2472,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>>>>>                           mmap_read_unlock(mm);
>>>>>>>                           mmap_locked = false;
>>>>>>>                           *result = hpage_collapse_scan_file(mm,
>>>>>>> -                                 khugepaged_scan.address, file, pgoff, cc);
>>>>>>> +                                 khugepaged_scan.address, file, pgoff,
>>>>>>> +                                 &cur_progress, cc);
>>>>>>>                           fput(file);
>>>>>>>                           if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
>>>>>>>                                   mmap_read_lock(mm);
>>>>>>> @@ -2486,7 +2514,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>>>>>                           }
>>>>>>>                   } else {
>>>>>>>                           *result = hpage_collapse_scan_pmd(mm, vma,
>>>>>>> -                                 khugepaged_scan.address, &mmap_locked, cc);
>>>>>>> +                                 khugepaged_scan.address, &mmap_locked,
>>>>>>> +                                 &cur_progress, cc);
>>>>>>>                   }
>>>>>>>
>>>>>>>                   if (*result == SCAN_SUCCEED)
>>>>>>> @@ -2494,7 +2523,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
>>>>>>>
>>>>>>>                   /* move to next address */
>>>>>>>                   khugepaged_scan.address += HPAGE_PMD_SIZE;
>>>>>>> -                 progress += HPAGE_PMD_NR;
>>>>>>> +                 progress += cur_progress;
>>>>>>>                   if (!mmap_locked)
>>>>>>>                           /*
>>>>>>>                            * We released mmap_lock so break loop.  Note
>>>>>>> @@ -2817,7 +2846,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>>>>>                   mmap_locked = false;
>>>>>>>                   *lock_dropped = true;
>>>>>>>                   result = hpage_collapse_scan_file(mm, addr, file, pgoff,
>>>>>>> -                                                   cc);
>>>>>>> +                                                   NULL, cc);
>>>>>>>
>>>>>>>                   if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>>>>>>>                       mapping_can_writeback(file->f_mapping)) {
>>>>>>> @@ -2832,7 +2861,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>>>>>                   fput(file);
>>>>>>>           } else {
>>>>>>>                   result = hpage_collapse_scan_pmd(mm, vma, addr,
>>>>>>> -                                                  &mmap_locked, cc);
>>>>>>> +                                                  &mmap_locked, NULL, cc);
>>>>>>>           }
>>>>>>>           if (!mmap_locked)
>>>>>>>                   *lock_dropped = true;
>>>>> --
>>>>> Thanks,
>>>>> Vernon


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2026-01-29 12:46 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-23  8:22 [PATCH mm-new v5 0/5] Improve khugepaged scan logic Vernon Yang
2026-01-23  8:22 ` [PATCH mm-new v5 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
2026-01-23 10:25   ` Dev Jain
2026-01-23  8:22 ` [PATCH mm-new v5 2/5] mm: khugepaged: refine scan progress number Vernon Yang
2026-01-23 10:46   ` Dev Jain
2026-01-23 15:25     ` Vernon Yang
2026-01-23 15:19   ` Matthew Wilcox
2026-01-23 15:29     ` Vernon Yang
2026-01-28  8:29   ` Dev Jain
2026-01-28 14:34     ` Vernon Yang
2026-01-29  5:35       ` Dev Jain
2026-01-29  7:59         ` Vernon Yang
2026-01-29  8:32           ` Dev Jain
2026-01-29 12:24             ` Vernon Yang
2026-01-29 12:46               ` Dev Jain
2026-01-29  9:18         ` Lance Yang
2026-01-29 12:28           ` Vernon Yang
2026-01-23  8:22 ` [PATCH mm-new v5 3/5] mm: add folio_test_lazyfree helper Vernon Yang
2026-01-23 10:54   ` Dev Jain
2026-01-26  1:52   ` Barry Song
2026-01-23  8:22 ` [PATCH mm-new v5 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
2026-01-23  9:09   ` Lance Yang
2026-01-23 15:08     ` Vernon Yang
2026-01-23 16:32       ` Lance Yang
2026-01-24  3:22         ` Vernon Yang
2026-01-24  6:48           ` Dev Jain
2026-01-26  2:06             ` Barry Song
2026-01-23  8:22 ` [PATCH mm-new v5 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
2026-01-23 12:40   ` Dev Jain
2026-01-23 15:32     ` Vernon Yang
2026-01-26  2:18     ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox