[PATCH v3 0/6] Improve khugepaged scan logic

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] Improve khugepaged scan logic
@ 2026-01-04  5:41 Vernon Yang
  2026-01-04  5:41 ` [PATCH v3 1/6] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Vernon Yang @ 2026-01-04  5:41 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	richard.weiyang, linux-mm, linux-kernel, Vernon Yang

Hi all,

This series is improve the khugepaged scan logic, reduce CPU consumption,
prioritize scanning task that access memory frequently.

The following data is traced by bpftrace[1] on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan by
khugepaged.

@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
@scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
@scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
@scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
total progress size: 674 MB
Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs

The khugepaged has below phenomenon: the khugepaged list is scanned in a
FIFO manner, as long as the task is not destroyed,
1. the task no longer has memory that can be collapsed into hugepage,
   continues scan it always.
2. the task at the front of the khugepaged scan list is cold, they are
   still scanned first.
3. everyone scan at intervals of khugepaged_scan_sleep_millisecs
   (default 10s). If we always scan the above two cases first, the valid
   scan will have to wait for a long time.

For the first case, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE, just skip it.

For the second case, if the user has explicitly informed us via
MADV_FREE that these folios will be freed, just skip it only.

The below is some performance test results.

kernbench results (testing on x86_64 machine):

                     baseline w/o patches   test w/ patches
Amean     user-32    18586.99 (   0.00%)    18562.36 *   0.13%*
Amean     syst-32     1133.61 (   0.00%)     1126.02 *   0.67%*
Amean     elsp-32      668.05 (   0.00%)      667.13 *   0.14%*
BAmean-95 user-32    18585.23 (   0.00%)    18559.71 (   0.14%)
BAmean-95 syst-32     1133.22 (   0.00%)     1125.49 (   0.68%)
BAmean-95 elsp-32      667.94 (   0.00%)      667.08 (   0.13%)
BAmean-99 user-32    18585.23 (   0.00%)    18559.71 (   0.14%)
BAmean-99 syst-32     1133.22 (   0.00%)     1125.49 (   0.68%)
BAmean-99 elsp-32      667.94 (   0.00%)      667.08 (   0.13%)

Create three task[2]: hot1 -> cold -> hot2. After all three task are
created, each allocate memory 128MB. the hot1/hot2 task continuously
access 128 MB memory, while the cold task only accesses its memory
briefly andthen call madvise(MADV_FREE). Here are the performance test
results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
| cycles per access   |  4.96         |  2.21         | -55.44% |
| Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
| dTLB-load-misses    |  284814532    |  69597236     | -75.56% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.29         |  2.07         | -71.60% |
| Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
| dTLB-load-misses    |  241600871    |  3216108      | -98.67% |

This series is based on Linux v6.19-rc3.

Thank you very much for your comments and discussions :)


V2 -> V3:
- Refine scan progress number, add folio_is_lazyfree helper
- Fix warnings at SCAN_PTE_MAPPED_HUGEPAGE.
- For MADV_FREE, we will skip the lazy-free folios instead.
- For MADV_COLD, remove it.
- Used hpage_collapse_test_exit_or_disable() instead of vma = NULL.
- pickup Reviewed-by.

V1 -> V2:
- Rename full to full_scan_finished, pickup Acked-by.
- Just skip SCAN_PMD_MAPPED/NO_PTE_TABLE memory, not remove mm.
- Set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE to just skip, not move mm.
- Again test performance at the v6.19-rc2.

V1 : https://lore.kernel.org/linux-mm/20251215090419.174418-1-yanglincheng@kylinos.cn
V2 : https://lore.kernel.org/linux-mm/20251229055151.54887-1-yanglincheng@kylinos.cn

[1] https://github.com/vernon2gh/app_and_module/blob/main/khugepaged/khugepaged_mm.bt
[2] https://github.com/vernon2gh/app_and_module/blob/main/khugepaged/app.c

Vernon Yang (6):
  mm: khugepaged: add trace_mm_khugepaged_scan event
  mm: khugepaged: refine scan progress number
  mm: khugepaged: just skip when the memory has been collapsed
  mm: add folio_is_lazyfree helper
  mm: khugepaged: skip lazy-free folios at scanning
  mm: khugepaged: set to next mm direct when mm has
    MMF_DISABLE_THP_COMPLETELY

 include/linux/mm_inline.h          |  5 ++++
 include/trace/events/huge_memory.h | 25 ++++++++++++++++
 mm/khugepaged.c                    | 47 +++++++++++++++++++++++-------
 mm/rmap.c                          |  4 +--
 mm/vmscan.c                        |  5 ++--
 5 files changed, 71 insertions(+), 15 deletions(-)

--
2.51.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/6] mm: khugepaged: add trace_mm_khugepaged_scan event
  2026-01-04  5:41 [PATCH v3 0/6] Improve khugepaged scan logic Vernon Yang
@ 2026-01-04  5:41 ` Vernon Yang
  2026-01-04  5:41 ` [PATCH v3 2/6] mm: khugepaged: refine scan progress number Vernon Yang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Vernon Yang @ 2026-01-04  5:41 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	richard.weiyang, linux-mm, linux-kernel, Vernon Yang

Add mm_khugepaged_scan event to track the total time for full scan
and the total number of pages scanned of khugepaged.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
---
 include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++
 mm/khugepaged.c                    |  2 ++
 2 files changed, 26 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4cde53b45a85..01225dd27ad5 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -236,5 +236,29 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
 		__print_symbolic(__entry->result, SCAN_STATUS))
 );
 
+TRACE_EVENT(mm_khugepaged_scan,
+
+	TP_PROTO(struct mm_struct *mm, int progress, bool full_scan_finished),
+
+	TP_ARGS(mm, progress, full_scan_finished),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(int, progress)
+		__field(bool, full_scan_finished)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->progress = progress;
+		__entry->full_scan_finished = full_scan_finished;
+	),
+
+	TP_printk("mm=%p, progress=%d, full_scan_finished=%d",
+		__entry->mm,
+		__entry->progress,
+		__entry->full_scan_finished)
+);
+
 #endif /* __HUGE_MEMORY_H */
 #include <trace/define_trace.h>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386..9f99f61689f8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2533,6 +2533,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		collect_mm_slot(slot);
 	}
 
+	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
+
 	return progress;
 }
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 2/6] mm: khugepaged: refine scan progress number
  2026-01-04  5:41 [PATCH v3 0/6] Improve khugepaged scan logic Vernon Yang
  2026-01-04  5:41 ` [PATCH v3 1/6] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
@ 2026-01-04  5:41 ` Vernon Yang
  2026-01-05 16:49   ` David Hildenbrand (Red Hat)
  2026-01-04  5:41 ` [PATCH v3 3/6] mm: khugepaged: just skip when the memory has been collapsed Vernon Yang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Vernon Yang @ 2026-01-04  5:41 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	richard.weiyang, linux-mm, linux-kernel, Vernon Yang

Currently, each PMD scan always increases `progress` by HPAGE_PMD_NR,
even if only scanning a single page. By counting the actual number of
pages scanned, the `progress` is tracked accurately.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 mm/khugepaged.c | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9f99f61689f8..4b124e854e2e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1247,7 +1247,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long start_addr, bool *mmap_locked,
-				   struct collapse_control *cc)
+				   int *progress, struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1258,23 +1258,28 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	unsigned long addr;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
+	int _progress = 0;
 
 	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
 
 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
-	if (result != SCAN_SUCCEED)
+	if (result != SCAN_SUCCEED) {
+		_progress = HPAGE_PMD_NR;
 		goto out;
+	}
 
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
 	if (!pte) {
+		_progress = HPAGE_PMD_NR;
 		result = SCAN_NO_PTE_TABLE;
 		goto out;
 	}
 
 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
 	     _pte++, addr += PAGE_SIZE) {
+		_progress++;
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none_or_zero(pteval)) {
 			++none_or_zero;
@@ -1410,6 +1415,9 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		*mmap_locked = false;
 	}
 out:
+	if (progress)
+		*progress += _progress;
+
 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
 				     none_or_zero, result, unmapped);
 	return result;
@@ -2287,7 +2295,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 
 static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 				    struct file *file, pgoff_t start,
-				    struct collapse_control *cc)
+				    int *progress, struct collapse_control *cc)
 {
 	struct folio *folio = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2295,6 +2303,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	int present, swap;
 	int node = NUMA_NO_NODE;
 	int result = SCAN_SUCCEED;
+	int _progress = 0;
 
 	present = 0;
 	swap = 0;
@@ -2327,6 +2336,8 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 			continue;
 		}
 
+		_progress += folio_nr_pages(folio);
+
 		if (folio_order(folio) == HPAGE_PMD_ORDER &&
 		    folio->index == start) {
 			/* Maybe PMD-mapped */
@@ -2388,6 +2399,9 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 		}
 	}
 
+	if (progress)
+		*progress += _progress;
+
 	trace_mm_khugepaged_scan_file(mm, folio, file, present, swap, result);
 	return result;
 }
@@ -2470,7 +2484,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				mmap_read_unlock(mm);
 				mmap_locked = false;
 				*result = hpage_collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
+					khugepaged_scan.address, file, pgoff,
+					&progress, cc);
 				fput(file);
 				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
 					mmap_read_lock(mm);
@@ -2484,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				}
 			} else {
 				*result = hpage_collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
+					khugepaged_scan.address, &mmap_locked,
+					&progress, cc);
 			}
 
 			if (*result == SCAN_SUCCEED)
@@ -2492,7 +2508,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
-			progress += HPAGE_PMD_NR;
 			if (!mmap_locked)
 				/*
 				 * We released mmap_lock so break loop.  Note
@@ -2810,11 +2825,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			mmap_read_unlock(mm);
 			mmap_locked = false;
 			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
+							  NULL, cc);
 			fput(file);
 		} else {
 			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
+							 &mmap_locked, NULL, cc);
 		}
 		if (!mmap_locked)
 			*lock_dropped = true;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 3/6] mm: khugepaged: just skip when the memory has been collapsed
  2026-01-04  5:41 [PATCH v3 0/6] Improve khugepaged scan logic Vernon Yang
  2026-01-04  5:41 ` [PATCH v3 1/6] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
  2026-01-04  5:41 ` [PATCH v3 2/6] mm: khugepaged: refine scan progress number Vernon Yang
@ 2026-01-04  5:41 ` Vernon Yang
  2026-01-04  5:41 ` [PATCH v3 4/6] mm: add folio_is_lazyfree helper Vernon Yang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Vernon Yang @ 2026-01-04  5:41 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	richard.weiyang, linux-mm, linux-kernel, Vernon Yang

The following data is traced by bpftrace on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
by khugepaged.

@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
@scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
@scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
@scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
total progress size: 674 MB
Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs

The khugepaged_scan list save all task that support collapse into hugepage,
as long as the task is not destroyed, khugepaged will not remove it from
the khugepaged_scan list. This exist a phenomenon where task has already
collapsed all memory regions into hugepage, but khugepaged continues to
scan it, which wastes CPU time and invalid, and due to
khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
scanning a large number of invalid task, so scanning really valid task
is later.

After applying this patch, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE, just skip it, as follow:

@scan_pmd_status[6]: 2
@scan_pmd_status[3]: 147
@scan_pmd_status[2]: 173
total progress size: 45 MB
Total time         : 20 seconds

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 mm/khugepaged.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4b124e854e2e..30786c706c4a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -66,7 +66,10 @@ enum scan_result {
 static struct task_struct *khugepaged_thread __read_mostly;
 static DEFINE_MUTEX(khugepaged_mutex);
 
-/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
+/*
+ * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
+ * every 10 second.
+ */
 static unsigned int khugepaged_pages_to_scan __read_mostly;
 static unsigned int khugepaged_pages_collapsed;
 static unsigned int khugepaged_full_scans;
@@ -1264,7 +1267,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 
 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
 	if (result != SCAN_SUCCEED) {
-		_progress = HPAGE_PMD_NR;
+		_progress = 1;
 		goto out;
 	}
 
@@ -1272,7 +1275,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	nodes_clear(cc->alloc_nmask);
 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
 	if (!pte) {
-		_progress = HPAGE_PMD_NR;
+		_progress = 1;
 		result = SCAN_NO_PTE_TABLE;
 		goto out;
 	}
@@ -2336,8 +2339,6 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 			continue;
 		}
 
-		_progress += folio_nr_pages(folio);
-
 		if (folio_order(folio) == HPAGE_PMD_ORDER &&
 		    folio->index == start) {
 			/* Maybe PMD-mapped */
@@ -2349,9 +2350,12 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 			 * returning.
 			 */
 			folio_put(folio);
+			_progress++;
 			break;
 		}
 
+		_progress += folio_nr_pages(folio);
+
 		node = folio_nid(folio);
 		if (hpage_collapse_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 4/6] mm: add folio_is_lazyfree helper
  2026-01-04  5:41 [PATCH v3 0/6] Improve khugepaged scan logic Vernon Yang
                   ` (2 preceding siblings ...)
  2026-01-04  5:41 ` [PATCH v3 3/6] mm: khugepaged: just skip when the memory has been collapsed Vernon Yang
@ 2026-01-04  5:41 ` Vernon Yang
  2026-01-04 11:42   ` Lance Yang
  2026-01-04  5:41 ` [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning Vernon Yang
  2026-01-04  5:41 ` [PATCH v3 6/6] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
  5 siblings, 1 reply; 25+ messages in thread
From: Vernon Yang @ 2026-01-04  5:41 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	richard.weiyang, linux-mm, linux-kernel, Vernon Yang

Add folio_is_lazyfree() function to identify lazy-free folios to improve
code readability.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/linux/mm_inline.h | 5 +++++
 mm/rmap.c                 | 4 ++--
 mm/vmscan.c               | 5 ++---
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index fa2d6ba811b5..65a4ae52d915 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -35,6 +35,11 @@ static inline int page_is_file_lru(struct page *page)
 	return folio_is_file_lru(page_folio(page));
 }
 
+static inline int folio_is_lazyfree(const struct folio *folio)
+{
+	return folio_test_anon(folio) && !folio_test_swapbacked(folio);
+}
+
 static __always_inline void __update_lru_size(struct lruvec *lruvec,
 				enum lru_list lru, enum zone_type zid,
 				long nr_pages)
diff --git a/mm/rmap.c b/mm/rmap.c
index f955f02d570e..7241a3fa8574 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1838,7 +1838,7 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	max_nr = (end_addr - addr) >> PAGE_SHIFT;
 
 	/* We only support lazyfree batching for now ... */
-	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
+	if (!folio_is_lazyfree(folio))
 		return 1;
 	if (pte_unused(pte))
 		return 1;
@@ -1934,7 +1934,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		if (!pvmw.pte) {
-			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
+			if (folio_is_lazyfree(folio)) {
 				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
 					goto walk_done;
 				/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 670fe9fae5ba..f357f74b5a35 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -963,8 +963,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
 	 * They could be mistakenly treated as file lru. So further anon
 	 * test is needed.
 	 */
-	if (!folio_is_file_lru(folio) ||
-	    (folio_test_anon(folio) && !folio_test_swapbacked(folio))) {
+	if (!folio_is_file_lru(folio) || folio_is_lazyfree(folio)) {
 		*dirty = false;
 		*writeback = false;
 		return;
@@ -1501,7 +1500,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			}
 		}
 
-		if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
+		if (folio_is_lazyfree(folio)) {
 			/* follow __remove_mapping for reference */
 			if (!folio_ref_freeze(folio, 1))
 				goto keep_locked;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning
  2026-01-04  5:41 [PATCH v3 0/6] Improve khugepaged scan logic Vernon Yang
                   ` (3 preceding siblings ...)
  2026-01-04  5:41 ` [PATCH v3 4/6] mm: add folio_is_lazyfree helper Vernon Yang
@ 2026-01-04  5:41 ` Vernon Yang
  2026-01-04 12:10   ` Lance Yang
  2026-01-04  5:41 ` [PATCH v3 6/6] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
  5 siblings, 1 reply; 25+ messages in thread
From: Vernon Yang @ 2026-01-04  5:41 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	richard.weiyang, linux-mm, linux-kernel, Vernon Yang

For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.

So if the user has explicitly informed us via MADV_FREE that this memory
will be freed, it is appropriate for khugepaged to skip it only, thereby
avoiding unnecessary scan and collapse operations to reducing CPU
wastage.

Here are the performance test results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
| cycles per access   |  4.96         |  2.21         | -55.44% |
| Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
| dTLB-load-misses    |  284814532    |  69597236     | -75.56% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.29         |  2.07         | -71.60% |
| Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
| dTLB-load-misses    |  241600871    |  3216108      | -98.67% |

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/trace/events/huge_memory.h | 1 +
 mm/khugepaged.c                    | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 01225dd27ad5..e99d5f71f2a4 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -25,6 +25,7 @@
 	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
 	EM( SCAN_PAGE_LOCK,		"page_locked")			\
 	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
+	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
 	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
 	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
 	EM( SCAN_VMA_NULL,		"vma_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 30786c706c4a..1ca034a5f653 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -45,6 +45,7 @@ enum scan_result {
 	SCAN_PAGE_LRU,
 	SCAN_PAGE_LOCK,
 	SCAN_PAGE_ANON,
+	SCAN_PAGE_LAZYFREE,
 	SCAN_PAGE_COMPOUND,
 	SCAN_ANY_PROCESS,
 	SCAN_VMA_NULL,
@@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		}
 		folio = page_folio(page);
 
+		if (folio_is_lazyfree(folio)) {
+			result = SCAN_PAGE_LAZYFREE;
+			goto out_unmap;
+		}
+
 		if (!folio_test_anon(folio)) {
 			result = SCAN_PAGE_ANON;
 			goto out_unmap;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 6/6] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2026-01-04  5:41 [PATCH v3 0/6] Improve khugepaged scan logic Vernon Yang
                   ` (4 preceding siblings ...)
  2026-01-04  5:41 ` [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning Vernon Yang
@ 2026-01-04  5:41 ` Vernon Yang
  2026-01-04 12:20   ` Lance Yang
  5 siblings, 1 reply; 25+ messages in thread
From: Vernon Yang @ 2026-01-04  5:41 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	richard.weiyang, linux-mm, linux-kernel, Vernon Yang

When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
reduce redundant operation.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 mm/khugepaged.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1ca034a5f653..d4ed0f397335 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2541,7 +2541,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (hpage_collapse_test_exit(mm) || !vma) {
+	if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
-- 
2.51.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 4/6] mm: add folio_is_lazyfree helper
  2026-01-04  5:41 ` [PATCH v3 4/6] mm: add folio_is_lazyfree helper Vernon Yang
@ 2026-01-04 11:42   ` Lance Yang
  2026-01-05  2:09     ` Vernon Yang
  0 siblings, 1 reply; 25+ messages in thread
From: Lance Yang @ 2026-01-04 11:42 UTC (permalink / raw)
  To: Vernon Yang, baolin.wang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, richard.weiyang,
	linux-mm, linux-kernel, Vernon Yang, akpm, david



On 2026/1/4 13:41, Vernon Yang wrote:
> Add folio_is_lazyfree() function to identify lazy-free folios to improve
> code readability.
> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   include/linux/mm_inline.h | 5 +++++
>   mm/rmap.c                 | 4 ++--
>   mm/vmscan.c               | 5 ++---
>   3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index fa2d6ba811b5..65a4ae52d915 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -35,6 +35,11 @@ static inline int page_is_file_lru(struct page *page)
>   	return folio_is_file_lru(page_folio(page));
>   }
>   
> +static inline int folio_is_lazyfree(const struct folio *folio)
> +{
> +	return folio_test_anon(folio) && !folio_test_swapbacked(folio);
> +}
> +
>   static __always_inline void __update_lru_size(struct lruvec *lruvec,
>   				enum lru_list lru, enum zone_type zid,
>   				long nr_pages)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f955f02d570e..7241a3fa8574 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1838,7 +1838,7 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>   	max_nr = (end_addr - addr) >> PAGE_SHIFT;
>   
>   	/* We only support lazyfree batching for now ... */
> -	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> +	if (!folio_is_lazyfree(folio))

Please rebase against mm-new. Commit[1] already supports file folios
batching in folio_unmap_pte_batch()

+	/* We only support lazyfree or file folios batching for now ... */
+	if (folio_test_anon(folio) && folio_test_swapbacked(folio))

[1] 
https://lore.kernel.org/all/142919ac14d3cf70cba370808d85debe089df7b4.1766631066.git.baolin.wang@linux.alibaba.com/

Thanks,
Lance

>   		return 1;
>   	if (pte_unused(pte))
>   		return 1;
> @@ -1934,7 +1934,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   		}
>   
>   		if (!pvmw.pte) {
> -			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> +			if (folio_is_lazyfree(folio)) {
>   				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>   					goto walk_done;
>   				/*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 670fe9fae5ba..f357f74b5a35 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -963,8 +963,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
>   	 * They could be mistakenly treated as file lru. So further anon
>   	 * test is needed.
>   	 */
> -	if (!folio_is_file_lru(folio) ||
> -	    (folio_test_anon(folio) && !folio_test_swapbacked(folio))) {
> +	if (!folio_is_file_lru(folio) || folio_is_lazyfree(folio)) {
>   		*dirty = false;
>   		*writeback = false;
>   		return;
> @@ -1501,7 +1500,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   			}
>   		}
>   
> -		if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> +		if (folio_is_lazyfree(folio)) {
>   			/* follow __remove_mapping for reference */
>   			if (!folio_ref_freeze(folio, 1))
>   				goto keep_locked;



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning
  2026-01-04  5:41 ` [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning Vernon Yang
@ 2026-01-04 12:10   ` Lance Yang
  2026-01-05  1:48     ` Vernon Yang
  0 siblings, 1 reply; 25+ messages in thread
From: Lance Yang @ 2026-01-04 12:10 UTC (permalink / raw)
  To: Vernon Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, richard.weiyang,
	linux-mm, linux-kernel, Vernon Yang, akpm, david



On 2026/1/4 13:41, Vernon Yang wrote:
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
> 
> So if the user has explicitly informed us via MADV_FREE that this memory
> will be freed, it is appropriate for khugepaged to skip it only, thereby
> avoiding unnecessary scan and collapse operations to reducing CPU
> wastage.
> 
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
> 
> Testing on x86_64 machine:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> | cycles per access   |  4.96         |  2.21         | -55.44% |
> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> 
> Testing on qemu-system-x86_64 -enable-kvm:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> | cycles per access   |  7.29         |  2.07         | -71.60% |
> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   include/trace/events/huge_memory.h | 1 +
>   mm/khugepaged.c                    | 6 ++++++
>   2 files changed, 7 insertions(+)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 01225dd27ad5..e99d5f71f2a4 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -25,6 +25,7 @@
>   	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
>   	EM( SCAN_PAGE_LOCK,		"page_locked")			\
>   	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
> +	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
>   	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
>   	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
>   	EM( SCAN_VMA_NULL,		"vma_null")			\
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 30786c706c4a..1ca034a5f653 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -45,6 +45,7 @@ enum scan_result {
>   	SCAN_PAGE_LRU,
>   	SCAN_PAGE_LOCK,
>   	SCAN_PAGE_ANON,
> +	SCAN_PAGE_LAZYFREE,
>   	SCAN_PAGE_COMPOUND,
>   	SCAN_ANY_PROCESS,
>   	SCAN_VMA_NULL,
> @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   		}
>   		folio = page_folio(page);
>   
> +		if (folio_is_lazyfree(folio)) {
> +			result = SCAN_PAGE_LAZYFREE;
> +			goto out_unmap;
> +		}

That's a bit tricky ... I don't think we need to handle MADV_FREE pages
differently :)

MADV_FREE pages are likely cold memory, but what if there are just
a few MADV_FREE pages in a hot memory region? Skipping the entire
region would be unfortunate ...

Also, even if we skip these pages now, after they are reclaimed, they
become pte_none. Then khugepaged will try to collapse them anyway
(based on khugepaged_max_ptes_none). So skipping them just delays
things, it does not really change the final result ;)

Thanks,
Lance

> +
>   		if (!folio_test_anon(folio)) {
>   			result = SCAN_PAGE_ANON;
>   			goto out_unmap;



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/6] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2026-01-04  5:41 ` [PATCH v3 6/6] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
@ 2026-01-04 12:20   ` Lance Yang
  2026-01-05  0:31     ` Wei Yang
  2026-01-05  2:06     ` Vernon Yang
  0 siblings, 2 replies; 25+ messages in thread
From: Lance Yang @ 2026-01-04 12:20 UTC (permalink / raw)
  To: Vernon Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, richard.weiyang,
	linux-mm, linux-kernel, david, Vernon Yang, akpm



On 2026/1/4 13:41, Vernon Yang wrote:
> When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
> scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
> reduce redundant operation.
> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   mm/khugepaged.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1ca034a5f653..d4ed0f397335 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2541,7 +2541,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   	 * Release the current mm_slot if this mm is about to die, or
>   	 * if we scanned all vmas of this mm.
>   	 */
> -	if (hpage_collapse_test_exit(mm) || !vma) {
> +	if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
>   		/*
>   		 * Make sure that if mm_users is reaching zero while
>   		 * khugepaged runs here, khugepaged_exit will find


Let's convert hpage_collapse_test_exit() in collect_mm_slot() as well,
otherwise the mm_slot would not be freed and will be scanned again ...

static void collect_mm_slot(struct mm_slot *slot)
{
	struct mm_struct *mm = slot->mm;

	lockdep_assert_held(&khugepaged_mm_lock);

	if (hpage_collapse_test_exit(mm)) { <-

		hash_del(&slot->hash);
		list_del(&slot->mm_node);

		mm_slot_free(mm_slot_cache, slot);
		mmdrop(mm);
	}
}


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/6] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2026-01-04 12:20   ` Lance Yang
@ 2026-01-05  0:31     ` Wei Yang
  2026-01-05  2:09       ` Lance Yang
  2026-01-05  2:06     ` Vernon Yang
  1 sibling, 1 reply; 25+ messages in thread
From: Wei Yang @ 2026-01-05  0:31 UTC (permalink / raw)
  To: Lance Yang
  Cc: Vernon Yang, lorenzo.stoakes, ziy, dev.jain, baohua,
	richard.weiyang, linux-mm, linux-kernel, david, Vernon Yang,
	akpm

On Sun, Jan 04, 2026 at 08:20:29PM +0800, Lance Yang wrote:
>
>
>On 2026/1/4 13:41, Vernon Yang wrote:
>> When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
>> scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
>> reduce redundant operation.
>> 
>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>> ---
>>   mm/khugepaged.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 1ca034a5f653..d4ed0f397335 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2541,7 +2541,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>>   	 * Release the current mm_slot if this mm is about to die, or
>>   	 * if we scanned all vmas of this mm.
>>   	 */
>> -	if (hpage_collapse_test_exit(mm) || !vma) {
>> +	if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
>>   		/*
>>   		 * Make sure that if mm_users is reaching zero while
>>   		 * khugepaged runs here, khugepaged_exit will find
>
>
>Let's convert hpage_collapse_test_exit() in collect_mm_slot() as well,
>otherwise the mm_slot would not be freed and will be scanned again ...
>
>static void collect_mm_slot(struct mm_slot *slot)
>{
>	struct mm_struct *mm = slot->mm;
>
>	lockdep_assert_held(&khugepaged_mm_lock);
>
>	if (hpage_collapse_test_exit(mm)) { <-
>

What if user toggle the MMF_DISABLE_THP_COMPLETELY flag again?

>		hash_del(&slot->hash);
>		list_del(&slot->mm_node);
>
>		mm_slot_free(mm_slot_cache, slot);
>		mmdrop(mm);
>	}
>}

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning
  2026-01-04 12:10   ` Lance Yang
@ 2026-01-05  1:48     ` Vernon Yang
  2026-01-05  2:51       ` Lance Yang
  0 siblings, 1 reply; 25+ messages in thread
From: Vernon Yang @ 2026-01-05  1:48 UTC (permalink / raw)
  To: Lance Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, richard.weiyang,
	linux-mm, linux-kernel, Vernon Yang, akpm, david

On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
>
>
> On 2026/1/4 13:41, Vernon Yang wrote:
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > So if the user has explicitly informed us via MADV_FREE that this memory
> > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > avoiding unnecessary scan and collapse operations to reducing CPU
> > wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   include/trace/events/huge_memory.h | 1 +
> >   mm/khugepaged.c                    | 6 ++++++
> >   2 files changed, 7 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 01225dd27ad5..e99d5f71f2a4 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -25,6 +25,7 @@
> >   	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
> >   	EM( SCAN_PAGE_LOCK,		"page_locked")			\
> >   	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
> > +	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
> >   	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
> >   	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
> >   	EM( SCAN_VMA_NULL,		"vma_null")			\
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 30786c706c4a..1ca034a5f653 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -45,6 +45,7 @@ enum scan_result {
> >   	SCAN_PAGE_LRU,
> >   	SCAN_PAGE_LOCK,
> >   	SCAN_PAGE_ANON,
> > +	SCAN_PAGE_LAZYFREE,
> >   	SCAN_PAGE_COMPOUND,
> >   	SCAN_ANY_PROCESS,
> >   	SCAN_VMA_NULL,
> > @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >   		}
> >   		folio = page_folio(page);
> > +		if (folio_is_lazyfree(folio)) {
> > +			result = SCAN_PAGE_LAZYFREE;
> > +			goto out_unmap;
> > +		}
>
> That's a bit tricky ... I don't think we need to handle MADV_FREE pages
> differently :)
>
> MADV_FREE pages are likely cold memory, but what if there are just
> a few MADV_FREE pages in a hot memory region? Skipping the entire
> region would be unfortunate ...

If there are hot in lazyfree folios, the folio will be set as non-lazyfree
in the memory reclaim path, it is not skipped in the next scan in the
khugepaged.

shrink_folio_list()
  try_to_unmap()
    folio_set_swapbacked()

If there are no hot in lazyfree folios, continuing the collapse would
waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
Additionally, due to collapse hugepage become non-lazyfree, preventing
the rapid release of lazyfree folios in the memory reclaim path.

So skipping lazy-free folios make sense here for us.

If I missed something, please let me know, thank!

> Also, even if we skip these pages now, after they are reclaimed, they
> become pte_none. Then khugepaged will try to collapse them anyway
> (based on khugepaged_max_ptes_none). So skipping them just delays
> things, it does not really change the final result ;)

This patch just resolve scene for hot1 -> cold -> hot2.

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/6] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2026-01-04 12:20   ` Lance Yang
  2026-01-05  0:31     ` Wei Yang
@ 2026-01-05  2:06     ` Vernon Yang
  2026-01-05  2:20       ` Lance Yang
  1 sibling, 1 reply; 25+ messages in thread
From: Vernon Yang @ 2026-01-05  2:06 UTC (permalink / raw)
  To: Lance Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, richard.weiyang,
	linux-mm, linux-kernel, david, Vernon Yang, akpm

On Sun, Jan 4, 2026 at 8:20 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> On 2026/1/4 13:41, Vernon Yang wrote:
> > When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
> > scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
> > reduce redundant operation.
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   mm/khugepaged.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 1ca034a5f653..d4ed0f397335 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2541,7 +2541,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >        * Release the current mm_slot if this mm is about to die, or
> >        * if we scanned all vmas of this mm.
> >        */
> > -     if (hpage_collapse_test_exit(mm) || !vma) {
> > +     if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
> >               /*
> >                * Make sure that if mm_users is reaching zero while
> >                * khugepaged runs here, khugepaged_exit will find
>
>
> Let's convert hpage_collapse_test_exit() in collect_mm_slot() as well,
> otherwise the mm_slot would not be freed and will be scanned again ...
>
> static void collect_mm_slot(struct mm_slot *slot)
> {
>         struct mm_struct *mm = slot->mm;
>
>         lockdep_assert_held(&khugepaged_mm_lock);
>
>         if (hpage_collapse_test_exit(mm)) { <-
>
>                 hash_del(&slot->hash);
>                 list_del(&slot->mm_node);
>
>                 mm_slot_free(mm_slot_cache, slot);
>                 mmdrop(mm);
>         }
> }

This patch just reduces redundant operation, For a detailed
discussion[1].

You already commit 5dad604809c5 ("mm/khugepaged: keep mm in mm_slot
without MMF_DISABLE_THP check"), I assume there is some problem here,
e.g. not can readd? data-race? etc. Can you explain the root cause? Thanks!

[1] https://lore.kernel.org/linux-mm/CACZaFFOvDad09MUopairAoAjZG6X5gffMaQbnfy0sCHGz8xSfg@mail.gmail.com

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/6] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2026-01-05  0:31     ` Wei Yang
@ 2026-01-05  2:09       ` Lance Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Lance Yang @ 2026-01-05  2:09 UTC (permalink / raw)
  To: Wei Yang
  Cc: Vernon Yang, lorenzo.stoakes, ziy, dev.jain, baohua, linux-mm,
	linux-kernel, david, Vernon Yang, akpm



On 2026/1/5 08:31, Wei Yang wrote:
> On Sun, Jan 04, 2026 at 08:20:29PM +0800, Lance Yang wrote:
>>
>>
>> On 2026/1/4 13:41, Vernon Yang wrote:
>>> When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
>>> scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
>>> reduce redundant operation.
>>>
>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>> ---
>>>    mm/khugepaged.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 1ca034a5f653..d4ed0f397335 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2541,7 +2541,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>>>    	 * Release the current mm_slot if this mm is about to die, or
>>>    	 * if we scanned all vmas of this mm.
>>>    	 */
>>> -	if (hpage_collapse_test_exit(mm) || !vma) {
>>> +	if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
>>>    		/*
>>>    		 * Make sure that if mm_users is reaching zero while
>>>    		 * khugepaged runs here, khugepaged_exit will find
>>
>>
>> Let's convert hpage_collapse_test_exit() in collect_mm_slot() as well,
>> otherwise the mm_slot would not be freed and will be scanned again ...
>>
>> static void collect_mm_slot(struct mm_slot *slot)
>> {
>> 	struct mm_struct *mm = slot->mm;
>>
>> 	lockdep_assert_held(&khugepaged_mm_lock);
>>
>> 	if (hpage_collapse_test_exit(mm)) { <-
>>
> 
> What if user toggle the MMF_DISABLE_THP_COMPLETELY flag again?

Maybe it's fine :)

If user sets MMF_DISABLE_THP_COMPLETELY, they probaly would not
clear it soon. Keeping the slot wastes memory.

If they do clear it later, page faults will trigger
do_huge_pmd_anonymous_page() -> khugepaged_enter_vma(), which
re-adds the mm.

Anyway, no strong opinion on that.

> 
>> 		hash_del(&slot->hash);
>> 		list_del(&slot->mm_node);
>>
>> 		mm_slot_free(mm_slot_cache, slot);
>> 		mmdrop(mm);
>> 	}
>> }
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 4/6] mm: add folio_is_lazyfree helper
  2026-01-04 11:42   ` Lance Yang
@ 2026-01-05  2:09     ` Vernon Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Vernon Yang @ 2026-01-05  2:09 UTC (permalink / raw)
  To: Lance Yang
  Cc: baolin.wang, lorenzo.stoakes, ziy, dev.jain, baohua,
	richard.weiyang, linux-mm, linux-kernel, Vernon Yang, akpm,
	david

On Sun, Jan 4, 2026 at 7:42 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> On 2026/1/4 13:41, Vernon Yang wrote:
> > Add folio_is_lazyfree() function to identify lazy-free folios to improve
> > code readability.
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   include/linux/mm_inline.h | 5 +++++
> >   mm/rmap.c                 | 4 ++--
> >   mm/vmscan.c               | 5 ++---
> >   3 files changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index fa2d6ba811b5..65a4ae52d915 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -35,6 +35,11 @@ static inline int page_is_file_lru(struct page *page)
> >       return folio_is_file_lru(page_folio(page));
> >   }
> >
> > +static inline int folio_is_lazyfree(const struct folio *folio)
> > +{
> > +     return folio_test_anon(folio) && !folio_test_swapbacked(folio);
> > +}
> > +
> >   static __always_inline void __update_lru_size(struct lruvec *lruvec,
> >                               enum lru_list lru, enum zone_type zid,
> >                               long nr_pages)
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index f955f02d570e..7241a3fa8574 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1838,7 +1838,7 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> >       max_nr = (end_addr - addr) >> PAGE_SHIFT;
> >
> >       /* We only support lazyfree batching for now ... */
> > -     if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
> > +     if (!folio_is_lazyfree(folio))
>
> Please rebase against mm-new. Commit[1] already supports file folios
> batching in folio_unmap_pte_batch()

Ok, thanks! I will rebase in the next version.

> +       /* We only support lazyfree or file folios batching for now ... */
> +       if (folio_test_anon(folio) && folio_test_swapbacked(folio))
>
> [1]
> https://lore.kernel.org/all/142919ac14d3cf70cba370808d85debe089df7b4.1766631066.git.baolin.wang@linux.alibaba.com/
>
> Thanks,
> Lance
>
> >               return 1;
> >       if (pte_unused(pte))
> >               return 1;
> > @@ -1934,7 +1934,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >               }
> >
> >               if (!pvmw.pte) {
> > -                     if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> > +                     if (folio_is_lazyfree(folio)) {
> >                               if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> >                                       goto walk_done;
> >                               /*
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 670fe9fae5ba..f357f74b5a35 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -963,8 +963,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
> >        * They could be mistakenly treated as file lru. So further anon
> >        * test is needed.
> >        */
> > -     if (!folio_is_file_lru(folio) ||
> > -         (folio_test_anon(folio) && !folio_test_swapbacked(folio))) {
> > +     if (!folio_is_file_lru(folio) || folio_is_lazyfree(folio)) {
> >               *dirty = false;
> >               *writeback = false;
> >               return;
> > @@ -1501,7 +1500,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >                       }
> >               }
> >
> > -             if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> > +             if (folio_is_lazyfree(folio)) {
> >                       /* follow __remove_mapping for reference */
> >                       if (!folio_ref_freeze(folio, 1))
> >                               goto keep_locked;
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/6] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2026-01-05  2:06     ` Vernon Yang
@ 2026-01-05  2:20       ` Lance Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Lance Yang @ 2026-01-05  2:20 UTC (permalink / raw)
  To: Vernon Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, richard.weiyang,
	linux-mm, linux-kernel, david, Vernon Yang, akpm



On 2026/1/5 10:06, Vernon Yang wrote:
> On Sun, Jan 4, 2026 at 8:20 PM Lance Yang <lance.yang@linux.dev> wrote:
>>
>> On 2026/1/4 13:41, Vernon Yang wrote:
>>> When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
>>> scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
>>> reduce redundant operation.
>>>
>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>> ---
>>>    mm/khugepaged.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 1ca034a5f653..d4ed0f397335 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2541,7 +2541,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>>>         * Release the current mm_slot if this mm is about to die, or
>>>         * if we scanned all vmas of this mm.
>>>         */
>>> -     if (hpage_collapse_test_exit(mm) || !vma) {
>>> +     if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
>>>                /*
>>>                 * Make sure that if mm_users is reaching zero while
>>>                 * khugepaged runs here, khugepaged_exit will find
>>
>>
>> Let's convert hpage_collapse_test_exit() in collect_mm_slot() as well,
>> otherwise the mm_slot would not be freed and will be scanned again ...
>>
>> static void collect_mm_slot(struct mm_slot *slot)
>> {
>>          struct mm_struct *mm = slot->mm;
>>
>>          lockdep_assert_held(&khugepaged_mm_lock);
>>
>>          if (hpage_collapse_test_exit(mm)) { <-
>>
>>                  hash_del(&slot->hash);
>>                  list_del(&slot->mm_node);
>>
>>                  mm_slot_free(mm_slot_cache, slot);
>>                  mmdrop(mm);
>>          }
>> }
> 
> This patch just reduces redundant operation, For a detailed
> discussion[1].
> 
> You already commit 5dad604809c5 ("mm/khugepaged: keep mm in mm_slot
> without MMF_DISABLE_THP check"), I assume there is some problem here,
> e.g. not can readd? data-race? etc. Can you explain the root cause? Thanks!

Ah, I didn't fully recall that ...

Maybe I kept the slot because it's hard for khugepaged to re-add the mm 
later.

But looking at the code again, I'm not sure if that was the right call :(

> 
> [1] https://lore.kernel.org/linux-mm/CACZaFFOvDad09MUopairAoAjZG6X5gffMaQbnfy0sCHGz8xSfg@mail.gmail.com
> 
> --
> Thanks,
> Vernon



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning
  2026-01-05  1:48     ` Vernon Yang
@ 2026-01-05  2:51       ` Lance Yang
  2026-01-05  3:12         ` Vernon Yang
  0 siblings, 1 reply; 25+ messages in thread
From: Lance Yang @ 2026-01-05  2:51 UTC (permalink / raw)
  To: Vernon Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, richard.weiyang,
	linux-mm, linux-kernel, Vernon Yang, akpm, david



On 2026/1/5 09:48, Vernon Yang wrote:
> On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
>>
>>
>> On 2026/1/4 13:41, Vernon Yang wrote:
>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>> continuously access 128 MB memory, while the cold task only accesses
>>> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
>>> still prioritizes scanning the cold task and only scans the hot2 task
>>> after completing the scan of the cold task.
>>>
>>> So if the user has explicitly informed us via MADV_FREE that this memory
>>> will be freed, it is appropriate for khugepaged to skip it only, thereby
>>> avoiding unnecessary scan and collapse operations to reducing CPU
>>> wastage.
>>>
>>> Here are the performance test results:
>>> (Throughput bigger is better, other smaller is better)
>>>
>>> Testing on x86_64 machine:
>>>
>>> | task hot2           | without patch | with patch    |  delta  |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
>>> | cycles per access   |  4.96         |  2.21         | -55.44% |
>>> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
>>> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
>>>
>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>
>>> | task hot2           | without patch | with patch    |  delta  |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>>> | cycles per access   |  7.29         |  2.07         | -71.60% |
>>> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
>>> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
>>>
>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>> ---
>>>    include/trace/events/huge_memory.h | 1 +
>>>    mm/khugepaged.c                    | 6 ++++++
>>>    2 files changed, 7 insertions(+)
>>>
>>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>>> index 01225dd27ad5..e99d5f71f2a4 100644
>>> --- a/include/trace/events/huge_memory.h
>>> +++ b/include/trace/events/huge_memory.h
>>> @@ -25,6 +25,7 @@
>>>    	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
>>>    	EM( SCAN_PAGE_LOCK,		"page_locked")			\
>>>    	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
>>> +	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
>>>    	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
>>>    	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
>>>    	EM( SCAN_VMA_NULL,		"vma_null")			\
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 30786c706c4a..1ca034a5f653 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -45,6 +45,7 @@ enum scan_result {
>>>    	SCAN_PAGE_LRU,
>>>    	SCAN_PAGE_LOCK,
>>>    	SCAN_PAGE_ANON,
>>> +	SCAN_PAGE_LAZYFREE,
>>>    	SCAN_PAGE_COMPOUND,
>>>    	SCAN_ANY_PROCESS,
>>>    	SCAN_VMA_NULL,
>>> @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>    		}
>>>    		folio = page_folio(page);
>>> +		if (folio_is_lazyfree(folio)) {
>>> +			result = SCAN_PAGE_LAZYFREE;
>>> +			goto out_unmap;
>>> +		}
>>
>> That's a bit tricky ... I don't think we need to handle MADV_FREE pages
>> differently :)
>>
>> MADV_FREE pages are likely cold memory, but what if there are just
>> a few MADV_FREE pages in a hot memory region? Skipping the entire
>> region would be unfortunate ...
> 
> If there are hot in lazyfree folios, the folio will be set as non-lazyfree
> in the memory reclaim path, it is not skipped in the next scan in the
> khugepaged.
> 
> shrink_folio_list()
>    try_to_unmap()
>      folio_set_swapbacked()
> 
> If there are no hot in lazyfree folios, continuing the collapse would
> waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
> Additionally, due to collapse hugepage become non-lazyfree, preventing
> the rapid release of lazyfree folios in the memory reclaim path.
> 
> So skipping lazy-free folios make sense here for us.
> 
> If I missed something, please let me know, thank!

I'm not saying lazyfree pages become hot :)

If a PMD region has mostly hot pages but just a few lazyfree
pages, we would skip the entire region. Those hot pages won't
be collapsed.

> 
>> Also, even if we skip these pages now, after they are reclaimed, they
>> become pte_none. Then khugepaged will try to collapse them anyway
>> (based on khugepaged_max_ptes_none). So skipping them just delays
>> things, it does not really change the final result ;)
> 
> This patch just resolve scene for hot1 -> cold -> hot2.
> 
> --
> Thanks,
> Vernon



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning
  2026-01-05  2:51       ` Lance Yang
@ 2026-01-05  3:12         ` Vernon Yang
  2026-01-05  3:35           ` Lance Yang
  0 siblings, 1 reply; 25+ messages in thread
From: Vernon Yang @ 2026-01-05  3:12 UTC (permalink / raw)
  To: Lance Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, richard.weiyang,
	linux-mm, linux-kernel, Vernon Yang, akpm, david

On Mon, Jan 5, 2026 at 10:51 AM Lance Yang <lance.yang@linux.dev> wrote:
>
> On 2026/1/5 09:48, Vernon Yang wrote:
> > On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
> >>
> >>
> >> On 2026/1/4 13:41, Vernon Yang wrote:
> >>> For example, create three task: hot1 -> cold -> hot2. After all three
> >>> task are created, each allocate memory 128MB. the hot1/hot2 task
> >>> continuously access 128 MB memory, while the cold task only accesses
> >>> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> >>> still prioritizes scanning the cold task and only scans the hot2 task
> >>> after completing the scan of the cold task.
> >>>
> >>> So if the user has explicitly informed us via MADV_FREE that this memory
> >>> will be freed, it is appropriate for khugepaged to skip it only, thereby
> >>> avoiding unnecessary scan and collapse operations to reducing CPU
> >>> wastage.
> >>>
> >>> Here are the performance test results:
> >>> (Throughput bigger is better, other smaller is better)
> >>>
> >>> Testing on x86_64 machine:
> >>>
> >>> | task hot2           | without patch | with patch    |  delta  |
> >>> |---------------------|---------------|---------------|---------|
> >>> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> >>> | cycles per access   |  4.96         |  2.21         | -55.44% |
> >>> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> >>> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> >>>
> >>> Testing on qemu-system-x86_64 -enable-kvm:
> >>>
> >>> | task hot2           | without patch | with patch    |  delta  |
> >>> |---------------------|---------------|---------------|---------|
> >>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> >>> | cycles per access   |  7.29         |  2.07         | -71.60% |
> >>> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> >>> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> >>>
> >>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> >>> ---
> >>>    include/trace/events/huge_memory.h | 1 +
> >>>    mm/khugepaged.c                    | 6 ++++++
> >>>    2 files changed, 7 insertions(+)
> >>>
> >>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> >>> index 01225dd27ad5..e99d5f71f2a4 100644
> >>> --- a/include/trace/events/huge_memory.h
> >>> +++ b/include/trace/events/huge_memory.h
> >>> @@ -25,6 +25,7 @@
> >>>     EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> >>>     EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> >>>     EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> >>> +   EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> >>>     EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> >>>     EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> >>>     EM( SCAN_VMA_NULL,              "vma_null")                     \
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 30786c706c4a..1ca034a5f653 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -45,6 +45,7 @@ enum scan_result {
> >>>     SCAN_PAGE_LRU,
> >>>     SCAN_PAGE_LOCK,
> >>>     SCAN_PAGE_ANON,
> >>> +   SCAN_PAGE_LAZYFREE,
> >>>     SCAN_PAGE_COMPOUND,
> >>>     SCAN_ANY_PROCESS,
> >>>     SCAN_VMA_NULL,
> >>> @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>>             }
> >>>             folio = page_folio(page);
> >>> +           if (folio_is_lazyfree(folio)) {
> >>> +                   result = SCAN_PAGE_LAZYFREE;
> >>> +                   goto out_unmap;
> >>> +           }
> >>
> >> That's a bit tricky ... I don't think we need to handle MADV_FREE pages
> >> differently :)
> >>
> >> MADV_FREE pages are likely cold memory, but what if there are just
> >> a few MADV_FREE pages in a hot memory region? Skipping the entire
> >> region would be unfortunate ...
> >
> > If there are hot in lazyfree folios, the folio will be set as non-lazyfree
> > in the memory reclaim path, it is not skipped in the next scan in the
> > khugepaged.
> >
> > shrink_folio_list()
> >    try_to_unmap()
> >      folio_set_swapbacked()
> >
> > If there are no hot in lazyfree folios, continuing the collapse would
> > waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
> > Additionally, due to collapse hugepage become non-lazyfree, preventing
> > the rapid release of lazyfree folios in the memory reclaim path.
> >
> > So skipping lazy-free folios make sense here for us.
> >
> > If I missed something, please let me know, thank!
>
> I'm not saying lazyfree pages become hot :)
>
> If a PMD region has mostly hot pages but just a few lazyfree
> pages, we would skip the entire region. Those hot pages won't
> be collapsed.

Same above, the lazyfree folios will be set as non-lazyfree
in the memory reclaim path, it is not skipped in the next scan,
the PMD region will collapse :)

> >
> >> Also, even if we skip these pages now, after they are reclaimed, they
> >> become pte_none. Then khugepaged will try to collapse them anyway
> >> (based on khugepaged_max_ptes_none). So skipping them just delays
> >> things, it does not really change the final result ;)
> >
> > This patch just resolve scene for hot1 -> cold -> hot2.
> >
> > --
> > Thanks,
> > Vernon
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning
  2026-01-05  3:12         ` Vernon Yang
@ 2026-01-05  3:35           ` Lance Yang
  2026-01-05 12:30             ` Vernon Yang
  0 siblings, 1 reply; 25+ messages in thread
From: Lance Yang @ 2026-01-05  3:35 UTC (permalink / raw)
  To: Vernon Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, richard.weiyang,
	linux-mm, linux-kernel, Vernon Yang, akpm, david



On 2026/1/5 11:12, Vernon Yang wrote:
> On Mon, Jan 5, 2026 at 10:51 AM Lance Yang <lance.yang@linux.dev> wrote:
>>
>> On 2026/1/5 09:48, Vernon Yang wrote:
>>> On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
>>>>
>>>>
>>>> On 2026/1/4 13:41, Vernon Yang wrote:
>>>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>>>> continuously access 128 MB memory, while the cold task only accesses
>>>>> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
>>>>> still prioritizes scanning the cold task and only scans the hot2 task
>>>>> after completing the scan of the cold task.
>>>>>
>>>>> So if the user has explicitly informed us via MADV_FREE that this memory
>>>>> will be freed, it is appropriate for khugepaged to skip it only, thereby
>>>>> avoiding unnecessary scan and collapse operations to reducing CPU
>>>>> wastage.
>>>>>
>>>>> Here are the performance test results:
>>>>> (Throughput bigger is better, other smaller is better)
>>>>>
>>>>> Testing on x86_64 machine:
>>>>>
>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>> |---------------------|---------------|---------------|---------|
>>>>> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
>>>>> | cycles per access   |  4.96         |  2.21         | -55.44% |
>>>>> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
>>>>> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
>>>>>
>>>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>>>
>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>> |---------------------|---------------|---------------|---------|
>>>>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>>>>> | cycles per access   |  7.29         |  2.07         | -71.60% |
>>>>> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
>>>>> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
>>>>>
>>>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>>>> ---
>>>>>     include/trace/events/huge_memory.h | 1 +
>>>>>     mm/khugepaged.c                    | 6 ++++++
>>>>>     2 files changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>>>>> index 01225dd27ad5..e99d5f71f2a4 100644
>>>>> --- a/include/trace/events/huge_memory.h
>>>>> +++ b/include/trace/events/huge_memory.h
>>>>> @@ -25,6 +25,7 @@
>>>>>      EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
>>>>>      EM( SCAN_PAGE_LOCK,             "page_locked")                  \
>>>>>      EM( SCAN_PAGE_ANON,             "page_not_anon")                \
>>>>> +   EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
>>>>>      EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
>>>>>      EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
>>>>>      EM( SCAN_VMA_NULL,              "vma_null")                     \
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 30786c706c4a..1ca034a5f653 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -45,6 +45,7 @@ enum scan_result {
>>>>>      SCAN_PAGE_LRU,
>>>>>      SCAN_PAGE_LOCK,
>>>>>      SCAN_PAGE_ANON,
>>>>> +   SCAN_PAGE_LAZYFREE,
>>>>>      SCAN_PAGE_COMPOUND,
>>>>>      SCAN_ANY_PROCESS,
>>>>>      SCAN_VMA_NULL,
>>>>> @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>>>              }
>>>>>              folio = page_folio(page);
>>>>> +           if (folio_is_lazyfree(folio)) {
>>>>> +                   result = SCAN_PAGE_LAZYFREE;
>>>>> +                   goto out_unmap;
>>>>> +           }
>>>>
>>>> That's a bit tricky ... I don't think we need to handle MADV_FREE pages
>>>> differently :)
>>>>
>>>> MADV_FREE pages are likely cold memory, but what if there are just
>>>> a few MADV_FREE pages in a hot memory region? Skipping the entire
>>>> region would be unfortunate ...
>>>
>>> If there are hot in lazyfree folios, the folio will be set as non-lazyfree
>>> in the memory reclaim path, it is not skipped in the next scan in the
>>> khugepaged.
>>>
>>> shrink_folio_list()
>>>     try_to_unmap()
>>>       folio_set_swapbacked()
>>>
>>> If there are no hot in lazyfree folios, continuing the collapse would
>>> waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
>>> Additionally, due to collapse hugepage become non-lazyfree, preventing
>>> the rapid release of lazyfree folios in the memory reclaim path.
>>>
>>> So skipping lazy-free folios make sense here for us.
>>>
>>> If I missed something, please let me know, thank!
>>
>> I'm not saying lazyfree pages become hot :)
>>
>> If a PMD region has mostly hot pages but just a few lazyfree
>> pages, we would skip the entire region. Those hot pages won't
>> be collapsed.
> 
> Same above, the lazyfree folios will be set as non-lazyfree

Nop ...

> in the memory reclaim path, it is not skipped in the next scan,
> the PMD region will collapse :)

Let me be more specific:

Assume we have a PMD region (512 pages):
- Pages 0-499: hot pages (frequently accessed, NOT lazyfree)
- Pages 500-511: lazyfree pages (MADV_FREE'd and clean)

This patch skips the entire region when it hits page 500. So pages
0-499 can't be collapsed, even though they are hot.

I'm NOT saying lazyfree pages themselves become hot ;)

As I mentioned earlier, even if we skip these pages now, after they
are reclaimed they become pte_none. Then khugepaged will try to
collapse them anyway (based on khugepaged_max_ptes_none). So
skipping them just delays things, it does not really change the
final result ...

> 
>>>
>>>> Also, even if we skip these pages now, after they are reclaimed, they
>>>> become pte_none. Then khugepaged will try to collapse them anyway
>>>> (based on khugepaged_max_ptes_none). So skipping them just delays
>>>> things, it does not really change the final result ;)
>>>
>>> This patch just resolve scene for hot1 -> cold -> hot2.
>>>
>>> --
>>> Thanks,
>>> Vernon
>>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning
  2026-01-05  3:35           ` Lance Yang
@ 2026-01-05 12:30             ` Vernon Yang
  2026-01-06 10:33               ` Barry Song
  0 siblings, 1 reply; 25+ messages in thread
From: Vernon Yang @ 2026-01-05 12:30 UTC (permalink / raw)
  To: Lance Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, richard.weiyang,
	linux-mm, linux-kernel, Vernon Yang, akpm, david

On Mon, Jan 05, 2026 at 11:35:58AM +0800, Lance Yang wrote:
>
>
> On 2026/1/5 11:12, Vernon Yang wrote:
> > On Mon, Jan 5, 2026 at 10:51 AM Lance Yang <lance.yang@linux.dev> wrote:
> > >
> > > On 2026/1/5 09:48, Vernon Yang wrote:
> > > > On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
> > > > >
> > > > >
> > > > > On 2026/1/4 13:41, Vernon Yang wrote:
> > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > after completing the scan of the cold task.
> > > > > >
> > > > > > So if the user has explicitly informed us via MADV_FREE that this memory
> > > > > > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > > > > > avoiding unnecessary scan and collapse operations to reducing CPU
> > > > > > wastage.
> > > > > >
> > > > > > Here are the performance test results:
> > > > > > (Throughput bigger is better, other smaller is better)
> > > > > >
> > > > > > Testing on x86_64 machine:
> > > > > >
> > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > |---------------------|---------------|---------------|---------|
> > > > > > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > > > > > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > > > > > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > > > > > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> > > > > >
> > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > >
> > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > |---------------------|---------------|---------------|---------|
> > > > > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > > > > > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > > > > > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > > > > > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> > > > > >
> > > > > > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > > > > > ---
> > > > > >     include/trace/events/huge_memory.h | 1 +
> > > > > >     mm/khugepaged.c                    | 6 ++++++
> > > > > >     2 files changed, 7 insertions(+)
> > > > > >
> > > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > > > > index 01225dd27ad5..e99d5f71f2a4 100644
> > > > > > --- a/include/trace/events/huge_memory.h
> > > > > > +++ b/include/trace/events/huge_memory.h
> > > > > > @@ -25,6 +25,7 @@
> > > > > >      EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> > > > > >      EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> > > > > >      EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> > > > > > +   EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> > > > > >      EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> > > > > >      EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> > > > > >      EM( SCAN_VMA_NULL,              "vma_null")                     \
> > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > index 30786c706c4a..1ca034a5f653 100644
> > > > > > --- a/mm/khugepaged.c
> > > > > > +++ b/mm/khugepaged.c
> > > > > > @@ -45,6 +45,7 @@ enum scan_result {
> > > > > >      SCAN_PAGE_LRU,
> > > > > >      SCAN_PAGE_LOCK,
> > > > > >      SCAN_PAGE_ANON,
> > > > > > +   SCAN_PAGE_LAZYFREE,
> > > > > >      SCAN_PAGE_COMPOUND,
> > > > > >      SCAN_ANY_PROCESS,
> > > > > >      SCAN_VMA_NULL,
> > > > > > @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > > > > >              }
> > > > > >              folio = page_folio(page);
> > > > > > +           if (folio_is_lazyfree(folio)) {
> > > > > > +                   result = SCAN_PAGE_LAZYFREE;
> > > > > > +                   goto out_unmap;
> > > > > > +           }
> > > > >
> > > > > That's a bit tricky ... I don't think we need to handle MADV_FREE pages
> > > > > differently :)
> > > > >
> > > > > MADV_FREE pages are likely cold memory, but what if there are just
> > > > > a few MADV_FREE pages in a hot memory region? Skipping the entire
> > > > > region would be unfortunate ...
> > > >
> > > > If there are hot in lazyfree folios, the folio will be set as non-lazyfree
> > > > in the memory reclaim path, it is not skipped in the next scan in the
> > > > khugepaged.
> > > >
> > > > shrink_folio_list()
> > > >     try_to_unmap()
> > > >       folio_set_swapbacked()
> > > >
> > > > If there are no hot in lazyfree folios, continuing the collapse would
> > > > waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
> > > > Additionally, due to collapse hugepage become non-lazyfree, preventing
> > > > the rapid release of lazyfree folios in the memory reclaim path.
> > > >
> > > > So skipping lazy-free folios make sense here for us.
> > > >
> > > > If I missed something, please let me know, thank!
> > >
> > > I'm not saying lazyfree pages become hot :)
> > >
> > > If a PMD region has mostly hot pages but just a few lazyfree
> > > pages, we would skip the entire region. Those hot pages won't
> > > be collapsed.
> >
> > Same above, the lazyfree folios will be set as non-lazyfree
>
> Nop ...
>
> > in the memory reclaim path, it is not skipped in the next scan,
> > the PMD region will collapse :)
>
> Let me be more specific:
>
> Assume we have a PMD region (512 pages):
> - Pages 0-499: hot pages (frequently accessed, NOT lazyfree)
> - Pages 500-511: lazyfree pages (MADV_FREE'd and clean)
>
> This patch skips the entire region when it hits page 500. So pages
> 0-499 can't be collapsed, even though they are hot.
>
> I'm NOT saying lazyfree pages themselves become hot ;)
>
> As I mentioned earlier, even if we skip these pages now, after they
> are reclaimed they become pte_none. Then khugepaged will try to
> collapse them anyway (based on khugepaged_max_ptes_none). So
> skipping them just delays things, it does not really change the
> final result ...

I got it. Thank you for explain.
I refine the code, it can resolve this issue, as follows:

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 30786c706c4a..afea2e12394e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -45,6 +45,7 @@ enum scan_result {
 	SCAN_PAGE_LRU,
 	SCAN_PAGE_LOCK,
 	SCAN_PAGE_ANON,
+	SCAN_PAGE_LAZYFREE,
 	SCAN_PAGE_COMPOUND,
 	SCAN_ANY_PROCESS,
 	SCAN_VMA_NULL,
@@ -1256,6 +1257,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	pte_t *pte, *_pte;
 	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
+	int lazyfree = 0;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long addr;
@@ -1337,6 +1339,21 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		}
 		folio = page_folio(page);

+		if (cc->is_khugepaged && !pte_dirty(pteval) &&
+		    folio_is_lazyfree(folio)) {
+			++lazyfree;
+
+			/*
+			 * Due to the lazyfree-folios is reclaimed become
+			 * pte_none, make sure it doesn't continue to be
+			 * collapsed when skip ahead.
+			 */
+			if ((lazyfree + none_or_zero) > khugepaged_max_ptes_none) {
+				result = SCAN_PAGE_LAZYFREE;
+				goto out_unmap;
+			}
+		}
+
 		if (!folio_test_anon(folio)) {
 			result = SCAN_PAGE_ANON;
 			goto out_unmap;


If it has anything bug or better idea, please let me know, thanks!
If no, I will send it in the next version.

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/6] mm: khugepaged: refine scan progress number
  2026-01-04  5:41 ` [PATCH v3 2/6] mm: khugepaged: refine scan progress number Vernon Yang
@ 2026-01-05 16:49   ` David Hildenbrand (Red Hat)
  2026-01-06  5:55     ` Vernon Yang
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-05 16:49 UTC (permalink / raw)
  To: Vernon Yang, akpm
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	richard.weiyang, linux-mm, linux-kernel, Vernon Yang

On 1/4/26 06:41, Vernon Yang wrote:
> Currently, each PMD scan always increases `progress` by HPAGE_PMD_NR,
> even if only scanning a single page. By counting the actual number of

"... a single pmd" ?

> pages scanned, the `progress` is tracked accurately.

"page table entries / pages scanned" ?

> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   mm/khugepaged.c | 31 +++++++++++++++++++++++--------
>   1 file changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9f99f61689f8..4b124e854e2e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1247,7 +1247,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   				   struct vm_area_struct *vma,
>   				   unsigned long start_addr, bool *mmap_locked,
> -				   struct collapse_control *cc)
> +				   int *progress, struct collapse_control *cc)
>   {
>   	pmd_t *pmd;
>   	pte_t *pte, *_pte;
> @@ -1258,23 +1258,28 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   	unsigned long addr;
>   	spinlock_t *ptl;
>   	int node = NUMA_NO_NODE, unmapped = 0;
> +	int _progress = 0;

"cur_progress" ?

>   
>   	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>   
>   	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> -	if (result != SCAN_SUCCEED)
> +	if (result != SCAN_SUCCEED) {
> +		_progress = HPAGE_PMD_NR;
>   		goto out;
> +	}
>   
>   	memset(cc->node_load, 0, sizeof(cc->node_load));
>   	nodes_clear(cc->alloc_nmask);
>   	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>   	if (!pte) {
> +		_progress = HPAGE_PMD_NR;
>   		result = SCAN_NO_PTE_TABLE;
>   		goto out;
>   	}
>   
>   	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>   	     _pte++, addr += PAGE_SIZE) {
> +		_progress++;
>   		pte_t pteval = ptep_get(_pte);
>   		if (pte_none_or_zero(pteval)) {
>   			++none_or_zero;
> @@ -1410,6 +1415,9 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   		*mmap_locked = false;
>   	}
>   out:
> +	if (progress)
> +		*progress += _progress;
> +
>   	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
>   				     none_or_zero, result, unmapped);
>   	return result;
> @@ -2287,7 +2295,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>   
>   static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>   				    struct file *file, pgoff_t start,
> -				    struct collapse_control *cc)
> +				    int *progress, struct collapse_control *cc)
>   {
>   	struct folio *folio = NULL;
>   	struct address_space *mapping = file->f_mapping;
> @@ -2295,6 +2303,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>   	int present, swap;
>   	int node = NUMA_NO_NODE;
>   	int result = SCAN_SUCCEED;
> +	int _progress = 0;

Same here.


Not sure if it would be cleaner to just let the parent increment its 
counter and returning instead the "cur_progress" from the function.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/6] mm: khugepaged: refine scan progress number
  2026-01-05 16:49   ` David Hildenbrand (Red Hat)
@ 2026-01-06  5:55     ` Vernon Yang
  2026-01-14 11:18       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 25+ messages in thread
From: Vernon Yang @ 2026-01-06  5:55 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: akpm, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	richard.weiyang, linux-mm, linux-kernel, Vernon Yang

On Mon, Jan 05, 2026 at 05:49:22PM +0100, David Hildenbrand (Red Hat) wrote:
> On 1/4/26 06:41, Vernon Yang wrote:
> > Currently, each PMD scan always increases `progress` by HPAGE_PMD_NR,
> > even if only scanning a single page. By counting the actual number of
>
> "... a single pmd" ?
>
> > pages scanned, the `progress` is tracked accurately.
>
> "page table entries / pages scanned" ?

The single page is pte-4KB only. This patch does not change the original
semantics of "progress", it simply uses the exact number of PTEs counted
to replace HPAGE_PMD_NR.

Let me provide a detailed example:

	static int hpage_collapse_scan_pmd()
	{
		for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
		     _pte++, addr += PAGE_SIZE) {
			_progress++;
			pte_t pteval = ptep_get(_pte);
			...
			if (pte_uffd_wp(pteval)) { <-- first scan hit
				result = SCAN_PTE_UFFD_WP;
				goto out_unmap;
			}
		}
	}

During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
directly. In practice, only one PTE is scanned before termination.
Here, "progress += 1" reflects the actual number of PTEs scanned, but
previously "progress += HPAGE_PMD_NR" always.

Previously discussed, just skip SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE,
currently in Patch #3, not this Patch #2.

> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   mm/khugepaged.c | 31 +++++++++++++++++++++++--------
> >   1 file changed, 23 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 9f99f61689f8..4b124e854e2e 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1247,7 +1247,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >   static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >   				   struct vm_area_struct *vma,
> >   				   unsigned long start_addr, bool *mmap_locked,
> > -				   struct collapse_control *cc)
> > +				   int *progress, struct collapse_control *cc)
> >   {
> >   	pmd_t *pmd;
> >   	pte_t *pte, *_pte;
> > @@ -1258,23 +1258,28 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >   	unsigned long addr;
> >   	spinlock_t *ptl;
> >   	int node = NUMA_NO_NODE, unmapped = 0;
> > +	int _progress = 0;
>
> "cur_progress" ?

Yes.

> >   	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
> >   	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> > -	if (result != SCAN_SUCCEED)
> > +	if (result != SCAN_SUCCEED) {
> > +		_progress = HPAGE_PMD_NR;
> >   		goto out;
> > +	}
> >   	memset(cc->node_load, 0, sizeof(cc->node_load));
> >   	nodes_clear(cc->alloc_nmask);
> >   	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >   	if (!pte) {
> > +		_progress = HPAGE_PMD_NR;
> >   		result = SCAN_NO_PTE_TABLE;
> >   		goto out;
> >   	}
> >   	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >   	     _pte++, addr += PAGE_SIZE) {
> > +		_progress++;
> >   		pte_t pteval = ptep_get(_pte);
> >   		if (pte_none_or_zero(pteval)) {
> >   			++none_or_zero;
> > @@ -1410,6 +1415,9 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >   		*mmap_locked = false;
> >   	}
> >   out:
> > +	if (progress)
> > +		*progress += _progress;
> > +
> >   	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> >   				     none_or_zero, result, unmapped);
> >   	return result;
> > @@ -2287,7 +2295,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >   static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >   				    struct file *file, pgoff_t start,
> > -				    struct collapse_control *cc)
> > +				    int *progress, struct collapse_control *cc)
> >   {
> >   	struct folio *folio = NULL;
> >   	struct address_space *mapping = file->f_mapping;
> > @@ -2295,6 +2303,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> >   	int present, swap;
> >   	int node = NUMA_NO_NODE;
> >   	int result = SCAN_SUCCEED;
> > +	int _progress = 0;
>
> Same here.
>
>
> Not sure if it would be cleaner to just let the parent increment its counter
> and returning instead the "cur_progress" from the function.

Both are good for me, I have implemented one version as follows, please
see if it is cleaner.

--
Thanks,
Vernon


diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9f99f61689f8..4cf24553c2bd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1247,6 +1247,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long start_addr, bool *mmap_locked,
+				   int *cur_progress,
 				   struct collapse_control *cc)
 {
 	pmd_t *pmd;
@@ -1262,19 +1263,27 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);

 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
-	if (result != SCAN_SUCCEED)
+	if (result != SCAN_SUCCEED) {
+		if (cur_progress)
+			*cur_progress = HPAGE_PMD_NR;
 		goto out;
+	}

 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
 	if (!pte) {
+		if (cur_progress)
+			*cur_progress = HPAGE_PMD_NR;
 		result = SCAN_NO_PTE_TABLE;
 		goto out;
 	}

 	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
 	     _pte++, addr += PAGE_SIZE) {
+		if (cur_progress)
+			*cur_progress += 1;
+
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none_or_zero(pteval)) {
 			++none_or_zero;
@@ -2287,6 +2296,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,

 static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 				    struct file *file, pgoff_t start,
+				    int *cur_progress,
 				    struct collapse_control *cc)
 {
 	struct folio *folio = NULL;
@@ -2327,6 +2337,9 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 			continue;
 		}

+		if (cur_progress)
+			*cur_progress += folio_nr_pages(folio);
+
 		if (folio_order(folio) == HPAGE_PMD_ORDER &&
 		    folio->index == start) {
 			/* Maybe PMD-mapped */
@@ -2454,6 +2467,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,

 		while (khugepaged_scan.address < hend) {
 			bool mmap_locked = true;
+			int cur_progress = 0;

 			cond_resched();
 			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
@@ -2470,7 +2484,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				mmap_read_unlock(mm);
 				mmap_locked = false;
 				*result = hpage_collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
+					khugepaged_scan.address, file, pgoff,
+					&cur_progress, cc);
 				fput(file);
 				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
 					mmap_read_lock(mm);
@@ -2484,7 +2499,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				}
 			} else {
 				*result = hpage_collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
+					khugepaged_scan.address, &mmap_locked,
+					&cur_progress, cc);
 			}

 			if (*result == SCAN_SUCCEED)
@@ -2492,7 +2508,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,

 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
-			progress += HPAGE_PMD_NR;
+			progress += cur_progress;
 			if (!mmap_locked)
 				/*
 				 * We released mmap_lock so break loop.  Note
@@ -2810,11 +2826,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			mmap_read_unlock(mm);
 			mmap_locked = false;
 			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
+							  NULL, cc);
 			fput(file);
 		} else {
 			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
+							 &mmap_locked, NULL, cc);
 		}
 		if (!mmap_locked)
 			*lock_dropped = true;


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning
  2026-01-05 12:30             ` Vernon Yang
@ 2026-01-06 10:33               ` Barry Song
  2026-01-07  8:36                 ` Vernon Yang
  0 siblings, 1 reply; 25+ messages in thread
From: Barry Song @ 2026-01-06 10:33 UTC (permalink / raw)
  To: Vernon Yang
  Cc: Lance Yang, lorenzo.stoakes, ziy, dev.jain, richard.weiyang,
	linux-mm, linux-kernel, Vernon Yang, akpm, david

On Tue, Jan 6, 2026 at 1:31 AM Vernon Yang <vernon2gm@gmail.com> wrote:
>
> On Mon, Jan 05, 2026 at 11:35:58AM +0800, Lance Yang wrote:
> >
> >
> > On 2026/1/5 11:12, Vernon Yang wrote:
> > > On Mon, Jan 5, 2026 at 10:51 AM Lance Yang <lance.yang@linux.dev> wrote:
> > > >
> > > > On 2026/1/5 09:48, Vernon Yang wrote:
> > > > > On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
> > > > > >
> > > > > >
> > > > > > On 2026/1/4 13:41, Vernon Yang wrote:
> > > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > > after completing the scan of the cold task.
> > > > > > >
> > > > > > > So if the user has explicitly informed us via MADV_FREE that this memory
> > > > > > > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > > > > > > avoiding unnecessary scan and collapse operations to reducing CPU
> > > > > > > wastage.
> > > > > > >
> > > > > > > Here are the performance test results:
> > > > > > > (Throughput bigger is better, other smaller is better)
> > > > > > >
> > > > > > > Testing on x86_64 machine:
> > > > > > >
> > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > > > > > > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > > > > > > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > > > > > > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> > > > > > >
> > > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > > >
> > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > > > > > > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > > > > > > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > > > > > > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> > > > > > >
> > > > > > > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > > > > > > ---
> > > > > > >     include/trace/events/huge_memory.h | 1 +
> > > > > > >     mm/khugepaged.c                    | 6 ++++++
> > > > > > >     2 files changed, 7 insertions(+)
> > > > > > >
> > > > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > > > > > index 01225dd27ad5..e99d5f71f2a4 100644
> > > > > > > --- a/include/trace/events/huge_memory.h
> > > > > > > +++ b/include/trace/events/huge_memory.h
> > > > > > > @@ -25,6 +25,7 @@
> > > > > > >      EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> > > > > > >      EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> > > > > > >      EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> > > > > > > +   EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> > > > > > >      EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> > > > > > >      EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> > > > > > >      EM( SCAN_VMA_NULL,              "vma_null")                     \
> > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > index 30786c706c4a..1ca034a5f653 100644
> > > > > > > --- a/mm/khugepaged.c
> > > > > > > +++ b/mm/khugepaged.c
> > > > > > > @@ -45,6 +45,7 @@ enum scan_result {
> > > > > > >      SCAN_PAGE_LRU,
> > > > > > >      SCAN_PAGE_LOCK,
> > > > > > >      SCAN_PAGE_ANON,
> > > > > > > +   SCAN_PAGE_LAZYFREE,
> > > > > > >      SCAN_PAGE_COMPOUND,
> > > > > > >      SCAN_ANY_PROCESS,
> > > > > > >      SCAN_VMA_NULL,
> > > > > > > @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > > > > > >              }
> > > > > > >              folio = page_folio(page);
> > > > > > > +           if (folio_is_lazyfree(folio)) {
> > > > > > > +                   result = SCAN_PAGE_LAZYFREE;
> > > > > > > +                   goto out_unmap;
> > > > > > > +           }
> > > > > >
> > > > > > That's a bit tricky ... I don't think we need to handle MADV_FREE pages
> > > > > > differently :)
> > > > > >
> > > > > > MADV_FREE pages are likely cold memory, but what if there are just
> > > > > > a few MADV_FREE pages in a hot memory region? Skipping the entire
> > > > > > region would be unfortunate ...
> > > > >
> > > > > If there are hot in lazyfree folios, the folio will be set as non-lazyfree
> > > > > in the memory reclaim path, it is not skipped in the next scan in the
> > > > > khugepaged.
> > > > >
> > > > > shrink_folio_list()
> > > > >     try_to_unmap()
> > > > >       folio_set_swapbacked()
> > > > >
> > > > > If there are no hot in lazyfree folios, continuing the collapse would
> > > > > waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
> > > > > Additionally, due to collapse hugepage become non-lazyfree, preventing
> > > > > the rapid release of lazyfree folios in the memory reclaim path.
> > > > >
> > > > > So skipping lazy-free folios make sense here for us.
> > > > >
> > > > > If I missed something, please let me know, thank!
> > > >
> > > > I'm not saying lazyfree pages become hot :)
> > > >
> > > > If a PMD region has mostly hot pages but just a few lazyfree
> > > > pages, we would skip the entire region. Those hot pages won't
> > > > be collapsed.
> > >
> > > Same above, the lazyfree folios will be set as non-lazyfree
> >
> > Nop ...
> >
> > > in the memory reclaim path, it is not skipped in the next scan,
> > > the PMD region will collapse :)
> >
> > Let me be more specific:
> >
> > Assume we have a PMD region (512 pages):
> > - Pages 0-499: hot pages (frequently accessed, NOT lazyfree)
> > - Pages 500-511: lazyfree pages (MADV_FREE'd and clean)
> >
> > This patch skips the entire region when it hits page 500. So pages
> > 0-499 can't be collapsed, even though they are hot.
> >
> > I'm NOT saying lazyfree pages themselves become hot ;)
> >
> > As I mentioned earlier, even if we skip these pages now, after they
> > are reclaimed they become pte_none. Then khugepaged will try to
> > collapse them anyway (based on khugepaged_max_ptes_none). So
> > skipping them just delays things, it does not really change the
> > final result ...
>
> I got it. Thank you for explain.
> I refine the code, it can resolve this issue, as follows:
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 30786c706c4a..afea2e12394e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -45,6 +45,7 @@ enum scan_result {
>         SCAN_PAGE_LRU,
>         SCAN_PAGE_LOCK,
>         SCAN_PAGE_ANON,
> +       SCAN_PAGE_LAZYFREE,
>         SCAN_PAGE_COMPOUND,
>         SCAN_ANY_PROCESS,
>         SCAN_VMA_NULL,
> @@ -1256,6 +1257,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>         pte_t *pte, *_pte;
>         int result = SCAN_FAIL, referenced = 0;
>         int none_or_zero = 0, shared = 0;
> +       int lazyfree = 0;
>         struct page *page = NULL;
>         struct folio *folio = NULL;
>         unsigned long addr;
> @@ -1337,6 +1339,21 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>                 }
>                 folio = page_folio(page);
>
> +               if (cc->is_khugepaged && !pte_dirty(pteval) &&
> +                   folio_is_lazyfree(folio)) {
> +                       ++lazyfree;
> +
> +                       /*
> +                        * Due to the lazyfree-folios is reclaimed become
> +                        * pte_none, make sure it doesn't continue to be
> +                        * collapsed when skip ahead.
> +                        */
> +                       if ((lazyfree + none_or_zero) > khugepaged_max_ptes_none) {
> +                               result = SCAN_PAGE_LAZYFREE;
> +                               goto out_unmap;
> +                       }
> +               }
> +

I am still not fully convinced that this is the correct approach. You may
want to look at jemalloc or scudo to see how userspace heaps use
MADV_FREE for small size classes. In practice, it can be quite
difficult to form a large range of PTEs that are all marked lazyfree.
From my perspective, it would make more sense not to collapse the
entire range if only part of it is lazyfree.
I mean:
for ptes as below,
    lazyfree, lazyfree, non-lazyfree, non-lazyfree

Collapsing the range is unnecessary, as the first two entries are likely
to be freed soon.

>                 if (!folio_test_anon(folio)) {
>                         result = SCAN_PAGE_ANON;
>                         goto out_unmap;
>
>
> If it has anything bug or better idea, please let me know, thanks!
> If no, I will send it in the next version.
>
> --
> Thanks,
> Vernon


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning
  2026-01-06 10:33               ` Barry Song
@ 2026-01-07  8:36                 ` Vernon Yang
  0 siblings, 0 replies; 25+ messages in thread
From: Vernon Yang @ 2026-01-07  8:36 UTC (permalink / raw)
  To: Barry Song
  Cc: Lance Yang, lorenzo.stoakes, ziy, dev.jain, richard.weiyang,
	linux-mm, linux-kernel, Vernon Yang, akpm, david

On Tue, Jan 06, 2026 at 11:33:35PM +1300, Barry Song wrote:
> On Tue, Jan 6, 2026 at 1:31 AM Vernon Yang <vernon2gm@gmail.com> wrote:
> >
> > On Mon, Jan 05, 2026 at 11:35:58AM +0800, Lance Yang wrote:
> > >
> > >
> > > On 2026/1/5 11:12, Vernon Yang wrote:
> > > > On Mon, Jan 5, 2026 at 10:51 AM Lance Yang <lance.yang@linux.dev> wrote:
> > > > >
> > > > > On 2026/1/5 09:48, Vernon Yang wrote:
> > > > > > On Sun, Jan 04, 2026 at 08:10:17PM +0800, Lance Yang wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 2026/1/4 13:41, Vernon Yang wrote:
> > > > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > > > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > > > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > > > after completing the scan of the cold task.
> > > > > > > >
> > > > > > > > So if the user has explicitly informed us via MADV_FREE that this memory
> > > > > > > > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > > > > > > > avoiding unnecessary scan and collapse operations to reducing CPU
> > > > > > > > wastage.
> > > > > > > >
> > > > > > > > Here are the performance test results:
> > > > > > > > (Throughput bigger is better, other smaller is better)
> > > > > > > >
> > > > > > > > Testing on x86_64 machine:
> > > > > > > >
> > > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > > > > > > > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > > > > > > > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > > > > > > > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> > > > > > > >
> > > > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > > > >
> > > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > > > > > > > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > > > > > > > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > > > > > > > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> > > > > > > >
> > > > > > > > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > > > > > > > ---
> > > > > > > >     include/trace/events/huge_memory.h | 1 +
> > > > > > > >     mm/khugepaged.c                    | 6 ++++++
> > > > > > > >     2 files changed, 7 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > > > > > > > index 01225dd27ad5..e99d5f71f2a4 100644
> > > > > > > > --- a/include/trace/events/huge_memory.h
> > > > > > > > +++ b/include/trace/events/huge_memory.h
> > > > > > > > @@ -25,6 +25,7 @@
> > > > > > > >      EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> > > > > > > >      EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> > > > > > > >      EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> > > > > > > > +   EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> > > > > > > >      EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> > > > > > > >      EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> > > > > > > >      EM( SCAN_VMA_NULL,              "vma_null")                     \
> > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > > index 30786c706c4a..1ca034a5f653 100644
> > > > > > > > --- a/mm/khugepaged.c
> > > > > > > > +++ b/mm/khugepaged.c
> > > > > > > > @@ -45,6 +45,7 @@ enum scan_result {
> > > > > > > >      SCAN_PAGE_LRU,
> > > > > > > >      SCAN_PAGE_LOCK,
> > > > > > > >      SCAN_PAGE_ANON,
> > > > > > > > +   SCAN_PAGE_LAZYFREE,
> > > > > > > >      SCAN_PAGE_COMPOUND,
> > > > > > > >      SCAN_ANY_PROCESS,
> > > > > > > >      SCAN_VMA_NULL,
> > > > > > > > @@ -1337,6 +1338,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > > > > > > >              }
> > > > > > > >              folio = page_folio(page);
> > > > > > > > +           if (folio_is_lazyfree(folio)) {
> > > > > > > > +                   result = SCAN_PAGE_LAZYFREE;
> > > > > > > > +                   goto out_unmap;
> > > > > > > > +           }
> > > > > > >
> > > > > > > That's a bit tricky ... I don't think we need to handle MADV_FREE pages
> > > > > > > differently :)
> > > > > > >
> > > > > > > MADV_FREE pages are likely cold memory, but what if there are just
> > > > > > > a few MADV_FREE pages in a hot memory region? Skipping the entire
> > > > > > > region would be unfortunate ...
> > > > > >
> > > > > > If there are hot in lazyfree folios, the folio will be set as non-lazyfree
> > > > > > in the memory reclaim path, it is not skipped in the next scan in the
> > > > > > khugepaged.
> > > > > >
> > > > > > shrink_folio_list()
> > > > > >     try_to_unmap()
> > > > > >       folio_set_swapbacked()
> > > > > >
> > > > > > If there are no hot in lazyfree folios, continuing the collapse would
> > > > > > waste CPU and require a long wait (khugepaged_scan_sleep_millisecs).
> > > > > > Additionally, due to collapse hugepage become non-lazyfree, preventing
> > > > > > the rapid release of lazyfree folios in the memory reclaim path.
> > > > > >
> > > > > > So skipping lazy-free folios make sense here for us.
> > > > > >
> > > > > > If I missed something, please let me know, thank!
> > > > >
> > > > > I'm not saying lazyfree pages become hot :)
> > > > >
> > > > > If a PMD region has mostly hot pages but just a few lazyfree
> > > > > pages, we would skip the entire region. Those hot pages won't
> > > > > be collapsed.
> > > >
> > > > Same above, the lazyfree folios will be set as non-lazyfree
> > >
> > > Nop ...
> > >
> > > > in the memory reclaim path, it is not skipped in the next scan,
> > > > the PMD region will collapse :)
> > >
> > > Let me be more specific:
> > >
> > > Assume we have a PMD region (512 pages):
> > > - Pages 0-499: hot pages (frequently accessed, NOT lazyfree)
> > > - Pages 500-511: lazyfree pages (MADV_FREE'd and clean)
> > >
> > > This patch skips the entire region when it hits page 500. So pages
> > > 0-499 can't be collapsed, even though they are hot.
> > >
> > > I'm NOT saying lazyfree pages themselves become hot ;)
> > >
> > > As I mentioned earlier, even if we skip these pages now, after they
> > > are reclaimed they become pte_none. Then khugepaged will try to
> > > collapse them anyway (based on khugepaged_max_ptes_none). So
> > > skipping them just delays things, it does not really change the
> > > final result ...

here

> >
> > I got it. Thank you for explain.
> > I refine the code, it can resolve this issue, as follows:
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 30786c706c4a..afea2e12394e 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -45,6 +45,7 @@ enum scan_result {
> >         SCAN_PAGE_LRU,
> >         SCAN_PAGE_LOCK,
> >         SCAN_PAGE_ANON,
> > +       SCAN_PAGE_LAZYFREE,
> >         SCAN_PAGE_COMPOUND,
> >         SCAN_ANY_PROCESS,
> >         SCAN_VMA_NULL,
> > @@ -1256,6 +1257,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >         pte_t *pte, *_pte;
> >         int result = SCAN_FAIL, referenced = 0;
> >         int none_or_zero = 0, shared = 0;
> > +       int lazyfree = 0;
> >         struct page *page = NULL;
> >         struct folio *folio = NULL;
> >         unsigned long addr;
> > @@ -1337,6 +1339,21 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >                 }
> >                 folio = page_folio(page);
> >
> > +               if (cc->is_khugepaged && !pte_dirty(pteval) &&
> > +                   folio_is_lazyfree(folio)) {
> > +                       ++lazyfree;
> > +
> > +                       /*
> > +                        * Due to the lazyfree-folios is reclaimed become
> > +                        * pte_none, make sure it doesn't continue to be
> > +                        * collapsed when skip ahead.
> > +                        */
> > +                       if ((lazyfree + none_or_zero) > khugepaged_max_ptes_none) {
> > +                               result = SCAN_PAGE_LAZYFREE;
> > +                               goto out_unmap;
> > +                       }
> > +               }
> > +
>
> I am still not fully convinced that this is the correct approach. You may
> want to look at jemalloc or scudo to see how userspace heaps use
> MADV_FREE for small size classes. In practice, it can be quite
> difficult to form a large range of PTEs that are all marked lazyfree.
> From my perspective, it would make more sense not to collapse the
> entire range if only part of it is lazyfree.
> I mean:
> for ptes as below,
>     lazyfree, lazyfree, non-lazyfree, non-lazyfree
>
> Collapsing the range is unnecessary, as the first two entries are likely
> to be freed soon.

But if the later two entries are hot, we not collapse, the describes of
Lance may occur.

> >                 if (!folio_test_anon(folio)) {
> >                         result = SCAN_PAGE_ANON;
> >                         goto out_unmap;
> >
> >
> > If it has anything bug or better idea, please let me know, thanks!
> > If no, I will send it in the next version.
> >
> > --
> > Thanks,
> > Vernon
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/6] mm: khugepaged: refine scan progress number
  2026-01-06  5:55     ` Vernon Yang
@ 2026-01-14 11:18       ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-14 11:18 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	richard.weiyang, linux-mm, linux-kernel, Vernon Yang

On 1/6/26 06:55, Vernon Yang wrote:
> On Mon, Jan 05, 2026 at 05:49:22PM +0100, David Hildenbrand (Red Hat) wrote:
>> On 1/4/26 06:41, Vernon Yang wrote:
>>> Currently, each PMD scan always increases `progress` by HPAGE_PMD_NR,
>>> even if only scanning a single page. By counting the actual number of
>>
>> "... a single pmd" ?
>>
>>> pages scanned, the `progress` is tracked accurately.
>>
>> "page table entries / pages scanned" ?
> 
> The single page is pte-4KB only. This patch does not change the original
> semantics of "progress", it simply uses the exact number of PTEs counted
> to replace HPAGE_PMD_NR.

You used the right terminology: "PTEs" counted.

It could be either a page or a PTE, depending on whether we collapse an 
anon THP or an file thp.

You make it sound like we always scan pages.

[...]

> 
>>>
>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>> ---
>>>    mm/khugepaged.c | 31 +++++++++++++++++++++++--------
>>>    1 file changed, 23 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 9f99f61689f8..4b124e854e2e 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1247,7 +1247,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>    static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>    				   struct vm_area_struct *vma,
>>>    				   unsigned long start_addr, bool *mmap_locked,
>>> -				   struct collapse_control *cc)
>>> +				   int *progress, struct collapse_control *cc)
>>>    {
>>>    	pmd_t *pmd;
>>>    	pte_t *pte, *_pte;
>>> @@ -1258,23 +1258,28 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>    	unsigned long addr;
>>>    	spinlock_t *ptl;
>>>    	int node = NUMA_NO_NODE, unmapped = 0;
>>> +	int _progress = 0;
>>
>> "cur_progress" ?
> 
> Yes.
> 
>>>    	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
>>>    	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
>>> -	if (result != SCAN_SUCCEED)
>>> +	if (result != SCAN_SUCCEED) {
>>> +		_progress = HPAGE_PMD_NR;
>>>    		goto out;
>>> +	}
>>>    	memset(cc->node_load, 0, sizeof(cc->node_load));
>>>    	nodes_clear(cc->alloc_nmask);
>>>    	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>>>    	if (!pte) {
>>> +		_progress = HPAGE_PMD_NR;
>>>    		result = SCAN_NO_PTE_TABLE;
>>>    		goto out;
>>>    	}
>>>    	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>>    	     _pte++, addr += PAGE_SIZE) {
>>> +		_progress++;
>>>    		pte_t pteval = ptep_get(_pte);
>>>    		if (pte_none_or_zero(pteval)) {
>>>    			++none_or_zero;
>>> @@ -1410,6 +1415,9 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>    		*mmap_locked = false;
>>>    	}
>>>    out:
>>> +	if (progress)
>>> +		*progress += _progress;
>>> +
>>>    	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
>>>    				     none_or_zero, result, unmapped);
>>>    	return result;
>>> @@ -2287,7 +2295,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>>>    static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>>>    				    struct file *file, pgoff_t start,
>>> -				    struct collapse_control *cc)
>>> +				    int *progress, struct collapse_control *cc)
>>>    {
>>>    	struct folio *folio = NULL;
>>>    	struct address_space *mapping = file->f_mapping;
>>> @@ -2295,6 +2303,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>>>    	int present, swap;
>>>    	int node = NUMA_NO_NODE;
>>>    	int result = SCAN_SUCCEED;
>>> +	int _progress = 0;
>>
>> Same here.
>>
>>
>> Not sure if it would be cleaner to just let the parent increment its counter
>> and returning instead the "cur_progress" from the function.
> 
> Both are good for me, I have implemented one version as follows, please
> see if it is cleaner.

 From a quick glimpse looks good.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-01-14 11:18 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-04  5:41 [PATCH v3 0/6] Improve khugepaged scan logic Vernon Yang
2026-01-04  5:41 ` [PATCH v3 1/6] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
2026-01-04  5:41 ` [PATCH v3 2/6] mm: khugepaged: refine scan progress number Vernon Yang
2026-01-05 16:49   ` David Hildenbrand (Red Hat)
2026-01-06  5:55     ` Vernon Yang
2026-01-14 11:18       ` David Hildenbrand (Red Hat)
2026-01-04  5:41 ` [PATCH v3 3/6] mm: khugepaged: just skip when the memory has been collapsed Vernon Yang
2026-01-04  5:41 ` [PATCH v3 4/6] mm: add folio_is_lazyfree helper Vernon Yang
2026-01-04 11:42   ` Lance Yang
2026-01-05  2:09     ` Vernon Yang
2026-01-04  5:41 ` [PATCH v3 5/6] mm: khugepaged: skip lazy-free folios at scanning Vernon Yang
2026-01-04 12:10   ` Lance Yang
2026-01-05  1:48     ` Vernon Yang
2026-01-05  2:51       ` Lance Yang
2026-01-05  3:12         ` Vernon Yang
2026-01-05  3:35           ` Lance Yang
2026-01-05 12:30             ` Vernon Yang
2026-01-06 10:33               ` Barry Song
2026-01-07  8:36                 ` Vernon Yang
2026-01-04  5:41 ` [PATCH v3 6/6] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
2026-01-04 12:20   ` Lance Yang
2026-01-05  0:31     ` Wei Yang
2026-01-05  2:09       ` Lance Yang
2026-01-05  2:06     ` Vernon Yang
2026-01-05  2:20       ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox