[PATCH 0/4] Improve khugepaged scan logic

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Improve khugepaged scan logic
@ 2025-12-15  9:04 Vernon Yang
  2025-12-15  9:04 ` [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
                   ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-15  9:04 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang

hi all,

This series is improve the khugepaged scan logic, reduce CPU consumption,
prioritize scanning task that access memory frequently.

The following data is traced by bpftrace[1] on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
khugepaged.

@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
@scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
@scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
total progress size: 701 MB
Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs

The khugepaged has below phenomenon: the khugepaged list is scanned in a
FIFO manner, as long as the task is not destroyed,
1. the task no longer has memory that can be collapsed into hugepage,
   continues scan it always.
2. the task at the front of the khugepaged scan list is cold, they are
   still scanned first.
3. everyone scan at intervals of khugepaged_scan_sleep_millisecs
   (default 10s). If we always scan the above two cases first, the valid
   scan will have to wait for a long time.

For the first case, when all memory has been collapsed, the mm is
automatically removed from khugepaged's scan list. If the page fault or
MADV_HUGEPAGE again, it is added back to khugepaged.

For the second case, if the user has explicitly informed us via
MADV_COLD/MADV_FREE that this memory is cold or will be freed, move mm
to khugepaged scan list tail for scan later.

The below is some performance test results.

kernbench results (testing on x86_64 machine):

                       6.18.0-baseline          6.18.0-test
Amean     user-32    18652.80 (   0.00%)    18640.85 (   0.06%)
Amean     syst-32     1165.09 (   0.00%)     1159.15 *   0.51%*
Amean     elsp-32      667.71 (   0.00%)      667.02 *   0.10%*
BAmean-95 user-32    18652.02 (   0.00%)    18638.11 (   0.07%)
BAmean-95 syst-32     1165.04 (   0.00%)     1158.41 (   0.57%)
BAmean-95 elsp-32      667.65 (   0.00%)      666.90 (   0.11%)
BAmean-99 user-32    18652.02 (   0.00%)    18638.11 (   0.07%)
BAmean-99 syst-32     1165.04 (   0.00%)     1158.41 (   0.57%)
BAmean-99 elsp-32      667.65 (   0.00%)      666.90 (   0.11%)

Create three task[2]: hot1 -> cold -> hot2. After all three task are
created, each allocate memory 128MB. the hot1/hot2 task continuously
access 128 MB memory, while the cold task only accesses its memory
briefly andthen call madvise(MADV_COLD). Here are the performance test
results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
| cycles per access   |  4.91         |  2.07         | -57.84% |
| Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
| dTLB-load-misses    |  288966432    |  1292908      | -99.55% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.23         |  2.12         | -70.68% |
| Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
| dTLB-load-misses    |  237406497    |  3189194      | -98.66% |

This series is based on Linux v6.18.

Thank you very much for your comments and discussions :)

[1] https://github.com/vernon2gh/app_and_module/blob/main/khugepaged/khugepaged_mm.bt
[2] https://github.com/vernon2gh/app_and_module/blob/main/khugepaged/app.c

Vernon Yang (4):
  mm: khugepaged: add trace_mm_khugepaged_scan event
  mm: khugepaged: remove mm when all memory has been collapsed
  mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  mm: khugepaged: set to next mm direct when mm has
    MMF_DISABLE_THP_COMPLETELY

 include/linux/khugepaged.h         |  1 +
 include/trace/events/huge_memory.h | 24 ++++++++++++
 mm/khugepaged.c                    | 60 ++++++++++++++++++++++++------
 mm/madvise.c                       |  3 ++
 4 files changed, 76 insertions(+), 12 deletions(-)

--
2.51.0

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event
  2025-12-15  9:04 [PATCH 0/4] Improve khugepaged scan logic Vernon Yang
@ 2025-12-15  9:04 ` Vernon Yang
  2025-12-18  9:24   ` David Hildenbrand (Red Hat)
  2025-12-15  9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-15  9:04 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang

Add mm_khugepaged_scan event to track the total time for full scan
and the total number of pages scanned of khugepaged.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++
 mm/khugepaged.c                    |  2 ++
 2 files changed, 26 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index dd94d14a2427..b2824c2f8238 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -237,5 +237,29 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
 		__print_symbolic(__entry->result, SCAN_STATUS))
 );
 
+TRACE_EVENT(mm_khugepaged_scan,
+
+	TP_PROTO(struct mm_struct *mm, int progress, bool full),
+
+	TP_ARGS(mm, progress, full),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(int, progress)
+		__field(bool, full)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->progress = progress;
+		__entry->full = full;
+	),
+
+	TP_printk("mm=%p, progress=%d, full=%d",
+		__entry->mm,
+		__entry->progress,
+		__entry->full)
+);
+
 #endif /* __HUGE_MEMORY_H */
 #include <trace/define_trace.h>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index abe54f0043c7..0598a19a98cc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2516,6 +2516,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		collect_mm_slot(slot);
 	}
 
+	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
+
 	return progress;
 }
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-15  9:04 [PATCH 0/4] Improve khugepaged scan logic Vernon Yang
  2025-12-15  9:04 ` [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
@ 2025-12-15  9:04 ` Vernon Yang
  2025-12-15 11:52   ` Lance Yang
                     ` (5 more replies)
  2025-12-15  9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
  2025-12-15  9:04 ` [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
  3 siblings, 6 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-15  9:04 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang

The following data is traced by bpftrace on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
khugepaged.

@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
@scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
@scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
total progress size: 701 MB
Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs

The khugepaged_scan list save all task that support collapse into hugepage,
as long as the take is not destroyed, khugepaged will not remove it from
the khugepaged_scan list. This exist a phenomenon where task has already
collapsed all memory regions into hugepage, but khugepaged continues to
scan it, which wastes CPU time and invalid, and due to
khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
scanning a large number of invalid task, so scanning really valid task
is later.

After applying this patch, when all memory is either SCAN_PMD_MAPPED or
SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
list. If the page fault or MADV_HUGEPAGE again, it is added back to
khugepaged.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0598a19a98cc..1ec1af5be3c8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -115,6 +115,7 @@ struct khugepaged_scan {
 	struct list_head mm_head;
 	struct mm_slot *mm_slot;
 	unsigned long address;
+	bool maybe_collapse;
 };
 
 static struct khugepaged_scan khugepaged_scan = {
@@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	return result;
 }
 
-static void collect_mm_slot(struct mm_slot *slot)
+static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
 {
 	struct mm_struct *mm = slot->mm;
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (hpage_collapse_test_exit(mm)) {
+	if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
 		/* free mm_slot */
 		hash_del(&slot->hash);
 		list_del(&slot->mm_node);
 
-		/*
-		 * Not strictly needed because the mm exited already.
-		 *
-		 * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
-		 */
+		if (!maybe_collapse)
+			mm_flags_clear(MMF_VM_HUGEPAGE, mm);
 
 		/* khugepaged_mm_lock actually not necessary for the below */
 		mm_slot_free(mm_slot_cache, slot);
@@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				     struct mm_slot, mm_node);
 		khugepaged_scan.address = 0;
 		khugepaged_scan.mm_slot = slot;
+		khugepaged_scan.maybe_collapse = false;
 	}
 	spin_unlock(&khugepaged_mm_lock);
 
@@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 					khugepaged_scan.address, &mmap_locked, cc);
 			}
 
-			if (*result == SCAN_SUCCEED)
+			switch (*result) {
+			case SCAN_PMD_NULL:
+			case SCAN_PMD_NONE:
+			case SCAN_PMD_MAPPED:
+			case SCAN_PTE_MAPPED_HUGEPAGE:
+				break;
+			case SCAN_SUCCEED:
 				++khugepaged_pages_collapsed;
+				fallthrough;
+			default:
+				khugepaged_scan.maybe_collapse = true;
+			}
 
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
@@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * if we scanned all vmas of this mm.
 	 */
 	if (hpage_collapse_test_exit(mm) || !vma) {
+		bool maybe_collapse = khugepaged_scan.maybe_collapse;
+
+		if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
+			maybe_collapse = true;
+
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
@@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
 			khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
 			khugepaged_scan.address = 0;
+			khugepaged_scan.maybe_collapse = false;
 		} else {
 			khugepaged_scan.mm_slot = NULL;
 			khugepaged_full_scans++;
 		}
 
-		collect_mm_slot(slot);
+		collect_mm_slot(slot, maybe_collapse);
 	}
 
 	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
@@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
 	slot = khugepaged_scan.mm_slot;
 	khugepaged_scan.mm_slot = NULL;
 	if (slot)
-		collect_mm_slot(slot);
+		collect_mm_slot(slot, true);
 	spin_unlock(&khugepaged_mm_lock);
 	return 0;
 }
-- 
2.51.0



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-15  9:04 [PATCH 0/4] Improve khugepaged scan logic Vernon Yang
  2025-12-15  9:04 ` [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
  2025-12-15  9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
@ 2025-12-15  9:04 ` Vernon Yang
  2025-12-15 21:12   ` kernel test robot
                     ` (3 more replies)
  2025-12-15  9:04 ` [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
  3 siblings, 4 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-15  9:04 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang

For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.

So if the user has explicitly informed us via MADV_COLD/FREE that this
memory is cold or will be freed, it is appropriate for khugepaged to
scan it only at the latest possible moment, thereby avoiding unnecessary
scan and collapse operations to reducing CPU wastage.

Here are the performance test results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
| cycles per access   |  4.91         |  2.07         | -57.84% |
| Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
| dTLB-load-misses    |  288966432    |  1292908      | -99.55% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.23         |  2.12         | -70.68% |
| Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
| dTLB-load-misses    |  237406497    |  3189194      | -98.66% |

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/linux/khugepaged.h |  1 +
 mm/khugepaged.c            | 14 ++++++++++++++
 mm/madvise.c               |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eb1946a70cff..726e99de84e9 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -15,6 +15,7 @@ extern void __khugepaged_enter(struct mm_struct *mm);
 extern void __khugepaged_exit(struct mm_struct *mm);
 extern void khugepaged_enter_vma(struct vm_area_struct *vma,
 				 vm_flags_t vm_flags);
+void khugepaged_move_tail(struct mm_struct *mm);
 extern void khugepaged_min_free_kbytes_update(void);
 extern bool current_is_khugepaged(void);
 extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1ec1af5be3c8..91836dda2015 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -468,6 +468,20 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 	}
 }
 
+void khugepaged_move_tail(struct mm_struct *mm)
+{
+	struct mm_slot *slot;
+
+	if (!mm_flags_test(MMF_VM_HUGEPAGE, mm))
+		return;
+
+	spin_lock(&khugepaged_mm_lock);
+	slot = mm_slot_lookup(mm_slots_hash, mm);
+	if (slot && khugepaged_scan.mm_slot != slot)
+		list_move_tail(&slot->mm_node, &khugepaged_scan.mm_head);
+	spin_unlock(&khugepaged_mm_lock);
+}
+
 void __khugepaged_exit(struct mm_struct *mm)
 {
 	struct mm_slot *slot;
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..3f9ca7af2c82 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -608,6 +608,8 @@ static long madvise_cold(struct madvise_behavior *madv_behavior)
 	madvise_cold_page_range(&tlb, madv_behavior);
 	tlb_finish_mmu(&tlb);
 
+	khugepaged_move_tail(vma->vm_mm);
+
 	return 0;
 }
 
@@ -835,6 +837,7 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior)
 			&walk_ops, tlb);
 	tlb_end_vma(tlb, vma);
 	mmu_notifier_invalidate_range_end(&range);
+	khugepaged_move_tail(mm);
 	return 0;
 }
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2025-12-15  9:04 [PATCH 0/4] Improve khugepaged scan logic Vernon Yang
                   ` (2 preceding siblings ...)
  2025-12-15  9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
@ 2025-12-15  9:04 ` Vernon Yang
  2025-12-18  9:33   ` David Hildenbrand (Red Hat)
  3 siblings, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-15  9:04 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang

When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
reduce redundant operation.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 mm/khugepaged.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 91836dda2015..a8723eea12f1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2432,6 +2432,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 
 		cond_resched();
 		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+			vma = NULL;
 			progress++;
 			break;
 		}
@@ -2452,8 +2453,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			bool mmap_locked = true;
 
 			cond_resched();
-			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+			if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+				vma = NULL;
 				goto breakouterloop;
+			}
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
@@ -2470,8 +2473,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				fput(file);
 				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
 					mmap_read_lock(mm);
-					if (hpage_collapse_test_exit_or_disable(mm))
+					if (hpage_collapse_test_exit_or_disable(mm)) {
+						vma = NULL;
 						goto breakouterloop;
+					}
 					*result = collapse_pte_mapped_thp(mm,
 						khugepaged_scan.address, false);
 					if (*result == SCAN_PMD_MAPPED)
-- 
2.51.0



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-15  9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
@ 2025-12-15 11:52   ` Lance Yang
  2025-12-16  6:27     ` Vernon Yang
  2025-12-15 21:45   ` kernel test robot
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Lance Yang @ 2025-12-15 11:52 UTC (permalink / raw)
  To: Vernon Yang
  Cc: ziy, npache, baohua, linux-mm, linux-kernel, Vernon Yang, akpm,
	lorenzo.stoakes, david

Hi Vernon,

Thanks for the patches!

On 2025/12/15 17:04, Vernon Yang wrote:
> The following data is traced by bpftrace on a desktop system. After
> the system has been left idle for 10 minutes upon booting, a lot of
> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> khugepaged.
> 
> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> @scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
> @scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
> total progress size: 701 MB
> Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
> 
> The khugepaged_scan list save all task that support collapse into hugepage,
> as long as the take is not destroyed, khugepaged will not remove it from

Nit: s/take/task/

> the khugepaged_scan list. This exist a phenomenon where task has already
> collapsed all memory regions into hugepage, but khugepaged continues to
> scan it, which wastes CPU time and invalid, and due to
> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> scanning a large number of invalid task, so scanning really valid task
> is later.
> 
> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> list. If the page fault or MADV_HUGEPAGE again, it is added back to
> khugepaged.
> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
>   1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 0598a19a98cc..1ec1af5be3c8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -115,6 +115,7 @@ struct khugepaged_scan {
>   	struct list_head mm_head;
>   	struct mm_slot *mm_slot;
>   	unsigned long address;
> +	bool maybe_collapse;

At a quick glance, the name of "maybe_collapse" is a bit ambiguous ...

Perhaps "scan_needed" or "collapse_possible" would be clearer to
indicate that the mm should be kept in the scan list?


>   };
>   
>   static struct khugepaged_scan khugepaged_scan = {
> @@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   	return result;
>   }
>   
> -static void collect_mm_slot(struct mm_slot *slot)
> +static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
>   {
>   	struct mm_struct *mm = slot->mm;
>   
>   	lockdep_assert_held(&khugepaged_mm_lock);
>   
> -	if (hpage_collapse_test_exit(mm)) {
> +	if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
>   		/* free mm_slot */
>   		hash_del(&slot->hash);
>   		list_del(&slot->mm_node);
>   
> -		/*
> -		 * Not strictly needed because the mm exited already.
> -		 *
> -		 * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> -		 */
> +		if (!maybe_collapse)
> +			mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>   
>   		/* khugepaged_mm_lock actually not necessary for the below */
>   		mm_slot_free(mm_slot_cache, slot);
> @@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   				     struct mm_slot, mm_node);
>   		khugepaged_scan.address = 0;
>   		khugepaged_scan.mm_slot = slot;
> +		khugepaged_scan.maybe_collapse = false;
>   	}
>   	spin_unlock(&khugepaged_mm_lock);
>   
> @@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   					khugepaged_scan.address, &mmap_locked, cc);
>   			}
>   
> -			if (*result == SCAN_SUCCEED)
> +			switch (*result) {
> +			case SCAN_PMD_NULL:
> +			case SCAN_PMD_NONE:
> +			case SCAN_PMD_MAPPED:
> +			case SCAN_PTE_MAPPED_HUGEPAGE:
> +				break;
> +			case SCAN_SUCCEED:
>   				++khugepaged_pages_collapsed;
> +				fallthrough;
> +			default:
> +				khugepaged_scan.maybe_collapse = true;
> +			}
>   
>   			/* move to next address */
>   			khugepaged_scan.address += HPAGE_PMD_SIZE;
> @@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   	 * if we scanned all vmas of this mm.
>   	 */
>   	if (hpage_collapse_test_exit(mm) || !vma) {
> +		bool maybe_collapse = khugepaged_scan.maybe_collapse;
> +
> +		if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
> +			maybe_collapse = true;
> +
>   		/*
>   		 * Make sure that if mm_users is reaching zero while
>   		 * khugepaged runs here, khugepaged_exit will find
> @@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   		if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
>   			khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
>   			khugepaged_scan.address = 0;
> +			khugepaged_scan.maybe_collapse = false;
>   		} else {
>   			khugepaged_scan.mm_slot = NULL;
>   			khugepaged_full_scans++;
>   		}
>   
> -		collect_mm_slot(slot);
> +		collect_mm_slot(slot, maybe_collapse);
>   	}
>   
>   	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> @@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
>   	slot = khugepaged_scan.mm_slot;
>   	khugepaged_scan.mm_slot = NULL;
>   	if (slot)
> -		collect_mm_slot(slot);
> +		collect_mm_slot(slot, true);
>   	spin_unlock(&khugepaged_mm_lock);
>   	return 0;
>   }



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-15  9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
@ 2025-12-15 21:12   ` kernel test robot
  2025-12-16  7:00     ` Vernon Yang
  2025-12-16 13:08   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 42+ messages in thread
From: kernel test robot @ 2025-12-15 21:12 UTC (permalink / raw)
  To: Vernon Yang, akpm, david, lorenzo.stoakes
  Cc: oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

Hi Vernon,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc1 next-20251215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
config: arc-allnoconfig (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512160400.pTmarqg6-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/madvise.c: In function 'madvise_cold':
>> mm/madvise.c:609:9: error: implicit declaration of function 'khugepaged_move_tail'; did you mean 'khugepaged_exit'? [-Wimplicit-function-declaration]
     609 |         khugepaged_move_tail(vma->vm_mm);
         |         ^~~~~~~~~~~~~~~~~~~~
         |         khugepaged_exit


vim +609 mm/madvise.c

   595	
   596	static long madvise_cold(struct madvise_behavior *madv_behavior)
   597	{
   598		struct vm_area_struct *vma = madv_behavior->vma;
   599		struct mmu_gather tlb;
   600	
   601		if (!can_madv_lru_vma(vma))
   602			return -EINVAL;
   603	
   604		lru_add_drain();
   605		tlb_gather_mmu(&tlb, madv_behavior->mm);
   606		madvise_cold_page_range(&tlb, madv_behavior);
   607		tlb_finish_mmu(&tlb);
   608	
 > 609		khugepaged_move_tail(vma->vm_mm);
   610	
   611		return 0;
   612	}
   613	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-15  9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
  2025-12-15 11:52   ` Lance Yang
@ 2025-12-15 21:45   ` kernel test robot
  2025-12-16  6:30     ` Vernon Yang
  2025-12-15 23:01   ` kernel test robot
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: kernel test robot @ 2025-12-15 21:45 UTC (permalink / raw)
  To: Vernon Yang, akpm, david, lorenzo.stoakes
  Cc: llvm, oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

Hi Vernon,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc1 next-20251215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251215090419.174418-3-yanglincheng%40kylinos.cn
patch subject: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251216/202512160533.KuHwyJTP-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160533.KuHwyJTP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512160533.KuHwyJTP-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/khugepaged.c:2490:9: error: use of undeclared identifier 'SCAN_PMD_NULL'; did you mean 'SCAN_VMA_NULL'?
    2490 |                         case SCAN_PMD_NULL:
         |                              ^~~~~~~~~~~~~
         |                              SCAN_VMA_NULL
   mm/khugepaged.c:50:2: note: 'SCAN_VMA_NULL' declared here
      50 |         SCAN_VMA_NULL,
         |         ^
>> mm/khugepaged.c:2491:9: error: use of undeclared identifier 'SCAN_PMD_NONE'
    2491 |                         case SCAN_PMD_NONE:
         |                              ^
   2 errors generated.


vim +2490 mm/khugepaged.c

  2392	
  2393	static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
  2394						    struct collapse_control *cc)
  2395		__releases(&khugepaged_mm_lock)
  2396		__acquires(&khugepaged_mm_lock)
  2397	{
  2398		struct vma_iterator vmi;
  2399		struct mm_slot *slot;
  2400		struct mm_struct *mm;
  2401		struct vm_area_struct *vma;
  2402		int progress = 0;
  2403	
  2404		VM_BUG_ON(!pages);
  2405		lockdep_assert_held(&khugepaged_mm_lock);
  2406		*result = SCAN_FAIL;
  2407	
  2408		if (khugepaged_scan.mm_slot) {
  2409			slot = khugepaged_scan.mm_slot;
  2410		} else {
  2411			slot = list_first_entry(&khugepaged_scan.mm_head,
  2412					     struct mm_slot, mm_node);
  2413			khugepaged_scan.address = 0;
  2414			khugepaged_scan.mm_slot = slot;
  2415			khugepaged_scan.maybe_collapse = false;
  2416		}
  2417		spin_unlock(&khugepaged_mm_lock);
  2418	
  2419		mm = slot->mm;
  2420		/*
  2421		 * Don't wait for semaphore (to avoid long wait times).  Just move to
  2422		 * the next mm on the list.
  2423		 */
  2424		vma = NULL;
  2425		if (unlikely(!mmap_read_trylock(mm)))
  2426			goto breakouterloop_mmap_lock;
  2427	
  2428		progress++;
  2429		if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
  2430			goto breakouterloop;
  2431	
  2432		vma_iter_init(&vmi, mm, khugepaged_scan.address);
  2433		for_each_vma(vmi, vma) {
  2434			unsigned long hstart, hend;
  2435	
  2436			cond_resched();
  2437			if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
  2438				progress++;
  2439				break;
  2440			}
  2441			if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
  2442	skip:
  2443				progress++;
  2444				continue;
  2445			}
  2446			hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
  2447			hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
  2448			if (khugepaged_scan.address > hend)
  2449				goto skip;
  2450			if (khugepaged_scan.address < hstart)
  2451				khugepaged_scan.address = hstart;
  2452			VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
  2453	
  2454			while (khugepaged_scan.address < hend) {
  2455				bool mmap_locked = true;
  2456	
  2457				cond_resched();
  2458				if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
  2459					goto breakouterloop;
  2460	
  2461				VM_BUG_ON(khugepaged_scan.address < hstart ||
  2462					  khugepaged_scan.address + HPAGE_PMD_SIZE >
  2463					  hend);
  2464				if (!vma_is_anonymous(vma)) {
  2465					struct file *file = get_file(vma->vm_file);
  2466					pgoff_t pgoff = linear_page_index(vma,
  2467							khugepaged_scan.address);
  2468	
  2469					mmap_read_unlock(mm);
  2470					mmap_locked = false;
  2471					*result = hpage_collapse_scan_file(mm,
  2472						khugepaged_scan.address, file, pgoff, cc);
  2473					fput(file);
  2474					if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
  2475						mmap_read_lock(mm);
  2476						if (hpage_collapse_test_exit_or_disable(mm))
  2477							goto breakouterloop;
  2478						*result = collapse_pte_mapped_thp(mm,
  2479							khugepaged_scan.address, false);
  2480						if (*result == SCAN_PMD_MAPPED)
  2481							*result = SCAN_SUCCEED;
  2482						mmap_read_unlock(mm);
  2483					}
  2484				} else {
  2485					*result = hpage_collapse_scan_pmd(mm, vma,
  2486						khugepaged_scan.address, &mmap_locked, cc);
  2487				}
  2488	
  2489				switch (*result) {
> 2490				case SCAN_PMD_NULL:
> 2491				case SCAN_PMD_NONE:
  2492				case SCAN_PMD_MAPPED:
  2493				case SCAN_PTE_MAPPED_HUGEPAGE:
  2494					break;
  2495				case SCAN_SUCCEED:
  2496					++khugepaged_pages_collapsed;
  2497					fallthrough;
  2498				default:
  2499					khugepaged_scan.maybe_collapse = true;
  2500				}
  2501	
  2502				/* move to next address */
  2503				khugepaged_scan.address += HPAGE_PMD_SIZE;
  2504				progress += HPAGE_PMD_NR;
  2505				if (!mmap_locked)
  2506					/*
  2507					 * We released mmap_lock so break loop.  Note
  2508					 * that we drop mmap_lock before all hugepage
  2509					 * allocations, so if allocation fails, we are
  2510					 * guaranteed to break here and report the
  2511					 * correct result back to caller.
  2512					 */
  2513					goto breakouterloop_mmap_lock;
  2514				if (progress >= pages)
  2515					goto breakouterloop;
  2516			}
  2517		}
  2518	breakouterloop:
  2519		mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
  2520	breakouterloop_mmap_lock:
  2521	
  2522		spin_lock(&khugepaged_mm_lock);
  2523		VM_BUG_ON(khugepaged_scan.mm_slot != slot);
  2524		/*
  2525		 * Release the current mm_slot if this mm is about to die, or
  2526		 * if we scanned all vmas of this mm.
  2527		 */
  2528		if (hpage_collapse_test_exit(mm) || !vma) {
  2529			bool maybe_collapse = khugepaged_scan.maybe_collapse;
  2530	
  2531			if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
  2532				maybe_collapse = true;
  2533	
  2534			/*
  2535			 * Make sure that if mm_users is reaching zero while
  2536			 * khugepaged runs here, khugepaged_exit will find
  2537			 * mm_slot not pointing to the exiting mm.
  2538			 */
  2539			if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
  2540				khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
  2541				khugepaged_scan.address = 0;
  2542				khugepaged_scan.maybe_collapse = false;
  2543			} else {
  2544				khugepaged_scan.mm_slot = NULL;
  2545				khugepaged_full_scans++;
  2546			}
  2547	
  2548			collect_mm_slot(slot, maybe_collapse);
  2549		}
  2550	
  2551		trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
  2552	
  2553		return progress;
  2554	}
  2555	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-15  9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
  2025-12-15 11:52   ` Lance Yang
  2025-12-15 21:45   ` kernel test robot
@ 2025-12-15 23:01   ` kernel test robot
  2025-12-16  6:32     ` Vernon Yang
  2025-12-17  3:31   ` Wei Yang
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: kernel test robot @ 2025-12-15 23:01 UTC (permalink / raw)
  To: Vernon Yang, akpm, david, lorenzo.stoakes
  Cc: oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

Hi Vernon,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc1 next-20251215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251215090419.174418-3-yanglincheng%40kylinos.cn
patch subject: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20251216/202512160619.3Ut4sxaJ-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160619.3Ut4sxaJ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512160619.3Ut4sxaJ-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/khugepaged.c: In function 'khugepaged_scan_mm_slot':
>> mm/khugepaged.c:2490:30: error: 'SCAN_PMD_NULL' undeclared (first use in this function); did you mean 'SCAN_VMA_NULL'?
    2490 |                         case SCAN_PMD_NULL:
         |                              ^~~~~~~~~~~~~
         |                              SCAN_VMA_NULL
   mm/khugepaged.c:2490:30: note: each undeclared identifier is reported only once for each function it appears in
>> mm/khugepaged.c:2491:30: error: 'SCAN_PMD_NONE' undeclared (first use in this function)
    2491 |                         case SCAN_PMD_NONE:
         |                              ^~~~~~~~~~~~~


vim +2490 mm/khugepaged.c

  2392	
  2393	static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
  2394						    struct collapse_control *cc)
  2395		__releases(&khugepaged_mm_lock)
  2396		__acquires(&khugepaged_mm_lock)
  2397	{
  2398		struct vma_iterator vmi;
  2399		struct mm_slot *slot;
  2400		struct mm_struct *mm;
  2401		struct vm_area_struct *vma;
  2402		int progress = 0;
  2403	
  2404		VM_BUG_ON(!pages);
  2405		lockdep_assert_held(&khugepaged_mm_lock);
  2406		*result = SCAN_FAIL;
  2407	
  2408		if (khugepaged_scan.mm_slot) {
  2409			slot = khugepaged_scan.mm_slot;
  2410		} else {
  2411			slot = list_first_entry(&khugepaged_scan.mm_head,
  2412					     struct mm_slot, mm_node);
  2413			khugepaged_scan.address = 0;
  2414			khugepaged_scan.mm_slot = slot;
  2415			khugepaged_scan.maybe_collapse = false;
  2416		}
  2417		spin_unlock(&khugepaged_mm_lock);
  2418	
  2419		mm = slot->mm;
  2420		/*
  2421		 * Don't wait for semaphore (to avoid long wait times).  Just move to
  2422		 * the next mm on the list.
  2423		 */
  2424		vma = NULL;
  2425		if (unlikely(!mmap_read_trylock(mm)))
  2426			goto breakouterloop_mmap_lock;
  2427	
  2428		progress++;
  2429		if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
  2430			goto breakouterloop;
  2431	
  2432		vma_iter_init(&vmi, mm, khugepaged_scan.address);
  2433		for_each_vma(vmi, vma) {
  2434			unsigned long hstart, hend;
  2435	
  2436			cond_resched();
  2437			if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
  2438				progress++;
  2439				break;
  2440			}
  2441			if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
  2442	skip:
  2443				progress++;
  2444				continue;
  2445			}
  2446			hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
  2447			hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
  2448			if (khugepaged_scan.address > hend)
  2449				goto skip;
  2450			if (khugepaged_scan.address < hstart)
  2451				khugepaged_scan.address = hstart;
  2452			VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
  2453	
  2454			while (khugepaged_scan.address < hend) {
  2455				bool mmap_locked = true;
  2456	
  2457				cond_resched();
  2458				if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
  2459					goto breakouterloop;
  2460	
  2461				VM_BUG_ON(khugepaged_scan.address < hstart ||
  2462					  khugepaged_scan.address + HPAGE_PMD_SIZE >
  2463					  hend);
  2464				if (!vma_is_anonymous(vma)) {
  2465					struct file *file = get_file(vma->vm_file);
  2466					pgoff_t pgoff = linear_page_index(vma,
  2467							khugepaged_scan.address);
  2468	
  2469					mmap_read_unlock(mm);
  2470					mmap_locked = false;
  2471					*result = hpage_collapse_scan_file(mm,
  2472						khugepaged_scan.address, file, pgoff, cc);
  2473					fput(file);
  2474					if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
  2475						mmap_read_lock(mm);
  2476						if (hpage_collapse_test_exit_or_disable(mm))
  2477							goto breakouterloop;
  2478						*result = collapse_pte_mapped_thp(mm,
  2479							khugepaged_scan.address, false);
  2480						if (*result == SCAN_PMD_MAPPED)
  2481							*result = SCAN_SUCCEED;
  2482						mmap_read_unlock(mm);
  2483					}
  2484				} else {
  2485					*result = hpage_collapse_scan_pmd(mm, vma,
  2486						khugepaged_scan.address, &mmap_locked, cc);
  2487				}
  2488	
  2489				switch (*result) {
> 2490				case SCAN_PMD_NULL:
> 2491				case SCAN_PMD_NONE:
  2492				case SCAN_PMD_MAPPED:
  2493				case SCAN_PTE_MAPPED_HUGEPAGE:
  2494					break;
  2495				case SCAN_SUCCEED:
  2496					++khugepaged_pages_collapsed;
  2497					fallthrough;
  2498				default:
  2499					khugepaged_scan.maybe_collapse = true;
  2500				}
  2501	
  2502				/* move to next address */
  2503				khugepaged_scan.address += HPAGE_PMD_SIZE;
  2504				progress += HPAGE_PMD_NR;
  2505				if (!mmap_locked)
  2506					/*
  2507					 * We released mmap_lock so break loop.  Note
  2508					 * that we drop mmap_lock before all hugepage
  2509					 * allocations, so if allocation fails, we are
  2510					 * guaranteed to break here and report the
  2511					 * correct result back to caller.
  2512					 */
  2513					goto breakouterloop_mmap_lock;
  2514				if (progress >= pages)
  2515					goto breakouterloop;
  2516			}
  2517		}
  2518	breakouterloop:
  2519		mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
  2520	breakouterloop_mmap_lock:
  2521	
  2522		spin_lock(&khugepaged_mm_lock);
  2523		VM_BUG_ON(khugepaged_scan.mm_slot != slot);
  2524		/*
  2525		 * Release the current mm_slot if this mm is about to die, or
  2526		 * if we scanned all vmas of this mm.
  2527		 */
  2528		if (hpage_collapse_test_exit(mm) || !vma) {
  2529			bool maybe_collapse = khugepaged_scan.maybe_collapse;
  2530	
  2531			if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
  2532				maybe_collapse = true;
  2533	
  2534			/*
  2535			 * Make sure that if mm_users is reaching zero while
  2536			 * khugepaged runs here, khugepaged_exit will find
  2537			 * mm_slot not pointing to the exiting mm.
  2538			 */
  2539			if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
  2540				khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
  2541				khugepaged_scan.address = 0;
  2542				khugepaged_scan.maybe_collapse = false;
  2543			} else {
  2544				khugepaged_scan.mm_slot = NULL;
  2545				khugepaged_full_scans++;
  2546			}
  2547	
  2548			collect_mm_slot(slot, maybe_collapse);
  2549		}
  2550	
  2551		trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
  2552	
  2553		return progress;
  2554	}
  2555	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-15 11:52   ` Lance Yang
@ 2025-12-16  6:27     ` Vernon Yang
  0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-16  6:27 UTC (permalink / raw)
  To: Lance Yang
  Cc: ziy, baohua, linux-mm, linux-kernel, Vernon Yang, akpm,
	lorenzo.stoakes, david

On Mon, Dec 15, 2025 at 07:52:41PM +0800, Lance Yang wrote:
> Hi Vernon,
>
> Thanks for the patches!
>
> On 2025/12/15 17:04, Vernon Yang wrote:
> > The following data is traced by bpftrace on a desktop system. After
> > the system has been left idle for 10 minutes upon booting, a lot of
> > SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> > khugepaged.
> >
> > @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> > @scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
> > @scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
> > total progress size: 701 MB
> > Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >
> > The khugepaged_scan list save all task that support collapse into hugepage,
> > as long as the take is not destroyed, khugepaged will not remove it from
>
> Nit: s/take/task/

Thanks, I'll fix it in the next version.

> > the khugepaged_scan list. This exist a phenomenon where task has already
> > collapsed all memory regions into hugepage, but khugepaged continues to
> > scan it, which wastes CPU time and invalid, and due to
> > khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> > scanning a large number of invalid task, so scanning really valid task
> > is later.
> >
> > After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> > SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> > list. If the page fault or MADV_HUGEPAGE again, it is added back to
> > khugepaged.
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
> >   1 file changed, 25 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 0598a19a98cc..1ec1af5be3c8 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -115,6 +115,7 @@ struct khugepaged_scan {
> >   	struct list_head mm_head;
> >   	struct mm_slot *mm_slot;
> >   	unsigned long address;
> > +	bool maybe_collapse;
>
> At a quick glance, the name of "maybe_collapse" is a bit ambiguous ...
>
> Perhaps "scan_needed" or "collapse_possible" would be clearer to
> indicate that the mm should be kept in the scan list?

The "collapse_possible" sounds good to me, Thanks! I will do it in the
next version.

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-15 21:45   ` kernel test robot
@ 2025-12-16  6:30     ` Vernon Yang
  0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-16  6:30 UTC (permalink / raw)
  To: kernel test robot
  Cc: akpm, david, lorenzo.stoakes, llvm, oe-kbuild-all, ziy, baohua,
	lance.yang, linux-mm, linux-kernel, Vernon Yang

On Tue, Dec 16, 2025 at 05:45:31AM +0800, kernel test robot wrote:
> Hi Vernon,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
> [also build test ERROR on linus/master v6.19-rc1 next-20251215]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20251215090419.174418-3-yanglincheng%40kylinos.cn
> patch subject: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
> config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251216/202512160533.KuHwyJTP-lkp@intel.com/config)
> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160533.KuHwyJTP-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202512160533.KuHwyJTP-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> >> mm/khugepaged.c:2490:9: error: use of undeclared identifier 'SCAN_PMD_NULL'; did you mean 'SCAN_VMA_NULL'?
>     2490 |                         case SCAN_PMD_NULL:
>          |                              ^~~~~~~~~~~~~
>          |                              SCAN_VMA_NULL
>    mm/khugepaged.c:50:2: note: 'SCAN_VMA_NULL' declared here
>       50 |         SCAN_VMA_NULL,
>          |         ^
> >> mm/khugepaged.c:2491:9: error: use of undeclared identifier 'SCAN_PMD_NONE'
>     2491 |                         case SCAN_PMD_NONE:
>          |                              ^
>    2 errors generated.

This series is based on Linux v6.18, due to the v6.19-rc1 add "mm/khugepaged:
unify SCAN_PMD_NONE and SCAN_PMD_NULL into SCAN_NO_PTE_TABLE"[1], trigger this
build errors. I'll fix it in the next version, Thanks!

[1] https://lkml.kernel.org/r/20251114030028.7035-4-richard.weiyang@gmail.com

>
> vim +2490 mm/khugepaged.c
>
>   2392
>   2393	static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   2394						    struct collapse_control *cc)
>   2395		__releases(&khugepaged_mm_lock)
>   2396		__acquires(&khugepaged_mm_lock)
>   2397	{
>   2398		struct vma_iterator vmi;
>   2399		struct mm_slot *slot;
>   2400		struct mm_struct *mm;
>   2401		struct vm_area_struct *vma;
>   2402		int progress = 0;
>   2403
>   2404		VM_BUG_ON(!pages);
>   2405		lockdep_assert_held(&khugepaged_mm_lock);
>   2406		*result = SCAN_FAIL;
>   2407
>   2408		if (khugepaged_scan.mm_slot) {
>   2409			slot = khugepaged_scan.mm_slot;
>   2410		} else {
>   2411			slot = list_first_entry(&khugepaged_scan.mm_head,
>   2412					     struct mm_slot, mm_node);
>   2413			khugepaged_scan.address = 0;
>   2414			khugepaged_scan.mm_slot = slot;
>   2415			khugepaged_scan.maybe_collapse = false;
>   2416		}
>   2417		spin_unlock(&khugepaged_mm_lock);
>   2418
>   2419		mm = slot->mm;
>   2420		/*
>   2421		 * Don't wait for semaphore (to avoid long wait times).  Just move to
>   2422		 * the next mm on the list.
>   2423		 */
>   2424		vma = NULL;
>   2425		if (unlikely(!mmap_read_trylock(mm)))
>   2426			goto breakouterloop_mmap_lock;
>   2427
>   2428		progress++;
>   2429		if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>   2430			goto breakouterloop;
>   2431
>   2432		vma_iter_init(&vmi, mm, khugepaged_scan.address);
>   2433		for_each_vma(vmi, vma) {
>   2434			unsigned long hstart, hend;
>   2435
>   2436			cond_resched();
>   2437			if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
>   2438				progress++;
>   2439				break;
>   2440			}
>   2441			if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
>   2442	skip:
>   2443				progress++;
>   2444				continue;
>   2445			}
>   2446			hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
>   2447			hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
>   2448			if (khugepaged_scan.address > hend)
>   2449				goto skip;
>   2450			if (khugepaged_scan.address < hstart)
>   2451				khugepaged_scan.address = hstart;
>   2452			VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
>   2453
>   2454			while (khugepaged_scan.address < hend) {
>   2455				bool mmap_locked = true;
>   2456
>   2457				cond_resched();
>   2458				if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>   2459					goto breakouterloop;
>   2460
>   2461				VM_BUG_ON(khugepaged_scan.address < hstart ||
>   2462					  khugepaged_scan.address + HPAGE_PMD_SIZE >
>   2463					  hend);
>   2464				if (!vma_is_anonymous(vma)) {
>   2465					struct file *file = get_file(vma->vm_file);
>   2466					pgoff_t pgoff = linear_page_index(vma,
>   2467							khugepaged_scan.address);
>   2468
>   2469					mmap_read_unlock(mm);
>   2470					mmap_locked = false;
>   2471					*result = hpage_collapse_scan_file(mm,
>   2472						khugepaged_scan.address, file, pgoff, cc);
>   2473					fput(file);
>   2474					if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
>   2475						mmap_read_lock(mm);
>   2476						if (hpage_collapse_test_exit_or_disable(mm))
>   2477							goto breakouterloop;
>   2478						*result = collapse_pte_mapped_thp(mm,
>   2479							khugepaged_scan.address, false);
>   2480						if (*result == SCAN_PMD_MAPPED)
>   2481							*result = SCAN_SUCCEED;
>   2482						mmap_read_unlock(mm);
>   2483					}
>   2484				} else {
>   2485					*result = hpage_collapse_scan_pmd(mm, vma,
>   2486						khugepaged_scan.address, &mmap_locked, cc);
>   2487				}
>   2488
>   2489				switch (*result) {
> > 2490				case SCAN_PMD_NULL:
> > 2491				case SCAN_PMD_NONE:
>   2492				case SCAN_PMD_MAPPED:
>   2493				case SCAN_PTE_MAPPED_HUGEPAGE:
>   2494					break;
>   2495				case SCAN_SUCCEED:
>   2496					++khugepaged_pages_collapsed;
>   2497					fallthrough;
>   2498				default:
>   2499					khugepaged_scan.maybe_collapse = true;
>   2500				}
>   2501
>   2502				/* move to next address */
>   2503				khugepaged_scan.address += HPAGE_PMD_SIZE;
>   2504				progress += HPAGE_PMD_NR;
>   2505				if (!mmap_locked)
>   2506					/*
>   2507					 * We released mmap_lock so break loop.  Note
>   2508					 * that we drop mmap_lock before all hugepage
>   2509					 * allocations, so if allocation fails, we are
>   2510					 * guaranteed to break here and report the
>   2511					 * correct result back to caller.
>   2512					 */
>   2513					goto breakouterloop_mmap_lock;
>   2514				if (progress >= pages)
>   2515					goto breakouterloop;
>   2516			}
>   2517		}
>   2518	breakouterloop:
>   2519		mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
>   2520	breakouterloop_mmap_lock:
>   2521
>   2522		spin_lock(&khugepaged_mm_lock);
>   2523		VM_BUG_ON(khugepaged_scan.mm_slot != slot);
>   2524		/*
>   2525		 * Release the current mm_slot if this mm is about to die, or
>   2526		 * if we scanned all vmas of this mm.
>   2527		 */
>   2528		if (hpage_collapse_test_exit(mm) || !vma) {
>   2529			bool maybe_collapse = khugepaged_scan.maybe_collapse;
>   2530
>   2531			if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
>   2532				maybe_collapse = true;
>   2533
>   2534			/*
>   2535			 * Make sure that if mm_users is reaching zero while
>   2536			 * khugepaged runs here, khugepaged_exit will find
>   2537			 * mm_slot not pointing to the exiting mm.
>   2538			 */
>   2539			if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
>   2540				khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
>   2541				khugepaged_scan.address = 0;
>   2542				khugepaged_scan.maybe_collapse = false;
>   2543			} else {
>   2544				khugepaged_scan.mm_slot = NULL;
>   2545				khugepaged_full_scans++;
>   2546			}
>   2547
>   2548			collect_mm_slot(slot, maybe_collapse);
>   2549		}
>   2550
>   2551		trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
>   2552
>   2553		return progress;
>   2554	}
>   2555
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-15 23:01   ` kernel test robot
@ 2025-12-16  6:32     ` Vernon Yang
  0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-16  6:32 UTC (permalink / raw)
  To: kernel test robot
  Cc: akpm, david, lorenzo.stoakes, oe-kbuild-all, ziy, baohua,
	lance.yang, linux-mm, linux-kernel, Vernon Yang

On Tue, Dec 16, 2025 at 07:01:18AM +0800, kernel test robot wrote:
> Hi Vernon,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
> [also build test ERROR on linus/master v6.19-rc1 next-20251215]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20251215090419.174418-3-yanglincheng%40kylinos.cn
> patch subject: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
> config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20251216/202512160619.3Ut4sxaJ-lkp@intel.com/config)
> compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160619.3Ut4sxaJ-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202512160619.3Ut4sxaJ-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
>    mm/khugepaged.c: In function 'khugepaged_scan_mm_slot':
> >> mm/khugepaged.c:2490:30: error: 'SCAN_PMD_NULL' undeclared (first use in this function); did you mean 'SCAN_VMA_NULL'?
>     2490 |                         case SCAN_PMD_NULL:
>          |                              ^~~~~~~~~~~~~
>          |                              SCAN_VMA_NULL
>    mm/khugepaged.c:2490:30: note: each undeclared identifier is reported only once for each function it appears in
> >> mm/khugepaged.c:2491:30: error: 'SCAN_PMD_NONE' undeclared (first use in this function)
>     2491 |                         case SCAN_PMD_NONE:
>          |                              ^~~~~~~~~~~~~

same above, Thanks.

>
> vim +2490 mm/khugepaged.c
>
>   2392
>   2393	static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   2394						    struct collapse_control *cc)
>   2395		__releases(&khugepaged_mm_lock)
>   2396		__acquires(&khugepaged_mm_lock)
>   2397	{
>   2398		struct vma_iterator vmi;
>   2399		struct mm_slot *slot;
>   2400		struct mm_struct *mm;
>   2401		struct vm_area_struct *vma;
>   2402		int progress = 0;
>   2403
>   2404		VM_BUG_ON(!pages);
>   2405		lockdep_assert_held(&khugepaged_mm_lock);
>   2406		*result = SCAN_FAIL;
>   2407
>   2408		if (khugepaged_scan.mm_slot) {
>   2409			slot = khugepaged_scan.mm_slot;
>   2410		} else {
>   2411			slot = list_first_entry(&khugepaged_scan.mm_head,
>   2412					     struct mm_slot, mm_node);
>   2413			khugepaged_scan.address = 0;
>   2414			khugepaged_scan.mm_slot = slot;
>   2415			khugepaged_scan.maybe_collapse = false;
>   2416		}
>   2417		spin_unlock(&khugepaged_mm_lock);
>   2418
>   2419		mm = slot->mm;
>   2420		/*
>   2421		 * Don't wait for semaphore (to avoid long wait times).  Just move to
>   2422		 * the next mm on the list.
>   2423		 */
>   2424		vma = NULL;
>   2425		if (unlikely(!mmap_read_trylock(mm)))
>   2426			goto breakouterloop_mmap_lock;
>   2427
>   2428		progress++;
>   2429		if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>   2430			goto breakouterloop;
>   2431
>   2432		vma_iter_init(&vmi, mm, khugepaged_scan.address);
>   2433		for_each_vma(vmi, vma) {
>   2434			unsigned long hstart, hend;
>   2435
>   2436			cond_resched();
>   2437			if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
>   2438				progress++;
>   2439				break;
>   2440			}
>   2441			if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
>   2442	skip:
>   2443				progress++;
>   2444				continue;
>   2445			}
>   2446			hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
>   2447			hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
>   2448			if (khugepaged_scan.address > hend)
>   2449				goto skip;
>   2450			if (khugepaged_scan.address < hstart)
>   2451				khugepaged_scan.address = hstart;
>   2452			VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
>   2453
>   2454			while (khugepaged_scan.address < hend) {
>   2455				bool mmap_locked = true;
>   2456
>   2457				cond_resched();
>   2458				if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>   2459					goto breakouterloop;
>   2460
>   2461				VM_BUG_ON(khugepaged_scan.address < hstart ||
>   2462					  khugepaged_scan.address + HPAGE_PMD_SIZE >
>   2463					  hend);
>   2464				if (!vma_is_anonymous(vma)) {
>   2465					struct file *file = get_file(vma->vm_file);
>   2466					pgoff_t pgoff = linear_page_index(vma,
>   2467							khugepaged_scan.address);
>   2468
>   2469					mmap_read_unlock(mm);
>   2470					mmap_locked = false;
>   2471					*result = hpage_collapse_scan_file(mm,
>   2472						khugepaged_scan.address, file, pgoff, cc);
>   2473					fput(file);
>   2474					if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
>   2475						mmap_read_lock(mm);
>   2476						if (hpage_collapse_test_exit_or_disable(mm))
>   2477							goto breakouterloop;
>   2478						*result = collapse_pte_mapped_thp(mm,
>   2479							khugepaged_scan.address, false);
>   2480						if (*result == SCAN_PMD_MAPPED)
>   2481							*result = SCAN_SUCCEED;
>   2482						mmap_read_unlock(mm);
>   2483					}
>   2484				} else {
>   2485					*result = hpage_collapse_scan_pmd(mm, vma,
>   2486						khugepaged_scan.address, &mmap_locked, cc);
>   2487				}
>   2488
>   2489				switch (*result) {
> > 2490				case SCAN_PMD_NULL:
> > 2491				case SCAN_PMD_NONE:
>   2492				case SCAN_PMD_MAPPED:
>   2493				case SCAN_PTE_MAPPED_HUGEPAGE:
>   2494					break;
>   2495				case SCAN_SUCCEED:
>   2496					++khugepaged_pages_collapsed;
>   2497					fallthrough;
>   2498				default:
>   2499					khugepaged_scan.maybe_collapse = true;
>   2500				}
>   2501
>   2502				/* move to next address */
>   2503				khugepaged_scan.address += HPAGE_PMD_SIZE;
>   2504				progress += HPAGE_PMD_NR;
>   2505				if (!mmap_locked)
>   2506					/*
>   2507					 * We released mmap_lock so break loop.  Note
>   2508					 * that we drop mmap_lock before all hugepage
>   2509					 * allocations, so if allocation fails, we are
>   2510					 * guaranteed to break here and report the
>   2511					 * correct result back to caller.
>   2512					 */
>   2513					goto breakouterloop_mmap_lock;
>   2514				if (progress >= pages)
>   2515					goto breakouterloop;
>   2516			}
>   2517		}
>   2518	breakouterloop:
>   2519		mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
>   2520	breakouterloop_mmap_lock:
>   2521
>   2522		spin_lock(&khugepaged_mm_lock);
>   2523		VM_BUG_ON(khugepaged_scan.mm_slot != slot);
>   2524		/*
>   2525		 * Release the current mm_slot if this mm is about to die, or
>   2526		 * if we scanned all vmas of this mm.
>   2527		 */
>   2528		if (hpage_collapse_test_exit(mm) || !vma) {
>   2529			bool maybe_collapse = khugepaged_scan.maybe_collapse;
>   2530
>   2531			if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
>   2532				maybe_collapse = true;
>   2533
>   2534			/*
>   2535			 * Make sure that if mm_users is reaching zero while
>   2536			 * khugepaged runs here, khugepaged_exit will find
>   2537			 * mm_slot not pointing to the exiting mm.
>   2538			 */
>   2539			if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
>   2540				khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
>   2541				khugepaged_scan.address = 0;
>   2542				khugepaged_scan.maybe_collapse = false;
>   2543			} else {
>   2544				khugepaged_scan.mm_slot = NULL;
>   2545				khugepaged_full_scans++;
>   2546			}
>   2547
>   2548			collect_mm_slot(slot, maybe_collapse);
>   2549		}
>   2550
>   2551		trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
>   2552
>   2553		return progress;
>   2554	}
>   2555
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-15 21:12   ` kernel test robot
@ 2025-12-16  7:00     ` Vernon Yang
  0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-16  7:00 UTC (permalink / raw)
  To: kernel test robot
  Cc: akpm, david, lorenzo.stoakes, oe-kbuild-all, ziy, baohua,
	lance.yang, linux-mm, linux-kernel, Vernon Yang

On Tue, Dec 16, 2025 at 05:12:16AM +0800, kernel test robot wrote:
> Hi Vernon,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
> [also build test ERROR on linus/master v6.19-rc1 next-20251215]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
> patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
> config: arc-allnoconfig (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/config)
> compiler: arc-linux-gcc (GCC) 15.1.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202512160400.pTmarqg6-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
>    mm/madvise.c: In function 'madvise_cold':
> >> mm/madvise.c:609:9: error: implicit declaration of function 'khugepaged_move_tail'; did you mean 'khugepaged_exit'? [-Wimplicit-function-declaration]
>      609 |         khugepaged_move_tail(vma->vm_mm);
>          |         ^~~~~~~~~~~~~~~~~~~~
>          |         khugepaged_exit

When CONFIG_TRANSPARENT_HUGEPAGE is disabled, trigger this build errors.
I'll fix it in the next version, Thanks!

>
> vim +609 mm/madvise.c
>
>    595
>    596	static long madvise_cold(struct madvise_behavior *madv_behavior)
>    597	{
>    598		struct vm_area_struct *vma = madv_behavior->vma;
>    599		struct mmu_gather tlb;
>    600
>    601		if (!can_madv_lru_vma(vma))
>    602			return -EINVAL;
>    603
>    604		lru_add_drain();
>    605		tlb_gather_mmu(&tlb, madv_behavior->mm);
>    606		madvise_cold_page_range(&tlb, madv_behavior);
>    607		tlb_finish_mmu(&tlb);
>    608
>  > 609		khugepaged_move_tail(vma->vm_mm);
>    610
>    611		return 0;
>    612	}
>    613
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-15  9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
  2025-12-15 21:12   ` kernel test robot
@ 2025-12-16 13:08   ` kernel test robot
  2025-12-16 13:31   ` kernel test robot
  2025-12-18  9:31   ` David Hildenbrand (Red Hat)
  3 siblings, 0 replies; 42+ messages in thread
From: kernel test robot @ 2025-12-16 13:08 UTC (permalink / raw)
  To: Vernon Yang, akpm, david, lorenzo.stoakes
  Cc: llvm, oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

Hi Vernon,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc1 next-20251216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251216/202512161406.RfF1dIYB-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512161406.RfF1dIYB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512161406.RfF1dIYB-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/madvise.c:609:2: error: call to undeclared function 'khugepaged_move_tail'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     609 |         khugepaged_move_tail(vma->vm_mm);
         |         ^
   mm/madvise.c:837:2: error: call to undeclared function 'khugepaged_move_tail'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     837 |         khugepaged_move_tail(mm);
         |         ^
   2 errors generated.


vim +/khugepaged_move_tail +609 mm/madvise.c

   595	
   596	static long madvise_cold(struct madvise_behavior *madv_behavior)
   597	{
   598		struct vm_area_struct *vma = madv_behavior->vma;
   599		struct mmu_gather tlb;
   600	
   601		if (!can_madv_lru_vma(vma))
   602			return -EINVAL;
   603	
   604		lru_add_drain();
   605		tlb_gather_mmu(&tlb, madv_behavior->mm);
   606		madvise_cold_page_range(&tlb, madv_behavior);
   607		tlb_finish_mmu(&tlb);
   608	
 > 609		khugepaged_move_tail(vma->vm_mm);
   610	
   611		return 0;
   612	}
   613	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-15  9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
  2025-12-15 21:12   ` kernel test robot
  2025-12-16 13:08   ` kernel test robot
@ 2025-12-16 13:31   ` kernel test robot
  2025-12-18  9:31   ` David Hildenbrand (Red Hat)
  3 siblings, 0 replies; 42+ messages in thread
From: kernel test robot @ 2025-12-16 13:31 UTC (permalink / raw)
  To: Vernon Yang, akpm, david, lorenzo.stoakes
  Cc: oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

Hi Vernon,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20251216]
[cannot apply to linus/master v6.16-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20251216/202512161405.8IVTXVcr-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512161405.8IVTXVcr-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512161405.8IVTXVcr-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/madvise.c: In function 'madvise_cold':
>> mm/madvise.c:609:9: error: implicit declaration of function 'khugepaged_move_tail'; did you mean 'khugepaged_exit'? [-Wimplicit-function-declaration]
     609 |         khugepaged_move_tail(vma->vm_mm);
         |         ^~~~~~~~~~~~~~~~~~~~
         |         khugepaged_exit


vim +609 mm/madvise.c

   595	
   596	static long madvise_cold(struct madvise_behavior *madv_behavior)
   597	{
   598		struct vm_area_struct *vma = madv_behavior->vma;
   599		struct mmu_gather tlb;
   600	
   601		if (!can_madv_lru_vma(vma))
   602			return -EINVAL;
   603	
   604		lru_add_drain();
   605		tlb_gather_mmu(&tlb, madv_behavior->mm);
   606		madvise_cold_page_range(&tlb, madv_behavior);
   607		tlb_finish_mmu(&tlb);
   608	
 > 609		khugepaged_move_tail(vma->vm_mm);
   610	
   611		return 0;
   612	}
   613	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-15  9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
                     ` (2 preceding siblings ...)
  2025-12-15 23:01   ` kernel test robot
@ 2025-12-17  3:31   ` Wei Yang
  2025-12-18  3:27     ` Vernon Yang
  2025-12-18  9:29   ` David Hildenbrand (Red Hat)
  2025-12-22 19:00   ` kernel test robot
  5 siblings, 1 reply; 42+ messages in thread
From: Wei Yang @ 2025-12-17  3:31 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, npache, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Mon, Dec 15, 2025 at 05:04:17PM +0800, Vernon Yang wrote:
>The following data is traced by bpftrace on a desktop system. After
>the system has been left idle for 10 minutes upon booting, a lot of
>SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
>khugepaged.
>
>@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
>@scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
>@scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
>total progress size: 701 MB
>Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
>
>The khugepaged_scan list save all task that support collapse into hugepage,
>as long as the take is not destroyed, khugepaged will not remove it from
>the khugepaged_scan list. This exist a phenomenon where task has already
>collapsed all memory regions into hugepage, but khugepaged continues to
>scan it, which wastes CPU time and invalid, and due to
>khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>scanning a large number of invalid task, so scanning really valid task
>is later.
>
>After applying this patch, when all memory is either SCAN_PMD_MAPPED or
>SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
>list. If the page fault or MADV_HUGEPAGE again, it is added back to
>khugepaged.

Two thing s come up my mind:

  * what happens if we split the huge page under memory pressure?
  * would this interfere with mTHP collapse?

>
>Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>---
> mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
> 1 file changed, 25 insertions(+), 10 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 0598a19a98cc..1ec1af5be3c8 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -115,6 +115,7 @@ struct khugepaged_scan {
> 	struct list_head mm_head;
> 	struct mm_slot *mm_slot;
> 	unsigned long address;
>+	bool maybe_collapse;
> };
> 
> static struct khugepaged_scan khugepaged_scan = {
>@@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> 	return result;
> }
> 
>-static void collect_mm_slot(struct mm_slot *slot)
>+static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
> {
> 	struct mm_struct *mm = slot->mm;
> 
> 	lockdep_assert_held(&khugepaged_mm_lock);
> 
>-	if (hpage_collapse_test_exit(mm)) {
>+	if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
> 		/* free mm_slot */
> 		hash_del(&slot->hash);
> 		list_del(&slot->mm_node);
> 
>-		/*
>-		 * Not strictly needed because the mm exited already.
>-		 *
>-		 * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>-		 */
>+		if (!maybe_collapse)
>+			mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> 
> 		/* khugepaged_mm_lock actually not necessary for the below */
> 		mm_slot_free(mm_slot_cache, slot);
>@@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> 				     struct mm_slot, mm_node);
> 		khugepaged_scan.address = 0;
> 		khugepaged_scan.mm_slot = slot;
>+		khugepaged_scan.maybe_collapse = false;
> 	}
> 	spin_unlock(&khugepaged_mm_lock);
> 
>@@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> 					khugepaged_scan.address, &mmap_locked, cc);
> 			}
> 
>-			if (*result == SCAN_SUCCEED)
>+			switch (*result) {
>+			case SCAN_PMD_NULL:
>+			case SCAN_PMD_NONE:
>+			case SCAN_PMD_MAPPED:
>+			case SCAN_PTE_MAPPED_HUGEPAGE:
>+				break;
>+			case SCAN_SUCCEED:
> 				++khugepaged_pages_collapsed;
>+				fallthrough;

If collapse successfully, we don't need to set maybe_collapse to true?

>+			default:
>+				khugepaged_scan.maybe_collapse = true;
>+			}
> 
> 			/* move to next address */
> 			khugepaged_scan.address += HPAGE_PMD_SIZE;
>@@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> 	 * if we scanned all vmas of this mm.
> 	 */
> 	if (hpage_collapse_test_exit(mm) || !vma) {
>+		bool maybe_collapse = khugepaged_scan.maybe_collapse;
>+
>+		if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
>+			maybe_collapse = true;
>+
> 		/*
> 		 * Make sure that if mm_users is reaching zero while
> 		 * khugepaged runs here, khugepaged_exit will find
>@@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> 		if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
> 			khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
> 			khugepaged_scan.address = 0;
>+			khugepaged_scan.maybe_collapse = false;
> 		} else {
> 			khugepaged_scan.mm_slot = NULL;
> 			khugepaged_full_scans++;
> 		}
> 
>-		collect_mm_slot(slot);
>+		collect_mm_slot(slot, maybe_collapse);
> 	}
> 
> 	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
>@@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
> 	slot = khugepaged_scan.mm_slot;
> 	khugepaged_scan.mm_slot = NULL;
> 	if (slot)
>-		collect_mm_slot(slot);
>+		collect_mm_slot(slot, true);
> 	spin_unlock(&khugepaged_mm_lock);
> 	return 0;
> }
>-- 
>2.51.0
>

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-17  3:31   ` Wei Yang
@ 2025-12-18  3:27     ` Vernon Yang
  2025-12-18  3:48       ` Wei Yang
  0 siblings, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-18  3:27 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Wed, Dec 17, 2025 at 03:31:55AM +0000, Wei Yang wrote:
> On Mon, Dec 15, 2025 at 05:04:17PM +0800, Vernon Yang wrote:
> >The following data is traced by bpftrace on a desktop system. After
> >the system has been left idle for 10 minutes upon booting, a lot of
> >SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> >khugepaged.
> >
> >@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> >@scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
> >@scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
> >total progress size: 701 MB
> >Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >
> >The khugepaged_scan list save all task that support collapse into hugepage,
> >as long as the take is not destroyed, khugepaged will not remove it from
> >the khugepaged_scan list. This exist a phenomenon where task has already
> >collapsed all memory regions into hugepage, but khugepaged continues to
> >scan it, which wastes CPU time and invalid, and due to
> >khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >scanning a large number of invalid task, so scanning really valid task
> >is later.
> >
> >After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> >SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> >list. If the page fault or MADV_HUGEPAGE again, it is added back to
> >khugepaged.
>
> Two thing s come up my mind:
>
>   * what happens if we split the huge page under memory pressure?

static unsigned int shrink_folio_list(struct list_head *folio_list,
		struct pglist_data *pgdat, struct scan_control *sc,
		struct reclaim_stat *stat, bool ignore_references,
		struct mem_cgroup *memcg)
{
	...

	folio = lru_to_folio(folio_list);

	...

	references = folio_check_references(folio, sc);
	switch (references) {
	case FOLIOREF_ACTIVATE:
		goto activate_locked;
	case FOLIOREF_KEEP:
		stat->nr_ref_keep += nr_pages;
		goto keep_locked;
	case FOLIOREF_RECLAIM:
	case FOLIOREF_RECLAIM_CLEAN:
		; /* try to reclaim the folio below */
	}

	...

	split_folio_to_list(folio, folio_list);
}

During memory reclaim above, only inactive folios are split. This also
implies that the folio is cold, meaning it hasn't been used recently, so
we do not expect to put the mm back onto the khugepaged scan list to
continue scan/collapse. khugeapged needs to scan hot folios as much as
possible priorityly and collapse hot folios to avoid wasting CPU.

>   * would this interfere with mTHP collapse?

It has no impact on mTHP collapse, only when all memory is either
SCAN_PMD_MAPPED or SCAN_PMD_NONE, the mm will be removed automatically.
other cases will not be removed.

Let me know if I missed something please, thanks!

>
> >
> >Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> >---
> > mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
> > 1 file changed, 25 insertions(+), 10 deletions(-)
> >
> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >index 0598a19a98cc..1ec1af5be3c8 100644
> >--- a/mm/khugepaged.c
> >+++ b/mm/khugepaged.c
> >@@ -115,6 +115,7 @@ struct khugepaged_scan {
> > 	struct list_head mm_head;
> > 	struct mm_slot *mm_slot;
> > 	unsigned long address;
> >+	bool maybe_collapse;
> > };
> >
> > static struct khugepaged_scan khugepaged_scan = {
> >@@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > 	return result;
> > }
> >
> >-static void collect_mm_slot(struct mm_slot *slot)
> >+static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
> > {
> > 	struct mm_struct *mm = slot->mm;
> >
> > 	lockdep_assert_held(&khugepaged_mm_lock);
> >
> >-	if (hpage_collapse_test_exit(mm)) {
> >+	if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
> > 		/* free mm_slot */
> > 		hash_del(&slot->hash);
> > 		list_del(&slot->mm_node);
> >
> >-		/*
> >-		 * Not strictly needed because the mm exited already.
> >-		 *
> >-		 * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> >-		 */
> >+		if (!maybe_collapse)
> >+			mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> >
> > 		/* khugepaged_mm_lock actually not necessary for the below */
> > 		mm_slot_free(mm_slot_cache, slot);
> >@@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > 				     struct mm_slot, mm_node);
> > 		khugepaged_scan.address = 0;
> > 		khugepaged_scan.mm_slot = slot;
> >+		khugepaged_scan.maybe_collapse = false;
> > 	}
> > 	spin_unlock(&khugepaged_mm_lock);
> >
> >@@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > 					khugepaged_scan.address, &mmap_locked, cc);
> > 			}
> >
> >-			if (*result == SCAN_SUCCEED)
> >+			switch (*result) {
> >+			case SCAN_PMD_NULL:
> >+			case SCAN_PMD_NONE:
> >+			case SCAN_PMD_MAPPED:
> >+			case SCAN_PTE_MAPPED_HUGEPAGE:
> >+				break;
> >+			case SCAN_SUCCEED:
> > 				++khugepaged_pages_collapsed;
> >+				fallthrough;
>
> If collapse successfully, we don't need to set maybe_collapse to true?

Above "fallthrough" explicitly tells the compiler that when the collapse is
successful, run below "khugepaged_scan.maybe_collapse = true" :)

> >+			default:
> >+				khugepaged_scan.maybe_collapse = true;
> >+			}
> >
> > 			/* move to next address */
> > 			khugepaged_scan.address += HPAGE_PMD_SIZE;
> >@@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > 	 * if we scanned all vmas of this mm.
> > 	 */
> > 	if (hpage_collapse_test_exit(mm) || !vma) {
> >+		bool maybe_collapse = khugepaged_scan.maybe_collapse;
> >+
> >+		if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
> >+			maybe_collapse = true;
> >+
> > 		/*
> > 		 * Make sure that if mm_users is reaching zero while
> > 		 * khugepaged runs here, khugepaged_exit will find
> >@@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > 		if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
> > 			khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
> > 			khugepaged_scan.address = 0;
> >+			khugepaged_scan.maybe_collapse = false;
> > 		} else {
> > 			khugepaged_scan.mm_slot = NULL;
> > 			khugepaged_full_scans++;
> > 		}
> >
> >-		collect_mm_slot(slot);
> >+		collect_mm_slot(slot, maybe_collapse);
> > 	}
> >
> > 	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> >@@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
> > 	slot = khugepaged_scan.mm_slot;
> > 	khugepaged_scan.mm_slot = NULL;
> > 	if (slot)
> >-		collect_mm_slot(slot);
> >+		collect_mm_slot(slot, true);
> > 	spin_unlock(&khugepaged_mm_lock);
> > 	return 0;
> > }
> >--
> >2.51.0
> >
>
> --
> Wei Yang
> Help you, Help me

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-18  3:27     ` Vernon Yang
@ 2025-12-18  3:48       ` Wei Yang
  2025-12-18  4:41         ` Vernon Yang
  0 siblings, 1 reply; 42+ messages in thread
From: Wei Yang @ 2025-12-18  3:48 UTC (permalink / raw)
  To: Vernon Yang
  Cc: Wei Yang, akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Thu, Dec 18, 2025 at 11:27:24AM +0800, Vernon Yang wrote:
>On Wed, Dec 17, 2025 at 03:31:55AM +0000, Wei Yang wrote:
>> On Mon, Dec 15, 2025 at 05:04:17PM +0800, Vernon Yang wrote:
>> >The following data is traced by bpftrace on a desktop system. After
>> >the system has been left idle for 10 minutes upon booting, a lot of
>> >SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
>> >khugepaged.
>> >
>> >@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
>> >@scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
>> >@scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
>> >total progress size: 701 MB
>> >Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
>> >
>> >The khugepaged_scan list save all task that support collapse into hugepage,
>> >as long as the take is not destroyed, khugepaged will not remove it from
>> >the khugepaged_scan list. This exist a phenomenon where task has already
>> >collapsed all memory regions into hugepage, but khugepaged continues to
>> >scan it, which wastes CPU time and invalid, and due to
>> >khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>> >scanning a large number of invalid task, so scanning really valid task
>> >is later.
>> >
>> >After applying this patch, when all memory is either SCAN_PMD_MAPPED or
>> >SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
>> >list. If the page fault or MADV_HUGEPAGE again, it is added back to
>> >khugepaged.
>>
>> Two thing s come up my mind:
>>
>>   * what happens if we split the huge page under memory pressure?
>
>static unsigned int shrink_folio_list(struct list_head *folio_list,
>		struct pglist_data *pgdat, struct scan_control *sc,
>		struct reclaim_stat *stat, bool ignore_references,
>		struct mem_cgroup *memcg)
>{
>	...
>
>	folio = lru_to_folio(folio_list);
>
>	...
>
>	references = folio_check_references(folio, sc);
>	switch (references) {
>	case FOLIOREF_ACTIVATE:
>		goto activate_locked;
>	case FOLIOREF_KEEP:
>		stat->nr_ref_keep += nr_pages;
>		goto keep_locked;
>	case FOLIOREF_RECLAIM:
>	case FOLIOREF_RECLAIM_CLEAN:
>		; /* try to reclaim the folio below */
>	}
>
>	...
>
>	split_folio_to_list(folio, folio_list);
>}
>
>During memory reclaim above, only inactive folios are split. This also
>implies that the folio is cold, meaning it hasn't been used recently, so
>we do not expect to put the mm back onto the khugepaged scan list to
>continue scan/collapse. khugeapged needs to scan hot folios as much as
>possible priorityly and collapse hot folios to avoid wasting CPU.
>

So we will never pout this process back onto the scan list, right?

>>   * would this interfere with mTHP collapse?
>
>It has no impact on mTHP collapse, only when all memory is either
>SCAN_PMD_MAPPED or SCAN_PMD_NONE, the mm will be removed automatically.
>other cases will not be removed.
>
>Let me know if I missed something please, thanks!
>
>>
>> >
>> >Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>> >---
>> > mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
>> > 1 file changed, 25 insertions(+), 10 deletions(-)
>> >
>> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> >index 0598a19a98cc..1ec1af5be3c8 100644
>> >--- a/mm/khugepaged.c
>> >+++ b/mm/khugepaged.c
>> >@@ -115,6 +115,7 @@ struct khugepaged_scan {
>> > 	struct list_head mm_head;
>> > 	struct mm_slot *mm_slot;
>> > 	unsigned long address;
>> >+	bool maybe_collapse;
>> > };
>> >
>> > static struct khugepaged_scan khugepaged_scan = {
>> >@@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> > 	return result;
>> > }
>> >
>> >-static void collect_mm_slot(struct mm_slot *slot)
>> >+static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
>> > {
>> > 	struct mm_struct *mm = slot->mm;
>> >
>> > 	lockdep_assert_held(&khugepaged_mm_lock);
>> >
>> >-	if (hpage_collapse_test_exit(mm)) {
>> >+	if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
>> > 		/* free mm_slot */
>> > 		hash_del(&slot->hash);
>> > 		list_del(&slot->mm_node);
>> >
>> >-		/*
>> >-		 * Not strictly needed because the mm exited already.
>> >-		 *
>> >-		 * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>> >-		 */
>> >+		if (!maybe_collapse)
>> >+			mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>> >
>> > 		/* khugepaged_mm_lock actually not necessary for the below */
>> > 		mm_slot_free(mm_slot_cache, slot);
>> >@@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>> > 				     struct mm_slot, mm_node);
>> > 		khugepaged_scan.address = 0;
>> > 		khugepaged_scan.mm_slot = slot;
>> >+		khugepaged_scan.maybe_collapse = false;
>> > 	}
>> > 	spin_unlock(&khugepaged_mm_lock);
>> >
>> >@@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>> > 					khugepaged_scan.address, &mmap_locked, cc);
>> > 			}
>> >
>> >-			if (*result == SCAN_SUCCEED)
>> >+			switch (*result) {
>> >+			case SCAN_PMD_NULL:
>> >+			case SCAN_PMD_NONE:
>> >+			case SCAN_PMD_MAPPED:
>> >+			case SCAN_PTE_MAPPED_HUGEPAGE:
>> >+				break;
>> >+			case SCAN_SUCCEED:
>> > 				++khugepaged_pages_collapsed;
>> >+				fallthrough;
>>
>> If collapse successfully, we don't need to set maybe_collapse to true?
>
>Above "fallthrough" explicitly tells the compiler that when the collapse is
>successful, run below "khugepaged_scan.maybe_collapse = true" :)
>

Got it, thanks.

>> >+			default:
>> >+				khugepaged_scan.maybe_collapse = true;
>> >+			}
>> >
>> > 			/* move to next address */
>> > 			khugepaged_scan.address += HPAGE_PMD_SIZE;
>> >@@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>> > 	 * if we scanned all vmas of this mm.
>> > 	 */
>> > 	if (hpage_collapse_test_exit(mm) || !vma) {
>> >+		bool maybe_collapse = khugepaged_scan.maybe_collapse;
>> >+
>> >+		if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
>> >+			maybe_collapse = true;
>> >+
>> > 		/*
>> > 		 * Make sure that if mm_users is reaching zero while
>> > 		 * khugepaged runs here, khugepaged_exit will find
>> >@@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>> > 		if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
>> > 			khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
>> > 			khugepaged_scan.address = 0;
>> >+			khugepaged_scan.maybe_collapse = false;
>> > 		} else {
>> > 			khugepaged_scan.mm_slot = NULL;
>> > 			khugepaged_full_scans++;
>> > 		}
>> >
>> >-		collect_mm_slot(slot);
>> >+		collect_mm_slot(slot, maybe_collapse);
>> > 	}
>> >
>> > 	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
>> >@@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
>> > 	slot = khugepaged_scan.mm_slot;
>> > 	khugepaged_scan.mm_slot = NULL;
>> > 	if (slot)
>> >-		collect_mm_slot(slot);
>> >+		collect_mm_slot(slot, true);
>> > 	spin_unlock(&khugepaged_mm_lock);
>> > 	return 0;
>> > }
>> >--
>> >2.51.0
>> >
>>
>> --
>> Wei Yang
>> Help you, Help me
>
>--
>Thanks,
>Vernon

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-18  3:48       ` Wei Yang
@ 2025-12-18  4:41         ` Vernon Yang
  0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-18  4:41 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Thu, Dec 18, 2025 at 03:48:01AM +0000, Wei Yang wrote:
> On Thu, Dec 18, 2025 at 11:27:24AM +0800, Vernon Yang wrote:
> >On Wed, Dec 17, 2025 at 03:31:55AM +0000, Wei Yang wrote:
> >> On Mon, Dec 15, 2025 at 05:04:17PM +0800, Vernon Yang wrote:
> >> >The following data is traced by bpftrace on a desktop system. After
> >> >the system has been left idle for 10 minutes upon booting, a lot of
> >> >SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> >> >khugepaged.
> >> >
> >> >@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> >> >@scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
> >> >@scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
> >> >total progress size: 701 MB
> >> >Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >> >
> >> >The khugepaged_scan list save all task that support collapse into hugepage,
> >> >as long as the take is not destroyed, khugepaged will not remove it from
> >> >the khugepaged_scan list. This exist a phenomenon where task has already
> >> >collapsed all memory regions into hugepage, but khugepaged continues to
> >> >scan it, which wastes CPU time and invalid, and due to
> >> >khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >> >scanning a large number of invalid task, so scanning really valid task
> >> >is later.
> >> >
> >> >After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> >> >SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> >> >list. If the page fault or MADV_HUGEPAGE again, it is added back to
> >> >khugepaged.
> >>
> >> Two thing s come up my mind:
> >>
> >>   * what happens if we split the huge page under memory pressure?
> >
> >static unsigned int shrink_folio_list(struct list_head *folio_list,
> >		struct pglist_data *pgdat, struct scan_control *sc,
> >		struct reclaim_stat *stat, bool ignore_references,
> >		struct mem_cgroup *memcg)
> >{
> >	...
> >
> >	folio = lru_to_folio(folio_list);
> >
> >	...
> >
> >	references = folio_check_references(folio, sc);
> >	switch (references) {
> >	case FOLIOREF_ACTIVATE:
> >		goto activate_locked;
> >	case FOLIOREF_KEEP:
> >		stat->nr_ref_keep += nr_pages;
> >		goto keep_locked;
> >	case FOLIOREF_RECLAIM:
> >	case FOLIOREF_RECLAIM_CLEAN:
> >		; /* try to reclaim the folio below */
> >	}
> >
> >	...
> >
> >	split_folio_to_list(folio, folio_list);
> >}
> >
> >During memory reclaim above, only inactive folios are split. This also
> >implies that the folio is cold, meaning it hasn't been used recently, so
> >we do not expect to put the mm back onto the khugepaged scan list to
> >continue scan/collapse. khugeapged needs to scan hot folios as much as
> >possible priorityly and collapse hot folios to avoid wasting CPU.
> >
>
> So we will never pout this process back onto the scan list, right?

No, if the page fault or MADV_HUGEPAGE again, this task is added back to
khugepaged scan list. Just doesn't actively put this task back to the
khugepaged scan list after splitting.

>
> >>   * would this interfere with mTHP collapse?
> >
> >It has no impact on mTHP collapse, only when all memory is either
> >SCAN_PMD_MAPPED or SCAN_PMD_NONE, the mm will be removed automatically.
> >other cases will not be removed.
> >
> >Let me know if I missed something please, thanks!
> >
> >>
> >> >
> >> >Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> >> >---
> >> > mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
> >> > 1 file changed, 25 insertions(+), 10 deletions(-)
> >> >
> >> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> >index 0598a19a98cc..1ec1af5be3c8 100644
> >> >--- a/mm/khugepaged.c
> >> >+++ b/mm/khugepaged.c
> >> >@@ -115,6 +115,7 @@ struct khugepaged_scan {
> >> > 	struct list_head mm_head;
> >> > 	struct mm_slot *mm_slot;
> >> > 	unsigned long address;
> >> >+	bool maybe_collapse;
> >> > };
> >> >
> >> > static struct khugepaged_scan khugepaged_scan = {
> >> >@@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >> > 	return result;
> >> > }
> >> >
> >> >-static void collect_mm_slot(struct mm_slot *slot)
> >> >+static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
> >> > {
> >> > 	struct mm_struct *mm = slot->mm;
> >> >
> >> > 	lockdep_assert_held(&khugepaged_mm_lock);
> >> >
> >> >-	if (hpage_collapse_test_exit(mm)) {
> >> >+	if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
> >> > 		/* free mm_slot */
> >> > 		hash_del(&slot->hash);
> >> > 		list_del(&slot->mm_node);
> >> >
> >> >-		/*
> >> >-		 * Not strictly needed because the mm exited already.
> >> >-		 *
> >> >-		 * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> >> >-		 */
> >> >+		if (!maybe_collapse)
> >> >+			mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> >> >
> >> > 		/* khugepaged_mm_lock actually not necessary for the below */
> >> > 		mm_slot_free(mm_slot_cache, slot);
> >> >@@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >> > 				     struct mm_slot, mm_node);
> >> > 		khugepaged_scan.address = 0;
> >> > 		khugepaged_scan.mm_slot = slot;
> >> >+		khugepaged_scan.maybe_collapse = false;
> >> > 	}
> >> > 	spin_unlock(&khugepaged_mm_lock);
> >> >
> >> >@@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >> > 					khugepaged_scan.address, &mmap_locked, cc);
> >> > 			}
> >> >
> >> >-			if (*result == SCAN_SUCCEED)
> >> >+			switch (*result) {
> >> >+			case SCAN_PMD_NULL:
> >> >+			case SCAN_PMD_NONE:
> >> >+			case SCAN_PMD_MAPPED:
> >> >+			case SCAN_PTE_MAPPED_HUGEPAGE:
> >> >+				break;
> >> >+			case SCAN_SUCCEED:
> >> > 				++khugepaged_pages_collapsed;
> >> >+				fallthrough;
> >>
> >> If collapse successfully, we don't need to set maybe_collapse to true?
> >
> >Above "fallthrough" explicitly tells the compiler that when the collapse is
> >successful, run below "khugepaged_scan.maybe_collapse = true" :)
> >
>
> Got it, thanks.
>
> >> >+			default:
> >> >+				khugepaged_scan.maybe_collapse = true;
> >> >+			}
> >> >
> >> > 			/* move to next address */
> >> > 			khugepaged_scan.address += HPAGE_PMD_SIZE;
> >> >@@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >> > 	 * if we scanned all vmas of this mm.
> >> > 	 */
> >> > 	if (hpage_collapse_test_exit(mm) || !vma) {
> >> >+		bool maybe_collapse = khugepaged_scan.maybe_collapse;
> >> >+
> >> >+		if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
> >> >+			maybe_collapse = true;
> >> >+
> >> > 		/*
> >> > 		 * Make sure that if mm_users is reaching zero while
> >> > 		 * khugepaged runs here, khugepaged_exit will find
> >> >@@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >> > 		if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
> >> > 			khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
> >> > 			khugepaged_scan.address = 0;
> >> >+			khugepaged_scan.maybe_collapse = false;
> >> > 		} else {
> >> > 			khugepaged_scan.mm_slot = NULL;
> >> > 			khugepaged_full_scans++;
> >> > 		}
> >> >
> >> >-		collect_mm_slot(slot);
> >> >+		collect_mm_slot(slot, maybe_collapse);
> >> > 	}
> >> >
> >> > 	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> >> >@@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
> >> > 	slot = khugepaged_scan.mm_slot;
> >> > 	khugepaged_scan.mm_slot = NULL;
> >> > 	if (slot)
> >> >-		collect_mm_slot(slot);
> >> >+		collect_mm_slot(slot, true);
> >> > 	spin_unlock(&khugepaged_mm_lock);
> >> > 	return 0;
> >> > }
> >> >--
> >> >2.51.0
> >> >
> >>
> >> --
> >> Wei Yang
> >> Help you, Help me
> >
> >--
> >Thanks,
> >Vernon
>
> --
> Wei Yang
> Help you, Help me

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event
  2025-12-15  9:04 ` [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
@ 2025-12-18  9:24   ` David Hildenbrand (Red Hat)
  2025-12-19  5:21     ` Vernon Yang
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  9:24 UTC (permalink / raw)
  To: Vernon Yang, akpm, lorenzo.stoakes
  Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang

On 12/15/25 10:04, Vernon Yang wrote:
> Add mm_khugepaged_scan event to track the total time for full scan
> and the total number of pages scanned of khugepaged.
> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++
>   mm/khugepaged.c                    |  2 ++
>   2 files changed, 26 insertions(+)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index dd94d14a2427..b2824c2f8238 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -237,5 +237,29 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
>   		__print_symbolic(__entry->result, SCAN_STATUS))
>   );
>   
> +TRACE_EVENT(mm_khugepaged_scan,
> +
> +	TP_PROTO(struct mm_struct *mm, int progress, bool full),
> +
> +	TP_ARGS(mm, progress, full),
> +
> +	TP_STRUCT__entry(
> +		__field(struct mm_struct *, mm)
> +		__field(int, progress)
> +		__field(bool, full)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->mm = mm;
> +		__entry->progress = progress;
> +		__entry->full = full;
> +	),
> +
> +	TP_printk("mm=%p, progress=%d, full=%d",
> +		__entry->mm,
> +		__entry->progress,
> +		__entry->full)
> +);
> +
>   #endif /* __HUGE_MEMORY_H */
>   #include <trace/define_trace.h>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index abe54f0043c7..0598a19a98cc 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2516,6 +2516,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   		collect_mm_slot(slot);
>   	}
>   
> +	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> +
>   	return progress;
>   }
>   

Nothing jumped at me, except that "full" could be called 
"full_scan_finished" or smth like that.

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-15  9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
                     ` (3 preceding siblings ...)
  2025-12-17  3:31   ` Wei Yang
@ 2025-12-18  9:29   ` David Hildenbrand (Red Hat)
  2025-12-19  5:24     ` Vernon Yang
  2025-12-19  8:35     ` Vernon Yang
  2025-12-22 19:00   ` kernel test robot
  5 siblings, 2 replies; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  9:29 UTC (permalink / raw)
  To: Vernon Yang, akpm, lorenzo.stoakes
  Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang

On 12/15/25 10:04, Vernon Yang wrote:
> The following data is traced by bpftrace on a desktop system. After
> the system has been left idle for 10 minutes upon booting, a lot of
> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> khugepaged.
> 
> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> @scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
> @scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
> total progress size: 701 MB
> Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
> 
> The khugepaged_scan list save all task that support collapse into hugepage,
> as long as the take is not destroyed, khugepaged will not remove it from
> the khugepaged_scan list. This exist a phenomenon where task has already
> collapsed all memory regions into hugepage, but khugepaged continues to
> scan it, which wastes CPU time and invalid, and due to
> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> scanning a large number of invalid task, so scanning really valid task
> is later.
> 
> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> list. If the page fault or MADV_HUGEPAGE again, it is added back to
> khugepaged.

I don't like that, as it assumes that memory within such a process would 
be rather static, which is easily not the case (e.g., allocators just 
doing MADV_DONTNEED to free memory).

If most stuff is collapsed to PMDs already, can't we just skip over 
these regions a bit faster?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-15  9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
                     ` (2 preceding siblings ...)
  2025-12-16 13:31   ` kernel test robot
@ 2025-12-18  9:31   ` David Hildenbrand (Red Hat)
  2025-12-19  5:29     ` Vernon Yang
  3 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  9:31 UTC (permalink / raw)
  To: Vernon Yang, akpm, lorenzo.stoakes
  Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang

On 12/15/25 10:04, Vernon Yang wrote:
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
> 
> So if the user has explicitly informed us via MADV_COLD/FREE that this
> memory is cold or will be freed, it is appropriate for khugepaged to
> scan it only at the latest possible moment, thereby avoiding unnecessary
> scan and collapse operations to reducing CPU wastage.
> 
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
> 
> Testing on x86_64 machine:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
> | cycles per access   |  4.91         |  2.07         | -57.84% |
> | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
> | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
> 
> Testing on qemu-system-x86_64 -enable-kvm:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> | cycles per access   |  7.23         |  2.12         | -70.68% |
> | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
> | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |

Again, I also don't like that because you make assumptions on a full 
process based on some part of it's address space.

E.g., if a library issues a MADV_COLD on some part of the memory the 
library manages, why should the remaining part of the process suffer as 
well?

This seems to be an heuristic focused on some specific workloads, no?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2025-12-15  9:04 ` [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
@ 2025-12-18  9:33   ` David Hildenbrand (Red Hat)
  2025-12-19  5:31     ` Vernon Yang
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  9:33 UTC (permalink / raw)
  To: Vernon Yang, akpm, lorenzo.stoakes
  Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang

On 12/15/25 10:04, Vernon Yang wrote:
> When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
> scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
> reduce redundant operation.

That conceptually makes sense to me. How much does that safe in 
practice? Do you have some performance numbers for processes with rather 
large number of VMAs?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event
  2025-12-18  9:24   ` David Hildenbrand (Red Hat)
@ 2025-12-19  5:21     ` Vernon Yang
  0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-19  5:21 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Thu, Dec 18, 2025 at 10:24:05AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 10:04, Vernon Yang wrote:
> > Add mm_khugepaged_scan event to track the total time for full scan
> > and the total number of pages scanned of khugepaged.
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++
> >   mm/khugepaged.c                    |  2 ++
> >   2 files changed, 26 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index dd94d14a2427..b2824c2f8238 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -237,5 +237,29 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
> >   		__print_symbolic(__entry->result, SCAN_STATUS))
> >   );
> > +TRACE_EVENT(mm_khugepaged_scan,
> > +
> > +	TP_PROTO(struct mm_struct *mm, int progress, bool full),
> > +
> > +	TP_ARGS(mm, progress, full),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(struct mm_struct *, mm)
> > +		__field(int, progress)
> > +		__field(bool, full)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->mm = mm;
> > +		__entry->progress = progress;
> > +		__entry->full = full;
> > +	),
> > +
> > +	TP_printk("mm=%p, progress=%d, full=%d",
> > +		__entry->mm,
> > +		__entry->progress,
> > +		__entry->full)
> > +);
> > +
> >   #endif /* __HUGE_MEMORY_H */
> >   #include <trace/define_trace.h>
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index abe54f0043c7..0598a19a98cc 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2516,6 +2516,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >   		collect_mm_slot(slot);
> >   	}
> > +	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> > +
> >   	return progress;
> >   }
>
> Nothing jumped at me, except that "full" could be called
> "full_scan_finished" or smth like that.

Thanks for your review. The full_scan_finished sounds good to me, I'll
do it in the next version.

>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
>
> --
> Cheers
>
> David

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-18  9:29   ` David Hildenbrand (Red Hat)
@ 2025-12-19  5:24     ` Vernon Yang
  2025-12-19  9:00       ` David Hildenbrand (Red Hat)
  2025-12-19  8:35     ` Vernon Yang
  1 sibling, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-19  5:24 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 10:04, Vernon Yang wrote:
> > The following data is traced by bpftrace on a desktop system. After
> > the system has been left idle for 10 minutes upon booting, a lot of
> > SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> > khugepaged.
> >
> > @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> > @scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
> > @scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
> > total progress size: 701 MB
> > Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >
> > The khugepaged_scan list save all task that support collapse into hugepage,
> > as long as the take is not destroyed, khugepaged will not remove it from
> > the khugepaged_scan list. This exist a phenomenon where task has already
> > collapsed all memory regions into hugepage, but khugepaged continues to
> > scan it, which wastes CPU time and invalid, and due to
> > khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> > scanning a large number of invalid task, so scanning really valid task
> > is later.
> >
> > After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> > SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> > list. If the page fault or MADV_HUGEPAGE again, it is added back to
> > khugepaged.
>
> I don't like that, as it assumes that memory within such a process would be
> rather static, which is easily not the case (e.g., allocators just doing
> MADV_DONTNEED to free memory).
>
> If most stuff is collapsed to PMDs already, can't we just skip over these
> regions a bit faster?

/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
static unsigned int khugepaged_pages_to_scan __read_mostly;

The observed phenomenon is that when scanning these regions, the loop is
broken upon reaching the number of khugepaged_pages_to_scan, thereforce
the khugepaged enters 10s sleep. So if we just skip over these regions,
will break the semantics of khugepaged_pages_to_scan.

I also think this approach is great because it is simple sufficiently.
If we can skip over these regions directly, that's excellent.

> --
> Cheers
>
> David

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-18  9:31   ` David Hildenbrand (Red Hat)
@ 2025-12-19  5:29     ` Vernon Yang
  2025-12-19  8:58       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-19  5:29 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 10:04, Vernon Yang wrote:
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > So if the user has explicitly informed us via MADV_COLD/FREE that this
> > memory is cold or will be freed, it is appropriate for khugepaged to
> > scan it only at the latest possible moment, thereby avoiding unnecessary
> > scan and collapse operations to reducing CPU wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
> > | cycles per access   |  4.91         |  2.07         | -57.84% |
> > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
> > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > | cycles per access   |  7.23         |  2.12         | -70.68% |
> > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
> > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>
> Again, I also don't like that because you make assumptions on a full process
> based on some part of it's address space.
>
> E.g., if a library issues a MADV_COLD on some part of the memory the library
> manages, why should the remaining part of the process suffer as well?

Yes, you make a good point, thanks!

> This seems to be an heuristic focused on some specific workloads, no?

Right.

Could we use the VM_NOHUGEPAGE flag to indicate that this region should
not be collapsed, so that khugepaged can simply skip this VMA during
scanning? This way, it won't affect the remaining part of the task's
memory regions.

> --
> Cheers
>
> David

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2025-12-18  9:33   ` David Hildenbrand (Red Hat)
@ 2025-12-19  5:31     ` Vernon Yang
  0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-19  5:31 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Thu, Dec 18, 2025 at 10:33:16AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 10:04, Vernon Yang wrote:
> > When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
> > scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
> > reduce redundant operation.
>
> That conceptually makes sense to me. How much does that safe in practice? Do
> you have some performance numbers for processes with rather large number of
> VMAs?

I also only came to this possibility through theoretical analysis and
haven't did any separate performance tests for this patch now.

If you have anything you'd like to test, please let me know, and I can
test the performance benefits.

> --
> Cheers
>
> David

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-18  9:29   ` David Hildenbrand (Red Hat)
  2025-12-19  5:24     ` Vernon Yang
@ 2025-12-19  8:35     ` Vernon Yang
  2025-12-19  8:55       ` David Hildenbrand (Red Hat)
  2025-12-23 11:18       ` Dev Jain
  1 sibling, 2 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-19  8:35 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 10:04, Vernon Yang wrote:
> > The following data is traced by bpftrace on a desktop system. After
> > the system has been left idle for 10 minutes upon booting, a lot of
> > SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> > khugepaged.
> >
> > @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> > @scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
> > @scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
> > total progress size: 701 MB
> > Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >
> > The khugepaged_scan list save all task that support collapse into hugepage,
> > as long as the take is not destroyed, khugepaged will not remove it from
> > the khugepaged_scan list. This exist a phenomenon where task has already
> > collapsed all memory regions into hugepage, but khugepaged continues to
> > scan it, which wastes CPU time and invalid, and due to
> > khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> > scanning a large number of invalid task, so scanning really valid task
> > is later.
> >
> > After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> > SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> > list. If the page fault or MADV_HUGEPAGE again, it is added back to
> > khugepaged.
>
> I don't like that, as it assumes that memory within such a process would be
> rather static, which is easily not the case (e.g., allocators just doing
> MADV_DONTNEED to free memory).
>
> If most stuff is collapsed to PMDs already, can't we just skip over these
> regions a bit faster?

I have a flash of inspiration and came up with a good idea.

If these regions have already been collapsed into hugepage, rechecking
them would be very fast. Due to the khugepaged_pages_to_scan can also
represent the number of VMAs to skip, we can extend its semantics as
follows:

	/*
	 * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
	 * every 10 second.
	 */
	static unsigned int khugepaged_pages_to_scan __read_mostly;

	switch (*result) {
	case SCAN_NO_PTE_TABLE:
	case SCAN_PMD_MAPPED:
	case SCAN_PTE_MAPPED_HUGEPAGE:
		progress++; // here
		break;
	case SCAN_SUCCEED:
		++khugepaged_pages_collapsed;
		fallthrough;
	default:
		progress += HPAGE_PMD_NR;
	}

This way can achieve our goal. David, do you like it?

> --
> Cheers
>
> David

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-19  8:35     ` Vernon Yang
@ 2025-12-19  8:55       ` David Hildenbrand (Red Hat)
  2025-12-23 11:18       ` Dev Jain
  1 sibling, 0 replies; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19  8:55 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On 12/19/25 09:35, Vernon Yang wrote:
> On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/15/25 10:04, Vernon Yang wrote:
>>> The following data is traced by bpftrace on a desktop system. After
>>> the system has been left idle for 10 minutes upon booting, a lot of
>>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
>>> khugepaged.
>>>
>>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
>>> @scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
>>> @scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
>>> total progress size: 701 MB
>>> Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
>>>
>>> The khugepaged_scan list save all task that support collapse into hugepage,
>>> as long as the take is not destroyed, khugepaged will not remove it from
>>> the khugepaged_scan list. This exist a phenomenon where task has already
>>> collapsed all memory regions into hugepage, but khugepaged continues to
>>> scan it, which wastes CPU time and invalid, and due to
>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>>> scanning a large number of invalid task, so scanning really valid task
>>> is later.
>>>
>>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
>>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
>>> list. If the page fault or MADV_HUGEPAGE again, it is added back to
>>> khugepaged.
>>
>> I don't like that, as it assumes that memory within such a process would be
>> rather static, which is easily not the case (e.g., allocators just doing
>> MADV_DONTNEED to free memory).
>>
>> If most stuff is collapsed to PMDs already, can't we just skip over these
>> regions a bit faster?
> 
> I have a flash of inspiration and came up with a good idea.
> 
> If these regions have already been collapsed into hugepage, rechecking
> them would be very fast. Due to the khugepaged_pages_to_scan can also
> represent the number of VMAs to skip, we can extend its semantics as
> follows:
> 
> 	/*
> 	 * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> 	 * every 10 second.
> 	 */
> 	static unsigned int khugepaged_pages_to_scan __read_mostly;
> 
> 	switch (*result) {
> 	case SCAN_NO_PTE_TABLE:
> 	case SCAN_PMD_MAPPED:
> 	case SCAN_PTE_MAPPED_HUGEPAGE:
> 		progress++; // here
> 		break;
> 	case SCAN_SUCCEED:
> 		++khugepaged_pages_collapsed;
> 		fallthrough;
> 	default:
> 		progress += HPAGE_PMD_NR;
> 	}
> 
> This way can achieve our goal. David, do you like it?

I'd have to see the full patch, but IMHO we should rather focus on on 
"how many pte/pmd entries did we check" and not "how many PMD areas did 
we check".

Maybe there is a history to this, but conceptually I think we wanted to 
limit the work we do in one operation to something reasonable. Reading a 
single PMD is obviously faster than 512 PTEs.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-19  5:29     ` Vernon Yang
@ 2025-12-19  8:58       ` David Hildenbrand (Red Hat)
  2025-12-21  2:10         ` Wei Yang
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19  8:58 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On 12/19/25 06:29, Vernon Yang wrote:
> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/15/25 10:04, Vernon Yang wrote:
>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>> continuously access 128 MB memory, while the cold task only accesses
>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>>> still prioritizes scanning the cold task and only scans the hot2 task
>>> after completing the scan of the cold task.
>>>
>>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>>> memory is cold or will be freed, it is appropriate for khugepaged to
>>> scan it only at the latest possible moment, thereby avoiding unnecessary
>>> scan and collapse operations to reducing CPU wastage.
>>>
>>> Here are the performance test results:
>>> (Throughput bigger is better, other smaller is better)
>>>
>>> Testing on x86_64 machine:
>>>
>>> | task hot2           | without patch | with patch    |  delta  |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
>>> | cycles per access   |  4.91         |  2.07         | -57.84% |
>>> | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
>>> | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
>>>
>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>
>>> | task hot2           | without patch | with patch    |  delta  |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>>> | cycles per access   |  7.23         |  2.12         | -70.68% |
>>> | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
>>> | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>>
>> Again, I also don't like that because you make assumptions on a full process
>> based on some part of it's address space.
>>
>> E.g., if a library issues a MADV_COLD on some part of the memory the library
>> manages, why should the remaining part of the process suffer as well?
> 
> Yes, you make a good point, thanks!
> 
>> This seems to be an heuristic focused on some specific workloads, no?
> 
> Right.
> 
> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> not be collapsed, so that khugepaged can simply skip this VMA during
> scanning? This way, it won't affect the remaining part of the task's
> memory regions.

I thought we would skip these regions already properly in khugeapged, or 
maybe I misunderstood your question.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-19  5:24     ` Vernon Yang
@ 2025-12-19  9:00       ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19  9:00 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On 12/19/25 06:24, Vernon Yang wrote:
> On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/15/25 10:04, Vernon Yang wrote:
>>> The following data is traced by bpftrace on a desktop system. After
>>> the system has been left idle for 10 minutes upon booting, a lot of
>>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
>>> khugepaged.
>>>
>>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
>>> @scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
>>> @scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
>>> total progress size: 701 MB
>>> Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
>>>
>>> The khugepaged_scan list save all task that support collapse into hugepage,
>>> as long as the take is not destroyed, khugepaged will not remove it from
>>> the khugepaged_scan list. This exist a phenomenon where task has already
>>> collapsed all memory regions into hugepage, but khugepaged continues to
>>> scan it, which wastes CPU time and invalid, and due to
>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>>> scanning a large number of invalid task, so scanning really valid task
>>> is later.
>>>
>>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
>>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
>>> list. If the page fault or MADV_HUGEPAGE again, it is added back to
>>> khugepaged.
>>
>> I don't like that, as it assumes that memory within such a process would be
>> rather static, which is easily not the case (e.g., allocators just doing
>> MADV_DONTNEED to free memory).
>>
>> If most stuff is collapsed to PMDs already, can't we just skip over these
>> regions a bit faster?
> 
> /* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> static unsigned int khugepaged_pages_to_scan __read_mostly;
> 
> The observed phenomenon is that when scanning these regions, the loop is
> broken upon reaching the number of khugepaged_pages_to_scan, thereforce
> the khugepaged enters 10s sleep.
BTW, the 10s sleep is ridiculous :)

I wonder whether we were more careful in the past regarding canning 
overhead due to the mmap read lock. Nowadays page faults typicaly use 
per-vma locks, so I wonder whether the scanning overhead is still a 
problem. (I assume there is more to optimize long-term)

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-19  8:58       ` David Hildenbrand (Red Hat)
@ 2025-12-21  2:10         ` Wei Yang
  2025-12-21  4:25           ` Vernon Yang
  0 siblings, 1 reply; 42+ messages in thread
From: Wei Yang @ 2025-12-21  2:10 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Vernon Yang, akpm, lorenzo.stoakes, ziy, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>On 12/19/25 06:29, Vernon Yang wrote:
>> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>> > On 12/15/25 10:04, Vernon Yang wrote:
>> > > For example, create three task: hot1 -> cold -> hot2. After all three
>> > > task are created, each allocate memory 128MB. the hot1/hot2 task
>> > > continuously access 128 MB memory, while the cold task only accesses
>> > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>> > > still prioritizes scanning the cold task and only scans the hot2 task
>> > > after completing the scan of the cold task.
>> > > 
>> > > So if the user has explicitly informed us via MADV_COLD/FREE that this
>> > > memory is cold or will be freed, it is appropriate for khugepaged to
>> > > scan it only at the latest possible moment, thereby avoiding unnecessary
>> > > scan and collapse operations to reducing CPU wastage.
>> > > 
>> > > Here are the performance test results:
>> > > (Throughput bigger is better, other smaller is better)
>> > > 
>> > > Testing on x86_64 machine:
>> > > 
>> > > | task hot2           | without patch | with patch    |  delta  |
>> > > |---------------------|---------------|---------------|---------|
>> > > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
>> > > | cycles per access   |  4.91         |  2.07         | -57.84% |
>> > > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
>> > > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
>> > > 
>> > > Testing on qemu-system-x86_64 -enable-kvm:
>> > > 
>> > > | task hot2           | without patch | with patch    |  delta  |
>> > > |---------------------|---------------|---------------|---------|
>> > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>> > > | cycles per access   |  7.23         |  2.12         | -70.68% |
>> > > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
>> > > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>> > 
>> > Again, I also don't like that because you make assumptions on a full process
>> > based on some part of it's address space.
>> > 
>> > E.g., if a library issues a MADV_COLD on some part of the memory the library
>> > manages, why should the remaining part of the process suffer as well?
>> 
>> Yes, you make a good point, thanks!
>> 
>> > This seems to be an heuristic focused on some specific workloads, no?
>> 
>> Right.
>> 
>> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>> not be collapsed, so that khugepaged can simply skip this VMA during
>> scanning? This way, it won't affect the remaining part of the task's
>> memory regions.
>
>I thought we would skip these regions already properly in khugeapged, or
>maybe I misunderstood your question.
>

I think we should, but seems we didn't do this for anonymous memory during
khugepaged.

We check the vma with thp_vma_allowable_order() during scan.

  * For anonymous memory during khugepaged, if we always enable 2M collapse,
    we will scan this vma. Even VM_NOHUGEPAGE is set.

  * For other cases, it looks good since __thp_vma_allowable_order() will skip
    this vma with vma_thp_disabled().

>-- 
>Cheers
>
>David

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-21  2:10         ` Wei Yang
@ 2025-12-21  4:25           ` Vernon Yang
  2025-12-21  9:24             ` David Hildenbrand (Red Hat)
  2025-12-21 12:38             ` Wei Yang
  0 siblings, 2 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-21  4:25 UTC (permalink / raw)
  To: Wei Yang, David Hildenbrand (Red Hat)
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
> >On 12/19/25 06:29, Vernon Yang wrote:
> >> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> >> > On 12/15/25 10:04, Vernon Yang wrote:
> >> > > For example, create three task: hot1 -> cold -> hot2. After all three
> >> > > task are created, each allocate memory 128MB. the hot1/hot2 task
> >> > > continuously access 128 MB memory, while the cold task only accesses
> >> > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> >> > > still prioritizes scanning the cold task and only scans the hot2 task
> >> > > after completing the scan of the cold task.
> >> > >
> >> > > So if the user has explicitly informed us via MADV_COLD/FREE that this
> >> > > memory is cold or will be freed, it is appropriate for khugepaged to
> >> > > scan it only at the latest possible moment, thereby avoiding unnecessary
> >> > > scan and collapse operations to reducing CPU wastage.
> >> > >
> >> > > Here are the performance test results:
> >> > > (Throughput bigger is better, other smaller is better)
> >> > >
> >> > > Testing on x86_64 machine:
> >> > >
> >> > > | task hot2           | without patch | with patch    |  delta  |
> >> > > |---------------------|---------------|---------------|---------|
> >> > > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
> >> > > | cycles per access   |  4.91         |  2.07         | -57.84% |
> >> > > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
> >> > > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
> >> > >
> >> > > Testing on qemu-system-x86_64 -enable-kvm:
> >> > >
> >> > > | task hot2           | without patch | with patch    |  delta  |
> >> > > |---------------------|---------------|---------------|---------|
> >> > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> >> > > | cycles per access   |  7.23         |  2.12         | -70.68% |
> >> > > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
> >> > > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
> >> >
> >> > Again, I also don't like that because you make assumptions on a full process
> >> > based on some part of it's address space.
> >> >
> >> > E.g., if a library issues a MADV_COLD on some part of the memory the library
> >> > manages, why should the remaining part of the process suffer as well?
> >>
> >> Yes, you make a good point, thanks!
> >>
> >> > This seems to be an heuristic focused on some specific workloads, no?
> >>
> >> Right.
> >>
> >> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> >> not be collapsed, so that khugepaged can simply skip this VMA during
> >> scanning? This way, it won't affect the remaining part of the task's
> >> memory regions.
> >
> >I thought we would skip these regions already properly in khugeapged, or
> >maybe I misunderstood your question.
> >
>
> I think we should, but seems we didn't do this for anonymous memory during
> khugepaged.
>
> We check the vma with thp_vma_allowable_order() during scan.
>
>   * For anonymous memory during khugepaged, if we always enable 2M collapse,
>     we will scan this vma. Even VM_NOHUGEPAGE is set.
>
>   * For other cases, it looks good since __thp_vma_allowable_order() will skip
>     this vma with vma_thp_disabled().

Hi David, Wei,

The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
memory during scan, as below:

khugepaged_scan_mm_slot()
    thp_vma_allowable_order()
        thp_vma_allowable_orders()
            __thp_vma_allowable_orders()
                vma_thp_disabled() {
                     if (vm_flags & VM_NOHUGEPAGE)
                         return true;
                }

REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
so the khugepaged will continue scan this vma.

I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
been successful. I will send it in the next version.

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-21  4:25           ` Vernon Yang
@ 2025-12-21  9:24             ` David Hildenbrand (Red Hat)
  2025-12-21 12:34               ` Vernon Yang
  2025-12-21 12:38             ` Wei Yang
  1 sibling, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-21  9:24 UTC (permalink / raw)
  To: Vernon Yang, Wei Yang
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On 12/21/25 05:25, Vernon Yang wrote:
> On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
>> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>>> On 12/19/25 06:29, Vernon Yang wrote:
>>>> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>> On 12/15/25 10:04, Vernon Yang wrote:
>>>>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>>>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>>>>> continuously access 128 MB memory, while the cold task only accesses
>>>>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>>>>>> still prioritizes scanning the cold task and only scans the hot2 task
>>>>>> after completing the scan of the cold task.
>>>>>>
>>>>>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>>>>>> memory is cold or will be freed, it is appropriate for khugepaged to
>>>>>> scan it only at the latest possible moment, thereby avoiding unnecessary
>>>>>> scan and collapse operations to reducing CPU wastage.
>>>>>>
>>>>>> Here are the performance test results:
>>>>>> (Throughput bigger is better, other smaller is better)
>>>>>>
>>>>>> Testing on x86_64 machine:
>>>>>>
>>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>>> |---------------------|---------------|---------------|---------|
>>>>>> | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
>>>>>> | cycles per access   |  4.91         |  2.07         | -57.84% |
>>>>>> | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
>>>>>> | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
>>>>>>
>>>>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>>>>
>>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>>> |---------------------|---------------|---------------|---------|
>>>>>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>>>>>> | cycles per access   |  7.23         |  2.12         | -70.68% |
>>>>>> | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
>>>>>> | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>>>>>
>>>>> Again, I also don't like that because you make assumptions on a full process
>>>>> based on some part of it's address space.
>>>>>
>>>>> E.g., if a library issues a MADV_COLD on some part of the memory the library
>>>>> manages, why should the remaining part of the process suffer as well?
>>>>
>>>> Yes, you make a good point, thanks!
>>>>
>>>>> This seems to be an heuristic focused on some specific workloads, no?
>>>>
>>>> Right.
>>>>
>>>> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>>>> not be collapsed, so that khugepaged can simply skip this VMA during
>>>> scanning? This way, it won't affect the remaining part of the task's
>>>> memory regions.
>>>
>>> I thought we would skip these regions already properly in khugeapged, or
>>> maybe I misunderstood your question.
>>>
>>
>> I think we should, but seems we didn't do this for anonymous memory during
>> khugepaged.
>>
>> We check the vma with thp_vma_allowable_order() during scan.
>>
>>    * For anonymous memory during khugepaged, if we always enable 2M collapse,
>>      we will scan this vma. Even VM_NOHUGEPAGE is set.
>>
>>    * For other cases, it looks good since __thp_vma_allowable_order() will skip
>>      this vma with vma_thp_disabled().
> 
> Hi David, Wei,
> 
> The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
> memory during scan, as below:
> 
> khugepaged_scan_mm_slot()
>      thp_vma_allowable_order()
>          thp_vma_allowable_orders()
>              __thp_vma_allowable_orders()
>                  vma_thp_disabled() {
>                       if (vm_flags & VM_NOHUGEPAGE)
>                           return true;
>                  }
> 
> REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
> so the khugepaged will continue scan this vma.
> 
> I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
> been successful. I will send it in the next version.

No we must not do that. That's a user-space visible change. :/

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-21  9:24             ` David Hildenbrand (Red Hat)
@ 2025-12-21 12:34               ` Vernon Yang
  2025-12-23  9:59                 ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-21 12:34 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Wei Yang, akpm, lorenzo.stoakes, ziy, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Sun, Dec 21, 2025 at 10:24:11AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/21/25 05:25, Vernon Yang wrote:
> > On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
> > > On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > On 12/19/25 06:29, Vernon Yang wrote:
> > > > > On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > > On 12/15/25 10:04, Vernon Yang wrote:
> > > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> > > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > > after completing the scan of the cold task.
> > > > > > >
> > > > > > > So if the user has explicitly informed us via MADV_COLD/FREE that this
> > > > > > > memory is cold or will be freed, it is appropriate for khugepaged to
> > > > > > > scan it only at the latest possible moment, thereby avoiding unnecessary
> > > > > > > scan and collapse operations to reducing CPU wastage.
> > > > > > >
> > > > > > > Here are the performance test results:
> > > > > > > (Throughput bigger is better, other smaller is better)
> > > > > > >
> > > > > > > Testing on x86_64 machine:
> > > > > > >
> > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
> > > > > > > | cycles per access   |  4.91         |  2.07         | -57.84% |
> > > > > > > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
> > > > > > > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
> > > > > > >
> > > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > > >
> > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > > > > > > | cycles per access   |  7.23         |  2.12         | -70.68% |
> > > > > > > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
> > > > > > > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
> > > > > >
> > > > > > Again, I also don't like that because you make assumptions on a full process
> > > > > > based on some part of it's address space.
> > > > > >
> > > > > > E.g., if a library issues a MADV_COLD on some part of the memory the library
> > > > > > manages, why should the remaining part of the process suffer as well?
> > > > >
> > > > > Yes, you make a good point, thanks!
> > > > >
> > > > > > This seems to be an heuristic focused on some specific workloads, no?
> > > > >
> > > > > Right.
> > > > >
> > > > > Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> > > > > not be collapsed, so that khugepaged can simply skip this VMA during
> > > > > scanning? This way, it won't affect the remaining part of the task's
> > > > > memory regions.
> > > >
> > > > I thought we would skip these regions already properly in khugeapged, or
> > > > maybe I misunderstood your question.
> > > >
> > >
> > > I think we should, but seems we didn't do this for anonymous memory during
> > > khugepaged.
> > >
> > > We check the vma with thp_vma_allowable_order() during scan.
> > >
> > >    * For anonymous memory during khugepaged, if we always enable 2M collapse,
> > >      we will scan this vma. Even VM_NOHUGEPAGE is set.
> > >
> > >    * For other cases, it looks good since __thp_vma_allowable_order() will skip
> > >      this vma with vma_thp_disabled().
> >
> > Hi David, Wei,
> >
> > The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
> > memory during scan, as below:
> >
> > khugepaged_scan_mm_slot()
> >      thp_vma_allowable_order()
> >          thp_vma_allowable_orders()
> >              __thp_vma_allowable_orders()
> >                  vma_thp_disabled() {
> >                       if (vm_flags & VM_NOHUGEPAGE)
> >                           return true;
> >                  }
> >
> > REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
> > so the khugepaged will continue scan this vma.
> >
> > I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
> > been successful. I will send it in the next version.
>
> No we must not do that. That's a user-space visible change. :/

David, what good ideas do you have to achieve this goal? let me know
please, thank!

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-21  4:25           ` Vernon Yang
  2025-12-21  9:24             ` David Hildenbrand (Red Hat)
@ 2025-12-21 12:38             ` Wei Yang
  1 sibling, 0 replies; 42+ messages in thread
From: Wei Yang @ 2025-12-21 12:38 UTC (permalink / raw)
  To: Vernon Yang
  Cc: Wei Yang, David Hildenbrand (Red Hat),
	akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Sun, Dec 21, 2025 at 12:25:44PM +0800, Vernon Yang wrote:
>On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
>> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>> >On 12/19/25 06:29, Vernon Yang wrote:
>> >> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>> >> > On 12/15/25 10:04, Vernon Yang wrote:
>> >> > > For example, create three task: hot1 -> cold -> hot2. After all three
>> >> > > task are created, each allocate memory 128MB. the hot1/hot2 task
>> >> > > continuously access 128 MB memory, while the cold task only accesses
>> >> > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>> >> > > still prioritizes scanning the cold task and only scans the hot2 task
>> >> > > after completing the scan of the cold task.
>> >> > >
>> >> > > So if the user has explicitly informed us via MADV_COLD/FREE that this
>> >> > > memory is cold or will be freed, it is appropriate for khugepaged to
>> >> > > scan it only at the latest possible moment, thereby avoiding unnecessary
>> >> > > scan and collapse operations to reducing CPU wastage.
>> >> > >
>> >> > > Here are the performance test results:
>> >> > > (Throughput bigger is better, other smaller is better)
>> >> > >
>> >> > > Testing on x86_64 machine:
>> >> > >
>> >> > > | task hot2           | without patch | with patch    |  delta  |
>> >> > > |---------------------|---------------|---------------|---------|
>> >> > > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
>> >> > > | cycles per access   |  4.91         |  2.07         | -57.84% |
>> >> > > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
>> >> > > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
>> >> > >
>> >> > > Testing on qemu-system-x86_64 -enable-kvm:
>> >> > >
>> >> > > | task hot2           | without patch | with patch    |  delta  |
>> >> > > |---------------------|---------------|---------------|---------|
>> >> > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>> >> > > | cycles per access   |  7.23         |  2.12         | -70.68% |
>> >> > > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
>> >> > > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>> >> >
>> >> > Again, I also don't like that because you make assumptions on a full process
>> >> > based on some part of it's address space.
>> >> >
>> >> > E.g., if a library issues a MADV_COLD on some part of the memory the library
>> >> > manages, why should the remaining part of the process suffer as well?
>> >>
>> >> Yes, you make a good point, thanks!
>> >>
>> >> > This seems to be an heuristic focused on some specific workloads, no?
>> >>
>> >> Right.
>> >>
>> >> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>> >> not be collapsed, so that khugepaged can simply skip this VMA during
>> >> scanning? This way, it won't affect the remaining part of the task's
>> >> memory regions.
>> >
>> >I thought we would skip these regions already properly in khugeapged, or
>> >maybe I misunderstood your question.
>> >
>>
>> I think we should, but seems we didn't do this for anonymous memory during
>> khugepaged.
>>
>> We check the vma with thp_vma_allowable_order() during scan.
>>
>>   * For anonymous memory during khugepaged, if we always enable 2M collapse,
>>     we will scan this vma. Even VM_NOHUGEPAGE is set.
>>
>>   * For other cases, it looks good since __thp_vma_allowable_order() will skip
>>     this vma with vma_thp_disabled().
>
>Hi David, Wei,
>
>The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
>memory during scan, as below:
>
>khugepaged_scan_mm_slot()
>    thp_vma_allowable_order()
>        thp_vma_allowable_orders()

Oops, you are right. It only bypass __thp_vma_allowable_order() if orders is
0.

>            __thp_vma_allowable_orders()
>                vma_thp_disabled() {
>                     if (vm_flags & VM_NOHUGEPAGE)
>                         return true;
>                }
>
>REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
>so the khugepaged will continue scan this vma.
>
>I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
>been successful. I will send it in the next version.
>
>--
>Thanks,
>Vernon

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-15  9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
                     ` (4 preceding siblings ...)
  2025-12-18  9:29   ` David Hildenbrand (Red Hat)
@ 2025-12-22 19:00   ` kernel test robot
  5 siblings, 0 replies; 42+ messages in thread
From: kernel test robot @ 2025-12-22 19:00 UTC (permalink / raw)
  To: Vernon Yang, akpm, david, lorenzo.stoakes
  Cc: oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

Hi Vernon,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc2 next-20251219]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251215090419.174418-3-yanglincheng%40kylinos.cn
patch subject: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20251222/202512221928.EnLvUgqT-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251222/202512221928.EnLvUgqT-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512221928.EnLvUgqT-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/khugepaged.c: In function 'khugepaged_scan_mm_slot':
>> mm/khugepaged.c:2490:30: error: 'SCAN_PMD_NULL' undeclared (first use in this function); did you mean 'SCAN_VMA_NULL'?
    2490 |                         case SCAN_PMD_NULL:
         |                              ^~~~~~~~~~~~~
         |                              SCAN_VMA_NULL
   mm/khugepaged.c:2490:30: note: each undeclared identifier is reported only once for each function it appears in
>> mm/khugepaged.c:2491:30: error: 'SCAN_PMD_NONE' undeclared (first use in this function)
    2491 |                         case SCAN_PMD_NONE:
         |                              ^~~~~~~~~~~~~


vim +2490 mm/khugepaged.c

  2392	
  2393	static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
  2394						    struct collapse_control *cc)
  2395		__releases(&khugepaged_mm_lock)
  2396		__acquires(&khugepaged_mm_lock)
  2397	{
  2398		struct vma_iterator vmi;
  2399		struct mm_slot *slot;
  2400		struct mm_struct *mm;
  2401		struct vm_area_struct *vma;
  2402		int progress = 0;
  2403	
  2404		VM_BUG_ON(!pages);
  2405		lockdep_assert_held(&khugepaged_mm_lock);
  2406		*result = SCAN_FAIL;
  2407	
  2408		if (khugepaged_scan.mm_slot) {
  2409			slot = khugepaged_scan.mm_slot;
  2410		} else {
  2411			slot = list_first_entry(&khugepaged_scan.mm_head,
  2412					     struct mm_slot, mm_node);
  2413			khugepaged_scan.address = 0;
  2414			khugepaged_scan.mm_slot = slot;
  2415			khugepaged_scan.maybe_collapse = false;
  2416		}
  2417		spin_unlock(&khugepaged_mm_lock);
  2418	
  2419		mm = slot->mm;
  2420		/*
  2421		 * Don't wait for semaphore (to avoid long wait times).  Just move to
  2422		 * the next mm on the list.
  2423		 */
  2424		vma = NULL;
  2425		if (unlikely(!mmap_read_trylock(mm)))
  2426			goto breakouterloop_mmap_lock;
  2427	
  2428		progress++;
  2429		if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
  2430			goto breakouterloop;
  2431	
  2432		vma_iter_init(&vmi, mm, khugepaged_scan.address);
  2433		for_each_vma(vmi, vma) {
  2434			unsigned long hstart, hend;
  2435	
  2436			cond_resched();
  2437			if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
  2438				progress++;
  2439				break;
  2440			}
  2441			if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
  2442	skip:
  2443				progress++;
  2444				continue;
  2445			}
  2446			hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
  2447			hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
  2448			if (khugepaged_scan.address > hend)
  2449				goto skip;
  2450			if (khugepaged_scan.address < hstart)
  2451				khugepaged_scan.address = hstart;
  2452			VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
  2453	
  2454			while (khugepaged_scan.address < hend) {
  2455				bool mmap_locked = true;
  2456	
  2457				cond_resched();
  2458				if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
  2459					goto breakouterloop;
  2460	
  2461				VM_BUG_ON(khugepaged_scan.address < hstart ||
  2462					  khugepaged_scan.address + HPAGE_PMD_SIZE >
  2463					  hend);
  2464				if (!vma_is_anonymous(vma)) {
  2465					struct file *file = get_file(vma->vm_file);
  2466					pgoff_t pgoff = linear_page_index(vma,
  2467							khugepaged_scan.address);
  2468	
  2469					mmap_read_unlock(mm);
  2470					mmap_locked = false;
  2471					*result = hpage_collapse_scan_file(mm,
  2472						khugepaged_scan.address, file, pgoff, cc);
  2473					fput(file);
  2474					if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
  2475						mmap_read_lock(mm);
  2476						if (hpage_collapse_test_exit_or_disable(mm))
  2477							goto breakouterloop;
  2478						*result = collapse_pte_mapped_thp(mm,
  2479							khugepaged_scan.address, false);
  2480						if (*result == SCAN_PMD_MAPPED)
  2481							*result = SCAN_SUCCEED;
  2482						mmap_read_unlock(mm);
  2483					}
  2484				} else {
  2485					*result = hpage_collapse_scan_pmd(mm, vma,
  2486						khugepaged_scan.address, &mmap_locked, cc);
  2487				}
  2488	
  2489				switch (*result) {
> 2490				case SCAN_PMD_NULL:
> 2491				case SCAN_PMD_NONE:
  2492				case SCAN_PMD_MAPPED:
  2493				case SCAN_PTE_MAPPED_HUGEPAGE:
  2494					break;
  2495				case SCAN_SUCCEED:
  2496					++khugepaged_pages_collapsed;
  2497					fallthrough;
  2498				default:
  2499					khugepaged_scan.maybe_collapse = true;
  2500				}
  2501	
  2502				/* move to next address */
  2503				khugepaged_scan.address += HPAGE_PMD_SIZE;
  2504				progress += HPAGE_PMD_NR;
  2505				if (!mmap_locked)
  2506					/*
  2507					 * We released mmap_lock so break loop.  Note
  2508					 * that we drop mmap_lock before all hugepage
  2509					 * allocations, so if allocation fails, we are
  2510					 * guaranteed to break here and report the
  2511					 * correct result back to caller.
  2512					 */
  2513					goto breakouterloop_mmap_lock;
  2514				if (progress >= pages)
  2515					goto breakouterloop;
  2516			}
  2517		}
  2518	breakouterloop:
  2519		mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
  2520	breakouterloop_mmap_lock:
  2521	
  2522		spin_lock(&khugepaged_mm_lock);
  2523		VM_BUG_ON(khugepaged_scan.mm_slot != slot);
  2524		/*
  2525		 * Release the current mm_slot if this mm is about to die, or
  2526		 * if we scanned all vmas of this mm.
  2527		 */
  2528		if (hpage_collapse_test_exit(mm) || !vma) {
  2529			bool maybe_collapse = khugepaged_scan.maybe_collapse;
  2530	
  2531			if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
  2532				maybe_collapse = true;
  2533	
  2534			/*
  2535			 * Make sure that if mm_users is reaching zero while
  2536			 * khugepaged runs here, khugepaged_exit will find
  2537			 * mm_slot not pointing to the exiting mm.
  2538			 */
  2539			if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
  2540				khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
  2541				khugepaged_scan.address = 0;
  2542				khugepaged_scan.maybe_collapse = false;
  2543			} else {
  2544				khugepaged_scan.mm_slot = NULL;
  2545				khugepaged_full_scans++;
  2546			}
  2547	
  2548			collect_mm_slot(slot, maybe_collapse);
  2549		}
  2550	
  2551		trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
  2552	
  2553		return progress;
  2554	}
  2555	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-21 12:34               ` Vernon Yang
@ 2025-12-23  9:59                 ` David Hildenbrand (Red Hat)
  2025-12-25 15:12                   ` Vernon Yang
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-23  9:59 UTC (permalink / raw)
  To: Vernon Yang
  Cc: Wei Yang, akpm, lorenzo.stoakes, ziy, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On 12/21/25 13:34, Vernon Yang wrote:
> On Sun, Dec 21, 2025 at 10:24:11AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/21/25 05:25, Vernon Yang wrote:
>>> On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
>>>> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>> On 12/19/25 06:29, Vernon Yang wrote:
>>>>>> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>>>> On 12/15/25 10:04, Vernon Yang wrote:
>>>>>>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>>>>>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>>>>>>> continuously access 128 MB memory, while the cold task only accesses
>>>>>>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>>>>>>>> still prioritizes scanning the cold task and only scans the hot2 task
>>>>>>>> after completing the scan of the cold task.
>>>>>>>>
>>>>>>>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>>>>>>>> memory is cold or will be freed, it is appropriate for khugepaged to
>>>>>>>> scan it only at the latest possible moment, thereby avoiding unnecessary
>>>>>>>> scan and collapse operations to reducing CPU wastage.
>>>>>>>>
>>>>>>>> Here are the performance test results:
>>>>>>>> (Throughput bigger is better, other smaller is better)
>>>>>>>>
>>>>>>>> Testing on x86_64 machine:
>>>>>>>>
>>>>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>>>>> |---------------------|---------------|---------------|---------|
>>>>>>>> | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
>>>>>>>> | cycles per access   |  4.91         |  2.07         | -57.84% |
>>>>>>>> | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
>>>>>>>> | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
>>>>>>>>
>>>>>>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>>>>>>
>>>>>>>> | task hot2           | without patch | with patch    |  delta  |
>>>>>>>> |---------------------|---------------|---------------|---------|
>>>>>>>> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
>>>>>>>> | cycles per access   |  7.23         |  2.12         | -70.68% |
>>>>>>>> | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
>>>>>>>> | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
>>>>>>>
>>>>>>> Again, I also don't like that because you make assumptions on a full process
>>>>>>> based on some part of it's address space.
>>>>>>>
>>>>>>> E.g., if a library issues a MADV_COLD on some part of the memory the library
>>>>>>> manages, why should the remaining part of the process suffer as well?
>>>>>>
>>>>>> Yes, you make a good point, thanks!
>>>>>>
>>>>>>> This seems to be an heuristic focused on some specific workloads, no?
>>>>>>
>>>>>> Right.
>>>>>>
>>>>>> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>>>>>> not be collapsed, so that khugepaged can simply skip this VMA during
>>>>>> scanning? This way, it won't affect the remaining part of the task's
>>>>>> memory regions.
>>>>>
>>>>> I thought we would skip these regions already properly in khugeapged, or
>>>>> maybe I misunderstood your question.
>>>>>
>>>>
>>>> I think we should, but seems we didn't do this for anonymous memory during
>>>> khugepaged.
>>>>
>>>> We check the vma with thp_vma_allowable_order() during scan.
>>>>
>>>>     * For anonymous memory during khugepaged, if we always enable 2M collapse,
>>>>       we will scan this vma. Even VM_NOHUGEPAGE is set.
>>>>
>>>>     * For other cases, it looks good since __thp_vma_allowable_order() will skip
>>>>       this vma with vma_thp_disabled().
>>>
>>> Hi David, Wei,
>>>
>>> The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
>>> memory during scan, as below:
>>>
>>> khugepaged_scan_mm_slot()
>>>       thp_vma_allowable_order()
>>>           thp_vma_allowable_orders()
>>>               __thp_vma_allowable_orders()
>>>                   vma_thp_disabled() {
>>>                        if (vm_flags & VM_NOHUGEPAGE)
>>>                            return true;
>>>                   }
>>>
>>> REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
>>> so the khugepaged will continue scan this vma.
>>>
>>> I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
>>> been successful. I will send it in the next version.
>>
>> No we must not do that. That's a user-space visible change. :/
> 
> David, what good ideas do you have to achieve this goal? let me know
> please, thank!

Your idea would be to skip a VMA when we issues madvise(MADV_COLD).

That sounds like yet another heuristic that can easily be wrong? :/

In particular, imagine if the VMA is much larger than the madvise'd 
region (other parts used for something else) or if the previously cold 
memory area is used for something that is now hot.

With memory allocators that manage most of the memory in a single large 
VMA, it's rather easy to see how such a heuristic would be bad, no?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-19  8:35     ` Vernon Yang
  2025-12-19  8:55       ` David Hildenbrand (Red Hat)
@ 2025-12-23 11:18       ` Dev Jain
  2025-12-25 16:07         ` Vernon Yang
  2025-12-29  6:02         ` Vernon Yang
  1 sibling, 2 replies; 42+ messages in thread
From: Dev Jain @ 2025-12-23 11:18 UTC (permalink / raw)
  To: Vernon Yang, David Hildenbrand (Red Hat)
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang


On 19/12/25 2:05 pm, Vernon Yang wrote:
> On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/15/25 10:04, Vernon Yang wrote:
>>> The following data is traced by bpftrace on a desktop system. After
>>> the system has been left idle for 10 minutes upon booting, a lot of
>>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
>>> khugepaged.
>>>
>>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
>>> @scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
>>> @scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
>>> total progress size: 701 MB
>>> Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
>>>
>>> The khugepaged_scan list save all task that support collapse into hugepage,
>>> as long as the take is not destroyed, khugepaged will not remove it from
>>> the khugepaged_scan list. This exist a phenomenon where task has already
>>> collapsed all memory regions into hugepage, but khugepaged continues to
>>> scan it, which wastes CPU time and invalid, and due to
>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>>> scanning a large number of invalid task, so scanning really valid task
>>> is later.
>>>
>>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
>>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
>>> list. If the page fault or MADV_HUGEPAGE again, it is added back to
>>> khugepaged.
>> I don't like that, as it assumes that memory within such a process would be
>> rather static, which is easily not the case (e.g., allocators just doing
>> MADV_DONTNEED to free memory).
>>
>> If most stuff is collapsed to PMDs already, can't we just skip over these
>> regions a bit faster?
> I have a flash of inspiration and came up with a good idea.
>
> If these regions have already been collapsed into hugepage, rechecking
> them would be very fast. Due to the khugepaged_pages_to_scan can also
> represent the number of VMAs to skip, we can extend its semantics as
> follows:
>
> 	/*
> 	 * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> 	 * every 10 second.
> 	 */
> 	static unsigned int khugepaged_pages_to_scan __read_mostly;
>
> 	switch (*result) {
> 	case SCAN_NO_PTE_TABLE:
> 	case SCAN_PMD_MAPPED:
> 	case SCAN_PTE_MAPPED_HUGEPAGE:
> 		progress++; // here
> 		break;
> 	case SCAN_SUCCEED:
> 		++khugepaged_pages_collapsed;
> 		fallthrough;
> 	default:
> 		progress += HPAGE_PMD_NR;
> 	}
>
> This way can achieve our goal. David, do you like it?

This looks good, can you formally test this and see if it comes close to the optimizations
yielded by the current version of the patchset?

>
>> --
>> Cheers
>>
>> David
> --
> Thanks,
> Vernon
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
  2025-12-23  9:59                 ` David Hildenbrand (Red Hat)
@ 2025-12-25 15:12                   ` Vernon Yang
  0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-25 15:12 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Wei Yang, akpm, lorenzo.stoakes, ziy, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Tue, Dec 23, 2025 at 10:59:29AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/21/25 13:34, Vernon Yang wrote:
> > On Sun, Dec 21, 2025 at 10:24:11AM +0100, David Hildenbrand (Red Hat) wrote:
> > > On 12/21/25 05:25, Vernon Yang wrote:
> > > > On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
> > > > > On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > > On 12/19/25 06:29, Vernon Yang wrote:
> > > > > > > On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > > > > On 12/15/25 10:04, Vernon Yang wrote:
> > > > > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > > > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> > > > > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > > > > after completing the scan of the cold task.
> > > > > > > > >
> > > > > > > > > So if the user has explicitly informed us via MADV_COLD/FREE that this
> > > > > > > > > memory is cold or will be freed, it is appropriate for khugepaged to
> > > > > > > > > scan it only at the latest possible moment, thereby avoiding unnecessary
> > > > > > > > > scan and collapse operations to reducing CPU wastage.
> > > > > > > > >
> > > > > > > > > Here are the performance test results:
> > > > > > > > > (Throughput bigger is better, other smaller is better)
> > > > > > > > >
> > > > > > > > > Testing on x86_64 machine:
> > > > > > > > >
> > > > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > > > | total accesses time |  3.14 sec     |  2.92 sec     | -7.01%  |
> > > > > > > > > | cycles per access   |  4.91         |  2.07         | -57.84% |
> > > > > > > > > | Throughput          |  104.38 M/sec |  112.12 M/sec | +7.42%  |
> > > > > > > > > | dTLB-load-misses    |  288966432    |  1292908      | -99.55% |
> > > > > > > > >
> > > > > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > > > > >
> > > > > > > > > | task hot2           | without patch | with patch    |  delta  |
> > > > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > > > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > > > > > > > > | cycles per access   |  7.23         |  2.12         | -70.68% |
> > > > > > > > > | Throughput          |  97.88 M/sec  |  110.76 M/sec | +13.16% |
> > > > > > > > > | dTLB-load-misses    |  237406497    |  3189194      | -98.66% |
> > > > > > > >
> > > > > > > > Again, I also don't like that because you make assumptions on a full process
> > > > > > > > based on some part of it's address space.
> > > > > > > >
> > > > > > > > E.g., if a library issues a MADV_COLD on some part of the memory the library
> > > > > > > > manages, why should the remaining part of the process suffer as well?
> > > > > > >
> > > > > > > Yes, you make a good point, thanks!
> > > > > > >
> > > > > > > > This seems to be an heuristic focused on some specific workloads, no?
> > > > > > >
> > > > > > > Right.
> > > > > > >
> > > > > > > Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> > > > > > > not be collapsed, so that khugepaged can simply skip this VMA during
> > > > > > > scanning? This way, it won't affect the remaining part of the task's
> > > > > > > memory regions.
> > > > > >
> > > > > > I thought we would skip these regions already properly in khugeapged, or
> > > > > > maybe I misunderstood your question.
> > > > > >
> > > > >
> > > > > I think we should, but seems we didn't do this for anonymous memory during
> > > > > khugepaged.
> > > > >
> > > > > We check the vma with thp_vma_allowable_order() during scan.
> > > > >
> > > > >     * For anonymous memory during khugepaged, if we always enable 2M collapse,
> > > > >       we will scan this vma. Even VM_NOHUGEPAGE is set.
> > > > >
> > > > >     * For other cases, it looks good since __thp_vma_allowable_order() will skip
> > > > >       this vma with vma_thp_disabled().
> > > >
> > > > Hi David, Wei,
> > > >
> > > > The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
> > > > memory during scan, as below:
> > > >
> > > > khugepaged_scan_mm_slot()
> > > >       thp_vma_allowable_order()
> > > >           thp_vma_allowable_orders()
> > > >               __thp_vma_allowable_orders()
> > > >                   vma_thp_disabled() {
> > > >                        if (vm_flags & VM_NOHUGEPAGE)
> > > >                            return true;
> > > >                   }
> > > >
> > > > REAL ISSUE: when madvise(MADV_COLD)，not set VM_NOHUGEPAGE flag to vma,
> > > > so the khugepaged will continue scan this vma.
> > > >
> > > > I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
> > > > been successful. I will send it in the next version.
> > >
> > > No we must not do that. That's a user-space visible change. :/
> >
> > David, what good ideas do you have to achieve this goal? let me know
> > please, thank!
>
> Your idea would be to skip a VMA when we issues madvise(MADV_COLD).
>
> That sounds like yet another heuristic that can easily be wrong? :/
>
> In particular, imagine if the VMA is much larger than the madvise'd region
> (other parts used for something else) or if the previously cold memory area
> is used for something that is now hot.
>
> With memory allocators that manage most of the memory in a single large VMA,
> it's rather easy to see how such a heuristic would be bad, no?

Thanks for your explain, but I current approach is as follows, the large
VMA will split at this case.

madvise_vma_behavior
    madvise_cold
    madvise_update_vma

Maybe I'll send v2 first, and we'll discuss it more clearly :)

--
Merry Christmas,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-23 11:18       ` Dev Jain
@ 2025-12-25 16:07         ` Vernon Yang
  2025-12-29  6:02         ` Vernon Yang
  1 sibling, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-25 16:07 UTC (permalink / raw)
  To: Dev Jain
  Cc: David Hildenbrand (Red Hat),
	akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Tue, Dec 23, 2025 at 04:48:57PM +0530, Dev Jain wrote:
>
> On 19/12/25 2:05 pm, Vernon Yang wrote:
> > On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
> >> On 12/15/25 10:04, Vernon Yang wrote:
> >>> The following data is traced by bpftrace on a desktop system. After
> >>> the system has been left idle for 10 minutes upon booting, a lot of
> >>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> >>> khugepaged.
> >>>
> >>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> >>> @scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
> >>> @scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
> >>> total progress size: 701 MB
> >>> Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >>>
> >>> The khugepaged_scan list save all task that support collapse into hugepage,
> >>> as long as the take is not destroyed, khugepaged will not remove it from
> >>> the khugepaged_scan list. This exist a phenomenon where task has already
> >>> collapsed all memory regions into hugepage, but khugepaged continues to
> >>> scan it, which wastes CPU time and invalid, and due to
> >>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >>> scanning a large number of invalid task, so scanning really valid task
> >>> is later.
> >>>
> >>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> >>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> >>> list. If the page fault or MADV_HUGEPAGE again, it is added back to
> >>> khugepaged.
> >> I don't like that, as it assumes that memory within such a process would be
> >> rather static, which is easily not the case (e.g., allocators just doing
> >> MADV_DONTNEED to free memory).
> >>
> >> If most stuff is collapsed to PMDs already, can't we just skip over these
> >> regions a bit faster?
> > I have a flash of inspiration and came up with a good idea.
> >
> > If these regions have already been collapsed into hugepage, rechecking
> > them would be very fast. Due to the khugepaged_pages_to_scan can also
> > represent the number of VMAs to skip, we can extend its semantics as
> > follows:
> >
> > 	/*
> > 	 * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> > 	 * every 10 second.
> > 	 */
> > 	static unsigned int khugepaged_pages_to_scan __read_mostly;
> >
> > 	switch (*result) {
> > 	case SCAN_NO_PTE_TABLE:
> > 	case SCAN_PMD_MAPPED:
> > 	case SCAN_PTE_MAPPED_HUGEPAGE:
> > 		progress++; // here
> > 		break;
> > 	case SCAN_SUCCEED:
> > 		++khugepaged_pages_collapsed;
> > 		fallthrough;
> > 	default:
> > 		progress += HPAGE_PMD_NR;
> > 	}
> >
> > This way can achieve our goal. David, do you like it?
>
> This looks good, can you formally test this and see if it comes close to the optimizations
> yielded by the current version of the patchset?

Both can achieve this function, reducing the time of a full scan,
previously tested.

About performance test, I will test it formally.

--
Merry Christmas,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
  2025-12-23 11:18       ` Dev Jain
  2025-12-25 16:07         ` Vernon Yang
@ 2025-12-29  6:02         ` Vernon Yang
  1 sibling, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-29  6:02 UTC (permalink / raw)
  To: Dev Jain
  Cc: David Hildenbrand (Red Hat),
	akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Tue, Dec 23, 2025 at 04:48:57PM +0530, Dev Jain wrote:
>
> On 19/12/25 2:05 pm, Vernon Yang wrote:
> > On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
> >> On 12/15/25 10:04, Vernon Yang wrote:
> >>> The following data is traced by bpftrace on a desktop system. After
> >>> the system has been left idle for 10 minutes upon booting, a lot of
> >>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> >>> khugepaged.
> >>>
> >>> @scan_pmd_status[1]: 1           ## SCAN_SUCCEED
> >>> @scan_pmd_status[4]: 158         ## SCAN_PMD_MAPPED
> >>> @scan_pmd_status[3]: 174         ## SCAN_PMD_NONE
> >>> total progress size: 701 MB
> >>> Total time         : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >>>
> >>> The khugepaged_scan list save all task that support collapse into hugepage,
> >>> as long as the take is not destroyed, khugepaged will not remove it from
> >>> the khugepaged_scan list. This exist a phenomenon where task has already
> >>> collapsed all memory regions into hugepage, but khugepaged continues to
> >>> scan it, which wastes CPU time and invalid, and due to
> >>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >>> scanning a large number of invalid task, so scanning really valid task
> >>> is later.
> >>>
> >>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> >>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> >>> list. If the page fault or MADV_HUGEPAGE again, it is added back to
> >>> khugepaged.
> >> I don't like that, as it assumes that memory within such a process would be
> >> rather static, which is easily not the case (e.g., allocators just doing
> >> MADV_DONTNEED to free memory).
> >>
> >> If most stuff is collapsed to PMDs already, can't we just skip over these
> >> regions a bit faster?
> > I have a flash of inspiration and came up with a good idea.
> >
> > If these regions have already been collapsed into hugepage, rechecking
> > them would be very fast. Due to the khugepaged_pages_to_scan can also
> > represent the number of VMAs to skip, we can extend its semantics as
> > follows:
> >
> > 	/*
> > 	 * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> > 	 * every 10 second.
> > 	 */
> > 	static unsigned int khugepaged_pages_to_scan __read_mostly;
> >
> > 	switch (*result) {
> > 	case SCAN_NO_PTE_TABLE:
> > 	case SCAN_PMD_MAPPED:
> > 	case SCAN_PTE_MAPPED_HUGEPAGE:
> > 		progress++; // here
> > 		break;
> > 	case SCAN_SUCCEED:
> > 		++khugepaged_pages_collapsed;
> > 		fallthrough;
> > 	default:
> > 		progress += HPAGE_PMD_NR;
> > 	}
> >
> > This way can achieve our goal. David, do you like it?
>
> This looks good, can you formally test this and see if it comes close to the optimizations
> yielded by the current version of the patchset?

Either has same performance. For detailed data, you can see the v2[1].

[1] https://lore.kernel.org/linux-mm/20251229055151.54887-1-yanglincheng@kylinos.cn/

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-12-29  6:02 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-15  9:04 [PATCH 0/4] Improve khugepaged scan logic Vernon Yang
2025-12-15  9:04 ` [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
2025-12-18  9:24   ` David Hildenbrand (Red Hat)
2025-12-19  5:21     ` Vernon Yang
2025-12-15  9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
2025-12-15 11:52   ` Lance Yang
2025-12-16  6:27     ` Vernon Yang
2025-12-15 21:45   ` kernel test robot
2025-12-16  6:30     ` Vernon Yang
2025-12-15 23:01   ` kernel test robot
2025-12-16  6:32     ` Vernon Yang
2025-12-17  3:31   ` Wei Yang
2025-12-18  3:27     ` Vernon Yang
2025-12-18  3:48       ` Wei Yang
2025-12-18  4:41         ` Vernon Yang
2025-12-18  9:29   ` David Hildenbrand (Red Hat)
2025-12-19  5:24     ` Vernon Yang
2025-12-19  9:00       ` David Hildenbrand (Red Hat)
2025-12-19  8:35     ` Vernon Yang
2025-12-19  8:55       ` David Hildenbrand (Red Hat)
2025-12-23 11:18       ` Dev Jain
2025-12-25 16:07         ` Vernon Yang
2025-12-29  6:02         ` Vernon Yang
2025-12-22 19:00   ` kernel test robot
2025-12-15  9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
2025-12-15 21:12   ` kernel test robot
2025-12-16  7:00     ` Vernon Yang
2025-12-16 13:08   ` kernel test robot
2025-12-16 13:31   ` kernel test robot
2025-12-18  9:31   ` David Hildenbrand (Red Hat)
2025-12-19  5:29     ` Vernon Yang
2025-12-19  8:58       ` David Hildenbrand (Red Hat)
2025-12-21  2:10         ` Wei Yang
2025-12-21  4:25           ` Vernon Yang
2025-12-21  9:24             ` David Hildenbrand (Red Hat)
2025-12-21 12:34               ` Vernon Yang
2025-12-23  9:59                 ` David Hildenbrand (Red Hat)
2025-12-25 15:12                   ` Vernon Yang
2025-12-21 12:38             ` Wei Yang
2025-12-15  9:04 ` [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
2025-12-18  9:33   ` David Hildenbrand (Red Hat)
2025-12-19  5:31     ` Vernon Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox