* [PATCH 0/4] Improve khugepaged scan logic
@ 2025-12-15 9:04 Vernon Yang
2025-12-15 9:04 ` [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
` (3 more replies)
0 siblings, 4 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-15 9:04 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang
hi all,
This series is improve the khugepaged scan logic, reduce CPU consumption,
prioritize scanning task that access memory frequently.
The following data is traced by bpftrace[1] on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
khugepaged.
@scan_pmd_status[1]: 1 ## SCAN_SUCCEED
@scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
@scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
total progress size: 701 MB
Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
The khugepaged has below phenomenon: the khugepaged list is scanned in a
FIFO manner, as long as the task is not destroyed,
1. the task no longer has memory that can be collapsed into hugepage,
continues scan it always.
2. the task at the front of the khugepaged scan list is cold, they are
still scanned first.
3. everyone scan at intervals of khugepaged_scan_sleep_millisecs
(default 10s). If we always scan the above two cases first, the valid
scan will have to wait for a long time.
For the first case, when all memory has been collapsed, the mm is
automatically removed from khugepaged's scan list. If the page fault or
MADV_HUGEPAGE again, it is added back to khugepaged.
For the second case, if the user has explicitly informed us via
MADV_COLD/MADV_FREE that this memory is cold or will be freed, move mm
to khugepaged scan list tail for scan later.
The below is some performance test results.
kernbench results (testing on x86_64 machine):
6.18.0-baseline 6.18.0-test
Amean user-32 18652.80 ( 0.00%) 18640.85 ( 0.06%)
Amean syst-32 1165.09 ( 0.00%) 1159.15 * 0.51%*
Amean elsp-32 667.71 ( 0.00%) 667.02 * 0.10%*
BAmean-95 user-32 18652.02 ( 0.00%) 18638.11 ( 0.07%)
BAmean-95 syst-32 1165.04 ( 0.00%) 1158.41 ( 0.57%)
BAmean-95 elsp-32 667.65 ( 0.00%) 666.90 ( 0.11%)
BAmean-99 user-32 18652.02 ( 0.00%) 18638.11 ( 0.07%)
BAmean-99 syst-32 1165.04 ( 0.00%) 1158.41 ( 0.57%)
BAmean-99 elsp-32 667.65 ( 0.00%) 666.90 ( 0.11%)
Create three task[2]: hot1 -> cold -> hot2. After all three task are
created, each allocate memory 128MB. the hot1/hot2 task continuously
access 128 MB memory, while the cold task only accesses its memory
briefly andthen call madvise(MADV_COLD). Here are the performance test
results:
(Throughput bigger is better, other smaller is better)
Testing on x86_64 machine:
| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.14 sec | 2.92 sec | -7.01% |
| cycles per access | 4.91 | 2.07 | -57.84% |
| Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
| dTLB-load-misses | 288966432 | 1292908 | -99.55% |
Testing on qemu-system-x86_64 -enable-kvm:
| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.35 sec | 2.96 sec | -11.64% |
| cycles per access | 7.23 | 2.12 | -70.68% |
| Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
| dTLB-load-misses | 237406497 | 3189194 | -98.66% |
This series is based on Linux v6.18.
Thank you very much for your comments and discussions :)
[1] https://github.com/vernon2gh/app_and_module/blob/main/khugepaged/khugepaged_mm.bt
[2] https://github.com/vernon2gh/app_and_module/blob/main/khugepaged/app.c
Vernon Yang (4):
mm: khugepaged: add trace_mm_khugepaged_scan event
mm: khugepaged: remove mm when all memory has been collapsed
mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
mm: khugepaged: set to next mm direct when mm has
MMF_DISABLE_THP_COMPLETELY
include/linux/khugepaged.h | 1 +
include/trace/events/huge_memory.h | 24 ++++++++++++
mm/khugepaged.c | 60 ++++++++++++++++++++++++------
mm/madvise.c | 3 ++
4 files changed, 76 insertions(+), 12 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event
2025-12-15 9:04 [PATCH 0/4] Improve khugepaged scan logic Vernon Yang
@ 2025-12-15 9:04 ` Vernon Yang
2025-12-18 9:24 ` David Hildenbrand (Red Hat)
2025-12-15 9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
` (2 subsequent siblings)
3 siblings, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-15 9:04 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang
Add mm_khugepaged_scan event to track the total time for full scan
and the total number of pages scanned of khugepaged.
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++
mm/khugepaged.c | 2 ++
2 files changed, 26 insertions(+)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index dd94d14a2427..b2824c2f8238 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -237,5 +237,29 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
__print_symbolic(__entry->result, SCAN_STATUS))
);
+TRACE_EVENT(mm_khugepaged_scan,
+
+ TP_PROTO(struct mm_struct *mm, int progress, bool full),
+
+ TP_ARGS(mm, progress, full),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(int, progress)
+ __field(bool, full)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->progress = progress;
+ __entry->full = full;
+ ),
+
+ TP_printk("mm=%p, progress=%d, full=%d",
+ __entry->mm,
+ __entry->progress,
+ __entry->full)
+);
+
#endif /* __HUGE_MEMORY_H */
#include <trace/define_trace.h>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index abe54f0043c7..0598a19a98cc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2516,6 +2516,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
collect_mm_slot(slot);
}
+ trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
+
return progress;
}
--
2.51.0
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-15 9:04 [PATCH 0/4] Improve khugepaged scan logic Vernon Yang
2025-12-15 9:04 ` [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
@ 2025-12-15 9:04 ` Vernon Yang
2025-12-15 11:52 ` Lance Yang
` (5 more replies)
2025-12-15 9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
2025-12-15 9:04 ` [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
3 siblings, 6 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-15 9:04 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang
The following data is traced by bpftrace on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
khugepaged.
@scan_pmd_status[1]: 1 ## SCAN_SUCCEED
@scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
@scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
total progress size: 701 MB
Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
The khugepaged_scan list save all task that support collapse into hugepage,
as long as the take is not destroyed, khugepaged will not remove it from
the khugepaged_scan list. This exist a phenomenon where task has already
collapsed all memory regions into hugepage, but khugepaged continues to
scan it, which wastes CPU time and invalid, and due to
khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
scanning a large number of invalid task, so scanning really valid task
is later.
After applying this patch, when all memory is either SCAN_PMD_MAPPED or
SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
list. If the page fault or MADV_HUGEPAGE again, it is added back to
khugepaged.
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
1 file changed, 25 insertions(+), 10 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0598a19a98cc..1ec1af5be3c8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -115,6 +115,7 @@ struct khugepaged_scan {
struct list_head mm_head;
struct mm_slot *mm_slot;
unsigned long address;
+ bool maybe_collapse;
};
static struct khugepaged_scan khugepaged_scan = {
@@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
return result;
}
-static void collect_mm_slot(struct mm_slot *slot)
+static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
{
struct mm_struct *mm = slot->mm;
lockdep_assert_held(&khugepaged_mm_lock);
- if (hpage_collapse_test_exit(mm)) {
+ if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
/* free mm_slot */
hash_del(&slot->hash);
list_del(&slot->mm_node);
- /*
- * Not strictly needed because the mm exited already.
- *
- * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
- */
+ if (!maybe_collapse)
+ mm_flags_clear(MMF_VM_HUGEPAGE, mm);
/* khugepaged_mm_lock actually not necessary for the below */
mm_slot_free(mm_slot_cache, slot);
@@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
struct mm_slot, mm_node);
khugepaged_scan.address = 0;
khugepaged_scan.mm_slot = slot;
+ khugepaged_scan.maybe_collapse = false;
}
spin_unlock(&khugepaged_mm_lock);
@@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
khugepaged_scan.address, &mmap_locked, cc);
}
- if (*result == SCAN_SUCCEED)
+ switch (*result) {
+ case SCAN_PMD_NULL:
+ case SCAN_PMD_NONE:
+ case SCAN_PMD_MAPPED:
+ case SCAN_PTE_MAPPED_HUGEPAGE:
+ break;
+ case SCAN_SUCCEED:
++khugepaged_pages_collapsed;
+ fallthrough;
+ default:
+ khugepaged_scan.maybe_collapse = true;
+ }
/* move to next address */
khugepaged_scan.address += HPAGE_PMD_SIZE;
@@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
* if we scanned all vmas of this mm.
*/
if (hpage_collapse_test_exit(mm) || !vma) {
+ bool maybe_collapse = khugepaged_scan.maybe_collapse;
+
+ if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
+ maybe_collapse = true;
+
/*
* Make sure that if mm_users is reaching zero while
* khugepaged runs here, khugepaged_exit will find
@@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
khugepaged_scan.address = 0;
+ khugepaged_scan.maybe_collapse = false;
} else {
khugepaged_scan.mm_slot = NULL;
khugepaged_full_scans++;
}
- collect_mm_slot(slot);
+ collect_mm_slot(slot, maybe_collapse);
}
trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
@@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
slot = khugepaged_scan.mm_slot;
khugepaged_scan.mm_slot = NULL;
if (slot)
- collect_mm_slot(slot);
+ collect_mm_slot(slot, true);
spin_unlock(&khugepaged_mm_lock);
return 0;
}
--
2.51.0
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-15 9:04 [PATCH 0/4] Improve khugepaged scan logic Vernon Yang
2025-12-15 9:04 ` [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
2025-12-15 9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
@ 2025-12-15 9:04 ` Vernon Yang
2025-12-15 21:12 ` kernel test robot
` (3 more replies)
2025-12-15 9:04 ` [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
3 siblings, 4 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-15 9:04 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang
For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.
So if the user has explicitly informed us via MADV_COLD/FREE that this
memory is cold or will be freed, it is appropriate for khugepaged to
scan it only at the latest possible moment, thereby avoiding unnecessary
scan and collapse operations to reducing CPU wastage.
Here are the performance test results:
(Throughput bigger is better, other smaller is better)
Testing on x86_64 machine:
| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.14 sec | 2.92 sec | -7.01% |
| cycles per access | 4.91 | 2.07 | -57.84% |
| Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
| dTLB-load-misses | 288966432 | 1292908 | -99.55% |
Testing on qemu-system-x86_64 -enable-kvm:
| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.35 sec | 2.96 sec | -11.64% |
| cycles per access | 7.23 | 2.12 | -70.68% |
| Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
| dTLB-load-misses | 237406497 | 3189194 | -98.66% |
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
include/linux/khugepaged.h | 1 +
mm/khugepaged.c | 14 ++++++++++++++
mm/madvise.c | 3 +++
3 files changed, 18 insertions(+)
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eb1946a70cff..726e99de84e9 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -15,6 +15,7 @@ extern void __khugepaged_enter(struct mm_struct *mm);
extern void __khugepaged_exit(struct mm_struct *mm);
extern void khugepaged_enter_vma(struct vm_area_struct *vma,
vm_flags_t vm_flags);
+void khugepaged_move_tail(struct mm_struct *mm);
extern void khugepaged_min_free_kbytes_update(void);
extern bool current_is_khugepaged(void);
extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1ec1af5be3c8..91836dda2015 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -468,6 +468,20 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
}
}
+void khugepaged_move_tail(struct mm_struct *mm)
+{
+ struct mm_slot *slot;
+
+ if (!mm_flags_test(MMF_VM_HUGEPAGE, mm))
+ return;
+
+ spin_lock(&khugepaged_mm_lock);
+ slot = mm_slot_lookup(mm_slots_hash, mm);
+ if (slot && khugepaged_scan.mm_slot != slot)
+ list_move_tail(&slot->mm_node, &khugepaged_scan.mm_head);
+ spin_unlock(&khugepaged_mm_lock);
+}
+
void __khugepaged_exit(struct mm_struct *mm)
{
struct mm_slot *slot;
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..3f9ca7af2c82 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -608,6 +608,8 @@ static long madvise_cold(struct madvise_behavior *madv_behavior)
madvise_cold_page_range(&tlb, madv_behavior);
tlb_finish_mmu(&tlb);
+ khugepaged_move_tail(vma->vm_mm);
+
return 0;
}
@@ -835,6 +837,7 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior)
&walk_ops, tlb);
tlb_end_vma(tlb, vma);
mmu_notifier_invalidate_range_end(&range);
+ khugepaged_move_tail(mm);
return 0;
}
--
2.51.0
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
2025-12-15 9:04 [PATCH 0/4] Improve khugepaged scan logic Vernon Yang
` (2 preceding siblings ...)
2025-12-15 9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
@ 2025-12-15 9:04 ` Vernon Yang
2025-12-18 9:33 ` David Hildenbrand (Red Hat)
3 siblings, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-15 9:04 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang
When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
reduce redundant operation.
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
mm/khugepaged.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 91836dda2015..a8723eea12f1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2432,6 +2432,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
cond_resched();
if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+ vma = NULL;
progress++;
break;
}
@@ -2452,8 +2453,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
bool mmap_locked = true;
cond_resched();
- if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+ if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+ vma = NULL;
goto breakouterloop;
+ }
VM_BUG_ON(khugepaged_scan.address < hstart ||
khugepaged_scan.address + HPAGE_PMD_SIZE >
@@ -2470,8 +2473,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
fput(file);
if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
mmap_read_lock(mm);
- if (hpage_collapse_test_exit_or_disable(mm))
+ if (hpage_collapse_test_exit_or_disable(mm)) {
+ vma = NULL;
goto breakouterloop;
+ }
*result = collapse_pte_mapped_thp(mm,
khugepaged_scan.address, false);
if (*result == SCAN_PMD_MAPPED)
--
2.51.0
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-15 9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
@ 2025-12-15 11:52 ` Lance Yang
2025-12-16 6:27 ` Vernon Yang
2025-12-15 21:45 ` kernel test robot
` (4 subsequent siblings)
5 siblings, 1 reply; 42+ messages in thread
From: Lance Yang @ 2025-12-15 11:52 UTC (permalink / raw)
To: Vernon Yang
Cc: ziy, npache, baohua, linux-mm, linux-kernel, Vernon Yang, akpm,
lorenzo.stoakes, david
Hi Vernon,
Thanks for the patches!
On 2025/12/15 17:04, Vernon Yang wrote:
> The following data is traced by bpftrace on a desktop system. After
> the system has been left idle for 10 minutes upon booting, a lot of
> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> khugepaged.
>
> @scan_pmd_status[1]: 1 ## SCAN_SUCCEED
> @scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
> @scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
> total progress size: 701 MB
> Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
>
> The khugepaged_scan list save all task that support collapse into hugepage,
> as long as the take is not destroyed, khugepaged will not remove it from
Nit: s/take/task/
> the khugepaged_scan list. This exist a phenomenon where task has already
> collapsed all memory regions into hugepage, but khugepaged continues to
> scan it, which wastes CPU time and invalid, and due to
> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> scanning a large number of invalid task, so scanning really valid task
> is later.
>
> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> list. If the page fault or MADV_HUGEPAGE again, it is added back to
> khugepaged.
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
> mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
> 1 file changed, 25 insertions(+), 10 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 0598a19a98cc..1ec1af5be3c8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -115,6 +115,7 @@ struct khugepaged_scan {
> struct list_head mm_head;
> struct mm_slot *mm_slot;
> unsigned long address;
> + bool maybe_collapse;
At a quick glance, the name of "maybe_collapse" is a bit ambiguous ...
Perhaps "scan_needed" or "collapse_possible" would be clearer to
indicate that the mm should be kept in the scan list?
> };
>
> static struct khugepaged_scan khugepaged_scan = {
> @@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> return result;
> }
>
> -static void collect_mm_slot(struct mm_slot *slot)
> +static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
> {
> struct mm_struct *mm = slot->mm;
>
> lockdep_assert_held(&khugepaged_mm_lock);
>
> - if (hpage_collapse_test_exit(mm)) {
> + if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
> /* free mm_slot */
> hash_del(&slot->hash);
> list_del(&slot->mm_node);
>
> - /*
> - * Not strictly needed because the mm exited already.
> - *
> - * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> - */
> + if (!maybe_collapse)
> + mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>
> /* khugepaged_mm_lock actually not necessary for the below */
> mm_slot_free(mm_slot_cache, slot);
> @@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> struct mm_slot, mm_node);
> khugepaged_scan.address = 0;
> khugepaged_scan.mm_slot = slot;
> + khugepaged_scan.maybe_collapse = false;
> }
> spin_unlock(&khugepaged_mm_lock);
>
> @@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> khugepaged_scan.address, &mmap_locked, cc);
> }
>
> - if (*result == SCAN_SUCCEED)
> + switch (*result) {
> + case SCAN_PMD_NULL:
> + case SCAN_PMD_NONE:
> + case SCAN_PMD_MAPPED:
> + case SCAN_PTE_MAPPED_HUGEPAGE:
> + break;
> + case SCAN_SUCCEED:
> ++khugepaged_pages_collapsed;
> + fallthrough;
> + default:
> + khugepaged_scan.maybe_collapse = true;
> + }
>
> /* move to next address */
> khugepaged_scan.address += HPAGE_PMD_SIZE;
> @@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> * if we scanned all vmas of this mm.
> */
> if (hpage_collapse_test_exit(mm) || !vma) {
> + bool maybe_collapse = khugepaged_scan.maybe_collapse;
> +
> + if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
> + maybe_collapse = true;
> +
> /*
> * Make sure that if mm_users is reaching zero while
> * khugepaged runs here, khugepaged_exit will find
> @@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
> khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
> khugepaged_scan.address = 0;
> + khugepaged_scan.maybe_collapse = false;
> } else {
> khugepaged_scan.mm_slot = NULL;
> khugepaged_full_scans++;
> }
>
> - collect_mm_slot(slot);
> + collect_mm_slot(slot, maybe_collapse);
> }
>
> trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> @@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
> slot = khugepaged_scan.mm_slot;
> khugepaged_scan.mm_slot = NULL;
> if (slot)
> - collect_mm_slot(slot);
> + collect_mm_slot(slot, true);
> spin_unlock(&khugepaged_mm_lock);
> return 0;
> }
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-15 9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
@ 2025-12-15 21:12 ` kernel test robot
2025-12-16 7:00 ` Vernon Yang
2025-12-16 13:08 ` kernel test robot
` (2 subsequent siblings)
3 siblings, 1 reply; 42+ messages in thread
From: kernel test robot @ 2025-12-15 21:12 UTC (permalink / raw)
To: Vernon Yang, akpm, david, lorenzo.stoakes
Cc: oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
Hi Vernon,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc1 next-20251215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
config: arc-allnoconfig (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512160400.pTmarqg6-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/madvise.c: In function 'madvise_cold':
>> mm/madvise.c:609:9: error: implicit declaration of function 'khugepaged_move_tail'; did you mean 'khugepaged_exit'? [-Wimplicit-function-declaration]
609 | khugepaged_move_tail(vma->vm_mm);
| ^~~~~~~~~~~~~~~~~~~~
| khugepaged_exit
vim +609 mm/madvise.c
595
596 static long madvise_cold(struct madvise_behavior *madv_behavior)
597 {
598 struct vm_area_struct *vma = madv_behavior->vma;
599 struct mmu_gather tlb;
600
601 if (!can_madv_lru_vma(vma))
602 return -EINVAL;
603
604 lru_add_drain();
605 tlb_gather_mmu(&tlb, madv_behavior->mm);
606 madvise_cold_page_range(&tlb, madv_behavior);
607 tlb_finish_mmu(&tlb);
608
> 609 khugepaged_move_tail(vma->vm_mm);
610
611 return 0;
612 }
613
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-15 9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
2025-12-15 11:52 ` Lance Yang
@ 2025-12-15 21:45 ` kernel test robot
2025-12-16 6:30 ` Vernon Yang
2025-12-15 23:01 ` kernel test robot
` (3 subsequent siblings)
5 siblings, 1 reply; 42+ messages in thread
From: kernel test robot @ 2025-12-15 21:45 UTC (permalink / raw)
To: Vernon Yang, akpm, david, lorenzo.stoakes
Cc: llvm, oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
Hi Vernon,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc1 next-20251215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251215090419.174418-3-yanglincheng%40kylinos.cn
patch subject: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251216/202512160533.KuHwyJTP-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160533.KuHwyJTP-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512160533.KuHwyJTP-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/khugepaged.c:2490:9: error: use of undeclared identifier 'SCAN_PMD_NULL'; did you mean 'SCAN_VMA_NULL'?
2490 | case SCAN_PMD_NULL:
| ^~~~~~~~~~~~~
| SCAN_VMA_NULL
mm/khugepaged.c:50:2: note: 'SCAN_VMA_NULL' declared here
50 | SCAN_VMA_NULL,
| ^
>> mm/khugepaged.c:2491:9: error: use of undeclared identifier 'SCAN_PMD_NONE'
2491 | case SCAN_PMD_NONE:
| ^
2 errors generated.
vim +2490 mm/khugepaged.c
2392
2393 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
2394 struct collapse_control *cc)
2395 __releases(&khugepaged_mm_lock)
2396 __acquires(&khugepaged_mm_lock)
2397 {
2398 struct vma_iterator vmi;
2399 struct mm_slot *slot;
2400 struct mm_struct *mm;
2401 struct vm_area_struct *vma;
2402 int progress = 0;
2403
2404 VM_BUG_ON(!pages);
2405 lockdep_assert_held(&khugepaged_mm_lock);
2406 *result = SCAN_FAIL;
2407
2408 if (khugepaged_scan.mm_slot) {
2409 slot = khugepaged_scan.mm_slot;
2410 } else {
2411 slot = list_first_entry(&khugepaged_scan.mm_head,
2412 struct mm_slot, mm_node);
2413 khugepaged_scan.address = 0;
2414 khugepaged_scan.mm_slot = slot;
2415 khugepaged_scan.maybe_collapse = false;
2416 }
2417 spin_unlock(&khugepaged_mm_lock);
2418
2419 mm = slot->mm;
2420 /*
2421 * Don't wait for semaphore (to avoid long wait times). Just move to
2422 * the next mm on the list.
2423 */
2424 vma = NULL;
2425 if (unlikely(!mmap_read_trylock(mm)))
2426 goto breakouterloop_mmap_lock;
2427
2428 progress++;
2429 if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
2430 goto breakouterloop;
2431
2432 vma_iter_init(&vmi, mm, khugepaged_scan.address);
2433 for_each_vma(vmi, vma) {
2434 unsigned long hstart, hend;
2435
2436 cond_resched();
2437 if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
2438 progress++;
2439 break;
2440 }
2441 if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
2442 skip:
2443 progress++;
2444 continue;
2445 }
2446 hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
2447 hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
2448 if (khugepaged_scan.address > hend)
2449 goto skip;
2450 if (khugepaged_scan.address < hstart)
2451 khugepaged_scan.address = hstart;
2452 VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
2453
2454 while (khugepaged_scan.address < hend) {
2455 bool mmap_locked = true;
2456
2457 cond_resched();
2458 if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
2459 goto breakouterloop;
2460
2461 VM_BUG_ON(khugepaged_scan.address < hstart ||
2462 khugepaged_scan.address + HPAGE_PMD_SIZE >
2463 hend);
2464 if (!vma_is_anonymous(vma)) {
2465 struct file *file = get_file(vma->vm_file);
2466 pgoff_t pgoff = linear_page_index(vma,
2467 khugepaged_scan.address);
2468
2469 mmap_read_unlock(mm);
2470 mmap_locked = false;
2471 *result = hpage_collapse_scan_file(mm,
2472 khugepaged_scan.address, file, pgoff, cc);
2473 fput(file);
2474 if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
2475 mmap_read_lock(mm);
2476 if (hpage_collapse_test_exit_or_disable(mm))
2477 goto breakouterloop;
2478 *result = collapse_pte_mapped_thp(mm,
2479 khugepaged_scan.address, false);
2480 if (*result == SCAN_PMD_MAPPED)
2481 *result = SCAN_SUCCEED;
2482 mmap_read_unlock(mm);
2483 }
2484 } else {
2485 *result = hpage_collapse_scan_pmd(mm, vma,
2486 khugepaged_scan.address, &mmap_locked, cc);
2487 }
2488
2489 switch (*result) {
> 2490 case SCAN_PMD_NULL:
> 2491 case SCAN_PMD_NONE:
2492 case SCAN_PMD_MAPPED:
2493 case SCAN_PTE_MAPPED_HUGEPAGE:
2494 break;
2495 case SCAN_SUCCEED:
2496 ++khugepaged_pages_collapsed;
2497 fallthrough;
2498 default:
2499 khugepaged_scan.maybe_collapse = true;
2500 }
2501
2502 /* move to next address */
2503 khugepaged_scan.address += HPAGE_PMD_SIZE;
2504 progress += HPAGE_PMD_NR;
2505 if (!mmap_locked)
2506 /*
2507 * We released mmap_lock so break loop. Note
2508 * that we drop mmap_lock before all hugepage
2509 * allocations, so if allocation fails, we are
2510 * guaranteed to break here and report the
2511 * correct result back to caller.
2512 */
2513 goto breakouterloop_mmap_lock;
2514 if (progress >= pages)
2515 goto breakouterloop;
2516 }
2517 }
2518 breakouterloop:
2519 mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
2520 breakouterloop_mmap_lock:
2521
2522 spin_lock(&khugepaged_mm_lock);
2523 VM_BUG_ON(khugepaged_scan.mm_slot != slot);
2524 /*
2525 * Release the current mm_slot if this mm is about to die, or
2526 * if we scanned all vmas of this mm.
2527 */
2528 if (hpage_collapse_test_exit(mm) || !vma) {
2529 bool maybe_collapse = khugepaged_scan.maybe_collapse;
2530
2531 if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
2532 maybe_collapse = true;
2533
2534 /*
2535 * Make sure that if mm_users is reaching zero while
2536 * khugepaged runs here, khugepaged_exit will find
2537 * mm_slot not pointing to the exiting mm.
2538 */
2539 if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
2540 khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
2541 khugepaged_scan.address = 0;
2542 khugepaged_scan.maybe_collapse = false;
2543 } else {
2544 khugepaged_scan.mm_slot = NULL;
2545 khugepaged_full_scans++;
2546 }
2547
2548 collect_mm_slot(slot, maybe_collapse);
2549 }
2550
2551 trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
2552
2553 return progress;
2554 }
2555
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-15 9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
2025-12-15 11:52 ` Lance Yang
2025-12-15 21:45 ` kernel test robot
@ 2025-12-15 23:01 ` kernel test robot
2025-12-16 6:32 ` Vernon Yang
2025-12-17 3:31 ` Wei Yang
` (2 subsequent siblings)
5 siblings, 1 reply; 42+ messages in thread
From: kernel test robot @ 2025-12-15 23:01 UTC (permalink / raw)
To: Vernon Yang, akpm, david, lorenzo.stoakes
Cc: oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
Hi Vernon,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc1 next-20251215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251215090419.174418-3-yanglincheng%40kylinos.cn
patch subject: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20251216/202512160619.3Ut4sxaJ-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160619.3Ut4sxaJ-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512160619.3Ut4sxaJ-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/khugepaged.c: In function 'khugepaged_scan_mm_slot':
>> mm/khugepaged.c:2490:30: error: 'SCAN_PMD_NULL' undeclared (first use in this function); did you mean 'SCAN_VMA_NULL'?
2490 | case SCAN_PMD_NULL:
| ^~~~~~~~~~~~~
| SCAN_VMA_NULL
mm/khugepaged.c:2490:30: note: each undeclared identifier is reported only once for each function it appears in
>> mm/khugepaged.c:2491:30: error: 'SCAN_PMD_NONE' undeclared (first use in this function)
2491 | case SCAN_PMD_NONE:
| ^~~~~~~~~~~~~
vim +2490 mm/khugepaged.c
2392
2393 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
2394 struct collapse_control *cc)
2395 __releases(&khugepaged_mm_lock)
2396 __acquires(&khugepaged_mm_lock)
2397 {
2398 struct vma_iterator vmi;
2399 struct mm_slot *slot;
2400 struct mm_struct *mm;
2401 struct vm_area_struct *vma;
2402 int progress = 0;
2403
2404 VM_BUG_ON(!pages);
2405 lockdep_assert_held(&khugepaged_mm_lock);
2406 *result = SCAN_FAIL;
2407
2408 if (khugepaged_scan.mm_slot) {
2409 slot = khugepaged_scan.mm_slot;
2410 } else {
2411 slot = list_first_entry(&khugepaged_scan.mm_head,
2412 struct mm_slot, mm_node);
2413 khugepaged_scan.address = 0;
2414 khugepaged_scan.mm_slot = slot;
2415 khugepaged_scan.maybe_collapse = false;
2416 }
2417 spin_unlock(&khugepaged_mm_lock);
2418
2419 mm = slot->mm;
2420 /*
2421 * Don't wait for semaphore (to avoid long wait times). Just move to
2422 * the next mm on the list.
2423 */
2424 vma = NULL;
2425 if (unlikely(!mmap_read_trylock(mm)))
2426 goto breakouterloop_mmap_lock;
2427
2428 progress++;
2429 if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
2430 goto breakouterloop;
2431
2432 vma_iter_init(&vmi, mm, khugepaged_scan.address);
2433 for_each_vma(vmi, vma) {
2434 unsigned long hstart, hend;
2435
2436 cond_resched();
2437 if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
2438 progress++;
2439 break;
2440 }
2441 if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
2442 skip:
2443 progress++;
2444 continue;
2445 }
2446 hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
2447 hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
2448 if (khugepaged_scan.address > hend)
2449 goto skip;
2450 if (khugepaged_scan.address < hstart)
2451 khugepaged_scan.address = hstart;
2452 VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
2453
2454 while (khugepaged_scan.address < hend) {
2455 bool mmap_locked = true;
2456
2457 cond_resched();
2458 if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
2459 goto breakouterloop;
2460
2461 VM_BUG_ON(khugepaged_scan.address < hstart ||
2462 khugepaged_scan.address + HPAGE_PMD_SIZE >
2463 hend);
2464 if (!vma_is_anonymous(vma)) {
2465 struct file *file = get_file(vma->vm_file);
2466 pgoff_t pgoff = linear_page_index(vma,
2467 khugepaged_scan.address);
2468
2469 mmap_read_unlock(mm);
2470 mmap_locked = false;
2471 *result = hpage_collapse_scan_file(mm,
2472 khugepaged_scan.address, file, pgoff, cc);
2473 fput(file);
2474 if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
2475 mmap_read_lock(mm);
2476 if (hpage_collapse_test_exit_or_disable(mm))
2477 goto breakouterloop;
2478 *result = collapse_pte_mapped_thp(mm,
2479 khugepaged_scan.address, false);
2480 if (*result == SCAN_PMD_MAPPED)
2481 *result = SCAN_SUCCEED;
2482 mmap_read_unlock(mm);
2483 }
2484 } else {
2485 *result = hpage_collapse_scan_pmd(mm, vma,
2486 khugepaged_scan.address, &mmap_locked, cc);
2487 }
2488
2489 switch (*result) {
> 2490 case SCAN_PMD_NULL:
> 2491 case SCAN_PMD_NONE:
2492 case SCAN_PMD_MAPPED:
2493 case SCAN_PTE_MAPPED_HUGEPAGE:
2494 break;
2495 case SCAN_SUCCEED:
2496 ++khugepaged_pages_collapsed;
2497 fallthrough;
2498 default:
2499 khugepaged_scan.maybe_collapse = true;
2500 }
2501
2502 /* move to next address */
2503 khugepaged_scan.address += HPAGE_PMD_SIZE;
2504 progress += HPAGE_PMD_NR;
2505 if (!mmap_locked)
2506 /*
2507 * We released mmap_lock so break loop. Note
2508 * that we drop mmap_lock before all hugepage
2509 * allocations, so if allocation fails, we are
2510 * guaranteed to break here and report the
2511 * correct result back to caller.
2512 */
2513 goto breakouterloop_mmap_lock;
2514 if (progress >= pages)
2515 goto breakouterloop;
2516 }
2517 }
2518 breakouterloop:
2519 mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
2520 breakouterloop_mmap_lock:
2521
2522 spin_lock(&khugepaged_mm_lock);
2523 VM_BUG_ON(khugepaged_scan.mm_slot != slot);
2524 /*
2525 * Release the current mm_slot if this mm is about to die, or
2526 * if we scanned all vmas of this mm.
2527 */
2528 if (hpage_collapse_test_exit(mm) || !vma) {
2529 bool maybe_collapse = khugepaged_scan.maybe_collapse;
2530
2531 if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
2532 maybe_collapse = true;
2533
2534 /*
2535 * Make sure that if mm_users is reaching zero while
2536 * khugepaged runs here, khugepaged_exit will find
2537 * mm_slot not pointing to the exiting mm.
2538 */
2539 if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
2540 khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
2541 khugepaged_scan.address = 0;
2542 khugepaged_scan.maybe_collapse = false;
2543 } else {
2544 khugepaged_scan.mm_slot = NULL;
2545 khugepaged_full_scans++;
2546 }
2547
2548 collect_mm_slot(slot, maybe_collapse);
2549 }
2550
2551 trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
2552
2553 return progress;
2554 }
2555
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-15 11:52 ` Lance Yang
@ 2025-12-16 6:27 ` Vernon Yang
0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-16 6:27 UTC (permalink / raw)
To: Lance Yang
Cc: ziy, baohua, linux-mm, linux-kernel, Vernon Yang, akpm,
lorenzo.stoakes, david
On Mon, Dec 15, 2025 at 07:52:41PM +0800, Lance Yang wrote:
> Hi Vernon,
>
> Thanks for the patches!
>
> On 2025/12/15 17:04, Vernon Yang wrote:
> > The following data is traced by bpftrace on a desktop system. After
> > the system has been left idle for 10 minutes upon booting, a lot of
> > SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> > khugepaged.
> >
> > @scan_pmd_status[1]: 1 ## SCAN_SUCCEED
> > @scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
> > @scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
> > total progress size: 701 MB
> > Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >
> > The khugepaged_scan list save all task that support collapse into hugepage,
> > as long as the take is not destroyed, khugepaged will not remove it from
>
> Nit: s/take/task/
Thanks, I'll fix it in the next version.
> > the khugepaged_scan list. This exist a phenomenon where task has already
> > collapsed all memory regions into hugepage, but khugepaged continues to
> > scan it, which wastes CPU time and invalid, and due to
> > khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> > scanning a large number of invalid task, so scanning really valid task
> > is later.
> >
> > After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> > SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> > list. If the page fault or MADV_HUGEPAGE again, it is added back to
> > khugepaged.
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> > mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
> > 1 file changed, 25 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 0598a19a98cc..1ec1af5be3c8 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -115,6 +115,7 @@ struct khugepaged_scan {
> > struct list_head mm_head;
> > struct mm_slot *mm_slot;
> > unsigned long address;
> > + bool maybe_collapse;
>
> At a quick glance, the name of "maybe_collapse" is a bit ambiguous ...
>
> Perhaps "scan_needed" or "collapse_possible" would be clearer to
> indicate that the mm should be kept in the scan list?
The "collapse_possible" sounds good to me, Thanks! I will do it in the
next version.
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-15 21:45 ` kernel test robot
@ 2025-12-16 6:30 ` Vernon Yang
0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-16 6:30 UTC (permalink / raw)
To: kernel test robot
Cc: akpm, david, lorenzo.stoakes, llvm, oe-kbuild-all, ziy, baohua,
lance.yang, linux-mm, linux-kernel, Vernon Yang
On Tue, Dec 16, 2025 at 05:45:31AM +0800, kernel test robot wrote:
> Hi Vernon,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
> [also build test ERROR on linus/master v6.19-rc1 next-20251215]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20251215090419.174418-3-yanglincheng%40kylinos.cn
> patch subject: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
> config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251216/202512160533.KuHwyJTP-lkp@intel.com/config)
> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160533.KuHwyJTP-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202512160533.KuHwyJTP-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> >> mm/khugepaged.c:2490:9: error: use of undeclared identifier 'SCAN_PMD_NULL'; did you mean 'SCAN_VMA_NULL'?
> 2490 | case SCAN_PMD_NULL:
> | ^~~~~~~~~~~~~
> | SCAN_VMA_NULL
> mm/khugepaged.c:50:2: note: 'SCAN_VMA_NULL' declared here
> 50 | SCAN_VMA_NULL,
> | ^
> >> mm/khugepaged.c:2491:9: error: use of undeclared identifier 'SCAN_PMD_NONE'
> 2491 | case SCAN_PMD_NONE:
> | ^
> 2 errors generated.
This series is based on Linux v6.18, due to the v6.19-rc1 add "mm/khugepaged:
unify SCAN_PMD_NONE and SCAN_PMD_NULL into SCAN_NO_PTE_TABLE"[1], trigger this
build errors. I'll fix it in the next version, Thanks!
[1] https://lkml.kernel.org/r/20251114030028.7035-4-richard.weiyang@gmail.com
>
> vim +2490 mm/khugepaged.c
>
> 2392
> 2393 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> 2394 struct collapse_control *cc)
> 2395 __releases(&khugepaged_mm_lock)
> 2396 __acquires(&khugepaged_mm_lock)
> 2397 {
> 2398 struct vma_iterator vmi;
> 2399 struct mm_slot *slot;
> 2400 struct mm_struct *mm;
> 2401 struct vm_area_struct *vma;
> 2402 int progress = 0;
> 2403
> 2404 VM_BUG_ON(!pages);
> 2405 lockdep_assert_held(&khugepaged_mm_lock);
> 2406 *result = SCAN_FAIL;
> 2407
> 2408 if (khugepaged_scan.mm_slot) {
> 2409 slot = khugepaged_scan.mm_slot;
> 2410 } else {
> 2411 slot = list_first_entry(&khugepaged_scan.mm_head,
> 2412 struct mm_slot, mm_node);
> 2413 khugepaged_scan.address = 0;
> 2414 khugepaged_scan.mm_slot = slot;
> 2415 khugepaged_scan.maybe_collapse = false;
> 2416 }
> 2417 spin_unlock(&khugepaged_mm_lock);
> 2418
> 2419 mm = slot->mm;
> 2420 /*
> 2421 * Don't wait for semaphore (to avoid long wait times). Just move to
> 2422 * the next mm on the list.
> 2423 */
> 2424 vma = NULL;
> 2425 if (unlikely(!mmap_read_trylock(mm)))
> 2426 goto breakouterloop_mmap_lock;
> 2427
> 2428 progress++;
> 2429 if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> 2430 goto breakouterloop;
> 2431
> 2432 vma_iter_init(&vmi, mm, khugepaged_scan.address);
> 2433 for_each_vma(vmi, vma) {
> 2434 unsigned long hstart, hend;
> 2435
> 2436 cond_resched();
> 2437 if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
> 2438 progress++;
> 2439 break;
> 2440 }
> 2441 if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> 2442 skip:
> 2443 progress++;
> 2444 continue;
> 2445 }
> 2446 hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
> 2447 hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
> 2448 if (khugepaged_scan.address > hend)
> 2449 goto skip;
> 2450 if (khugepaged_scan.address < hstart)
> 2451 khugepaged_scan.address = hstart;
> 2452 VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
> 2453
> 2454 while (khugepaged_scan.address < hend) {
> 2455 bool mmap_locked = true;
> 2456
> 2457 cond_resched();
> 2458 if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> 2459 goto breakouterloop;
> 2460
> 2461 VM_BUG_ON(khugepaged_scan.address < hstart ||
> 2462 khugepaged_scan.address + HPAGE_PMD_SIZE >
> 2463 hend);
> 2464 if (!vma_is_anonymous(vma)) {
> 2465 struct file *file = get_file(vma->vm_file);
> 2466 pgoff_t pgoff = linear_page_index(vma,
> 2467 khugepaged_scan.address);
> 2468
> 2469 mmap_read_unlock(mm);
> 2470 mmap_locked = false;
> 2471 *result = hpage_collapse_scan_file(mm,
> 2472 khugepaged_scan.address, file, pgoff, cc);
> 2473 fput(file);
> 2474 if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> 2475 mmap_read_lock(mm);
> 2476 if (hpage_collapse_test_exit_or_disable(mm))
> 2477 goto breakouterloop;
> 2478 *result = collapse_pte_mapped_thp(mm,
> 2479 khugepaged_scan.address, false);
> 2480 if (*result == SCAN_PMD_MAPPED)
> 2481 *result = SCAN_SUCCEED;
> 2482 mmap_read_unlock(mm);
> 2483 }
> 2484 } else {
> 2485 *result = hpage_collapse_scan_pmd(mm, vma,
> 2486 khugepaged_scan.address, &mmap_locked, cc);
> 2487 }
> 2488
> 2489 switch (*result) {
> > 2490 case SCAN_PMD_NULL:
> > 2491 case SCAN_PMD_NONE:
> 2492 case SCAN_PMD_MAPPED:
> 2493 case SCAN_PTE_MAPPED_HUGEPAGE:
> 2494 break;
> 2495 case SCAN_SUCCEED:
> 2496 ++khugepaged_pages_collapsed;
> 2497 fallthrough;
> 2498 default:
> 2499 khugepaged_scan.maybe_collapse = true;
> 2500 }
> 2501
> 2502 /* move to next address */
> 2503 khugepaged_scan.address += HPAGE_PMD_SIZE;
> 2504 progress += HPAGE_PMD_NR;
> 2505 if (!mmap_locked)
> 2506 /*
> 2507 * We released mmap_lock so break loop. Note
> 2508 * that we drop mmap_lock before all hugepage
> 2509 * allocations, so if allocation fails, we are
> 2510 * guaranteed to break here and report the
> 2511 * correct result back to caller.
> 2512 */
> 2513 goto breakouterloop_mmap_lock;
> 2514 if (progress >= pages)
> 2515 goto breakouterloop;
> 2516 }
> 2517 }
> 2518 breakouterloop:
> 2519 mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
> 2520 breakouterloop_mmap_lock:
> 2521
> 2522 spin_lock(&khugepaged_mm_lock);
> 2523 VM_BUG_ON(khugepaged_scan.mm_slot != slot);
> 2524 /*
> 2525 * Release the current mm_slot if this mm is about to die, or
> 2526 * if we scanned all vmas of this mm.
> 2527 */
> 2528 if (hpage_collapse_test_exit(mm) || !vma) {
> 2529 bool maybe_collapse = khugepaged_scan.maybe_collapse;
> 2530
> 2531 if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
> 2532 maybe_collapse = true;
> 2533
> 2534 /*
> 2535 * Make sure that if mm_users is reaching zero while
> 2536 * khugepaged runs here, khugepaged_exit will find
> 2537 * mm_slot not pointing to the exiting mm.
> 2538 */
> 2539 if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
> 2540 khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
> 2541 khugepaged_scan.address = 0;
> 2542 khugepaged_scan.maybe_collapse = false;
> 2543 } else {
> 2544 khugepaged_scan.mm_slot = NULL;
> 2545 khugepaged_full_scans++;
> 2546 }
> 2547
> 2548 collect_mm_slot(slot, maybe_collapse);
> 2549 }
> 2550
> 2551 trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> 2552
> 2553 return progress;
> 2554 }
> 2555
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-15 23:01 ` kernel test robot
@ 2025-12-16 6:32 ` Vernon Yang
0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-16 6:32 UTC (permalink / raw)
To: kernel test robot
Cc: akpm, david, lorenzo.stoakes, oe-kbuild-all, ziy, baohua,
lance.yang, linux-mm, linux-kernel, Vernon Yang
On Tue, Dec 16, 2025 at 07:01:18AM +0800, kernel test robot wrote:
> Hi Vernon,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
> [also build test ERROR on linus/master v6.19-rc1 next-20251215]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20251215090419.174418-3-yanglincheng%40kylinos.cn
> patch subject: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
> config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20251216/202512160619.3Ut4sxaJ-lkp@intel.com/config)
> compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160619.3Ut4sxaJ-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202512160619.3Ut4sxaJ-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> mm/khugepaged.c: In function 'khugepaged_scan_mm_slot':
> >> mm/khugepaged.c:2490:30: error: 'SCAN_PMD_NULL' undeclared (first use in this function); did you mean 'SCAN_VMA_NULL'?
> 2490 | case SCAN_PMD_NULL:
> | ^~~~~~~~~~~~~
> | SCAN_VMA_NULL
> mm/khugepaged.c:2490:30: note: each undeclared identifier is reported only once for each function it appears in
> >> mm/khugepaged.c:2491:30: error: 'SCAN_PMD_NONE' undeclared (first use in this function)
> 2491 | case SCAN_PMD_NONE:
> | ^~~~~~~~~~~~~
same above, Thanks.
>
> vim +2490 mm/khugepaged.c
>
> 2392
> 2393 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> 2394 struct collapse_control *cc)
> 2395 __releases(&khugepaged_mm_lock)
> 2396 __acquires(&khugepaged_mm_lock)
> 2397 {
> 2398 struct vma_iterator vmi;
> 2399 struct mm_slot *slot;
> 2400 struct mm_struct *mm;
> 2401 struct vm_area_struct *vma;
> 2402 int progress = 0;
> 2403
> 2404 VM_BUG_ON(!pages);
> 2405 lockdep_assert_held(&khugepaged_mm_lock);
> 2406 *result = SCAN_FAIL;
> 2407
> 2408 if (khugepaged_scan.mm_slot) {
> 2409 slot = khugepaged_scan.mm_slot;
> 2410 } else {
> 2411 slot = list_first_entry(&khugepaged_scan.mm_head,
> 2412 struct mm_slot, mm_node);
> 2413 khugepaged_scan.address = 0;
> 2414 khugepaged_scan.mm_slot = slot;
> 2415 khugepaged_scan.maybe_collapse = false;
> 2416 }
> 2417 spin_unlock(&khugepaged_mm_lock);
> 2418
> 2419 mm = slot->mm;
> 2420 /*
> 2421 * Don't wait for semaphore (to avoid long wait times). Just move to
> 2422 * the next mm on the list.
> 2423 */
> 2424 vma = NULL;
> 2425 if (unlikely(!mmap_read_trylock(mm)))
> 2426 goto breakouterloop_mmap_lock;
> 2427
> 2428 progress++;
> 2429 if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> 2430 goto breakouterloop;
> 2431
> 2432 vma_iter_init(&vmi, mm, khugepaged_scan.address);
> 2433 for_each_vma(vmi, vma) {
> 2434 unsigned long hstart, hend;
> 2435
> 2436 cond_resched();
> 2437 if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
> 2438 progress++;
> 2439 break;
> 2440 }
> 2441 if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> 2442 skip:
> 2443 progress++;
> 2444 continue;
> 2445 }
> 2446 hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
> 2447 hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
> 2448 if (khugepaged_scan.address > hend)
> 2449 goto skip;
> 2450 if (khugepaged_scan.address < hstart)
> 2451 khugepaged_scan.address = hstart;
> 2452 VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
> 2453
> 2454 while (khugepaged_scan.address < hend) {
> 2455 bool mmap_locked = true;
> 2456
> 2457 cond_resched();
> 2458 if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> 2459 goto breakouterloop;
> 2460
> 2461 VM_BUG_ON(khugepaged_scan.address < hstart ||
> 2462 khugepaged_scan.address + HPAGE_PMD_SIZE >
> 2463 hend);
> 2464 if (!vma_is_anonymous(vma)) {
> 2465 struct file *file = get_file(vma->vm_file);
> 2466 pgoff_t pgoff = linear_page_index(vma,
> 2467 khugepaged_scan.address);
> 2468
> 2469 mmap_read_unlock(mm);
> 2470 mmap_locked = false;
> 2471 *result = hpage_collapse_scan_file(mm,
> 2472 khugepaged_scan.address, file, pgoff, cc);
> 2473 fput(file);
> 2474 if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> 2475 mmap_read_lock(mm);
> 2476 if (hpage_collapse_test_exit_or_disable(mm))
> 2477 goto breakouterloop;
> 2478 *result = collapse_pte_mapped_thp(mm,
> 2479 khugepaged_scan.address, false);
> 2480 if (*result == SCAN_PMD_MAPPED)
> 2481 *result = SCAN_SUCCEED;
> 2482 mmap_read_unlock(mm);
> 2483 }
> 2484 } else {
> 2485 *result = hpage_collapse_scan_pmd(mm, vma,
> 2486 khugepaged_scan.address, &mmap_locked, cc);
> 2487 }
> 2488
> 2489 switch (*result) {
> > 2490 case SCAN_PMD_NULL:
> > 2491 case SCAN_PMD_NONE:
> 2492 case SCAN_PMD_MAPPED:
> 2493 case SCAN_PTE_MAPPED_HUGEPAGE:
> 2494 break;
> 2495 case SCAN_SUCCEED:
> 2496 ++khugepaged_pages_collapsed;
> 2497 fallthrough;
> 2498 default:
> 2499 khugepaged_scan.maybe_collapse = true;
> 2500 }
> 2501
> 2502 /* move to next address */
> 2503 khugepaged_scan.address += HPAGE_PMD_SIZE;
> 2504 progress += HPAGE_PMD_NR;
> 2505 if (!mmap_locked)
> 2506 /*
> 2507 * We released mmap_lock so break loop. Note
> 2508 * that we drop mmap_lock before all hugepage
> 2509 * allocations, so if allocation fails, we are
> 2510 * guaranteed to break here and report the
> 2511 * correct result back to caller.
> 2512 */
> 2513 goto breakouterloop_mmap_lock;
> 2514 if (progress >= pages)
> 2515 goto breakouterloop;
> 2516 }
> 2517 }
> 2518 breakouterloop:
> 2519 mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
> 2520 breakouterloop_mmap_lock:
> 2521
> 2522 spin_lock(&khugepaged_mm_lock);
> 2523 VM_BUG_ON(khugepaged_scan.mm_slot != slot);
> 2524 /*
> 2525 * Release the current mm_slot if this mm is about to die, or
> 2526 * if we scanned all vmas of this mm.
> 2527 */
> 2528 if (hpage_collapse_test_exit(mm) || !vma) {
> 2529 bool maybe_collapse = khugepaged_scan.maybe_collapse;
> 2530
> 2531 if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
> 2532 maybe_collapse = true;
> 2533
> 2534 /*
> 2535 * Make sure that if mm_users is reaching zero while
> 2536 * khugepaged runs here, khugepaged_exit will find
> 2537 * mm_slot not pointing to the exiting mm.
> 2538 */
> 2539 if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
> 2540 khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
> 2541 khugepaged_scan.address = 0;
> 2542 khugepaged_scan.maybe_collapse = false;
> 2543 } else {
> 2544 khugepaged_scan.mm_slot = NULL;
> 2545 khugepaged_full_scans++;
> 2546 }
> 2547
> 2548 collect_mm_slot(slot, maybe_collapse);
> 2549 }
> 2550
> 2551 trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> 2552
> 2553 return progress;
> 2554 }
> 2555
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-15 21:12 ` kernel test robot
@ 2025-12-16 7:00 ` Vernon Yang
0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-16 7:00 UTC (permalink / raw)
To: kernel test robot
Cc: akpm, david, lorenzo.stoakes, oe-kbuild-all, ziy, baohua,
lance.yang, linux-mm, linux-kernel, Vernon Yang
On Tue, Dec 16, 2025 at 05:12:16AM +0800, kernel test robot wrote:
> Hi Vernon,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
> [also build test ERROR on linus/master v6.19-rc1 next-20251215]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
> patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
> config: arc-allnoconfig (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/config)
> compiler: arc-linux-gcc (GCC) 15.1.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512160400.pTmarqg6-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202512160400.pTmarqg6-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> mm/madvise.c: In function 'madvise_cold':
> >> mm/madvise.c:609:9: error: implicit declaration of function 'khugepaged_move_tail'; did you mean 'khugepaged_exit'? [-Wimplicit-function-declaration]
> 609 | khugepaged_move_tail(vma->vm_mm);
> | ^~~~~~~~~~~~~~~~~~~~
> | khugepaged_exit
When CONFIG_TRANSPARENT_HUGEPAGE is disabled, trigger this build errors.
I'll fix it in the next version, Thanks!
>
> vim +609 mm/madvise.c
>
> 595
> 596 static long madvise_cold(struct madvise_behavior *madv_behavior)
> 597 {
> 598 struct vm_area_struct *vma = madv_behavior->vma;
> 599 struct mmu_gather tlb;
> 600
> 601 if (!can_madv_lru_vma(vma))
> 602 return -EINVAL;
> 603
> 604 lru_add_drain();
> 605 tlb_gather_mmu(&tlb, madv_behavior->mm);
> 606 madvise_cold_page_range(&tlb, madv_behavior);
> 607 tlb_finish_mmu(&tlb);
> 608
> > 609 khugepaged_move_tail(vma->vm_mm);
> 610
> 611 return 0;
> 612 }
> 613
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-15 9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
2025-12-15 21:12 ` kernel test robot
@ 2025-12-16 13:08 ` kernel test robot
2025-12-16 13:31 ` kernel test robot
2025-12-18 9:31 ` David Hildenbrand (Red Hat)
3 siblings, 0 replies; 42+ messages in thread
From: kernel test robot @ 2025-12-16 13:08 UTC (permalink / raw)
To: Vernon Yang, akpm, david, lorenzo.stoakes
Cc: llvm, oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
Hi Vernon,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc1 next-20251216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251216/202512161406.RfF1dIYB-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512161406.RfF1dIYB-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512161406.RfF1dIYB-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/madvise.c:609:2: error: call to undeclared function 'khugepaged_move_tail'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
609 | khugepaged_move_tail(vma->vm_mm);
| ^
mm/madvise.c:837:2: error: call to undeclared function 'khugepaged_move_tail'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
837 | khugepaged_move_tail(mm);
| ^
2 errors generated.
vim +/khugepaged_move_tail +609 mm/madvise.c
595
596 static long madvise_cold(struct madvise_behavior *madv_behavior)
597 {
598 struct vm_area_struct *vma = madv_behavior->vma;
599 struct mmu_gather tlb;
600
601 if (!can_madv_lru_vma(vma))
602 return -EINVAL;
603
604 lru_add_drain();
605 tlb_gather_mmu(&tlb, madv_behavior->mm);
606 madvise_cold_page_range(&tlb, madv_behavior);
607 tlb_finish_mmu(&tlb);
608
> 609 khugepaged_move_tail(vma->vm_mm);
610
611 return 0;
612 }
613
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-15 9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
2025-12-15 21:12 ` kernel test robot
2025-12-16 13:08 ` kernel test robot
@ 2025-12-16 13:31 ` kernel test robot
2025-12-18 9:31 ` David Hildenbrand (Red Hat)
3 siblings, 0 replies; 42+ messages in thread
From: kernel test robot @ 2025-12-16 13:31 UTC (permalink / raw)
To: Vernon Yang, akpm, david, lorenzo.stoakes
Cc: oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
Hi Vernon,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20251216]
[cannot apply to linus/master v6.16-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251215090419.174418-4-yanglincheng%40kylinos.cn
patch subject: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20251216/202512161405.8IVTXVcr-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251216/202512161405.8IVTXVcr-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512161405.8IVTXVcr-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/madvise.c: In function 'madvise_cold':
>> mm/madvise.c:609:9: error: implicit declaration of function 'khugepaged_move_tail'; did you mean 'khugepaged_exit'? [-Wimplicit-function-declaration]
609 | khugepaged_move_tail(vma->vm_mm);
| ^~~~~~~~~~~~~~~~~~~~
| khugepaged_exit
vim +609 mm/madvise.c
595
596 static long madvise_cold(struct madvise_behavior *madv_behavior)
597 {
598 struct vm_area_struct *vma = madv_behavior->vma;
599 struct mmu_gather tlb;
600
601 if (!can_madv_lru_vma(vma))
602 return -EINVAL;
603
604 lru_add_drain();
605 tlb_gather_mmu(&tlb, madv_behavior->mm);
606 madvise_cold_page_range(&tlb, madv_behavior);
607 tlb_finish_mmu(&tlb);
608
> 609 khugepaged_move_tail(vma->vm_mm);
610
611 return 0;
612 }
613
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-15 9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
` (2 preceding siblings ...)
2025-12-15 23:01 ` kernel test robot
@ 2025-12-17 3:31 ` Wei Yang
2025-12-18 3:27 ` Vernon Yang
2025-12-18 9:29 ` David Hildenbrand (Red Hat)
2025-12-22 19:00 ` kernel test robot
5 siblings, 1 reply; 42+ messages in thread
From: Wei Yang @ 2025-12-17 3:31 UTC (permalink / raw)
To: Vernon Yang
Cc: akpm, david, lorenzo.stoakes, ziy, npache, baohua, lance.yang,
linux-mm, linux-kernel, Vernon Yang
On Mon, Dec 15, 2025 at 05:04:17PM +0800, Vernon Yang wrote:
>The following data is traced by bpftrace on a desktop system. After
>the system has been left idle for 10 minutes upon booting, a lot of
>SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
>khugepaged.
>
>@scan_pmd_status[1]: 1 ## SCAN_SUCCEED
>@scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
>@scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
>total progress size: 701 MB
>Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
>
>The khugepaged_scan list save all task that support collapse into hugepage,
>as long as the take is not destroyed, khugepaged will not remove it from
>the khugepaged_scan list. This exist a phenomenon where task has already
>collapsed all memory regions into hugepage, but khugepaged continues to
>scan it, which wastes CPU time and invalid, and due to
>khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>scanning a large number of invalid task, so scanning really valid task
>is later.
>
>After applying this patch, when all memory is either SCAN_PMD_MAPPED or
>SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
>list. If the page fault or MADV_HUGEPAGE again, it is added back to
>khugepaged.
Two thing s come up my mind:
* what happens if we split the huge page under memory pressure?
* would this interfere with mTHP collapse?
>
>Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>---
> mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
> 1 file changed, 25 insertions(+), 10 deletions(-)
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 0598a19a98cc..1ec1af5be3c8 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -115,6 +115,7 @@ struct khugepaged_scan {
> struct list_head mm_head;
> struct mm_slot *mm_slot;
> unsigned long address;
>+ bool maybe_collapse;
> };
>
> static struct khugepaged_scan khugepaged_scan = {
>@@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> return result;
> }
>
>-static void collect_mm_slot(struct mm_slot *slot)
>+static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
> {
> struct mm_struct *mm = slot->mm;
>
> lockdep_assert_held(&khugepaged_mm_lock);
>
>- if (hpage_collapse_test_exit(mm)) {
>+ if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
> /* free mm_slot */
> hash_del(&slot->hash);
> list_del(&slot->mm_node);
>
>- /*
>- * Not strictly needed because the mm exited already.
>- *
>- * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>- */
>+ if (!maybe_collapse)
>+ mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>
> /* khugepaged_mm_lock actually not necessary for the below */
> mm_slot_free(mm_slot_cache, slot);
>@@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> struct mm_slot, mm_node);
> khugepaged_scan.address = 0;
> khugepaged_scan.mm_slot = slot;
>+ khugepaged_scan.maybe_collapse = false;
> }
> spin_unlock(&khugepaged_mm_lock);
>
>@@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> khugepaged_scan.address, &mmap_locked, cc);
> }
>
>- if (*result == SCAN_SUCCEED)
>+ switch (*result) {
>+ case SCAN_PMD_NULL:
>+ case SCAN_PMD_NONE:
>+ case SCAN_PMD_MAPPED:
>+ case SCAN_PTE_MAPPED_HUGEPAGE:
>+ break;
>+ case SCAN_SUCCEED:
> ++khugepaged_pages_collapsed;
>+ fallthrough;
If collapse successfully, we don't need to set maybe_collapse to true?
>+ default:
>+ khugepaged_scan.maybe_collapse = true;
>+ }
>
> /* move to next address */
> khugepaged_scan.address += HPAGE_PMD_SIZE;
>@@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> * if we scanned all vmas of this mm.
> */
> if (hpage_collapse_test_exit(mm) || !vma) {
>+ bool maybe_collapse = khugepaged_scan.maybe_collapse;
>+
>+ if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
>+ maybe_collapse = true;
>+
> /*
> * Make sure that if mm_users is reaching zero while
> * khugepaged runs here, khugepaged_exit will find
>@@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
> khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
> khugepaged_scan.address = 0;
>+ khugepaged_scan.maybe_collapse = false;
> } else {
> khugepaged_scan.mm_slot = NULL;
> khugepaged_full_scans++;
> }
>
>- collect_mm_slot(slot);
>+ collect_mm_slot(slot, maybe_collapse);
> }
>
> trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
>@@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
> slot = khugepaged_scan.mm_slot;
> khugepaged_scan.mm_slot = NULL;
> if (slot)
>- collect_mm_slot(slot);
>+ collect_mm_slot(slot, true);
> spin_unlock(&khugepaged_mm_lock);
> return 0;
> }
>--
>2.51.0
>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-17 3:31 ` Wei Yang
@ 2025-12-18 3:27 ` Vernon Yang
2025-12-18 3:48 ` Wei Yang
0 siblings, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-18 3:27 UTC (permalink / raw)
To: Wei Yang
Cc: akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Wed, Dec 17, 2025 at 03:31:55AM +0000, Wei Yang wrote:
> On Mon, Dec 15, 2025 at 05:04:17PM +0800, Vernon Yang wrote:
> >The following data is traced by bpftrace on a desktop system. After
> >the system has been left idle for 10 minutes upon booting, a lot of
> >SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> >khugepaged.
> >
> >@scan_pmd_status[1]: 1 ## SCAN_SUCCEED
> >@scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
> >@scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
> >total progress size: 701 MB
> >Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >
> >The khugepaged_scan list save all task that support collapse into hugepage,
> >as long as the take is not destroyed, khugepaged will not remove it from
> >the khugepaged_scan list. This exist a phenomenon where task has already
> >collapsed all memory regions into hugepage, but khugepaged continues to
> >scan it, which wastes CPU time and invalid, and due to
> >khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >scanning a large number of invalid task, so scanning really valid task
> >is later.
> >
> >After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> >SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> >list. If the page fault or MADV_HUGEPAGE again, it is added back to
> >khugepaged.
>
> Two thing s come up my mind:
>
> * what happens if we split the huge page under memory pressure?
static unsigned int shrink_folio_list(struct list_head *folio_list,
struct pglist_data *pgdat, struct scan_control *sc,
struct reclaim_stat *stat, bool ignore_references,
struct mem_cgroup *memcg)
{
...
folio = lru_to_folio(folio_list);
...
references = folio_check_references(folio, sc);
switch (references) {
case FOLIOREF_ACTIVATE:
goto activate_locked;
case FOLIOREF_KEEP:
stat->nr_ref_keep += nr_pages;
goto keep_locked;
case FOLIOREF_RECLAIM:
case FOLIOREF_RECLAIM_CLEAN:
; /* try to reclaim the folio below */
}
...
split_folio_to_list(folio, folio_list);
}
During memory reclaim above, only inactive folios are split. This also
implies that the folio is cold, meaning it hasn't been used recently, so
we do not expect to put the mm back onto the khugepaged scan list to
continue scan/collapse. khugeapged needs to scan hot folios as much as
possible priorityly and collapse hot folios to avoid wasting CPU.
> * would this interfere with mTHP collapse?
It has no impact on mTHP collapse, only when all memory is either
SCAN_PMD_MAPPED or SCAN_PMD_NONE, the mm will be removed automatically.
other cases will not be removed.
Let me know if I missed something please, thanks!
>
> >
> >Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> >---
> > mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
> > 1 file changed, 25 insertions(+), 10 deletions(-)
> >
> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >index 0598a19a98cc..1ec1af5be3c8 100644
> >--- a/mm/khugepaged.c
> >+++ b/mm/khugepaged.c
> >@@ -115,6 +115,7 @@ struct khugepaged_scan {
> > struct list_head mm_head;
> > struct mm_slot *mm_slot;
> > unsigned long address;
> >+ bool maybe_collapse;
> > };
> >
> > static struct khugepaged_scan khugepaged_scan = {
> >@@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > return result;
> > }
> >
> >-static void collect_mm_slot(struct mm_slot *slot)
> >+static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
> > {
> > struct mm_struct *mm = slot->mm;
> >
> > lockdep_assert_held(&khugepaged_mm_lock);
> >
> >- if (hpage_collapse_test_exit(mm)) {
> >+ if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
> > /* free mm_slot */
> > hash_del(&slot->hash);
> > list_del(&slot->mm_node);
> >
> >- /*
> >- * Not strictly needed because the mm exited already.
> >- *
> >- * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> >- */
> >+ if (!maybe_collapse)
> >+ mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> >
> > /* khugepaged_mm_lock actually not necessary for the below */
> > mm_slot_free(mm_slot_cache, slot);
> >@@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > struct mm_slot, mm_node);
> > khugepaged_scan.address = 0;
> > khugepaged_scan.mm_slot = slot;
> >+ khugepaged_scan.maybe_collapse = false;
> > }
> > spin_unlock(&khugepaged_mm_lock);
> >
> >@@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > khugepaged_scan.address, &mmap_locked, cc);
> > }
> >
> >- if (*result == SCAN_SUCCEED)
> >+ switch (*result) {
> >+ case SCAN_PMD_NULL:
> >+ case SCAN_PMD_NONE:
> >+ case SCAN_PMD_MAPPED:
> >+ case SCAN_PTE_MAPPED_HUGEPAGE:
> >+ break;
> >+ case SCAN_SUCCEED:
> > ++khugepaged_pages_collapsed;
> >+ fallthrough;
>
> If collapse successfully, we don't need to set maybe_collapse to true?
Above "fallthrough" explicitly tells the compiler that when the collapse is
successful, run below "khugepaged_scan.maybe_collapse = true" :)
> >+ default:
> >+ khugepaged_scan.maybe_collapse = true;
> >+ }
> >
> > /* move to next address */
> > khugepaged_scan.address += HPAGE_PMD_SIZE;
> >@@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > * if we scanned all vmas of this mm.
> > */
> > if (hpage_collapse_test_exit(mm) || !vma) {
> >+ bool maybe_collapse = khugepaged_scan.maybe_collapse;
> >+
> >+ if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
> >+ maybe_collapse = true;
> >+
> > /*
> > * Make sure that if mm_users is reaching zero while
> > * khugepaged runs here, khugepaged_exit will find
> >@@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
> > khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
> > khugepaged_scan.address = 0;
> >+ khugepaged_scan.maybe_collapse = false;
> > } else {
> > khugepaged_scan.mm_slot = NULL;
> > khugepaged_full_scans++;
> > }
> >
> >- collect_mm_slot(slot);
> >+ collect_mm_slot(slot, maybe_collapse);
> > }
> >
> > trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> >@@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
> > slot = khugepaged_scan.mm_slot;
> > khugepaged_scan.mm_slot = NULL;
> > if (slot)
> >- collect_mm_slot(slot);
> >+ collect_mm_slot(slot, true);
> > spin_unlock(&khugepaged_mm_lock);
> > return 0;
> > }
> >--
> >2.51.0
> >
>
> --
> Wei Yang
> Help you, Help me
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-18 3:27 ` Vernon Yang
@ 2025-12-18 3:48 ` Wei Yang
2025-12-18 4:41 ` Vernon Yang
0 siblings, 1 reply; 42+ messages in thread
From: Wei Yang @ 2025-12-18 3:48 UTC (permalink / raw)
To: Vernon Yang
Cc: Wei Yang, akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang,
linux-mm, linux-kernel, Vernon Yang
On Thu, Dec 18, 2025 at 11:27:24AM +0800, Vernon Yang wrote:
>On Wed, Dec 17, 2025 at 03:31:55AM +0000, Wei Yang wrote:
>> On Mon, Dec 15, 2025 at 05:04:17PM +0800, Vernon Yang wrote:
>> >The following data is traced by bpftrace on a desktop system. After
>> >the system has been left idle for 10 minutes upon booting, a lot of
>> >SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
>> >khugepaged.
>> >
>> >@scan_pmd_status[1]: 1 ## SCAN_SUCCEED
>> >@scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
>> >@scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
>> >total progress size: 701 MB
>> >Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
>> >
>> >The khugepaged_scan list save all task that support collapse into hugepage,
>> >as long as the take is not destroyed, khugepaged will not remove it from
>> >the khugepaged_scan list. This exist a phenomenon where task has already
>> >collapsed all memory regions into hugepage, but khugepaged continues to
>> >scan it, which wastes CPU time and invalid, and due to
>> >khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>> >scanning a large number of invalid task, so scanning really valid task
>> >is later.
>> >
>> >After applying this patch, when all memory is either SCAN_PMD_MAPPED or
>> >SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
>> >list. If the page fault or MADV_HUGEPAGE again, it is added back to
>> >khugepaged.
>>
>> Two thing s come up my mind:
>>
>> * what happens if we split the huge page under memory pressure?
>
>static unsigned int shrink_folio_list(struct list_head *folio_list,
> struct pglist_data *pgdat, struct scan_control *sc,
> struct reclaim_stat *stat, bool ignore_references,
> struct mem_cgroup *memcg)
>{
> ...
>
> folio = lru_to_folio(folio_list);
>
> ...
>
> references = folio_check_references(folio, sc);
> switch (references) {
> case FOLIOREF_ACTIVATE:
> goto activate_locked;
> case FOLIOREF_KEEP:
> stat->nr_ref_keep += nr_pages;
> goto keep_locked;
> case FOLIOREF_RECLAIM:
> case FOLIOREF_RECLAIM_CLEAN:
> ; /* try to reclaim the folio below */
> }
>
> ...
>
> split_folio_to_list(folio, folio_list);
>}
>
>During memory reclaim above, only inactive folios are split. This also
>implies that the folio is cold, meaning it hasn't been used recently, so
>we do not expect to put the mm back onto the khugepaged scan list to
>continue scan/collapse. khugeapged needs to scan hot folios as much as
>possible priorityly and collapse hot folios to avoid wasting CPU.
>
So we will never pout this process back onto the scan list, right?
>> * would this interfere with mTHP collapse?
>
>It has no impact on mTHP collapse, only when all memory is either
>SCAN_PMD_MAPPED or SCAN_PMD_NONE, the mm will be removed automatically.
>other cases will not be removed.
>
>Let me know if I missed something please, thanks!
>
>>
>> >
>> >Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>> >---
>> > mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
>> > 1 file changed, 25 insertions(+), 10 deletions(-)
>> >
>> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> >index 0598a19a98cc..1ec1af5be3c8 100644
>> >--- a/mm/khugepaged.c
>> >+++ b/mm/khugepaged.c
>> >@@ -115,6 +115,7 @@ struct khugepaged_scan {
>> > struct list_head mm_head;
>> > struct mm_slot *mm_slot;
>> > unsigned long address;
>> >+ bool maybe_collapse;
>> > };
>> >
>> > static struct khugepaged_scan khugepaged_scan = {
>> >@@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> > return result;
>> > }
>> >
>> >-static void collect_mm_slot(struct mm_slot *slot)
>> >+static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
>> > {
>> > struct mm_struct *mm = slot->mm;
>> >
>> > lockdep_assert_held(&khugepaged_mm_lock);
>> >
>> >- if (hpage_collapse_test_exit(mm)) {
>> >+ if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
>> > /* free mm_slot */
>> > hash_del(&slot->hash);
>> > list_del(&slot->mm_node);
>> >
>> >- /*
>> >- * Not strictly needed because the mm exited already.
>> >- *
>> >- * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>> >- */
>> >+ if (!maybe_collapse)
>> >+ mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>> >
>> > /* khugepaged_mm_lock actually not necessary for the below */
>> > mm_slot_free(mm_slot_cache, slot);
>> >@@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>> > struct mm_slot, mm_node);
>> > khugepaged_scan.address = 0;
>> > khugepaged_scan.mm_slot = slot;
>> >+ khugepaged_scan.maybe_collapse = false;
>> > }
>> > spin_unlock(&khugepaged_mm_lock);
>> >
>> >@@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>> > khugepaged_scan.address, &mmap_locked, cc);
>> > }
>> >
>> >- if (*result == SCAN_SUCCEED)
>> >+ switch (*result) {
>> >+ case SCAN_PMD_NULL:
>> >+ case SCAN_PMD_NONE:
>> >+ case SCAN_PMD_MAPPED:
>> >+ case SCAN_PTE_MAPPED_HUGEPAGE:
>> >+ break;
>> >+ case SCAN_SUCCEED:
>> > ++khugepaged_pages_collapsed;
>> >+ fallthrough;
>>
>> If collapse successfully, we don't need to set maybe_collapse to true?
>
>Above "fallthrough" explicitly tells the compiler that when the collapse is
>successful, run below "khugepaged_scan.maybe_collapse = true" :)
>
Got it, thanks.
>> >+ default:
>> >+ khugepaged_scan.maybe_collapse = true;
>> >+ }
>> >
>> > /* move to next address */
>> > khugepaged_scan.address += HPAGE_PMD_SIZE;
>> >@@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>> > * if we scanned all vmas of this mm.
>> > */
>> > if (hpage_collapse_test_exit(mm) || !vma) {
>> >+ bool maybe_collapse = khugepaged_scan.maybe_collapse;
>> >+
>> >+ if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
>> >+ maybe_collapse = true;
>> >+
>> > /*
>> > * Make sure that if mm_users is reaching zero while
>> > * khugepaged runs here, khugepaged_exit will find
>> >@@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>> > if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
>> > khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
>> > khugepaged_scan.address = 0;
>> >+ khugepaged_scan.maybe_collapse = false;
>> > } else {
>> > khugepaged_scan.mm_slot = NULL;
>> > khugepaged_full_scans++;
>> > }
>> >
>> >- collect_mm_slot(slot);
>> >+ collect_mm_slot(slot, maybe_collapse);
>> > }
>> >
>> > trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
>> >@@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
>> > slot = khugepaged_scan.mm_slot;
>> > khugepaged_scan.mm_slot = NULL;
>> > if (slot)
>> >- collect_mm_slot(slot);
>> >+ collect_mm_slot(slot, true);
>> > spin_unlock(&khugepaged_mm_lock);
>> > return 0;
>> > }
>> >--
>> >2.51.0
>> >
>>
>> --
>> Wei Yang
>> Help you, Help me
>
>--
>Thanks,
>Vernon
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-18 3:48 ` Wei Yang
@ 2025-12-18 4:41 ` Vernon Yang
0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-18 4:41 UTC (permalink / raw)
To: Wei Yang
Cc: akpm, david, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Thu, Dec 18, 2025 at 03:48:01AM +0000, Wei Yang wrote:
> On Thu, Dec 18, 2025 at 11:27:24AM +0800, Vernon Yang wrote:
> >On Wed, Dec 17, 2025 at 03:31:55AM +0000, Wei Yang wrote:
> >> On Mon, Dec 15, 2025 at 05:04:17PM +0800, Vernon Yang wrote:
> >> >The following data is traced by bpftrace on a desktop system. After
> >> >the system has been left idle for 10 minutes upon booting, a lot of
> >> >SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> >> >khugepaged.
> >> >
> >> >@scan_pmd_status[1]: 1 ## SCAN_SUCCEED
> >> >@scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
> >> >@scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
> >> >total progress size: 701 MB
> >> >Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >> >
> >> >The khugepaged_scan list save all task that support collapse into hugepage,
> >> >as long as the take is not destroyed, khugepaged will not remove it from
> >> >the khugepaged_scan list. This exist a phenomenon where task has already
> >> >collapsed all memory regions into hugepage, but khugepaged continues to
> >> >scan it, which wastes CPU time and invalid, and due to
> >> >khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >> >scanning a large number of invalid task, so scanning really valid task
> >> >is later.
> >> >
> >> >After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> >> >SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> >> >list. If the page fault or MADV_HUGEPAGE again, it is added back to
> >> >khugepaged.
> >>
> >> Two thing s come up my mind:
> >>
> >> * what happens if we split the huge page under memory pressure?
> >
> >static unsigned int shrink_folio_list(struct list_head *folio_list,
> > struct pglist_data *pgdat, struct scan_control *sc,
> > struct reclaim_stat *stat, bool ignore_references,
> > struct mem_cgroup *memcg)
> >{
> > ...
> >
> > folio = lru_to_folio(folio_list);
> >
> > ...
> >
> > references = folio_check_references(folio, sc);
> > switch (references) {
> > case FOLIOREF_ACTIVATE:
> > goto activate_locked;
> > case FOLIOREF_KEEP:
> > stat->nr_ref_keep += nr_pages;
> > goto keep_locked;
> > case FOLIOREF_RECLAIM:
> > case FOLIOREF_RECLAIM_CLEAN:
> > ; /* try to reclaim the folio below */
> > }
> >
> > ...
> >
> > split_folio_to_list(folio, folio_list);
> >}
> >
> >During memory reclaim above, only inactive folios are split. This also
> >implies that the folio is cold, meaning it hasn't been used recently, so
> >we do not expect to put the mm back onto the khugepaged scan list to
> >continue scan/collapse. khugeapged needs to scan hot folios as much as
> >possible priorityly and collapse hot folios to avoid wasting CPU.
> >
>
> So we will never pout this process back onto the scan list, right?
No, if the page fault or MADV_HUGEPAGE again, this task is added back to
khugepaged scan list. Just doesn't actively put this task back to the
khugepaged scan list after splitting.
>
> >> * would this interfere with mTHP collapse?
> >
> >It has no impact on mTHP collapse, only when all memory is either
> >SCAN_PMD_MAPPED or SCAN_PMD_NONE, the mm will be removed automatically.
> >other cases will not be removed.
> >
> >Let me know if I missed something please, thanks!
> >
> >>
> >> >
> >> >Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> >> >---
> >> > mm/khugepaged.c | 35 +++++++++++++++++++++++++----------
> >> > 1 file changed, 25 insertions(+), 10 deletions(-)
> >> >
> >> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> >index 0598a19a98cc..1ec1af5be3c8 100644
> >> >--- a/mm/khugepaged.c
> >> >+++ b/mm/khugepaged.c
> >> >@@ -115,6 +115,7 @@ struct khugepaged_scan {
> >> > struct list_head mm_head;
> >> > struct mm_slot *mm_slot;
> >> > unsigned long address;
> >> >+ bool maybe_collapse;
> >> > };
> >> >
> >> > static struct khugepaged_scan khugepaged_scan = {
> >> >@@ -1420,22 +1421,19 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >> > return result;
> >> > }
> >> >
> >> >-static void collect_mm_slot(struct mm_slot *slot)
> >> >+static void collect_mm_slot(struct mm_slot *slot, bool maybe_collapse)
> >> > {
> >> > struct mm_struct *mm = slot->mm;
> >> >
> >> > lockdep_assert_held(&khugepaged_mm_lock);
> >> >
> >> >- if (hpage_collapse_test_exit(mm)) {
> >> >+ if (hpage_collapse_test_exit(mm) || !maybe_collapse) {
> >> > /* free mm_slot */
> >> > hash_del(&slot->hash);
> >> > list_del(&slot->mm_node);
> >> >
> >> >- /*
> >> >- * Not strictly needed because the mm exited already.
> >> >- *
> >> >- * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> >> >- */
> >> >+ if (!maybe_collapse)
> >> >+ mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> >> >
> >> > /* khugepaged_mm_lock actually not necessary for the below */
> >> > mm_slot_free(mm_slot_cache, slot);
> >> >@@ -2397,6 +2395,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >> > struct mm_slot, mm_node);
> >> > khugepaged_scan.address = 0;
> >> > khugepaged_scan.mm_slot = slot;
> >> >+ khugepaged_scan.maybe_collapse = false;
> >> > }
> >> > spin_unlock(&khugepaged_mm_lock);
> >> >
> >> >@@ -2470,8 +2469,18 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >> > khugepaged_scan.address, &mmap_locked, cc);
> >> > }
> >> >
> >> >- if (*result == SCAN_SUCCEED)
> >> >+ switch (*result) {
> >> >+ case SCAN_PMD_NULL:
> >> >+ case SCAN_PMD_NONE:
> >> >+ case SCAN_PMD_MAPPED:
> >> >+ case SCAN_PTE_MAPPED_HUGEPAGE:
> >> >+ break;
> >> >+ case SCAN_SUCCEED:
> >> > ++khugepaged_pages_collapsed;
> >> >+ fallthrough;
> >>
> >> If collapse successfully, we don't need to set maybe_collapse to true?
> >
> >Above "fallthrough" explicitly tells the compiler that when the collapse is
> >successful, run below "khugepaged_scan.maybe_collapse = true" :)
> >
>
> Got it, thanks.
>
> >> >+ default:
> >> >+ khugepaged_scan.maybe_collapse = true;
> >> >+ }
> >> >
> >> > /* move to next address */
> >> > khugepaged_scan.address += HPAGE_PMD_SIZE;
> >> >@@ -2500,6 +2509,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >> > * if we scanned all vmas of this mm.
> >> > */
> >> > if (hpage_collapse_test_exit(mm) || !vma) {
> >> >+ bool maybe_collapse = khugepaged_scan.maybe_collapse;
> >> >+
> >> >+ if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
> >> >+ maybe_collapse = true;
> >> >+
> >> > /*
> >> > * Make sure that if mm_users is reaching zero while
> >> > * khugepaged runs here, khugepaged_exit will find
> >> >@@ -2508,12 +2522,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >> > if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
> >> > khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
> >> > khugepaged_scan.address = 0;
> >> >+ khugepaged_scan.maybe_collapse = false;
> >> > } else {
> >> > khugepaged_scan.mm_slot = NULL;
> >> > khugepaged_full_scans++;
> >> > }
> >> >
> >> >- collect_mm_slot(slot);
> >> >+ collect_mm_slot(slot, maybe_collapse);
> >> > }
> >> >
> >> > trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> >> >@@ -2616,7 +2631,7 @@ static int khugepaged(void *none)
> >> > slot = khugepaged_scan.mm_slot;
> >> > khugepaged_scan.mm_slot = NULL;
> >> > if (slot)
> >> >- collect_mm_slot(slot);
> >> >+ collect_mm_slot(slot, true);
> >> > spin_unlock(&khugepaged_mm_lock);
> >> > return 0;
> >> > }
> >> >--
> >> >2.51.0
> >> >
> >>
> >> --
> >> Wei Yang
> >> Help you, Help me
> >
> >--
> >Thanks,
> >Vernon
>
> --
> Wei Yang
> Help you, Help me
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event
2025-12-15 9:04 ` [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
@ 2025-12-18 9:24 ` David Hildenbrand (Red Hat)
2025-12-19 5:21 ` Vernon Yang
0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 9:24 UTC (permalink / raw)
To: Vernon Yang, akpm, lorenzo.stoakes
Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang
On 12/15/25 10:04, Vernon Yang wrote:
> Add mm_khugepaged_scan event to track the total time for full scan
> and the total number of pages scanned of khugepaged.
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
> include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++
> mm/khugepaged.c | 2 ++
> 2 files changed, 26 insertions(+)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index dd94d14a2427..b2824c2f8238 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -237,5 +237,29 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
> __print_symbolic(__entry->result, SCAN_STATUS))
> );
>
> +TRACE_EVENT(mm_khugepaged_scan,
> +
> + TP_PROTO(struct mm_struct *mm, int progress, bool full),
> +
> + TP_ARGS(mm, progress, full),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(int, progress)
> + __field(bool, full)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->progress = progress;
> + __entry->full = full;
> + ),
> +
> + TP_printk("mm=%p, progress=%d, full=%d",
> + __entry->mm,
> + __entry->progress,
> + __entry->full)
> +);
> +
> #endif /* __HUGE_MEMORY_H */
> #include <trace/define_trace.h>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index abe54f0043c7..0598a19a98cc 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2516,6 +2516,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> collect_mm_slot(slot);
> }
>
> + trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> +
> return progress;
> }
>
Nothing jumped at me, except that "full" could be called
"full_scan_finished" or smth like that.
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
--
Cheers
David
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-15 9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
` (3 preceding siblings ...)
2025-12-17 3:31 ` Wei Yang
@ 2025-12-18 9:29 ` David Hildenbrand (Red Hat)
2025-12-19 5:24 ` Vernon Yang
2025-12-19 8:35 ` Vernon Yang
2025-12-22 19:00 ` kernel test robot
5 siblings, 2 replies; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 9:29 UTC (permalink / raw)
To: Vernon Yang, akpm, lorenzo.stoakes
Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang
On 12/15/25 10:04, Vernon Yang wrote:
> The following data is traced by bpftrace on a desktop system. After
> the system has been left idle for 10 minutes upon booting, a lot of
> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> khugepaged.
>
> @scan_pmd_status[1]: 1 ## SCAN_SUCCEED
> @scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
> @scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
> total progress size: 701 MB
> Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
>
> The khugepaged_scan list save all task that support collapse into hugepage,
> as long as the take is not destroyed, khugepaged will not remove it from
> the khugepaged_scan list. This exist a phenomenon where task has already
> collapsed all memory regions into hugepage, but khugepaged continues to
> scan it, which wastes CPU time and invalid, and due to
> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> scanning a large number of invalid task, so scanning really valid task
> is later.
>
> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> list. If the page fault or MADV_HUGEPAGE again, it is added back to
> khugepaged.
I don't like that, as it assumes that memory within such a process would
be rather static, which is easily not the case (e.g., allocators just
doing MADV_DONTNEED to free memory).
If most stuff is collapsed to PMDs already, can't we just skip over
these regions a bit faster?
--
Cheers
David
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-15 9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
` (2 preceding siblings ...)
2025-12-16 13:31 ` kernel test robot
@ 2025-12-18 9:31 ` David Hildenbrand (Red Hat)
2025-12-19 5:29 ` Vernon Yang
3 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 9:31 UTC (permalink / raw)
To: Vernon Yang, akpm, lorenzo.stoakes
Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang
On 12/15/25 10:04, Vernon Yang wrote:
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
>
> So if the user has explicitly informed us via MADV_COLD/FREE that this
> memory is cold or will be freed, it is appropriate for khugepaged to
> scan it only at the latest possible moment, thereby avoiding unnecessary
> scan and collapse operations to reducing CPU wastage.
>
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
>
> Testing on x86_64 machine:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.14 sec | 2.92 sec | -7.01% |
> | cycles per access | 4.91 | 2.07 | -57.84% |
> | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
> | dTLB-load-misses | 288966432 | 1292908 | -99.55% |
>
> Testing on qemu-system-x86_64 -enable-kvm:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> | cycles per access | 7.23 | 2.12 | -70.68% |
> | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
> | dTLB-load-misses | 237406497 | 3189194 | -98.66% |
Again, I also don't like that because you make assumptions on a full
process based on some part of it's address space.
E.g., if a library issues a MADV_COLD on some part of the memory the
library manages, why should the remaining part of the process suffer as
well?
This seems to be an heuristic focused on some specific workloads, no?
--
Cheers
David
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
2025-12-15 9:04 ` [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
@ 2025-12-18 9:33 ` David Hildenbrand (Red Hat)
2025-12-19 5:31 ` Vernon Yang
0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 9:33 UTC (permalink / raw)
To: Vernon Yang, akpm, lorenzo.stoakes
Cc: ziy, npache, baohua, lance.yang, linux-mm, linux-kernel, Vernon Yang
On 12/15/25 10:04, Vernon Yang wrote:
> When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
> scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
> reduce redundant operation.
That conceptually makes sense to me. How much does that safe in
practice? Do you have some performance numbers for processes with rather
large number of VMAs?
--
Cheers
David
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event
2025-12-18 9:24 ` David Hildenbrand (Red Hat)
@ 2025-12-19 5:21 ` Vernon Yang
0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-19 5:21 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Thu, Dec 18, 2025 at 10:24:05AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 10:04, Vernon Yang wrote:
> > Add mm_khugepaged_scan event to track the total time for full scan
> > and the total number of pages scanned of khugepaged.
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> > include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++
> > mm/khugepaged.c | 2 ++
> > 2 files changed, 26 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index dd94d14a2427..b2824c2f8238 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -237,5 +237,29 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
> > __print_symbolic(__entry->result, SCAN_STATUS))
> > );
> > +TRACE_EVENT(mm_khugepaged_scan,
> > +
> > + TP_PROTO(struct mm_struct *mm, int progress, bool full),
> > +
> > + TP_ARGS(mm, progress, full),
> > +
> > + TP_STRUCT__entry(
> > + __field(struct mm_struct *, mm)
> > + __field(int, progress)
> > + __field(bool, full)
> > + ),
> > +
> > + TP_fast_assign(
> > + __entry->mm = mm;
> > + __entry->progress = progress;
> > + __entry->full = full;
> > + ),
> > +
> > + TP_printk("mm=%p, progress=%d, full=%d",
> > + __entry->mm,
> > + __entry->progress,
> > + __entry->full)
> > +);
> > +
> > #endif /* __HUGE_MEMORY_H */
> > #include <trace/define_trace.h>
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index abe54f0043c7..0598a19a98cc 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2516,6 +2516,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > collect_mm_slot(slot);
> > }
> > + trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
> > +
> > return progress;
> > }
>
> Nothing jumped at me, except that "full" could be called
> "full_scan_finished" or smth like that.
Thanks for your review. The full_scan_finished sounds good to me, I'll
do it in the next version.
>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
>
> --
> Cheers
>
> David
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-18 9:29 ` David Hildenbrand (Red Hat)
@ 2025-12-19 5:24 ` Vernon Yang
2025-12-19 9:00 ` David Hildenbrand (Red Hat)
2025-12-19 8:35 ` Vernon Yang
1 sibling, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-19 5:24 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 10:04, Vernon Yang wrote:
> > The following data is traced by bpftrace on a desktop system. After
> > the system has been left idle for 10 minutes upon booting, a lot of
> > SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> > khugepaged.
> >
> > @scan_pmd_status[1]: 1 ## SCAN_SUCCEED
> > @scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
> > @scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
> > total progress size: 701 MB
> > Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >
> > The khugepaged_scan list save all task that support collapse into hugepage,
> > as long as the take is not destroyed, khugepaged will not remove it from
> > the khugepaged_scan list. This exist a phenomenon where task has already
> > collapsed all memory regions into hugepage, but khugepaged continues to
> > scan it, which wastes CPU time and invalid, and due to
> > khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> > scanning a large number of invalid task, so scanning really valid task
> > is later.
> >
> > After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> > SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> > list. If the page fault or MADV_HUGEPAGE again, it is added back to
> > khugepaged.
>
> I don't like that, as it assumes that memory within such a process would be
> rather static, which is easily not the case (e.g., allocators just doing
> MADV_DONTNEED to free memory).
>
> If most stuff is collapsed to PMDs already, can't we just skip over these
> regions a bit faster?
/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
static unsigned int khugepaged_pages_to_scan __read_mostly;
The observed phenomenon is that when scanning these regions, the loop is
broken upon reaching the number of khugepaged_pages_to_scan, thereforce
the khugepaged enters 10s sleep. So if we just skip over these regions,
will break the semantics of khugepaged_pages_to_scan.
I also think this approach is great because it is simple sufficiently.
If we can skip over these regions directly, that's excellent.
> --
> Cheers
>
> David
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-18 9:31 ` David Hildenbrand (Red Hat)
@ 2025-12-19 5:29 ` Vernon Yang
2025-12-19 8:58 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-19 5:29 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 10:04, Vernon Yang wrote:
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > So if the user has explicitly informed us via MADV_COLD/FREE that this
> > memory is cold or will be freed, it is appropriate for khugepaged to
> > scan it only at the latest possible moment, thereby avoiding unnecessary
> > scan and collapse operations to reducing CPU wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2 | without patch | with patch | delta |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time | 3.14 sec | 2.92 sec | -7.01% |
> > | cycles per access | 4.91 | 2.07 | -57.84% |
> > | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
> > | dTLB-load-misses | 288966432 | 1292908 | -99.55% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2 | without patch | with patch | delta |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> > | cycles per access | 7.23 | 2.12 | -70.68% |
> > | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
> > | dTLB-load-misses | 237406497 | 3189194 | -98.66% |
>
> Again, I also don't like that because you make assumptions on a full process
> based on some part of it's address space.
>
> E.g., if a library issues a MADV_COLD on some part of the memory the library
> manages, why should the remaining part of the process suffer as well?
Yes, you make a good point, thanks!
> This seems to be an heuristic focused on some specific workloads, no?
Right.
Could we use the VM_NOHUGEPAGE flag to indicate that this region should
not be collapsed, so that khugepaged can simply skip this VMA during
scanning? This way, it won't affect the remaining part of the task's
memory regions.
> --
> Cheers
>
> David
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
2025-12-18 9:33 ` David Hildenbrand (Red Hat)
@ 2025-12-19 5:31 ` Vernon Yang
0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-19 5:31 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Thu, Dec 18, 2025 at 10:33:16AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 10:04, Vernon Yang wrote:
> > When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
> > scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
> > reduce redundant operation.
>
> That conceptually makes sense to me. How much does that safe in practice? Do
> you have some performance numbers for processes with rather large number of
> VMAs?
I also only came to this possibility through theoretical analysis and
haven't did any separate performance tests for this patch now.
If you have anything you'd like to test, please let me know, and I can
test the performance benefits.
> --
> Cheers
>
> David
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-18 9:29 ` David Hildenbrand (Red Hat)
2025-12-19 5:24 ` Vernon Yang
@ 2025-12-19 8:35 ` Vernon Yang
2025-12-19 8:55 ` David Hildenbrand (Red Hat)
2025-12-23 11:18 ` Dev Jain
1 sibling, 2 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-19 8:35 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 10:04, Vernon Yang wrote:
> > The following data is traced by bpftrace on a desktop system. After
> > the system has been left idle for 10 minutes upon booting, a lot of
> > SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> > khugepaged.
> >
> > @scan_pmd_status[1]: 1 ## SCAN_SUCCEED
> > @scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
> > @scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
> > total progress size: 701 MB
> > Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >
> > The khugepaged_scan list save all task that support collapse into hugepage,
> > as long as the take is not destroyed, khugepaged will not remove it from
> > the khugepaged_scan list. This exist a phenomenon where task has already
> > collapsed all memory regions into hugepage, but khugepaged continues to
> > scan it, which wastes CPU time and invalid, and due to
> > khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> > scanning a large number of invalid task, so scanning really valid task
> > is later.
> >
> > After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> > SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> > list. If the page fault or MADV_HUGEPAGE again, it is added back to
> > khugepaged.
>
> I don't like that, as it assumes that memory within such a process would be
> rather static, which is easily not the case (e.g., allocators just doing
> MADV_DONTNEED to free memory).
>
> If most stuff is collapsed to PMDs already, can't we just skip over these
> regions a bit faster?
I have a flash of inspiration and came up with a good idea.
If these regions have already been collapsed into hugepage, rechecking
them would be very fast. Due to the khugepaged_pages_to_scan can also
represent the number of VMAs to skip, we can extend its semantics as
follows:
/*
* default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
* every 10 second.
*/
static unsigned int khugepaged_pages_to_scan __read_mostly;
switch (*result) {
case SCAN_NO_PTE_TABLE:
case SCAN_PMD_MAPPED:
case SCAN_PTE_MAPPED_HUGEPAGE:
progress++; // here
break;
case SCAN_SUCCEED:
++khugepaged_pages_collapsed;
fallthrough;
default:
progress += HPAGE_PMD_NR;
}
This way can achieve our goal. David, do you like it?
> --
> Cheers
>
> David
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-19 8:35 ` Vernon Yang
@ 2025-12-19 8:55 ` David Hildenbrand (Red Hat)
2025-12-23 11:18 ` Dev Jain
1 sibling, 0 replies; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19 8:55 UTC (permalink / raw)
To: Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 12/19/25 09:35, Vernon Yang wrote:
> On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/15/25 10:04, Vernon Yang wrote:
>>> The following data is traced by bpftrace on a desktop system. After
>>> the system has been left idle for 10 minutes upon booting, a lot of
>>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
>>> khugepaged.
>>>
>>> @scan_pmd_status[1]: 1 ## SCAN_SUCCEED
>>> @scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
>>> @scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
>>> total progress size: 701 MB
>>> Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
>>>
>>> The khugepaged_scan list save all task that support collapse into hugepage,
>>> as long as the take is not destroyed, khugepaged will not remove it from
>>> the khugepaged_scan list. This exist a phenomenon where task has already
>>> collapsed all memory regions into hugepage, but khugepaged continues to
>>> scan it, which wastes CPU time and invalid, and due to
>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>>> scanning a large number of invalid task, so scanning really valid task
>>> is later.
>>>
>>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
>>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
>>> list. If the page fault or MADV_HUGEPAGE again, it is added back to
>>> khugepaged.
>>
>> I don't like that, as it assumes that memory within such a process would be
>> rather static, which is easily not the case (e.g., allocators just doing
>> MADV_DONTNEED to free memory).
>>
>> If most stuff is collapsed to PMDs already, can't we just skip over these
>> regions a bit faster?
>
> I have a flash of inspiration and came up with a good idea.
>
> If these regions have already been collapsed into hugepage, rechecking
> them would be very fast. Due to the khugepaged_pages_to_scan can also
> represent the number of VMAs to skip, we can extend its semantics as
> follows:
>
> /*
> * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> * every 10 second.
> */
> static unsigned int khugepaged_pages_to_scan __read_mostly;
>
> switch (*result) {
> case SCAN_NO_PTE_TABLE:
> case SCAN_PMD_MAPPED:
> case SCAN_PTE_MAPPED_HUGEPAGE:
> progress++; // here
> break;
> case SCAN_SUCCEED:
> ++khugepaged_pages_collapsed;
> fallthrough;
> default:
> progress += HPAGE_PMD_NR;
> }
>
> This way can achieve our goal. David, do you like it?
I'd have to see the full patch, but IMHO we should rather focus on on
"how many pte/pmd entries did we check" and not "how many PMD areas did
we check".
Maybe there is a history to this, but conceptually I think we wanted to
limit the work we do in one operation to something reasonable. Reading a
single PMD is obviously faster than 512 PTEs.
--
Cheers
David
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-19 5:29 ` Vernon Yang
@ 2025-12-19 8:58 ` David Hildenbrand (Red Hat)
2025-12-21 2:10 ` Wei Yang
0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19 8:58 UTC (permalink / raw)
To: Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 12/19/25 06:29, Vernon Yang wrote:
> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/15/25 10:04, Vernon Yang wrote:
>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>> continuously access 128 MB memory, while the cold task only accesses
>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>>> still prioritizes scanning the cold task and only scans the hot2 task
>>> after completing the scan of the cold task.
>>>
>>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>>> memory is cold or will be freed, it is appropriate for khugepaged to
>>> scan it only at the latest possible moment, thereby avoiding unnecessary
>>> scan and collapse operations to reducing CPU wastage.
>>>
>>> Here are the performance test results:
>>> (Throughput bigger is better, other smaller is better)
>>>
>>> Testing on x86_64 machine:
>>>
>>> | task hot2 | without patch | with patch | delta |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time | 3.14 sec | 2.92 sec | -7.01% |
>>> | cycles per access | 4.91 | 2.07 | -57.84% |
>>> | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
>>> | dTLB-load-misses | 288966432 | 1292908 | -99.55% |
>>>
>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>
>>> | task hot2 | without patch | with patch | delta |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
>>> | cycles per access | 7.23 | 2.12 | -70.68% |
>>> | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
>>> | dTLB-load-misses | 237406497 | 3189194 | -98.66% |
>>
>> Again, I also don't like that because you make assumptions on a full process
>> based on some part of it's address space.
>>
>> E.g., if a library issues a MADV_COLD on some part of the memory the library
>> manages, why should the remaining part of the process suffer as well?
>
> Yes, you make a good point, thanks!
>
>> This seems to be an heuristic focused on some specific workloads, no?
>
> Right.
>
> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> not be collapsed, so that khugepaged can simply skip this VMA during
> scanning? This way, it won't affect the remaining part of the task's
> memory regions.
I thought we would skip these regions already properly in khugeapged, or
maybe I misunderstood your question.
--
Cheers
David
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-19 5:24 ` Vernon Yang
@ 2025-12-19 9:00 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19 9:00 UTC (permalink / raw)
To: Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 12/19/25 06:24, Vernon Yang wrote:
> On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/15/25 10:04, Vernon Yang wrote:
>>> The following data is traced by bpftrace on a desktop system. After
>>> the system has been left idle for 10 minutes upon booting, a lot of
>>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
>>> khugepaged.
>>>
>>> @scan_pmd_status[1]: 1 ## SCAN_SUCCEED
>>> @scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
>>> @scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
>>> total progress size: 701 MB
>>> Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
>>>
>>> The khugepaged_scan list save all task that support collapse into hugepage,
>>> as long as the take is not destroyed, khugepaged will not remove it from
>>> the khugepaged_scan list. This exist a phenomenon where task has already
>>> collapsed all memory regions into hugepage, but khugepaged continues to
>>> scan it, which wastes CPU time and invalid, and due to
>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>>> scanning a large number of invalid task, so scanning really valid task
>>> is later.
>>>
>>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
>>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
>>> list. If the page fault or MADV_HUGEPAGE again, it is added back to
>>> khugepaged.
>>
>> I don't like that, as it assumes that memory within such a process would be
>> rather static, which is easily not the case (e.g., allocators just doing
>> MADV_DONTNEED to free memory).
>>
>> If most stuff is collapsed to PMDs already, can't we just skip over these
>> regions a bit faster?
>
> /* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> static unsigned int khugepaged_pages_to_scan __read_mostly;
>
> The observed phenomenon is that when scanning these regions, the loop is
> broken upon reaching the number of khugepaged_pages_to_scan, thereforce
> the khugepaged enters 10s sleep.
BTW, the 10s sleep is ridiculous :)
I wonder whether we were more careful in the past regarding canning
overhead due to the mmap read lock. Nowadays page faults typicaly use
per-vma locks, so I wonder whether the scanning overhead is still a
problem. (I assume there is more to optimize long-term)
--
Cheers
David
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-19 8:58 ` David Hildenbrand (Red Hat)
@ 2025-12-21 2:10 ` Wei Yang
2025-12-21 4:25 ` Vernon Yang
0 siblings, 1 reply; 42+ messages in thread
From: Wei Yang @ 2025-12-21 2:10 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Vernon Yang, akpm, lorenzo.stoakes, ziy, baohua, lance.yang,
linux-mm, linux-kernel, Vernon Yang
On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>On 12/19/25 06:29, Vernon Yang wrote:
>> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>> > On 12/15/25 10:04, Vernon Yang wrote:
>> > > For example, create three task: hot1 -> cold -> hot2. After all three
>> > > task are created, each allocate memory 128MB. the hot1/hot2 task
>> > > continuously access 128 MB memory, while the cold task only accesses
>> > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>> > > still prioritizes scanning the cold task and only scans the hot2 task
>> > > after completing the scan of the cold task.
>> > >
>> > > So if the user has explicitly informed us via MADV_COLD/FREE that this
>> > > memory is cold or will be freed, it is appropriate for khugepaged to
>> > > scan it only at the latest possible moment, thereby avoiding unnecessary
>> > > scan and collapse operations to reducing CPU wastage.
>> > >
>> > > Here are the performance test results:
>> > > (Throughput bigger is better, other smaller is better)
>> > >
>> > > Testing on x86_64 machine:
>> > >
>> > > | task hot2 | without patch | with patch | delta |
>> > > |---------------------|---------------|---------------|---------|
>> > > | total accesses time | 3.14 sec | 2.92 sec | -7.01% |
>> > > | cycles per access | 4.91 | 2.07 | -57.84% |
>> > > | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
>> > > | dTLB-load-misses | 288966432 | 1292908 | -99.55% |
>> > >
>> > > Testing on qemu-system-x86_64 -enable-kvm:
>> > >
>> > > | task hot2 | without patch | with patch | delta |
>> > > |---------------------|---------------|---------------|---------|
>> > > | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
>> > > | cycles per access | 7.23 | 2.12 | -70.68% |
>> > > | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
>> > > | dTLB-load-misses | 237406497 | 3189194 | -98.66% |
>> >
>> > Again, I also don't like that because you make assumptions on a full process
>> > based on some part of it's address space.
>> >
>> > E.g., if a library issues a MADV_COLD on some part of the memory the library
>> > manages, why should the remaining part of the process suffer as well?
>>
>> Yes, you make a good point, thanks!
>>
>> > This seems to be an heuristic focused on some specific workloads, no?
>>
>> Right.
>>
>> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>> not be collapsed, so that khugepaged can simply skip this VMA during
>> scanning? This way, it won't affect the remaining part of the task's
>> memory regions.
>
>I thought we would skip these regions already properly in khugeapged, or
>maybe I misunderstood your question.
>
I think we should, but seems we didn't do this for anonymous memory during
khugepaged.
We check the vma with thp_vma_allowable_order() during scan.
* For anonymous memory during khugepaged, if we always enable 2M collapse,
we will scan this vma. Even VM_NOHUGEPAGE is set.
* For other cases, it looks good since __thp_vma_allowable_order() will skip
this vma with vma_thp_disabled().
>--
>Cheers
>
>David
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-21 2:10 ` Wei Yang
@ 2025-12-21 4:25 ` Vernon Yang
2025-12-21 9:24 ` David Hildenbrand (Red Hat)
2025-12-21 12:38 ` Wei Yang
0 siblings, 2 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-21 4:25 UTC (permalink / raw)
To: Wei Yang, David Hildenbrand (Red Hat)
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
> >On 12/19/25 06:29, Vernon Yang wrote:
> >> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> >> > On 12/15/25 10:04, Vernon Yang wrote:
> >> > > For example, create three task: hot1 -> cold -> hot2. After all three
> >> > > task are created, each allocate memory 128MB. the hot1/hot2 task
> >> > > continuously access 128 MB memory, while the cold task only accesses
> >> > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> >> > > still prioritizes scanning the cold task and only scans the hot2 task
> >> > > after completing the scan of the cold task.
> >> > >
> >> > > So if the user has explicitly informed us via MADV_COLD/FREE that this
> >> > > memory is cold or will be freed, it is appropriate for khugepaged to
> >> > > scan it only at the latest possible moment, thereby avoiding unnecessary
> >> > > scan and collapse operations to reducing CPU wastage.
> >> > >
> >> > > Here are the performance test results:
> >> > > (Throughput bigger is better, other smaller is better)
> >> > >
> >> > > Testing on x86_64 machine:
> >> > >
> >> > > | task hot2 | without patch | with patch | delta |
> >> > > |---------------------|---------------|---------------|---------|
> >> > > | total accesses time | 3.14 sec | 2.92 sec | -7.01% |
> >> > > | cycles per access | 4.91 | 2.07 | -57.84% |
> >> > > | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
> >> > > | dTLB-load-misses | 288966432 | 1292908 | -99.55% |
> >> > >
> >> > > Testing on qemu-system-x86_64 -enable-kvm:
> >> > >
> >> > > | task hot2 | without patch | with patch | delta |
> >> > > |---------------------|---------------|---------------|---------|
> >> > > | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> >> > > | cycles per access | 7.23 | 2.12 | -70.68% |
> >> > > | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
> >> > > | dTLB-load-misses | 237406497 | 3189194 | -98.66% |
> >> >
> >> > Again, I also don't like that because you make assumptions on a full process
> >> > based on some part of it's address space.
> >> >
> >> > E.g., if a library issues a MADV_COLD on some part of the memory the library
> >> > manages, why should the remaining part of the process suffer as well?
> >>
> >> Yes, you make a good point, thanks!
> >>
> >> > This seems to be an heuristic focused on some specific workloads, no?
> >>
> >> Right.
> >>
> >> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> >> not be collapsed, so that khugepaged can simply skip this VMA during
> >> scanning? This way, it won't affect the remaining part of the task's
> >> memory regions.
> >
> >I thought we would skip these regions already properly in khugeapged, or
> >maybe I misunderstood your question.
> >
>
> I think we should, but seems we didn't do this for anonymous memory during
> khugepaged.
>
> We check the vma with thp_vma_allowable_order() during scan.
>
> * For anonymous memory during khugepaged, if we always enable 2M collapse,
> we will scan this vma. Even VM_NOHUGEPAGE is set.
>
> * For other cases, it looks good since __thp_vma_allowable_order() will skip
> this vma with vma_thp_disabled().
Hi David, Wei,
The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
memory during scan, as below:
khugepaged_scan_mm_slot()
thp_vma_allowable_order()
thp_vma_allowable_orders()
__thp_vma_allowable_orders()
vma_thp_disabled() {
if (vm_flags & VM_NOHUGEPAGE)
return true;
}
REAL ISSUE: when madvise(MADV_COLD),not set VM_NOHUGEPAGE flag to vma,
so the khugepaged will continue scan this vma.
I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
been successful. I will send it in the next version.
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-21 4:25 ` Vernon Yang
@ 2025-12-21 9:24 ` David Hildenbrand (Red Hat)
2025-12-21 12:34 ` Vernon Yang
2025-12-21 12:38 ` Wei Yang
1 sibling, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-21 9:24 UTC (permalink / raw)
To: Vernon Yang, Wei Yang
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 12/21/25 05:25, Vernon Yang wrote:
> On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
>> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>>> On 12/19/25 06:29, Vernon Yang wrote:
>>>> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>> On 12/15/25 10:04, Vernon Yang wrote:
>>>>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>>>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>>>>> continuously access 128 MB memory, while the cold task only accesses
>>>>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>>>>>> still prioritizes scanning the cold task and only scans the hot2 task
>>>>>> after completing the scan of the cold task.
>>>>>>
>>>>>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>>>>>> memory is cold or will be freed, it is appropriate for khugepaged to
>>>>>> scan it only at the latest possible moment, thereby avoiding unnecessary
>>>>>> scan and collapse operations to reducing CPU wastage.
>>>>>>
>>>>>> Here are the performance test results:
>>>>>> (Throughput bigger is better, other smaller is better)
>>>>>>
>>>>>> Testing on x86_64 machine:
>>>>>>
>>>>>> | task hot2 | without patch | with patch | delta |
>>>>>> |---------------------|---------------|---------------|---------|
>>>>>> | total accesses time | 3.14 sec | 2.92 sec | -7.01% |
>>>>>> | cycles per access | 4.91 | 2.07 | -57.84% |
>>>>>> | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
>>>>>> | dTLB-load-misses | 288966432 | 1292908 | -99.55% |
>>>>>>
>>>>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>>>>
>>>>>> | task hot2 | without patch | with patch | delta |
>>>>>> |---------------------|---------------|---------------|---------|
>>>>>> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
>>>>>> | cycles per access | 7.23 | 2.12 | -70.68% |
>>>>>> | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
>>>>>> | dTLB-load-misses | 237406497 | 3189194 | -98.66% |
>>>>>
>>>>> Again, I also don't like that because you make assumptions on a full process
>>>>> based on some part of it's address space.
>>>>>
>>>>> E.g., if a library issues a MADV_COLD on some part of the memory the library
>>>>> manages, why should the remaining part of the process suffer as well?
>>>>
>>>> Yes, you make a good point, thanks!
>>>>
>>>>> This seems to be an heuristic focused on some specific workloads, no?
>>>>
>>>> Right.
>>>>
>>>> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>>>> not be collapsed, so that khugepaged can simply skip this VMA during
>>>> scanning? This way, it won't affect the remaining part of the task's
>>>> memory regions.
>>>
>>> I thought we would skip these regions already properly in khugeapged, or
>>> maybe I misunderstood your question.
>>>
>>
>> I think we should, but seems we didn't do this for anonymous memory during
>> khugepaged.
>>
>> We check the vma with thp_vma_allowable_order() during scan.
>>
>> * For anonymous memory during khugepaged, if we always enable 2M collapse,
>> we will scan this vma. Even VM_NOHUGEPAGE is set.
>>
>> * For other cases, it looks good since __thp_vma_allowable_order() will skip
>> this vma with vma_thp_disabled().
>
> Hi David, Wei,
>
> The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
> memory during scan, as below:
>
> khugepaged_scan_mm_slot()
> thp_vma_allowable_order()
> thp_vma_allowable_orders()
> __thp_vma_allowable_orders()
> vma_thp_disabled() {
> if (vm_flags & VM_NOHUGEPAGE)
> return true;
> }
>
> REAL ISSUE: when madvise(MADV_COLD),not set VM_NOHUGEPAGE flag to vma,
> so the khugepaged will continue scan this vma.
>
> I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
> been successful. I will send it in the next version.
No we must not do that. That's a user-space visible change. :/
--
Cheers
David
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-21 9:24 ` David Hildenbrand (Red Hat)
@ 2025-12-21 12:34 ` Vernon Yang
2025-12-23 9:59 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 42+ messages in thread
From: Vernon Yang @ 2025-12-21 12:34 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Wei Yang, akpm, lorenzo.stoakes, ziy, baohua, lance.yang,
linux-mm, linux-kernel, Vernon Yang
On Sun, Dec 21, 2025 at 10:24:11AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/21/25 05:25, Vernon Yang wrote:
> > On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
> > > On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > On 12/19/25 06:29, Vernon Yang wrote:
> > > > > On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > > On 12/15/25 10:04, Vernon Yang wrote:
> > > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> > > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > > after completing the scan of the cold task.
> > > > > > >
> > > > > > > So if the user has explicitly informed us via MADV_COLD/FREE that this
> > > > > > > memory is cold or will be freed, it is appropriate for khugepaged to
> > > > > > > scan it only at the latest possible moment, thereby avoiding unnecessary
> > > > > > > scan and collapse operations to reducing CPU wastage.
> > > > > > >
> > > > > > > Here are the performance test results:
> > > > > > > (Throughput bigger is better, other smaller is better)
> > > > > > >
> > > > > > > Testing on x86_64 machine:
> > > > > > >
> > > > > > > | task hot2 | without patch | with patch | delta |
> > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > | total accesses time | 3.14 sec | 2.92 sec | -7.01% |
> > > > > > > | cycles per access | 4.91 | 2.07 | -57.84% |
> > > > > > > | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
> > > > > > > | dTLB-load-misses | 288966432 | 1292908 | -99.55% |
> > > > > > >
> > > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > > >
> > > > > > > | task hot2 | without patch | with patch | delta |
> > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> > > > > > > | cycles per access | 7.23 | 2.12 | -70.68% |
> > > > > > > | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
> > > > > > > | dTLB-load-misses | 237406497 | 3189194 | -98.66% |
> > > > > >
> > > > > > Again, I also don't like that because you make assumptions on a full process
> > > > > > based on some part of it's address space.
> > > > > >
> > > > > > E.g., if a library issues a MADV_COLD on some part of the memory the library
> > > > > > manages, why should the remaining part of the process suffer as well?
> > > > >
> > > > > Yes, you make a good point, thanks!
> > > > >
> > > > > > This seems to be an heuristic focused on some specific workloads, no?
> > > > >
> > > > > Right.
> > > > >
> > > > > Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> > > > > not be collapsed, so that khugepaged can simply skip this VMA during
> > > > > scanning? This way, it won't affect the remaining part of the task's
> > > > > memory regions.
> > > >
> > > > I thought we would skip these regions already properly in khugeapged, or
> > > > maybe I misunderstood your question.
> > > >
> > >
> > > I think we should, but seems we didn't do this for anonymous memory during
> > > khugepaged.
> > >
> > > We check the vma with thp_vma_allowable_order() during scan.
> > >
> > > * For anonymous memory during khugepaged, if we always enable 2M collapse,
> > > we will scan this vma. Even VM_NOHUGEPAGE is set.
> > >
> > > * For other cases, it looks good since __thp_vma_allowable_order() will skip
> > > this vma with vma_thp_disabled().
> >
> > Hi David, Wei,
> >
> > The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
> > memory during scan, as below:
> >
> > khugepaged_scan_mm_slot()
> > thp_vma_allowable_order()
> > thp_vma_allowable_orders()
> > __thp_vma_allowable_orders()
> > vma_thp_disabled() {
> > if (vm_flags & VM_NOHUGEPAGE)
> > return true;
> > }
> >
> > REAL ISSUE: when madvise(MADV_COLD),not set VM_NOHUGEPAGE flag to vma,
> > so the khugepaged will continue scan this vma.
> >
> > I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
> > been successful. I will send it in the next version.
>
> No we must not do that. That's a user-space visible change. :/
David, what good ideas do you have to achieve this goal? let me know
please, thank!
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-21 4:25 ` Vernon Yang
2025-12-21 9:24 ` David Hildenbrand (Red Hat)
@ 2025-12-21 12:38 ` Wei Yang
1 sibling, 0 replies; 42+ messages in thread
From: Wei Yang @ 2025-12-21 12:38 UTC (permalink / raw)
To: Vernon Yang
Cc: Wei Yang, David Hildenbrand (Red Hat),
akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Sun, Dec 21, 2025 at 12:25:44PM +0800, Vernon Yang wrote:
>On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
>> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>> >On 12/19/25 06:29, Vernon Yang wrote:
>> >> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>> >> > On 12/15/25 10:04, Vernon Yang wrote:
>> >> > > For example, create three task: hot1 -> cold -> hot2. After all three
>> >> > > task are created, each allocate memory 128MB. the hot1/hot2 task
>> >> > > continuously access 128 MB memory, while the cold task only accesses
>> >> > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>> >> > > still prioritizes scanning the cold task and only scans the hot2 task
>> >> > > after completing the scan of the cold task.
>> >> > >
>> >> > > So if the user has explicitly informed us via MADV_COLD/FREE that this
>> >> > > memory is cold or will be freed, it is appropriate for khugepaged to
>> >> > > scan it only at the latest possible moment, thereby avoiding unnecessary
>> >> > > scan and collapse operations to reducing CPU wastage.
>> >> > >
>> >> > > Here are the performance test results:
>> >> > > (Throughput bigger is better, other smaller is better)
>> >> > >
>> >> > > Testing on x86_64 machine:
>> >> > >
>> >> > > | task hot2 | without patch | with patch | delta |
>> >> > > |---------------------|---------------|---------------|---------|
>> >> > > | total accesses time | 3.14 sec | 2.92 sec | -7.01% |
>> >> > > | cycles per access | 4.91 | 2.07 | -57.84% |
>> >> > > | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
>> >> > > | dTLB-load-misses | 288966432 | 1292908 | -99.55% |
>> >> > >
>> >> > > Testing on qemu-system-x86_64 -enable-kvm:
>> >> > >
>> >> > > | task hot2 | without patch | with patch | delta |
>> >> > > |---------------------|---------------|---------------|---------|
>> >> > > | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
>> >> > > | cycles per access | 7.23 | 2.12 | -70.68% |
>> >> > > | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
>> >> > > | dTLB-load-misses | 237406497 | 3189194 | -98.66% |
>> >> >
>> >> > Again, I also don't like that because you make assumptions on a full process
>> >> > based on some part of it's address space.
>> >> >
>> >> > E.g., if a library issues a MADV_COLD on some part of the memory the library
>> >> > manages, why should the remaining part of the process suffer as well?
>> >>
>> >> Yes, you make a good point, thanks!
>> >>
>> >> > This seems to be an heuristic focused on some specific workloads, no?
>> >>
>> >> Right.
>> >>
>> >> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>> >> not be collapsed, so that khugepaged can simply skip this VMA during
>> >> scanning? This way, it won't affect the remaining part of the task's
>> >> memory regions.
>> >
>> >I thought we would skip these regions already properly in khugeapged, or
>> >maybe I misunderstood your question.
>> >
>>
>> I think we should, but seems we didn't do this for anonymous memory during
>> khugepaged.
>>
>> We check the vma with thp_vma_allowable_order() during scan.
>>
>> * For anonymous memory during khugepaged, if we always enable 2M collapse,
>> we will scan this vma. Even VM_NOHUGEPAGE is set.
>>
>> * For other cases, it looks good since __thp_vma_allowable_order() will skip
>> this vma with vma_thp_disabled().
>
>Hi David, Wei,
>
>The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
>memory during scan, as below:
>
>khugepaged_scan_mm_slot()
> thp_vma_allowable_order()
> thp_vma_allowable_orders()
Oops, you are right. It only bypass __thp_vma_allowable_order() if orders is
0.
> __thp_vma_allowable_orders()
> vma_thp_disabled() {
> if (vm_flags & VM_NOHUGEPAGE)
> return true;
> }
>
>REAL ISSUE: when madvise(MADV_COLD),not set VM_NOHUGEPAGE flag to vma,
>so the khugepaged will continue scan this vma.
>
>I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
>been successful. I will send it in the next version.
>
>--
>Thanks,
>Vernon
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-15 9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
` (4 preceding siblings ...)
2025-12-18 9:29 ` David Hildenbrand (Red Hat)
@ 2025-12-22 19:00 ` kernel test robot
5 siblings, 0 replies; 42+ messages in thread
From: kernel test robot @ 2025-12-22 19:00 UTC (permalink / raw)
To: Vernon Yang, akpm, david, lorenzo.stoakes
Cc: oe-kbuild-all, ziy, npache, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
Hi Vernon,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.19-rc2 next-20251219]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Vernon-Yang/mm-khugepaged-add-trace_mm_khugepaged_scan-event/20251215-171046
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251215090419.174418-3-yanglincheng%40kylinos.cn
patch subject: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20251222/202512221928.EnLvUgqT-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251222/202512221928.EnLvUgqT-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512221928.EnLvUgqT-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/khugepaged.c: In function 'khugepaged_scan_mm_slot':
>> mm/khugepaged.c:2490:30: error: 'SCAN_PMD_NULL' undeclared (first use in this function); did you mean 'SCAN_VMA_NULL'?
2490 | case SCAN_PMD_NULL:
| ^~~~~~~~~~~~~
| SCAN_VMA_NULL
mm/khugepaged.c:2490:30: note: each undeclared identifier is reported only once for each function it appears in
>> mm/khugepaged.c:2491:30: error: 'SCAN_PMD_NONE' undeclared (first use in this function)
2491 | case SCAN_PMD_NONE:
| ^~~~~~~~~~~~~
vim +2490 mm/khugepaged.c
2392
2393 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
2394 struct collapse_control *cc)
2395 __releases(&khugepaged_mm_lock)
2396 __acquires(&khugepaged_mm_lock)
2397 {
2398 struct vma_iterator vmi;
2399 struct mm_slot *slot;
2400 struct mm_struct *mm;
2401 struct vm_area_struct *vma;
2402 int progress = 0;
2403
2404 VM_BUG_ON(!pages);
2405 lockdep_assert_held(&khugepaged_mm_lock);
2406 *result = SCAN_FAIL;
2407
2408 if (khugepaged_scan.mm_slot) {
2409 slot = khugepaged_scan.mm_slot;
2410 } else {
2411 slot = list_first_entry(&khugepaged_scan.mm_head,
2412 struct mm_slot, mm_node);
2413 khugepaged_scan.address = 0;
2414 khugepaged_scan.mm_slot = slot;
2415 khugepaged_scan.maybe_collapse = false;
2416 }
2417 spin_unlock(&khugepaged_mm_lock);
2418
2419 mm = slot->mm;
2420 /*
2421 * Don't wait for semaphore (to avoid long wait times). Just move to
2422 * the next mm on the list.
2423 */
2424 vma = NULL;
2425 if (unlikely(!mmap_read_trylock(mm)))
2426 goto breakouterloop_mmap_lock;
2427
2428 progress++;
2429 if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
2430 goto breakouterloop;
2431
2432 vma_iter_init(&vmi, mm, khugepaged_scan.address);
2433 for_each_vma(vmi, vma) {
2434 unsigned long hstart, hend;
2435
2436 cond_resched();
2437 if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
2438 progress++;
2439 break;
2440 }
2441 if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
2442 skip:
2443 progress++;
2444 continue;
2445 }
2446 hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
2447 hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
2448 if (khugepaged_scan.address > hend)
2449 goto skip;
2450 if (khugepaged_scan.address < hstart)
2451 khugepaged_scan.address = hstart;
2452 VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
2453
2454 while (khugepaged_scan.address < hend) {
2455 bool mmap_locked = true;
2456
2457 cond_resched();
2458 if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
2459 goto breakouterloop;
2460
2461 VM_BUG_ON(khugepaged_scan.address < hstart ||
2462 khugepaged_scan.address + HPAGE_PMD_SIZE >
2463 hend);
2464 if (!vma_is_anonymous(vma)) {
2465 struct file *file = get_file(vma->vm_file);
2466 pgoff_t pgoff = linear_page_index(vma,
2467 khugepaged_scan.address);
2468
2469 mmap_read_unlock(mm);
2470 mmap_locked = false;
2471 *result = hpage_collapse_scan_file(mm,
2472 khugepaged_scan.address, file, pgoff, cc);
2473 fput(file);
2474 if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
2475 mmap_read_lock(mm);
2476 if (hpage_collapse_test_exit_or_disable(mm))
2477 goto breakouterloop;
2478 *result = collapse_pte_mapped_thp(mm,
2479 khugepaged_scan.address, false);
2480 if (*result == SCAN_PMD_MAPPED)
2481 *result = SCAN_SUCCEED;
2482 mmap_read_unlock(mm);
2483 }
2484 } else {
2485 *result = hpage_collapse_scan_pmd(mm, vma,
2486 khugepaged_scan.address, &mmap_locked, cc);
2487 }
2488
2489 switch (*result) {
> 2490 case SCAN_PMD_NULL:
> 2491 case SCAN_PMD_NONE:
2492 case SCAN_PMD_MAPPED:
2493 case SCAN_PTE_MAPPED_HUGEPAGE:
2494 break;
2495 case SCAN_SUCCEED:
2496 ++khugepaged_pages_collapsed;
2497 fallthrough;
2498 default:
2499 khugepaged_scan.maybe_collapse = true;
2500 }
2501
2502 /* move to next address */
2503 khugepaged_scan.address += HPAGE_PMD_SIZE;
2504 progress += HPAGE_PMD_NR;
2505 if (!mmap_locked)
2506 /*
2507 * We released mmap_lock so break loop. Note
2508 * that we drop mmap_lock before all hugepage
2509 * allocations, so if allocation fails, we are
2510 * guaranteed to break here and report the
2511 * correct result back to caller.
2512 */
2513 goto breakouterloop_mmap_lock;
2514 if (progress >= pages)
2515 goto breakouterloop;
2516 }
2517 }
2518 breakouterloop:
2519 mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
2520 breakouterloop_mmap_lock:
2521
2522 spin_lock(&khugepaged_mm_lock);
2523 VM_BUG_ON(khugepaged_scan.mm_slot != slot);
2524 /*
2525 * Release the current mm_slot if this mm is about to die, or
2526 * if we scanned all vmas of this mm.
2527 */
2528 if (hpage_collapse_test_exit(mm) || !vma) {
2529 bool maybe_collapse = khugepaged_scan.maybe_collapse;
2530
2531 if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
2532 maybe_collapse = true;
2533
2534 /*
2535 * Make sure that if mm_users is reaching zero while
2536 * khugepaged runs here, khugepaged_exit will find
2537 * mm_slot not pointing to the exiting mm.
2538 */
2539 if (!list_is_last(&slot->mm_node, &khugepaged_scan.mm_head)) {
2540 khugepaged_scan.mm_slot = list_next_entry(slot, mm_node);
2541 khugepaged_scan.address = 0;
2542 khugepaged_scan.maybe_collapse = false;
2543 } else {
2544 khugepaged_scan.mm_slot = NULL;
2545 khugepaged_full_scans++;
2546 }
2547
2548 collect_mm_slot(slot, maybe_collapse);
2549 }
2550
2551 trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
2552
2553 return progress;
2554 }
2555
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-21 12:34 ` Vernon Yang
@ 2025-12-23 9:59 ` David Hildenbrand (Red Hat)
2025-12-25 15:12 ` Vernon Yang
0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-23 9:59 UTC (permalink / raw)
To: Vernon Yang
Cc: Wei Yang, akpm, lorenzo.stoakes, ziy, baohua, lance.yang,
linux-mm, linux-kernel, Vernon Yang
On 12/21/25 13:34, Vernon Yang wrote:
> On Sun, Dec 21, 2025 at 10:24:11AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/21/25 05:25, Vernon Yang wrote:
>>> On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
>>>> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>> On 12/19/25 06:29, Vernon Yang wrote:
>>>>>> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>>>> On 12/15/25 10:04, Vernon Yang wrote:
>>>>>>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>>>>>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>>>>>>> continuously access 128 MB memory, while the cold task only accesses
>>>>>>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>>>>>>>> still prioritizes scanning the cold task and only scans the hot2 task
>>>>>>>> after completing the scan of the cold task.
>>>>>>>>
>>>>>>>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>>>>>>>> memory is cold or will be freed, it is appropriate for khugepaged to
>>>>>>>> scan it only at the latest possible moment, thereby avoiding unnecessary
>>>>>>>> scan and collapse operations to reducing CPU wastage.
>>>>>>>>
>>>>>>>> Here are the performance test results:
>>>>>>>> (Throughput bigger is better, other smaller is better)
>>>>>>>>
>>>>>>>> Testing on x86_64 machine:
>>>>>>>>
>>>>>>>> | task hot2 | without patch | with patch | delta |
>>>>>>>> |---------------------|---------------|---------------|---------|
>>>>>>>> | total accesses time | 3.14 sec | 2.92 sec | -7.01% |
>>>>>>>> | cycles per access | 4.91 | 2.07 | -57.84% |
>>>>>>>> | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
>>>>>>>> | dTLB-load-misses | 288966432 | 1292908 | -99.55% |
>>>>>>>>
>>>>>>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>>>>>>
>>>>>>>> | task hot2 | without patch | with patch | delta |
>>>>>>>> |---------------------|---------------|---------------|---------|
>>>>>>>> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
>>>>>>>> | cycles per access | 7.23 | 2.12 | -70.68% |
>>>>>>>> | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
>>>>>>>> | dTLB-load-misses | 237406497 | 3189194 | -98.66% |
>>>>>>>
>>>>>>> Again, I also don't like that because you make assumptions on a full process
>>>>>>> based on some part of it's address space.
>>>>>>>
>>>>>>> E.g., if a library issues a MADV_COLD on some part of the memory the library
>>>>>>> manages, why should the remaining part of the process suffer as well?
>>>>>>
>>>>>> Yes, you make a good point, thanks!
>>>>>>
>>>>>>> This seems to be an heuristic focused on some specific workloads, no?
>>>>>>
>>>>>> Right.
>>>>>>
>>>>>> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>>>>>> not be collapsed, so that khugepaged can simply skip this VMA during
>>>>>> scanning? This way, it won't affect the remaining part of the task's
>>>>>> memory regions.
>>>>>
>>>>> I thought we would skip these regions already properly in khugeapged, or
>>>>> maybe I misunderstood your question.
>>>>>
>>>>
>>>> I think we should, but seems we didn't do this for anonymous memory during
>>>> khugepaged.
>>>>
>>>> We check the vma with thp_vma_allowable_order() during scan.
>>>>
>>>> * For anonymous memory during khugepaged, if we always enable 2M collapse,
>>>> we will scan this vma. Even VM_NOHUGEPAGE is set.
>>>>
>>>> * For other cases, it looks good since __thp_vma_allowable_order() will skip
>>>> this vma with vma_thp_disabled().
>>>
>>> Hi David, Wei,
>>>
>>> The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
>>> memory during scan, as below:
>>>
>>> khugepaged_scan_mm_slot()
>>> thp_vma_allowable_order()
>>> thp_vma_allowable_orders()
>>> __thp_vma_allowable_orders()
>>> vma_thp_disabled() {
>>> if (vm_flags & VM_NOHUGEPAGE)
>>> return true;
>>> }
>>>
>>> REAL ISSUE: when madvise(MADV_COLD),not set VM_NOHUGEPAGE flag to vma,
>>> so the khugepaged will continue scan this vma.
>>>
>>> I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
>>> been successful. I will send it in the next version.
>>
>> No we must not do that. That's a user-space visible change. :/
>
> David, what good ideas do you have to achieve this goal? let me know
> please, thank!
Your idea would be to skip a VMA when we issues madvise(MADV_COLD).
That sounds like yet another heuristic that can easily be wrong? :/
In particular, imagine if the VMA is much larger than the madvise'd
region (other parts used for something else) or if the previously cold
memory area is used for something that is now hot.
With memory allocators that manage most of the memory in a single large
VMA, it's rather easy to see how such a heuristic would be bad, no?
--
Cheers
David
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-19 8:35 ` Vernon Yang
2025-12-19 8:55 ` David Hildenbrand (Red Hat)
@ 2025-12-23 11:18 ` Dev Jain
2025-12-25 16:07 ` Vernon Yang
2025-12-29 6:02 ` Vernon Yang
1 sibling, 2 replies; 42+ messages in thread
From: Dev Jain @ 2025-12-23 11:18 UTC (permalink / raw)
To: Vernon Yang, David Hildenbrand (Red Hat)
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 19/12/25 2:05 pm, Vernon Yang wrote:
> On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/15/25 10:04, Vernon Yang wrote:
>>> The following data is traced by bpftrace on a desktop system. After
>>> the system has been left idle for 10 minutes upon booting, a lot of
>>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
>>> khugepaged.
>>>
>>> @scan_pmd_status[1]: 1 ## SCAN_SUCCEED
>>> @scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
>>> @scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
>>> total progress size: 701 MB
>>> Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
>>>
>>> The khugepaged_scan list save all task that support collapse into hugepage,
>>> as long as the take is not destroyed, khugepaged will not remove it from
>>> the khugepaged_scan list. This exist a phenomenon where task has already
>>> collapsed all memory regions into hugepage, but khugepaged continues to
>>> scan it, which wastes CPU time and invalid, and due to
>>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
>>> scanning a large number of invalid task, so scanning really valid task
>>> is later.
>>>
>>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
>>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
>>> list. If the page fault or MADV_HUGEPAGE again, it is added back to
>>> khugepaged.
>> I don't like that, as it assumes that memory within such a process would be
>> rather static, which is easily not the case (e.g., allocators just doing
>> MADV_DONTNEED to free memory).
>>
>> If most stuff is collapsed to PMDs already, can't we just skip over these
>> regions a bit faster?
> I have a flash of inspiration and came up with a good idea.
>
> If these regions have already been collapsed into hugepage, rechecking
> them would be very fast. Due to the khugepaged_pages_to_scan can also
> represent the number of VMAs to skip, we can extend its semantics as
> follows:
>
> /*
> * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> * every 10 second.
> */
> static unsigned int khugepaged_pages_to_scan __read_mostly;
>
> switch (*result) {
> case SCAN_NO_PTE_TABLE:
> case SCAN_PMD_MAPPED:
> case SCAN_PTE_MAPPED_HUGEPAGE:
> progress++; // here
> break;
> case SCAN_SUCCEED:
> ++khugepaged_pages_collapsed;
> fallthrough;
> default:
> progress += HPAGE_PMD_NR;
> }
>
> This way can achieve our goal. David, do you like it?
This looks good, can you formally test this and see if it comes close to the optimizations
yielded by the current version of the patchset?
>
>> --
>> Cheers
>>
>> David
> --
> Thanks,
> Vernon
>
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE
2025-12-23 9:59 ` David Hildenbrand (Red Hat)
@ 2025-12-25 15:12 ` Vernon Yang
0 siblings, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-25 15:12 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Wei Yang, akpm, lorenzo.stoakes, ziy, baohua, lance.yang,
linux-mm, linux-kernel, Vernon Yang
On Tue, Dec 23, 2025 at 10:59:29AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/21/25 13:34, Vernon Yang wrote:
> > On Sun, Dec 21, 2025 at 10:24:11AM +0100, David Hildenbrand (Red Hat) wrote:
> > > On 12/21/25 05:25, Vernon Yang wrote:
> > > > On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
> > > > > On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > > On 12/19/25 06:29, Vernon Yang wrote:
> > > > > > > On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
> > > > > > > > On 12/15/25 10:04, Vernon Yang wrote:
> > > > > > > > > For example, create three task: hot1 -> cold -> hot2. After all three
> > > > > > > > > task are created, each allocate memory 128MB. the hot1/hot2 task
> > > > > > > > > continuously access 128 MB memory, while the cold task only accesses
> > > > > > > > > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> > > > > > > > > still prioritizes scanning the cold task and only scans the hot2 task
> > > > > > > > > after completing the scan of the cold task.
> > > > > > > > >
> > > > > > > > > So if the user has explicitly informed us via MADV_COLD/FREE that this
> > > > > > > > > memory is cold or will be freed, it is appropriate for khugepaged to
> > > > > > > > > scan it only at the latest possible moment, thereby avoiding unnecessary
> > > > > > > > > scan and collapse operations to reducing CPU wastage.
> > > > > > > > >
> > > > > > > > > Here are the performance test results:
> > > > > > > > > (Throughput bigger is better, other smaller is better)
> > > > > > > > >
> > > > > > > > > Testing on x86_64 machine:
> > > > > > > > >
> > > > > > > > > | task hot2 | without patch | with patch | delta |
> > > > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > > > | total accesses time | 3.14 sec | 2.92 sec | -7.01% |
> > > > > > > > > | cycles per access | 4.91 | 2.07 | -57.84% |
> > > > > > > > > | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
> > > > > > > > > | dTLB-load-misses | 288966432 | 1292908 | -99.55% |
> > > > > > > > >
> > > > > > > > > Testing on qemu-system-x86_64 -enable-kvm:
> > > > > > > > >
> > > > > > > > > | task hot2 | without patch | with patch | delta |
> > > > > > > > > |---------------------|---------------|---------------|---------|
> > > > > > > > > | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> > > > > > > > > | cycles per access | 7.23 | 2.12 | -70.68% |
> > > > > > > > > | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
> > > > > > > > > | dTLB-load-misses | 237406497 | 3189194 | -98.66% |
> > > > > > > >
> > > > > > > > Again, I also don't like that because you make assumptions on a full process
> > > > > > > > based on some part of it's address space.
> > > > > > > >
> > > > > > > > E.g., if a library issues a MADV_COLD on some part of the memory the library
> > > > > > > > manages, why should the remaining part of the process suffer as well?
> > > > > > >
> > > > > > > Yes, you make a good point, thanks!
> > > > > > >
> > > > > > > > This seems to be an heuristic focused on some specific workloads, no?
> > > > > > >
> > > > > > > Right.
> > > > > > >
> > > > > > > Could we use the VM_NOHUGEPAGE flag to indicate that this region should
> > > > > > > not be collapsed, so that khugepaged can simply skip this VMA during
> > > > > > > scanning? This way, it won't affect the remaining part of the task's
> > > > > > > memory regions.
> > > > > >
> > > > > > I thought we would skip these regions already properly in khugeapged, or
> > > > > > maybe I misunderstood your question.
> > > > > >
> > > > >
> > > > > I think we should, but seems we didn't do this for anonymous memory during
> > > > > khugepaged.
> > > > >
> > > > > We check the vma with thp_vma_allowable_order() during scan.
> > > > >
> > > > > * For anonymous memory during khugepaged, if we always enable 2M collapse,
> > > > > we will scan this vma. Even VM_NOHUGEPAGE is set.
> > > > >
> > > > > * For other cases, it looks good since __thp_vma_allowable_order() will skip
> > > > > this vma with vma_thp_disabled().
> > > >
> > > > Hi David, Wei,
> > > >
> > > > The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
> > > > memory during scan, as below:
> > > >
> > > > khugepaged_scan_mm_slot()
> > > > thp_vma_allowable_order()
> > > > thp_vma_allowable_orders()
> > > > __thp_vma_allowable_orders()
> > > > vma_thp_disabled() {
> > > > if (vm_flags & VM_NOHUGEPAGE)
> > > > return true;
> > > > }
> > > >
> > > > REAL ISSUE: when madvise(MADV_COLD),not set VM_NOHUGEPAGE flag to vma,
> > > > so the khugepaged will continue scan this vma.
> > > >
> > > > I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
> > > > been successful. I will send it in the next version.
> > >
> > > No we must not do that. That's a user-space visible change. :/
> >
> > David, what good ideas do you have to achieve this goal? let me know
> > please, thank!
>
> Your idea would be to skip a VMA when we issues madvise(MADV_COLD).
>
> That sounds like yet another heuristic that can easily be wrong? :/
>
> In particular, imagine if the VMA is much larger than the madvise'd region
> (other parts used for something else) or if the previously cold memory area
> is used for something that is now hot.
>
> With memory allocators that manage most of the memory in a single large VMA,
> it's rather easy to see how such a heuristic would be bad, no?
Thanks for your explain, but I current approach is as follows, the large
VMA will split at this case.
madvise_vma_behavior
madvise_cold
madvise_update_vma
Maybe I'll send v2 first, and we'll discuss it more clearly :)
--
Merry Christmas,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-23 11:18 ` Dev Jain
@ 2025-12-25 16:07 ` Vernon Yang
2025-12-29 6:02 ` Vernon Yang
1 sibling, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-25 16:07 UTC (permalink / raw)
To: Dev Jain
Cc: David Hildenbrand (Red Hat),
akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Tue, Dec 23, 2025 at 04:48:57PM +0530, Dev Jain wrote:
>
> On 19/12/25 2:05 pm, Vernon Yang wrote:
> > On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
> >> On 12/15/25 10:04, Vernon Yang wrote:
> >>> The following data is traced by bpftrace on a desktop system. After
> >>> the system has been left idle for 10 minutes upon booting, a lot of
> >>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> >>> khugepaged.
> >>>
> >>> @scan_pmd_status[1]: 1 ## SCAN_SUCCEED
> >>> @scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
> >>> @scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
> >>> total progress size: 701 MB
> >>> Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >>>
> >>> The khugepaged_scan list save all task that support collapse into hugepage,
> >>> as long as the take is not destroyed, khugepaged will not remove it from
> >>> the khugepaged_scan list. This exist a phenomenon where task has already
> >>> collapsed all memory regions into hugepage, but khugepaged continues to
> >>> scan it, which wastes CPU time and invalid, and due to
> >>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >>> scanning a large number of invalid task, so scanning really valid task
> >>> is later.
> >>>
> >>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> >>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> >>> list. If the page fault or MADV_HUGEPAGE again, it is added back to
> >>> khugepaged.
> >> I don't like that, as it assumes that memory within such a process would be
> >> rather static, which is easily not the case (e.g., allocators just doing
> >> MADV_DONTNEED to free memory).
> >>
> >> If most stuff is collapsed to PMDs already, can't we just skip over these
> >> regions a bit faster?
> > I have a flash of inspiration and came up with a good idea.
> >
> > If these regions have already been collapsed into hugepage, rechecking
> > them would be very fast. Due to the khugepaged_pages_to_scan can also
> > represent the number of VMAs to skip, we can extend its semantics as
> > follows:
> >
> > /*
> > * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> > * every 10 second.
> > */
> > static unsigned int khugepaged_pages_to_scan __read_mostly;
> >
> > switch (*result) {
> > case SCAN_NO_PTE_TABLE:
> > case SCAN_PMD_MAPPED:
> > case SCAN_PTE_MAPPED_HUGEPAGE:
> > progress++; // here
> > break;
> > case SCAN_SUCCEED:
> > ++khugepaged_pages_collapsed;
> > fallthrough;
> > default:
> > progress += HPAGE_PMD_NR;
> > }
> >
> > This way can achieve our goal. David, do you like it?
>
> This looks good, can you formally test this and see if it comes close to the optimizations
> yielded by the current version of the patchset?
Both can achieve this function, reducing the time of a full scan,
previously tested.
About performance test, I will test it formally.
--
Merry Christmas,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed
2025-12-23 11:18 ` Dev Jain
2025-12-25 16:07 ` Vernon Yang
@ 2025-12-29 6:02 ` Vernon Yang
1 sibling, 0 replies; 42+ messages in thread
From: Vernon Yang @ 2025-12-29 6:02 UTC (permalink / raw)
To: Dev Jain
Cc: David Hildenbrand (Red Hat),
akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Tue, Dec 23, 2025 at 04:48:57PM +0530, Dev Jain wrote:
>
> On 19/12/25 2:05 pm, Vernon Yang wrote:
> > On Thu, Dec 18, 2025 at 10:29:18AM +0100, David Hildenbrand (Red Hat) wrote:
> >> On 12/15/25 10:04, Vernon Yang wrote:
> >>> The following data is traced by bpftrace on a desktop system. After
> >>> the system has been left idle for 10 minutes upon booting, a lot of
> >>> SCAN_PMD_MAPPED or SCAN_PMD_NONE are observed during a full scan by
> >>> khugepaged.
> >>>
> >>> @scan_pmd_status[1]: 1 ## SCAN_SUCCEED
> >>> @scan_pmd_status[4]: 158 ## SCAN_PMD_MAPPED
> >>> @scan_pmd_status[3]: 174 ## SCAN_PMD_NONE
> >>> total progress size: 701 MB
> >>> Total time : 440 seconds ## include khugepaged_scan_sleep_millisecs
> >>>
> >>> The khugepaged_scan list save all task that support collapse into hugepage,
> >>> as long as the take is not destroyed, khugepaged will not remove it from
> >>> the khugepaged_scan list. This exist a phenomenon where task has already
> >>> collapsed all memory regions into hugepage, but khugepaged continues to
> >>> scan it, which wastes CPU time and invalid, and due to
> >>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >>> scanning a large number of invalid task, so scanning really valid task
> >>> is later.
> >>>
> >>> After applying this patch, when all memory is either SCAN_PMD_MAPPED or
> >>> SCAN_PMD_NONE, the mm is automatically removed from khugepaged's scan
> >>> list. If the page fault or MADV_HUGEPAGE again, it is added back to
> >>> khugepaged.
> >> I don't like that, as it assumes that memory within such a process would be
> >> rather static, which is easily not the case (e.g., allocators just doing
> >> MADV_DONTNEED to free memory).
> >>
> >> If most stuff is collapsed to PMDs already, can't we just skip over these
> >> regions a bit faster?
> > I have a flash of inspiration and came up with a good idea.
> >
> > If these regions have already been collapsed into hugepage, rechecking
> > them would be very fast. Due to the khugepaged_pages_to_scan can also
> > represent the number of VMAs to skip, we can extend its semantics as
> > follows:
> >
> > /*
> > * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> > * every 10 second.
> > */
> > static unsigned int khugepaged_pages_to_scan __read_mostly;
> >
> > switch (*result) {
> > case SCAN_NO_PTE_TABLE:
> > case SCAN_PMD_MAPPED:
> > case SCAN_PTE_MAPPED_HUGEPAGE:
> > progress++; // here
> > break;
> > case SCAN_SUCCEED:
> > ++khugepaged_pages_collapsed;
> > fallthrough;
> > default:
> > progress += HPAGE_PMD_NR;
> > }
> >
> > This way can achieve our goal. David, do you like it?
>
> This looks good, can you formally test this and see if it comes close to the optimizations
> yielded by the current version of the patchset?
Either has same performance. For detailed data, you can see the v2[1].
[1] https://lore.kernel.org/linux-mm/20251229055151.54887-1-yanglincheng@kylinos.cn/
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2025-12-29 6:02 UTC | newest]
Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-15 9:04 [PATCH 0/4] Improve khugepaged scan logic Vernon Yang
2025-12-15 9:04 ` [PATCH 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
2025-12-18 9:24 ` David Hildenbrand (Red Hat)
2025-12-19 5:21 ` Vernon Yang
2025-12-15 9:04 ` [PATCH 2/4] mm: khugepaged: remove mm when all memory has been collapsed Vernon Yang
2025-12-15 11:52 ` Lance Yang
2025-12-16 6:27 ` Vernon Yang
2025-12-15 21:45 ` kernel test robot
2025-12-16 6:30 ` Vernon Yang
2025-12-15 23:01 ` kernel test robot
2025-12-16 6:32 ` Vernon Yang
2025-12-17 3:31 ` Wei Yang
2025-12-18 3:27 ` Vernon Yang
2025-12-18 3:48 ` Wei Yang
2025-12-18 4:41 ` Vernon Yang
2025-12-18 9:29 ` David Hildenbrand (Red Hat)
2025-12-19 5:24 ` Vernon Yang
2025-12-19 9:00 ` David Hildenbrand (Red Hat)
2025-12-19 8:35 ` Vernon Yang
2025-12-19 8:55 ` David Hildenbrand (Red Hat)
2025-12-23 11:18 ` Dev Jain
2025-12-25 16:07 ` Vernon Yang
2025-12-29 6:02 ` Vernon Yang
2025-12-22 19:00 ` kernel test robot
2025-12-15 9:04 ` [PATCH 3/4] mm: khugepaged: move mm to list tail when MADV_COLD/MADV_FREE Vernon Yang
2025-12-15 21:12 ` kernel test robot
2025-12-16 7:00 ` Vernon Yang
2025-12-16 13:08 ` kernel test robot
2025-12-16 13:31 ` kernel test robot
2025-12-18 9:31 ` David Hildenbrand (Red Hat)
2025-12-19 5:29 ` Vernon Yang
2025-12-19 8:58 ` David Hildenbrand (Red Hat)
2025-12-21 2:10 ` Wei Yang
2025-12-21 4:25 ` Vernon Yang
2025-12-21 9:24 ` David Hildenbrand (Red Hat)
2025-12-21 12:34 ` Vernon Yang
2025-12-23 9:59 ` David Hildenbrand (Red Hat)
2025-12-25 15:12 ` Vernon Yang
2025-12-21 12:38 ` Wei Yang
2025-12-15 9:04 ` [PATCH 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
2025-12-18 9:33 ` David Hildenbrand (Red Hat)
2025-12-19 5:31 ` Vernon Yang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox