* [PATCH v2 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event
2025-12-29 5:51 [PATCH v2 0/4] Improve khugepaged scan logic Vernon Yang
@ 2025-12-29 5:51 ` Vernon Yang
2025-12-29 8:09 ` Barry Song
2025-12-29 5:51 ` [PATCH v2 2/4] mm: khugepaged: just skip when the memory has been collapsed Vernon Yang
` (3 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: Vernon Yang @ 2025-12-29 5:51 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, dev.jain, baohua, lance.yang, richard.weiyang, linux-mm,
linux-kernel, Vernon Yang
Add mm_khugepaged_scan event to track the total time for full scan
and the total number of pages scanned of khugepaged.
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
---
include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++
mm/khugepaged.c | 2 ++
2 files changed, 26 insertions(+)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4cde53b45a85..01225dd27ad5 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -236,5 +236,29 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
__print_symbolic(__entry->result, SCAN_STATUS))
);
+TRACE_EVENT(mm_khugepaged_scan,
+
+ TP_PROTO(struct mm_struct *mm, int progress, bool full_scan_finished),
+
+ TP_ARGS(mm, progress, full_scan_finished),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(int, progress)
+ __field(bool, full_scan_finished)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->progress = progress;
+ __entry->full_scan_finished = full_scan_finished;
+ ),
+
+ TP_printk("mm=%p, progress=%d, full_scan_finished=%d",
+ __entry->mm,
+ __entry->progress,
+ __entry->full_scan_finished)
+);
+
#endif /* __HUGE_MEMORY_H */
#include <trace/define_trace.h>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386..9f99f61689f8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2533,6 +2533,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
collect_mm_slot(slot);
}
+ trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
+
return progress;
}
--
2.51.0
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event
2025-12-29 5:51 ` [PATCH v2 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
@ 2025-12-29 8:09 ` Barry Song
0 siblings, 0 replies; 18+ messages in thread
From: Barry Song @ 2025-12-29 8:09 UTC (permalink / raw)
To: Vernon Yang
Cc: akpm, david, lorenzo.stoakes, ziy, dev.jain, lance.yang,
richard.weiyang, linux-mm, linux-kernel, Vernon Yang
On Mon, Dec 29, 2025 at 6:52 PM Vernon Yang <vernon2gm@gmail.com> wrote:
>
> Add mm_khugepaged_scan event to track the total time for full scan
> and the total number of pages scanned of khugepaged.
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Thanks
Barry
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 2/4] mm: khugepaged: just skip when the memory has been collapsed
2025-12-29 5:51 [PATCH v2 0/4] Improve khugepaged scan logic Vernon Yang
2025-12-29 5:51 ` [PATCH v2 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
@ 2025-12-29 5:51 ` Vernon Yang
2025-12-30 15:46 ` Vernon Yang
2025-12-29 5:51 ` [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE Vernon Yang
` (2 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: Vernon Yang @ 2025-12-29 5:51 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, dev.jain, baohua, lance.yang, richard.weiyang, linux-mm,
linux-kernel, Vernon Yang
The following data is traced by bpftrace on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
by khugepaged.
@scan_pmd_status[1]: 1 ## SCAN_SUCCEED
@scan_pmd_status[6]: 2 ## SCAN_EXCEED_SHARED_PTE
@scan_pmd_status[3]: 142 ## SCAN_PMD_MAPPED
@scan_pmd_status[2]: 178 ## SCAN_NO_PTE_TABLE
total progress size: 674 MB
Total time : 419 seconds ## include khugepaged_scan_sleep_millisecs
The khugepaged_scan list save all task that support collapse into hugepage,
as long as the task is not destroyed, khugepaged will not remove it from
the khugepaged_scan list. This exist a phenomenon where task has already
collapsed all memory regions into hugepage, but khugepaged continues to
scan it, which wastes CPU time and invalid, and due to
khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
scanning a large number of invalid task, so scanning really valid task
is later.
After applying this patch, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE, just skip it, as follow:
@scan_pmd_status[6]: 2
@scan_pmd_status[3]: 147
@scan_pmd_status[2]: 173
total progress size: 45 MB
Total time : 20 seconds
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
mm/khugepaged.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9f99f61689f8..2b3685b195f5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -66,7 +66,10 @@ enum scan_result {
static struct task_struct *khugepaged_thread __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
-/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
+/*
+ * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
+ * every 10 second.
+ */
static unsigned int khugepaged_pages_to_scan __read_mostly;
static unsigned int khugepaged_pages_collapsed;
static unsigned int khugepaged_full_scans;
@@ -2487,12 +2490,22 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
khugepaged_scan.address, &mmap_locked, cc);
}
- if (*result == SCAN_SUCCEED)
- ++khugepaged_pages_collapsed;
-
/* move to next address */
khugepaged_scan.address += HPAGE_PMD_SIZE;
- progress += HPAGE_PMD_NR;
+
+ switch (*result) {
+ case SCAN_NO_PTE_TABLE:
+ case SCAN_PMD_MAPPED:
+ case SCAN_PTE_MAPPED_HUGEPAGE:
+ progress++;
+ break;
+ case SCAN_SUCCEED:
+ ++khugepaged_pages_collapsed;
+ fallthrough;
+ default:
+ progress += HPAGE_PMD_NR;
+ }
+
if (!mmap_locked)
/*
* We released mmap_lock so break loop. Note
--
2.51.0
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 2/4] mm: khugepaged: just skip when the memory has been collapsed
2025-12-29 5:51 ` [PATCH v2 2/4] mm: khugepaged: just skip when the memory has been collapsed Vernon Yang
@ 2025-12-30 15:46 ` Vernon Yang
0 siblings, 0 replies; 18+ messages in thread
From: Vernon Yang @ 2025-12-30 15:46 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, dev.jain, baohua, lance.yang, richard.weiyang, linux-mm,
linux-kernel, Vernon Yang
On Mon, Dec 29, 2025 at 1:52 PM Vernon Yang <vernon2gm@gmail.com> wrote:
>
> The following data is traced by bpftrace on a desktop system. After
> the system has been left idle for 10 minutes upon booting, a lot of
> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
> by khugepaged.
>
> @scan_pmd_status[1]: 1 ## SCAN_SUCCEED
> @scan_pmd_status[6]: 2 ## SCAN_EXCEED_SHARED_PTE
> @scan_pmd_status[3]: 142 ## SCAN_PMD_MAPPED
> @scan_pmd_status[2]: 178 ## SCAN_NO_PTE_TABLE
> total progress size: 674 MB
> Total time : 419 seconds ## include khugepaged_scan_sleep_millisecs
>
> The khugepaged_scan list save all task that support collapse into hugepage,
> as long as the task is not destroyed, khugepaged will not remove it from
> the khugepaged_scan list. This exist a phenomenon where task has already
> collapsed all memory regions into hugepage, but khugepaged continues to
> scan it, which wastes CPU time and invalid, and due to
> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> scanning a large number of invalid task, so scanning really valid task
> is later.
>
> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
> SCAN_NO_PTE_TABLE, just skip it, as follow:
>
> @scan_pmd_status[6]: 2
> @scan_pmd_status[3]: 147
> @scan_pmd_status[2]: 173
> total progress size: 45 MB
> Total time : 20 seconds
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
> mm/khugepaged.c | 23 ++++++++++++++++++-----
> 1 file changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9f99f61689f8..2b3685b195f5 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -66,7 +66,10 @@ enum scan_result {
> static struct task_struct *khugepaged_thread __read_mostly;
> static DEFINE_MUTEX(khugepaged_mutex);
>
> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> +/*
> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> + * every 10 second.
> + */
> static unsigned int khugepaged_pages_to_scan __read_mostly;
> static unsigned int khugepaged_pages_collapsed;
> static unsigned int khugepaged_full_scans;
> @@ -2487,12 +2490,22 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> khugepaged_scan.address, &mmap_locked, cc);
> }
>
> - if (*result == SCAN_SUCCEED)
> - ++khugepaged_pages_collapsed;
> -
> /* move to next address */
> khugepaged_scan.address += HPAGE_PMD_SIZE;
> - progress += HPAGE_PMD_NR;
> +
> + switch (*result) {
> + case SCAN_NO_PTE_TABLE:
> + case SCAN_PMD_MAPPED:
> + case SCAN_PTE_MAPPED_HUGEPAGE:
> + progress++;
> + break;
> + case SCAN_SUCCEED:
> + ++khugepaged_pages_collapsed;
> + fallthrough;
> + default:
> + progress += HPAGE_PMD_NR;
> + }
> +
> if (!mmap_locked)
> /*
> * We released mmap_lock so break loop. Note
> --
> 2.51.0
>
Sorry, this coding has warnings at SCAN_PTE_MAPPED_HUGEPAGE, and the code
will be refactored in the next version. But the core idea is unchanged.
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE
2025-12-29 5:51 [PATCH v2 0/4] Improve khugepaged scan logic Vernon Yang
2025-12-29 5:51 ` [PATCH v2 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
2025-12-29 5:51 ` [PATCH v2 2/4] mm: khugepaged: just skip when the memory has been collapsed Vernon Yang
@ 2025-12-29 5:51 ` Vernon Yang
2025-12-29 8:20 ` Barry Song
2025-12-30 19:54 ` David Hildenbrand (Red Hat)
2025-12-29 5:51 ` [PATCH v2 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
2025-12-29 10:21 ` [syzbot ci] Re: Improve khugepaged scan logic syzbot ci
4 siblings, 2 replies; 18+ messages in thread
From: Vernon Yang @ 2025-12-29 5:51 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, dev.jain, baohua, lance.yang, richard.weiyang, linux-mm,
linux-kernel, Vernon Yang
For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.
So if the user has explicitly informed us via MADV_COLD/FREE that this
memory is cold or will be freed, it is appropriate for khugepaged to
skip it only, thereby avoiding unnecessary scan and collapse operations
to reducing CPU wastage.
Here are the performance test results:
(Throughput bigger is better, other smaller is better)
Testing on x86_64 machine:
| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.14 sec | 2.93 sec | -6.69% |
| cycles per access | 4.96 | 2.21 | -55.44% |
| Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
| dTLB-load-misses | 284814532 | 69597236 | -75.56% |
Testing on qemu-system-x86_64 -enable-kvm:
| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.35 sec | 2.96 sec | -11.64% |
| cycles per access | 7.29 | 2.07 | -71.60% |
| Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
| dTLB-load-misses | 241600871 | 3216108 | -98.67% |
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
mm/madvise.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index b617b1be0f53..3a48d725a3fc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1360,11 +1360,8 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
return madvise_remove(madv_behavior);
case MADV_WILLNEED:
return madvise_willneed(madv_behavior);
- case MADV_COLD:
- return madvise_cold(madv_behavior);
case MADV_PAGEOUT:
return madvise_pageout(madv_behavior);
- case MADV_FREE:
case MADV_DONTNEED:
case MADV_DONTNEED_LOCKED:
return madvise_dontneed_free(madv_behavior);
@@ -1378,6 +1375,18 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
/* The below behaviours update VMAs via madvise_update_vma(). */
+ case MADV_COLD:
+ error = madvise_cold(madv_behavior);
+ if (error)
+ goto out;
+ new_flags = (new_flags & ~VM_HUGEPAGE) | VM_NOHUGEPAGE;
+ break;
+ case MADV_FREE:
+ error = madvise_dontneed_free(madv_behavior);
+ if (error)
+ goto out;
+ new_flags = (new_flags & ~VM_HUGEPAGE) | VM_NOHUGEPAGE;
+ break;
case MADV_NORMAL:
new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
break;
@@ -1756,7 +1765,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
switch (madv_behavior->behavior) {
case MADV_REMOVE:
case MADV_WILLNEED:
- case MADV_COLD:
case MADV_PAGEOUT:
case MADV_POPULATE_READ:
case MADV_POPULATE_WRITE:
@@ -1766,7 +1774,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
case MADV_GUARD_REMOVE:
case MADV_DONTNEED:
case MADV_DONTNEED_LOCKED:
- case MADV_FREE:
return MADVISE_VMA_READ_LOCK;
default:
return MADVISE_MMAP_WRITE_LOCK;
--
2.51.0
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE
2025-12-29 5:51 ` [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE Vernon Yang
@ 2025-12-29 8:20 ` Barry Song
2025-12-29 8:26 ` Dev Jain
2025-12-30 15:30 ` Vernon Yang
2025-12-30 19:54 ` David Hildenbrand (Red Hat)
1 sibling, 2 replies; 18+ messages in thread
From: Barry Song @ 2025-12-29 8:20 UTC (permalink / raw)
To: Vernon Yang
Cc: akpm, david, lorenzo.stoakes, ziy, dev.jain, lance.yang,
richard.weiyang, linux-mm, linux-kernel, Vernon Yang
On Mon, Dec 29, 2025 at 6:52 PM Vernon Yang <vernon2gm@gmail.com> wrote:
>
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
>
> So if the user has explicitly informed us via MADV_COLD/FREE that this
> memory is cold or will be freed, it is appropriate for khugepaged to
> skip it only, thereby avoiding unnecessary scan and collapse operations
> to reducing CPU wastage.
>
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
>
> Testing on x86_64 machine:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
> | cycles per access | 4.96 | 2.21 | -55.44% |
> | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
> | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
>
> Testing on qemu-system-x86_64 -enable-kvm:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> | cycles per access | 7.29 | 2.07 | -71.60% |
> | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
> | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
> mm/madvise.c | 17 ++++++++++++-----
> 1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index b617b1be0f53..3a48d725a3fc 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1360,11 +1360,8 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
> return madvise_remove(madv_behavior);
> case MADV_WILLNEED:
> return madvise_willneed(madv_behavior);
> - case MADV_COLD:
> - return madvise_cold(madv_behavior);
> case MADV_PAGEOUT:
> return madvise_pageout(madv_behavior);
> - case MADV_FREE:
> case MADV_DONTNEED:
> case MADV_DONTNEED_LOCKED:
> return madvise_dontneed_free(madv_behavior);
> @@ -1378,6 +1375,18 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
>
> /* The below behaviours update VMAs via madvise_update_vma(). */
>
> + case MADV_COLD:
> + error = madvise_cold(madv_behavior);
> + if (error)
> + goto out;
> + new_flags = (new_flags & ~VM_HUGEPAGE) | VM_NOHUGEPAGE;
> + break;
> + case MADV_FREE:
> + error = madvise_dontneed_free(madv_behavior);
> + if (error)
> + goto out;
> + new_flags = (new_flags & ~VM_HUGEPAGE) | VM_NOHUGEPAGE;
> + break;
I am not convinced this is the right patch for MADV_FREE. Userspace
heaps may call MADV_FREE on free(), which does not mean they no longer
want huge pages; it only indicates that the old contents are no longer
needed. New allocations may still occur in the same region.
The same concern applies to MADV_COLD. MADV_COLD may only indicate
that the VMA is cold at the moment and for the near future, but it
can become hot again. For example, MADV_COLD may be issued when an
app moves to the background, but the memory can become hot again
once the app returns to the foreground.
In short, MADV_FREE and MADV_COLD only indicate that the memory is cold
or may be freed for a period of time; they are not permanent states.
Changing the VMA flags implies that the VMA is permanently free or
cold, which is not true in either case.
Your patch also prevents potential per-VMA lock optimizations.
However, if the intent is to treat folios hinted by MADV_FREE or
MADV_COLD as candidates not to be collapsed, I agree that this makes sense.
For MADV_FREE, could we simply skip the lazy-free folios instead?
For MADV_COLD, I am not sure how we can determine which folios
have actually been madvised as cold.
Thanks
Barry
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE
2025-12-29 8:20 ` Barry Song
@ 2025-12-29 8:26 ` Dev Jain
2025-12-30 15:30 ` Vernon Yang
1 sibling, 0 replies; 18+ messages in thread
From: Dev Jain @ 2025-12-29 8:26 UTC (permalink / raw)
To: Barry Song, Vernon Yang
Cc: akpm, david, lorenzo.stoakes, ziy, lance.yang, richard.weiyang,
linux-mm, linux-kernel, Vernon Yang
On 29/12/25 1:50 pm, Barry Song wrote:
> On Mon, Dec 29, 2025 at 6:52 PM Vernon Yang <vernon2gm@gmail.com> wrote:
>> For example, create three task: hot1 -> cold -> hot2. After all three
>> task are created, each allocate memory 128MB. the hot1/hot2 task
>> continuously access 128 MB memory, while the cold task only accesses
>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>> still prioritizes scanning the cold task and only scans the hot2 task
>> after completing the scan of the cold task.
>>
>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>> memory is cold or will be freed, it is appropriate for khugepaged to
>> skip it only, thereby avoiding unnecessary scan and collapse operations
>> to reducing CPU wastage.
>>
>> Here are the performance test results:
>> (Throughput bigger is better, other smaller is better)
>>
>> Testing on x86_64 machine:
>>
>> | task hot2 | without patch | with patch | delta |
>> |---------------------|---------------|---------------|---------|
>> | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
>> | cycles per access | 4.96 | 2.21 | -55.44% |
>> | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
>> | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
>>
>> Testing on qemu-system-x86_64 -enable-kvm:
>>
>> | task hot2 | without patch | with patch | delta |
>> |---------------------|---------------|---------------|---------|
>> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
>> | cycles per access | 7.29 | 2.07 | -71.60% |
>> | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
>> | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
>>
>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>> ---
>> mm/madvise.c | 17 ++++++++++++-----
>> 1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index b617b1be0f53..3a48d725a3fc 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -1360,11 +1360,8 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
>> return madvise_remove(madv_behavior);
>> case MADV_WILLNEED:
>> return madvise_willneed(madv_behavior);
>> - case MADV_COLD:
>> - return madvise_cold(madv_behavior);
>> case MADV_PAGEOUT:
>> return madvise_pageout(madv_behavior);
>> - case MADV_FREE:
>> case MADV_DONTNEED:
>> case MADV_DONTNEED_LOCKED:
>> return madvise_dontneed_free(madv_behavior);
>> @@ -1378,6 +1375,18 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
>>
>> /* The below behaviours update VMAs via madvise_update_vma(). */
>>
>> + case MADV_COLD:
>> + error = madvise_cold(madv_behavior);
>> + if (error)
>> + goto out;
>> + new_flags = (new_flags & ~VM_HUGEPAGE) | VM_NOHUGEPAGE;
>> + break;
>> + case MADV_FREE:
>> + error = madvise_dontneed_free(madv_behavior);
>> + if (error)
>> + goto out;
>> + new_flags = (new_flags & ~VM_HUGEPAGE) | VM_NOHUGEPAGE;
>> + break;
> I am not convinced this is the right patch for MADV_FREE. Userspace
> heaps may call MADV_FREE on free(), which does not mean they no longer
> want huge pages; it only indicates that the old contents are no longer
> needed. New allocations may still occur in the same region.
+1. Userspace allocators do MADV_DONTNEED/MADV_FREE to prevent overhead
of actually unmapping the memory via munmap.
>
> The same concern applies to MADV_COLD. MADV_COLD may only indicate
> that the VMA is cold at the moment and for the near future, but it
> can become hot again. For example, MADV_COLD may be issued when an
> app moves to the background, but the memory can become hot again
> once the app returns to the foreground.
>
> In short, MADV_FREE and MADV_COLD only indicate that the memory is cold
> or may be freed for a period of time; they are not permanent states.
> Changing the VMA flags implies that the VMA is permanently free or
> cold, which is not true in either case.
>
> Your patch also prevents potential per-VMA lock optimizations.
>
> However, if the intent is to treat folios hinted by MADV_FREE or
> MADV_COLD as candidates not to be collapsed, I agree that this makes sense.
>
> For MADV_FREE, could we simply skip the lazy-free folios instead?
> For MADV_COLD, I am not sure how we can determine which folios
> have actually been madvised as cold.
>
> Thanks
> Barry
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE
2025-12-29 8:20 ` Barry Song
2025-12-29 8:26 ` Dev Jain
@ 2025-12-30 15:30 ` Vernon Yang
1 sibling, 0 replies; 18+ messages in thread
From: Vernon Yang @ 2025-12-30 15:30 UTC (permalink / raw)
To: Barry Song
Cc: akpm, david, lorenzo.stoakes, ziy, dev.jain, lance.yang,
richard.weiyang, linux-mm, linux-kernel, Vernon Yang
On Mon, Dec 29, 2025 at 09:20:12PM +1300, Barry Song wrote:
> On Mon, Dec 29, 2025 at 6:52 PM Vernon Yang <vernon2gm@gmail.com> wrote:
> >
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > So if the user has explicitly informed us via MADV_COLD/FREE that this
> > memory is cold or will be freed, it is appropriate for khugepaged to
> > skip it only, thereby avoiding unnecessary scan and collapse operations
> > to reducing CPU wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2 | without patch | with patch | delta |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
> > | cycles per access | 4.96 | 2.21 | -55.44% |
> > | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
> > | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2 | without patch | with patch | delta |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> > | cycles per access | 7.29 | 2.07 | -71.60% |
> > | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
> > | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> > mm/madvise.c | 17 ++++++++++++-----
> > 1 file changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index b617b1be0f53..3a48d725a3fc 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1360,11 +1360,8 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
> > return madvise_remove(madv_behavior);
> > case MADV_WILLNEED:
> > return madvise_willneed(madv_behavior);
> > - case MADV_COLD:
> > - return madvise_cold(madv_behavior);
> > case MADV_PAGEOUT:
> > return madvise_pageout(madv_behavior);
> > - case MADV_FREE:
> > case MADV_DONTNEED:
> > case MADV_DONTNEED_LOCKED:
> > return madvise_dontneed_free(madv_behavior);
> > @@ -1378,6 +1375,18 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
> >
> > /* The below behaviours update VMAs via madvise_update_vma(). */
> >
> > + case MADV_COLD:
> > + error = madvise_cold(madv_behavior);
> > + if (error)
> > + goto out;
> > + new_flags = (new_flags & ~VM_HUGEPAGE) | VM_NOHUGEPAGE;
> > + break;
> > + case MADV_FREE:
> > + error = madvise_dontneed_free(madv_behavior);
> > + if (error)
> > + goto out;
> > + new_flags = (new_flags & ~VM_HUGEPAGE) | VM_NOHUGEPAGE;
> > + break;
>
> I am not convinced this is the right patch for MADV_FREE. Userspace
> heaps may call MADV_FREE on free(), which does not mean they no longer
> want huge pages; it only indicates that the old contents are no longer
> needed. New allocations may still occur in the same region.
>
> The same concern applies to MADV_COLD. MADV_COLD may only indicate
> that the VMA is cold at the moment and for the near future, but it
> can become hot again. For example, MADV_COLD may be issued when an
> app moves to the background, but the memory can become hot again
> once the app returns to the foreground.
>
> In short, MADV_FREE and MADV_COLD only indicate that the memory is cold
> or may be freed for a period of time; they are not permanent states.
> Changing the VMA flags implies that the VMA is permanently free or
> cold, which is not true in either case.
>
> Your patch also prevents potential per-VMA lock optimizations.
Thank you for review and explanation.
> However, if the intent is to treat folios hinted by MADV_FREE or
> MADV_COLD as candidates not to be collapsed, I agree that this makes sense.
>
> For MADV_FREE, could we simply skip the lazy-free folios instead?
It is nice that skiping lazy-free folios simply, it has the same
performance. Thanks for your suggestions, I will send it at the next
version.
> For MADV_COLD, I am not sure how we can determine which folios
> have actually been madvised as cold.
It is a tricky problem, I don't have a good solution for at the moment :(
Does anyone have any good ideas? please let me know, thanks!
If not, it might be removed in the next version.
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE
2025-12-29 5:51 ` [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE Vernon Yang
2025-12-29 8:20 ` Barry Song
@ 2025-12-30 19:54 ` David Hildenbrand (Red Hat)
2025-12-31 12:13 ` Vernon Yang
1 sibling, 1 reply; 18+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-30 19:54 UTC (permalink / raw)
To: Vernon Yang, akpm, lorenzo.stoakes
Cc: ziy, dev.jain, baohua, lance.yang, richard.weiyang, linux-mm,
linux-kernel, Vernon Yang
On 12/29/25 06:51, Vernon Yang wrote:
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
>
> So if the user has explicitly informed us via MADV_COLD/FREE that this
> memory is cold or will be freed, it is appropriate for khugepaged to
> skip it only, thereby avoiding unnecessary scan and collapse operations
> to reducing CPU wastage.
>
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
>
> Testing on x86_64 machine:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
> | cycles per access | 4.96 | 2.21 | -55.44% |
> | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
> | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
>
> Testing on qemu-system-x86_64 -enable-kvm:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> | cycles per access | 7.29 | 2.07 | -71.60% |
> | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
> | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
As raised in v1, this is not the way to go. Just because something was
once indicated to be cold does not meant that it will stay like that
forever.
Also,
(1) You are turning this into an operation that will perform VMA
modifications and require the mmap lock in write mode, bad.
(2) You might now create many VMAs, possibly breaking user space, bad.
If user space knows that memory will stay cold, it can use madvise() to
indicate that these regions are not a good fit for THPs.
But are they really not a good fit? What about smaller-order THPs?
Nobody knows, but changing the behavior like you suggest is definetly
bad. :)
--
Cheers
David
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE
2025-12-30 19:54 ` David Hildenbrand (Red Hat)
@ 2025-12-31 12:13 ` Vernon Yang
2025-12-31 12:19 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 18+ messages in thread
From: Vernon Yang @ 2025-12-31 12:13 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: akpm, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
richard.weiyang, linux-mm, linux-kernel, Vernon Yang
On Tue, Dec 30, 2025 at 08:54:33PM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/29/25 06:51, Vernon Yang wrote:
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > So if the user has explicitly informed us via MADV_COLD/FREE that this
> > memory is cold or will be freed, it is appropriate for khugepaged to
> > skip it only, thereby avoiding unnecessary scan and collapse operations
> > to reducing CPU wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2 | without patch | with patch | delta |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
> > | cycles per access | 4.96 | 2.21 | -55.44% |
> > | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
> > | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2 | without patch | with patch | delta |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> > | cycles per access | 7.29 | 2.07 | -71.60% |
> > | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
> > | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
>
> As raised in v1, this is not the way to go. Just because something was once
> indicated to be cold does not meant that it will stay like that forever.
>
> Also,
>
> (1) You are turning this into an operation that will perform VMA
> modifications and require the mmap lock in write mode, bad.
>
> (2) You might now create many VMAs, possibly breaking user space, bad.
>
> If user space knows that memory will stay cold, it can use madvise() to
> indicate that these regions are not a good fit for THPs.
>
> But are they really not a good fit? What about smaller-order THPs?
>
> Nobody knows, but changing the behavior like you suggest is definetly bad.
> :)
>
Thank you for review and explanation. I got it.
For MADV_FREE, we will skip the lazy-free folios instead.
For MADV_COLD, it will be removed in the next version.
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE
2025-12-31 12:13 ` Vernon Yang
@ 2025-12-31 12:19 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-31 12:19 UTC (permalink / raw)
To: Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
richard.weiyang, linux-mm, linux-kernel, Vernon Yang
On 12/31/25 13:13, Vernon Yang wrote:
> On Tue, Dec 30, 2025 at 08:54:33PM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/29/25 06:51, Vernon Yang wrote:
>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>> continuously access 128 MB memory, while the cold task only accesses
>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>>> still prioritizes scanning the cold task and only scans the hot2 task
>>> after completing the scan of the cold task.
>>>
>>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>>> memory is cold or will be freed, it is appropriate for khugepaged to
>>> skip it only, thereby avoiding unnecessary scan and collapse operations
>>> to reducing CPU wastage.
>>>
>>> Here are the performance test results:
>>> (Throughput bigger is better, other smaller is better)
>>>
>>> Testing on x86_64 machine:
>>>
>>> | task hot2 | without patch | with patch | delta |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
>>> | cycles per access | 4.96 | 2.21 | -55.44% |
>>> | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
>>> | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
>>>
>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>
>>> | task hot2 | without patch | with patch | delta |
>>> |---------------------|---------------|---------------|---------|
>>> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
>>> | cycles per access | 7.29 | 2.07 | -71.60% |
>>> | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
>>> | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
>>>
>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>> ---
>>
>> As raised in v1, this is not the way to go. Just because something was once
>> indicated to be cold does not meant that it will stay like that forever.
>>
>> Also,
>>
>> (1) You are turning this into an operation that will perform VMA
>> modifications and require the mmap lock in write mode, bad.
>>
>> (2) You might now create many VMAs, possibly breaking user space, bad.
>>
>> If user space knows that memory will stay cold, it can use madvise() to
>> indicate that these regions are not a good fit for THPs.
>>
>> But are they really not a good fit? What about smaller-order THPs?
>>
>> Nobody knows, but changing the behavior like you suggest is definetly bad.
>> :)
>>
>
> Thank you for review and explanation. I got it.
>
> For MADV_FREE, we will skip the lazy-free folios instead.
> For MADV_COLD, it will be removed in the next version.
Just to be clear, setting VM_NOHUGEPAGE should not be done from any of
these operations.
Treating lazyfree folios differently in khugepaged code could indeed
make sense.
--
Cheers
David
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
2025-12-29 5:51 [PATCH v2 0/4] Improve khugepaged scan logic Vernon Yang
` (2 preceding siblings ...)
2025-12-29 5:51 ` [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE Vernon Yang
@ 2025-12-29 5:51 ` Vernon Yang
2025-12-30 20:03 ` David Hildenbrand (Red Hat)
2025-12-29 10:21 ` [syzbot ci] Re: Improve khugepaged scan logic syzbot ci
4 siblings, 1 reply; 18+ messages in thread
From: Vernon Yang @ 2025-12-29 5:51 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: ziy, dev.jain, baohua, lance.yang, richard.weiyang, linux-mm,
linux-kernel, Vernon Yang
When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
reduce redundant operation.
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
mm/khugepaged.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2b3685b195f5..72be87ef384b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2439,6 +2439,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
cond_resched();
if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+ vma = NULL;
progress++;
break;
}
@@ -2459,8 +2460,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
bool mmap_locked = true;
cond_resched();
- if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+ if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+ vma = NULL;
goto breakouterloop;
+ }
VM_BUG_ON(khugepaged_scan.address < hstart ||
khugepaged_scan.address + HPAGE_PMD_SIZE >
@@ -2477,8 +2480,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
fput(file);
if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
mmap_read_lock(mm);
- if (hpage_collapse_test_exit_or_disable(mm))
+ if (hpage_collapse_test_exit_or_disable(mm)) {
+ vma = NULL;
goto breakouterloop;
+ }
*result = collapse_pte_mapped_thp(mm,
khugepaged_scan.address, false);
if (*result == SCAN_PMD_MAPPED)
--
2.51.0
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
2025-12-29 5:51 ` [PATCH v2 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
@ 2025-12-30 20:03 ` David Hildenbrand (Red Hat)
2025-12-31 2:51 ` Wei Yang
2025-12-31 10:57 ` Vernon Yang
0 siblings, 2 replies; 18+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-30 20:03 UTC (permalink / raw)
To: Vernon Yang, akpm, lorenzo.stoakes
Cc: ziy, dev.jain, baohua, lance.yang, richard.weiyang, linux-mm,
linux-kernel, Vernon Yang
On 12/29/25 06:51, Vernon Yang wrote:
> When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
> scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
> reduce redundant operation.
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
> mm/khugepaged.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2b3685b195f5..72be87ef384b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2439,6 +2439,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>
> cond_resched();
> if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
> + vma = NULL;
> progress++;
> break;
> }
I don't understand why we need changes at all.
The code is
mm = slot->mm;
/*
* Don't wait for semaphore (to avoid long wait times). Just move to
* the next mm on the list.
*/
vma = NULL;
if (unlikely(!mmap_read_trylock(mm)))
goto breakouterloop_mmap_lock;
progress++;
if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
goto breakouterloop;
...
So we'll go straight to breakouterloop with vma=NULL.
Do you want to optimize for skipping the MM if the flag gets toggled
while we are scanning that MM?
Is that really something we should be worrying about?
Also, why can't we simply do a
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386f..af8481d4b0f4e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2516,7 +2516,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
* Release the current mm_slot if this mm is about to die, or
* if we scanned all vmas of this mm.
*/
- if (hpage_collapse_test_exit(mm) || !vma) {
+ if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
/*
* Make sure that if mm_users is reaching zero while
* khugepaged runs here, khugepaged_exit will find
--
Cheers
David
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
2025-12-30 20:03 ` David Hildenbrand (Red Hat)
@ 2025-12-31 2:51 ` Wei Yang
2025-12-31 12:21 ` David Hildenbrand (Red Hat)
2025-12-31 10:57 ` Vernon Yang
1 sibling, 1 reply; 18+ messages in thread
From: Wei Yang @ 2025-12-31 2:51 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Vernon Yang, akpm, lorenzo.stoakes, ziy, dev.jain, baohua,
lance.yang, richard.weiyang, linux-mm, linux-kernel, Vernon Yang
On Tue, Dec 30, 2025 at 09:03:23PM +0100, David Hildenbrand (Red Hat) wrote:
>On 12/29/25 06:51, Vernon Yang wrote:
>> When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
>> scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
>> reduce redundant operation.
>>
>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>> ---
>> mm/khugepaged.c | 9 +++++++--
>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 2b3685b195f5..72be87ef384b 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2439,6 +2439,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>> cond_resched();
>> if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
>> + vma = NULL;
>> progress++;
>> break;
>> }
>
>I don't understand why we need changes at all.
>
>The code is
>
> mm = slot->mm;
> /*
> * Don't wait for semaphore (to avoid long wait times). Just move to
> * the next mm on the list.
> */
> vma = NULL;
> if (unlikely(!mmap_read_trylock(mm)))
> goto breakouterloop_mmap_lock;
>
> progress++;
> if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> goto breakouterloop;
>
> ...
>
>So we'll go straight to breakouterloop with vma=NULL.
>
>Do you want to optimize for skipping the MM if the flag gets toggled
>while we are scanning that MM?
>
>Is that really something we should be worrying about?
>
>Also, why can't we simply do a
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 97d1b2824386f..af8481d4b0f4e 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -2516,7 +2516,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> * Release the current mm_slot if this mm is about to die, or
> * if we scanned all vmas of this mm.
> */
>- if (hpage_collapse_test_exit(mm) || !vma) {
>+ if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
> /*
> * Make sure that if mm_users is reaching zero while
> * khugepaged runs here, khugepaged_exit will find
>
This one looks better.
But the sad thing is we can't remove this mm from scan list, since user may
toggle this flag later.
>
>--
>Cheers
>
>David
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v2 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
2025-12-31 2:51 ` Wei Yang
@ 2025-12-31 12:21 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-31 12:21 UTC (permalink / raw)
To: Wei Yang
Cc: Vernon Yang, akpm, lorenzo.stoakes, ziy, dev.jain, baohua,
lance.yang, linux-mm, linux-kernel, Vernon Yang
On 12/31/25 03:51, Wei Yang wrote:
> On Tue, Dec 30, 2025 at 09:03:23PM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/29/25 06:51, Vernon Yang wrote:
>>> When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
>>> scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
>>> reduce redundant operation.
>>>
>>> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
>>> ---
>>> mm/khugepaged.c | 9 +++++++--
>>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 2b3685b195f5..72be87ef384b 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2439,6 +2439,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>>> cond_resched();
>>> if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
>>> + vma = NULL;
>>> progress++;
>>> break;
>>> }
>>
>> I don't understand why we need changes at all.
>>
>> The code is
>>
>> mm = slot->mm;
>> /*
>> * Don't wait for semaphore (to avoid long wait times). Just move to
>> * the next mm on the list.
>> */
>> vma = NULL;
>> if (unlikely(!mmap_read_trylock(mm)))
>> goto breakouterloop_mmap_lock;
>>
>> progress++;
>> if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
>> goto breakouterloop;
>>
>> ...
>>
>> So we'll go straight to breakouterloop with vma=NULL.
>>
>> Do you want to optimize for skipping the MM if the flag gets toggled
>> while we are scanning that MM?
>>
>> Is that really something we should be worrying about?
>>
>> Also, why can't we simply do a
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 97d1b2824386f..af8481d4b0f4e 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2516,7 +2516,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>> * Release the current mm_slot if this mm is about to die, or
>> * if we scanned all vmas of this mm.
>> */
>> - if (hpage_collapse_test_exit(mm) || !vma) {
>> + if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
>> /*
>> * Make sure that if mm_users is reaching zero while
>> * khugepaged runs here, khugepaged_exit will find
>>
>
> This one looks better.
>
> But the sad thing is we can't remove this mm from scan list, since user may
> toggle this flag later.
In theory we could readd it to the list once the flag gets toggled.
In fact, we could remove it from the list once we set the flag. But not
sure if that ends up any cleaner (dealing with races? not sure).
--
Cheers
David
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
2025-12-30 20:03 ` David Hildenbrand (Red Hat)
2025-12-31 2:51 ` Wei Yang
@ 2025-12-31 10:57 ` Vernon Yang
1 sibling, 0 replies; 18+ messages in thread
From: Vernon Yang @ 2025-12-31 10:57 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: akpm, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
richard.weiyang, linux-mm, linux-kernel, Vernon Yang
On Wed, Dec 31, 2025 at 4:03 AM David Hildenbrand (Red Hat)
<david@kernel.org> wrote:
>
> On 12/29/25 06:51, Vernon Yang wrote:
> > When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
> > scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
> > reduce redundant operation.
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> > mm/khugepaged.c | 9 +++++++--
> > 1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 2b3685b195f5..72be87ef384b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2439,6 +2439,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >
> > cond_resched();
> > if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
> > + vma = NULL;
> > progress++;
> > break;
> > }
>
> I don't understand why we need changes at all.
>
> The code is
>
> mm = slot->mm;
> /*
> * Don't wait for semaphore (to avoid long wait times). Just move to
> * the next mm on the list.
> */
> vma = NULL;
> if (unlikely(!mmap_read_trylock(mm)))
> goto breakouterloop_mmap_lock;
>
> progress++;
> if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
> goto breakouterloop;
>
> ...
>
> So we'll go straight to breakouterloop with vma=NULL.
>
> Do you want to optimize for skipping the MM if the flag gets toggled
> while we are scanning that MM?
Yes
> Is that really something we should be worrying about?
Just reduce redundant operation.
Before optimizing, entering khugepaged_scan_mm_slot() next time, vma = NULL,
we will set khugepaged_scan.mm_slot to the next mm_slot.
After optimizing, we will directly set khugepaged_scan.mm_slot to the next
mm_slot.
> Also, why can't we simply do a
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 97d1b2824386f..af8481d4b0f4e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2516,7 +2516,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> * Release the current mm_slot if this mm is about to die, or
> * if we scanned all vmas of this mm.
> */
> - if (hpage_collapse_test_exit(mm) || !vma) {
> + if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
> /*
> * Make sure that if mm_users is reaching zero while
> * khugepaged runs here, khugepaged_exit will find
>
Sound goods to me. Thank you for your review and suggestion, I will do it in
the next version.
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 18+ messages in thread
* [syzbot ci] Re: Improve khugepaged scan logic
2025-12-29 5:51 [PATCH v2 0/4] Improve khugepaged scan logic Vernon Yang
` (3 preceding siblings ...)
2025-12-29 5:51 ` [PATCH v2 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
@ 2025-12-29 10:21 ` syzbot ci
4 siblings, 0 replies; 18+ messages in thread
From: syzbot ci @ 2025-12-29 10:21 UTC (permalink / raw)
To: akpm, baohua, david, dev.jain, lance.yang, linux-kernel,
linux-mm, lorenzo.stoakes, richard.weiyang, vernon2gm,
yanglincheng, ziy
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v2] Improve khugepaged scan logic
https://lore.kernel.org/all/20251229055151.54887-1-yanglincheng@kylinos.cn
* [PATCH v2 1/4] mm: khugepaged: add trace_mm_khugepaged_scan event
* [PATCH v2 2/4] mm: khugepaged: just skip when the memory has been collapsed
* [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE
* [PATCH v2 4/4] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
and found the following issue:
WARNING in madvise_dontneed_free
Full report is available here:
https://ci.syzbot.org/series/f936dff1-2423-4f46-a59a-ea041c1d741a
***
WARNING in madvise_dontneed_free
tree: mm-new
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base: 33b485bade996a9d0154cf0888b7a5c23723121e
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/81f62216-5094-4281-a942-238b7448a3be/config
C repro: https://ci.syzbot.org/findings/e308c3a0-c806-45c4-bc1c-24536a3c3ca3/c_repro
syz repro: https://ci.syzbot.org/findings/e308c3a0-c806-45c4-bc1c-24536a3c3ca3/syz_repro
------------[ cut here ]------------
WARNING: mm/madvise.c:795 at get_walk_lock mm/madvise.c:795 [inline], CPU#0: syz.0.17/5977
WARNING: mm/madvise.c:795 at madvise_free_single_vma mm/madvise.c:830 [inline], CPU#0: syz.0.17/5977
WARNING: mm/madvise.c:795 at madvise_dontneed_free+0xb52/0xe10 mm/madvise.c:960, CPU#0: syz.0.17/5977
Modules linked in:
CPU: 0 UID: 0 PID: 5977 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:get_walk_lock mm/madvise.c:795 [inline]
RIP: 0010:madvise_free_single_vma mm/madvise.c:830 [inline]
RIP: 0010:madvise_dontneed_free+0xb52/0xe10 mm/madvise.c:960
Code: c7 c6 b0 6e 25 8e e8 7d 4c a3 ff 48 83 fb 01 74 0c 83 fb 03 75 0e e8 ed 46 a3 ff eb 12 e8 e6 46 a3 ff eb 09 e8 df 46 a3 ff 90 <0f> 0b 90 31 db 89 9c 24 08 01 00 00 48 8b 74 24 68 48 8b 54 24 70
RSP: 0018:ffffc90004a17400 EFLAGS: 00010293
RAX: ffffffff821e7411 RBX: 0000000000000002 RCX: ffff888169b7d7c0
RDX: 0000000000000000 RSI: ffffffff8e256eb0 RDI: 0000000000000002
RBP: ffffc90004a175b0 R08: ffff888169b7d7c0 R09: 0000000000000002
R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000100 R15: 1ffff92000942e88
FS: 0000555555761500(0000) GS:ffff88818e62f000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe49d72b600 CR3: 00000001b85b6000 CR4: 00000000000006f0
Call Trace:
<TASK>
madvise_vma_behavior+0xd57/0x3680 mm/madvise.c:1385
madvise_walk_vmas+0x575/0xaf0 mm/madvise.c:1730
madvise_do_behavior+0x38e/0x550 mm/madvise.c:1944
do_madvise+0x1bc/0x270 mm/madvise.c:2037
__do_sys_madvise mm/madvise.c:2046 [inline]
__se_sys_madvise mm/madvise.c:2044 [inline]
__x64_sys_madvise+0xa7/0xc0 mm/madvise.c:2044
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fe49d78f7c9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff39cea178 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007fe49d9e5fa0 RCX: 00007fe49d78f7c9
RDX: 0000000000000008 RSI: 0000000000600002 RDI: 0000200000000000
RBP: 00007fe49d7f297f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fe49d9e5fa0 R14: 00007fe49d9e5fa0 R15: 0000000000000003
</TASK>
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 18+ messages in thread