* [PATCH mm-new v6 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event
2026-02-01 12:25 [PATCH mm-new v6 0/5] Improve khugepaged scan logic Vernon Yang
@ 2026-02-01 12:25 ` Vernon Yang
2026-02-01 12:25 ` [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number Vernon Yang
` (3 subsequent siblings)
4 siblings, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-01 12:25 UTC (permalink / raw)
To: akpm, david
Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
From: Vernon Yang <yanglincheng@kylinos.cn>
Add mm_khugepaged_scan event to track the total time for full scan
and the total number of pages scanned of khugepaged.
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Dev Jain <dev.jain@arm.com>
---
include/trace/events/huge_memory.h | 25 +++++++++++++++++++++++++
mm/khugepaged.c | 2 ++
2 files changed, 27 insertions(+)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4e41bff31888..384e29f6bef0 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -237,5 +237,30 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
__print_symbolic(__entry->result, SCAN_STATUS))
);
+TRACE_EVENT(mm_khugepaged_scan,
+
+ TP_PROTO(struct mm_struct *mm, unsigned int progress,
+ bool full_scan_finished),
+
+ TP_ARGS(mm, progress, full_scan_finished),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned int, progress)
+ __field(bool, full_scan_finished)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->progress = progress;
+ __entry->full_scan_finished = full_scan_finished;
+ ),
+
+ TP_printk("mm=%p, progress=%u, full_scan_finished=%d",
+ __entry->mm,
+ __entry->progress,
+ __entry->full_scan_finished)
+);
+
#endif /* __HUGE_MEMORY_H */
#include <trace/define_trace.h>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1d994b6c58c6..d94b34e10bdf 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2534,6 +2534,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
collect_mm_slot(slot);
}
+ trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
+
return progress;
}
--
2.51.0
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-01 12:25 [PATCH mm-new v6 0/5] Improve khugepaged scan logic Vernon Yang
2026-02-01 12:25 ` [PATCH mm-new v6 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
@ 2026-02-01 12:25 ` Vernon Yang
2026-02-04 21:35 ` David Hildenbrand (arm)
2026-02-01 12:25 ` [PATCH mm-new v6 3/5] mm: add folio_test_lazyfree helper Vernon Yang
` (2 subsequent siblings)
4 siblings, 1 reply; 26+ messages in thread
From: Vernon Yang @ 2026-02-01 12:25 UTC (permalink / raw)
To: akpm, david
Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
From: Vernon Yang <yanglincheng@kylinos.cn>
Currently, each scan always increases "progress" by HPAGE_PMD_NR,
even if only scanning a single PTE/PMD entry.
- When only scanning a sigle PTE entry, let me provide a detailed
example:
static int hpage_collapse_scan_pmd()
{
for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
_pte++, addr += PAGE_SIZE) {
pte_t pteval = ptep_get(_pte);
...
if (pte_uffd_wp(pteval)) { <-- first scan hit
result = SCAN_PTE_UFFD_WP;
goto out_unmap;
}
}
}
During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
directly. In practice, only one PTE is scanned before termination.
Here, "progress += 1" reflects the actual number of PTEs scanned, but
previously "progress += HPAGE_PMD_NR" always.
- When the memory has been collapsed to PMD, let me provide a detailed
example:
The following data is traced by bpftrace on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
by khugepaged.
From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
following statuses were observed, with frequency mentioned next to them:
SCAN_SUCCEED : 1
SCAN_EXCEED_SHARED_PTE: 2
SCAN_PMD_MAPPED : 142
SCAN_NO_PTE_TABLE : 178
total progress size : 674 MB
Total time : 419 seconds, include khugepaged_scan_sleep_millisecs
The khugepaged_scan list save all task that support collapse into hugepage,
as long as the task is not destroyed, khugepaged will not remove it from
the khugepaged_scan list. This exist a phenomenon where task has already
collapsed all memory regions into hugepage, but khugepaged continues to
scan it, which wastes CPU time and invalid, and due to
khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
scanning a large number of invalid task, so scanning really valid task
is later.
After applying this patch, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE, just skip it, as follow:
SCAN_EXCEED_SHARED_PTE: 2
SCAN_PMD_MAPPED : 147
SCAN_NO_PTE_TABLE : 173
total progress size : 45 MB
Total time : 20 seconds
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
mm/khugepaged.c | 41 +++++++++++++++++++++++++++++++----------
1 file changed, 31 insertions(+), 10 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d94b34e10bdf..df22b2274d92 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -68,7 +68,10 @@ enum scan_result {
static struct task_struct *khugepaged_thread __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
-/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
+/*
+ * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
+ * every 10 second.
+ */
static unsigned int khugepaged_pages_to_scan __read_mostly;
static unsigned int khugepaged_pages_collapsed;
static unsigned int khugepaged_full_scans;
@@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
}
static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
+ struct vm_area_struct *vma, unsigned long start_addr,
+ bool *mmap_locked, unsigned int *cur_progress,
struct collapse_control *cc)
{
pmd_t *pmd;
@@ -1256,13 +1260,18 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
- if (result != SCAN_SUCCEED)
+ if (result != SCAN_SUCCEED) {
+ if (cur_progress)
+ *cur_progress = 1;
goto out;
+ }
memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
if (!pte) {
+ if (cur_progress)
+ *cur_progress = 1;
result = SCAN_NO_PTE_TABLE;
goto out;
}
@@ -1396,6 +1405,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
result = SCAN_SUCCEED;
}
out_unmap:
+ if (cur_progress) {
+ if (_pte >= pte + HPAGE_PMD_NR)
+ *cur_progress = HPAGE_PMD_NR;
+ else
+ *cur_progress = _pte - pte + 1;
+ }
pte_unmap_unlock(pte, ptl);
if (result == SCAN_SUCCEED) {
result = collapse_huge_page(mm, start_addr, referenced,
@@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
return result;
}
-static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
- struct file *file, pgoff_t start, struct collapse_control *cc)
+static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
+ unsigned long addr, struct file *file, pgoff_t start,
+ unsigned int *cur_progress, struct collapse_control *cc)
{
struct folio *folio = NULL;
struct address_space *mapping = file->f_mapping;
@@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
cond_resched_rcu();
}
}
+ if (cur_progress)
+ *cur_progress = max(xas.xa_index - start, 1UL);
rcu_read_unlock();
if (result == SCAN_SUCCEED) {
@@ -2455,6 +2473,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
while (khugepaged_scan.address < hend) {
bool mmap_locked = true;
+ unsigned int cur_progress = 0;
cond_resched();
if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
@@ -2471,7 +2490,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
mmap_read_unlock(mm);
mmap_locked = false;
*result = hpage_collapse_scan_file(mm,
- khugepaged_scan.address, file, pgoff, cc);
+ khugepaged_scan.address, file, pgoff,
+ &cur_progress, cc);
fput(file);
if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
mmap_read_lock(mm);
@@ -2485,7 +2505,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
}
} else {
*result = hpage_collapse_scan_pmd(mm, vma,
- khugepaged_scan.address, &mmap_locked, cc);
+ khugepaged_scan.address, &mmap_locked,
+ &cur_progress, cc);
}
if (*result == SCAN_SUCCEED)
@@ -2493,7 +2514,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
/* move to next address */
khugepaged_scan.address += HPAGE_PMD_SIZE;
- progress += HPAGE_PMD_NR;
+ progress += cur_progress;
if (!mmap_locked)
/*
* We released mmap_lock so break loop. Note
@@ -2816,7 +2837,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
mmap_locked = false;
*lock_dropped = true;
result = hpage_collapse_scan_file(mm, addr, file, pgoff,
- cc);
+ NULL, cc);
if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
mapping_can_writeback(file->f_mapping)) {
@@ -2831,7 +2852,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
fput(file);
} else {
result = hpage_collapse_scan_pmd(mm, vma, addr,
- &mmap_locked, cc);
+ &mmap_locked, NULL, cc);
}
if (!mmap_locked)
*lock_dropped = true;
--
2.51.0
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-01 12:25 ` [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number Vernon Yang
@ 2026-02-04 21:35 ` David Hildenbrand (arm)
2026-02-05 6:08 ` Vernon Yang
0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand (arm) @ 2026-02-04 21:35 UTC (permalink / raw)
To: Vernon Yang, akpm
Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
[...]
> + if (cur_progress) {
> + if (_pte >= pte + HPAGE_PMD_NR)
> + *cur_progress = HPAGE_PMD_NR;
> + else
> + *cur_progress = _pte - pte + 1;
*cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
?
It's still a bit nasty, though.
Can't we just add one at the beginning of the loop and let the compiler
optimize that? ;)
> + }
> pte_unmap_unlock(pte, ptl);
> if (result == SCAN_SUCCEED) {
> result = collapse_huge_page(mm, start_addr, referenced,
> @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> return result;
> }
>
> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> - struct file *file, pgoff_t start, struct collapse_control *cc)
> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> + unsigned long addr, struct file *file, pgoff_t start,
> + unsigned int *cur_progress, struct collapse_control *cc)
> {
> struct folio *folio = NULL;
> struct address_space *mapping = file->f_mapping;
> @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
> cond_resched_rcu();
> }
> }
> + if (cur_progress)
> + *cur_progress = max(xas.xa_index - start, 1UL);
I would really just keep it simple here and do a
*cur_progress = HPAGE_PMD_NR;
This stuff is hard to reason about, so I would just leave the file case
essentially unchanged.
IIRC, it would not affect the numbers you report in the patch description?
--
Cheers,
David
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-04 21:35 ` David Hildenbrand (arm)
@ 2026-02-05 6:08 ` Vernon Yang
2026-02-05 12:07 ` Dev Jain
2026-02-05 12:11 ` David Hildenbrand (arm)
0 siblings, 2 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-05 6:08 UTC (permalink / raw)
To: David Hildenbrand (arm)
Cc: akpm, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
linux-mm, linux-kernel, Vernon Yang
On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>
> [...]
>
> > + if (cur_progress) {
> > + if (_pte >= pte + HPAGE_PMD_NR)
> > + *cur_progress = HPAGE_PMD_NR;
> > + else
> > + *cur_progress = _pte - pte + 1;
>
> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
> ?
>
> It's still a bit nasty, though.
>
> Can't we just add one at the beginning of the loop and let the compiler
> optimize that? ;)
I'm also worried that the compiler can't optimize this since the body of
the loop is complex, as with Dev's opinion [1].
[1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com
If you have a strong recommendation for this, please let me know, Thanks!
> > + }
> > pte_unmap_unlock(pte, ptl);
> > if (result == SCAN_SUCCEED) {
> > result = collapse_huge_page(mm, start_addr, referenced,
> > @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> > return result;
> > }
> >
> > -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > - struct file *file, pgoff_t start, struct collapse_control *cc)
> > +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> > + unsigned long addr, struct file *file, pgoff_t start,
> > + unsigned int *cur_progress, struct collapse_control *cc)
> > {
> > struct folio *folio = NULL;
> > struct address_space *mapping = file->f_mapping;
> > @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
> > cond_resched_rcu();
> > }
> > }
> > + if (cur_progress)
> > + *cur_progress = max(xas.xa_index - start, 1UL);
> I would really just keep it simple here and do a
>
> *cur_progress = HPAGE_PMD_NR;
>
> This stuff is hard to reason about, so I would just leave the file case
> essentially unchanged.
>
> IIRC, it would not affect the numbers you report in the patch description?
Yes, Let's keep it simple, always equal to HPAGE_PMD_NR in file case.
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-05 6:08 ` Vernon Yang
@ 2026-02-05 12:07 ` Dev Jain
2026-02-05 12:28 ` David Hildenbrand (Arm)
2026-02-05 12:11 ` David Hildenbrand (arm)
1 sibling, 1 reply; 26+ messages in thread
From: Dev Jain @ 2026-02-05 12:07 UTC (permalink / raw)
To: Vernon Yang, David Hildenbrand (arm)
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 05/02/26 11:38 am, Vernon Yang wrote:
> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>> [...]
>>
>>> + if (cur_progress) {
>>> + if (_pte >= pte + HPAGE_PMD_NR)
>>> + *cur_progress = HPAGE_PMD_NR;
>>> + else
>>> + *cur_progress = _pte - pte + 1;
>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>
>> ?
>>
>> It's still a bit nasty, though.
>>
>> Can't we just add one at the beginning of the loop and let the compiler
>> optimize that? ;)
> I'm also worried that the compiler can't optimize this since the body of
> the loop is complex, as with Dev's opinion [1].
>
> [1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com
>
> If you have a strong recommendation for this, please let me know, Thanks!
I haven't explicitly checked with assembly, but I am fairly sure this won't get optimized.
There are two cases where it could have been optimized:
1) Had the compiler inlined hpage_collapse_scan_pmd
2) Had the compiler done something like
if (p) -> foo(), where foo() contains the complete for loop, with the increment
else -> bar(), where bar() contains the complete for loop, without the increment
Both of which are highly unlikely because of the complexity of the function.
>
>>> + }
>>> pte_unmap_unlock(pte, ptl);
>>> if (result == SCAN_SUCCEED) {
>>> result = collapse_huge_page(mm, start_addr, referenced,
>>> @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>> return result;
>>> }
>>>
>>> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>>> - struct file *file, pgoff_t start, struct collapse_control *cc)
>>> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>>> + unsigned long addr, struct file *file, pgoff_t start,
>>> + unsigned int *cur_progress, struct collapse_control *cc)
>>> {
>>> struct folio *folio = NULL;
>>> struct address_space *mapping = file->f_mapping;
>>> @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>>> cond_resched_rcu();
>>> }
>>> }
>>> + if (cur_progress)
>>> + *cur_progress = max(xas.xa_index - start, 1UL);
>> I would really just keep it simple here and do a
>>
>> *cur_progress = HPAGE_PMD_NR;
>>
>> This stuff is hard to reason about, so I would just leave the file case
>> essentially unchanged.
>>
>> IIRC, it would not affect the numbers you report in the patch description?
> Yes, Let's keep it simple, always equal to HPAGE_PMD_NR in file case.
>
> --
> Thanks,
> Vernon
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-05 12:07 ` Dev Jain
@ 2026-02-05 12:28 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-05 12:28 UTC (permalink / raw)
To: Dev Jain, Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 2/5/26 13:07, Dev Jain wrote:
>
> On 05/02/26 11:38 am, Vernon Yang wrote:
>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>>> [...]
>>>
>>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>
>>> ?
>>>
>>> It's still a bit nasty, though.
>>>
>>> Can't we just add one at the beginning of the loop and let the compiler
>>> optimize that? ;)
>> I'm also worried that the compiler can't optimize this since the body of
>> the loop is complex, as with Dev's opinion [1].
>>
>> [1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com
>>
>> If you have a strong recommendation for this, please let me know, Thanks!
>
> I haven't explicitly checked with assembly, but I am fairly sure this won't get optimized.
> There are two cases where it could have been optimized:
>
> 1) Had the compiler inlined hpage_collapse_scan_pmd
Yeah, there are two callers so that likely does not happen.
> 2) Had the compiler done something like
> if (p) -> foo(), where foo() contains the complete for loop, with the increment
> else -> bar(), where bar() contains the complete for loop, without the increment
>
> Both of which are highly unlikely because of the complexity of the function.
Not sure if the compiler would be to optimize this out also in
non-inlined cases. In any case, I wonder if this must be optimized at
all ...
--
Cheers,
David
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-05 6:08 ` Vernon Yang
2026-02-05 12:07 ` Dev Jain
@ 2026-02-05 12:11 ` David Hildenbrand (arm)
2026-02-05 14:25 ` Dev Jain
1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand (arm) @ 2026-02-05 12:11 UTC (permalink / raw)
To: Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
linux-mm, linux-kernel, Vernon Yang
On 2/5/26 07:08, Vernon Yang wrote:
> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>>
>> [...]
>>
>>> + if (cur_progress) {
>>> + if (_pte >= pte + HPAGE_PMD_NR)
>>> + *cur_progress = HPAGE_PMD_NR;
>>> + else
>>> + *cur_progress = _pte - pte + 1;
>>
>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>
> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
Yes!
>
>> ?
>>
>> It's still a bit nasty, though.
>>
>> Can't we just add one at the beginning of the loop and let the compiler
>> optimize that? ;)
>
> I'm also worried that the compiler can't optimize this since the body of
> the loop is complex, as with Dev's opinion [1].
Why do we even have to optimize this? :)
Premature ... ? :)
--
Cheers,
David
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-05 12:11 ` David Hildenbrand (arm)
@ 2026-02-05 14:25 ` Dev Jain
2026-02-05 14:30 ` Dev Jain
2026-02-06 9:02 ` David Hildenbrand (Arm)
0 siblings, 2 replies; 26+ messages in thread
From: Dev Jain @ 2026-02-05 14:25 UTC (permalink / raw)
To: David Hildenbrand (arm), Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
> On 2/5/26 07:08, Vernon Yang wrote:
>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>> <david@kernel.org> wrote:
>>>
>>> [...]
>>>
>>>> + if (cur_progress) {
>>>> + if (_pte >= pte + HPAGE_PMD_NR)
>>>> + *cur_progress = HPAGE_PMD_NR;
>>>> + else
>>>> + *cur_progress = _pte - pte + 1;
>>>
>>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>>
>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>
> Yes!
>
>>
>>> ?
>>>
>>> It's still a bit nasty, though.
>>>
>>> Can't we just add one at the beginning of the loop and let the compiler
>>> optimize that? ;)
>>
>> I'm also worried that the compiler can't optimize this since the body of
>> the loop is complex, as with Dev's opinion [1].
>
> Why do we even have to optimize this? :)
>
> Premature ... ? :)
I mean .... we don't, but the alternate is a one liner using max().
The objective is to compute the number of iterations of the for-loop.
It just seems weird to me to track that in the loop, when we have the
loop iterator, which *literally* does that only.
Anyhow, I won't shout in any case : ) If you deem incrementing in the
loop prettier, that's fine.
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-05 14:25 ` Dev Jain
@ 2026-02-05 14:30 ` Dev Jain
2026-02-06 9:03 ` David Hildenbrand (Arm)
2026-02-06 9:02 ` David Hildenbrand (Arm)
1 sibling, 1 reply; 26+ messages in thread
From: Dev Jain @ 2026-02-05 14:30 UTC (permalink / raw)
To: David Hildenbrand (arm), Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 05/02/26 7:55 pm, Dev Jain wrote:
> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>> On 2/5/26 07:08, Vernon Yang wrote:
>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>> <david@kernel.org> wrote:
>>>> [...]
>>>>
>>>>> + if (cur_progress) {
>>>>> + if (_pte >= pte + HPAGE_PMD_NR)
>>>>> + *cur_progress = HPAGE_PMD_NR;
>>>>> + else
>>>>> + *cur_progress = _pte - pte + 1;
>>>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>> Yes!
>>
>>>> ?
>>>>
>>>> It's still a bit nasty, though.
>>>>
>>>> Can't we just add one at the beginning of the loop and let the compiler
>>>> optimize that? ;)
>>> I'm also worried that the compiler can't optimize this since the body of
>>> the loop is complex, as with Dev's opinion [1].
>> Why do we even have to optimize this? :)
>>
>> Premature ... ? :)
>
> I mean .... we don't, but the alternate is a one liner using max().
>
> The objective is to compute the number of iterations of the for-loop.
>
> It just seems weird to me to track that in the loop, when we have the
>
> loop iterator, which *literally* does that only.
I realize I shouldn't have bolded out the "literally" - below I wrote that
I won't shout, but the bold seems like shouting :)
>
>
>
> Anyhow, I won't shout in any case : ) If you deem incrementing in the
>
> loop prettier, that's fine.
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-05 14:30 ` Dev Jain
@ 2026-02-06 9:03 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-06 9:03 UTC (permalink / raw)
To: Dev Jain, Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 2/5/26 15:30, Dev Jain wrote:
>
> On 05/02/26 7:55 pm, Dev Jain wrote:
>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>> Yes!
>>>
>>> Why do we even have to optimize this? :)
>>>
>>> Premature ... ? :)
>>
>> I mean .... we don't, but the alternate is a one liner using max().
>>
>> The objective is to compute the number of iterations of the for-loop.
>>
>> It just seems weird to me to track that in the loop, when we have the
>>
>> loop iterator, which *literally* does that only.
>
> I realize I shouldn't have bolded out the "literally" - below I wrote that
> I won't shout, but the bold seems like shouting :)
Heh.
The thing is that the loop iterator does not quite what we want,
otherwise we wouldn't have to mess with max() etc.
--
Cheers,
David
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-05 14:25 ` Dev Jain
2026-02-05 14:30 ` Dev Jain
@ 2026-02-06 9:02 ` David Hildenbrand (Arm)
2026-02-06 10:00 ` Dev Jain
2026-02-06 11:12 ` Vernon Yang
1 sibling, 2 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-06 9:02 UTC (permalink / raw)
To: Dev Jain, Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 2/5/26 15:25, Dev Jain wrote:
>
> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>> On 2/5/26 07:08, Vernon Yang wrote:
>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>> <david@kernel.org> wrote:
>>>
>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>
>> Yes!
>>
>>>
>>>
>>> I'm also worried that the compiler can't optimize this since the body of
>>> the loop is complex, as with Dev's opinion [1].
>>
>> Why do we even have to optimize this? :)
>>
>> Premature ... ? :)
>
>
> I mean .... we don't, but the alternate is a one liner using max().
I'm fine with the max(), but it still seems like adding complexity to
optimize something that is nowhere prove to really be a problem.
--
Cheers,
David
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-06 9:02 ` David Hildenbrand (Arm)
@ 2026-02-06 10:00 ` Dev Jain
2026-02-06 11:10 ` David Hildenbrand (Arm)
2026-02-06 11:12 ` Vernon Yang
1 sibling, 1 reply; 26+ messages in thread
From: Dev Jain @ 2026-02-06 10:00 UTC (permalink / raw)
To: David Hildenbrand (Arm), Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 06/02/26 2:32 pm, David Hildenbrand (Arm) wrote:
> On 2/5/26 15:25, Dev Jain wrote:
>>
>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>> <david@kernel.org> wrote:
>>>>
>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>
>>> Yes!
>>>
>>>>
>>>>
>>>> I'm also worried that the compiler can't optimize this since the body of
>>>> the loop is complex, as with Dev's opinion [1].
>>>
>>> Why do we even have to optimize this? :)
>>>
>>> Premature ... ? :)
>>
>>
>> I mean .... we don't, but the alternate is a one liner using max().
>
> I'm fine with the max(), but it still seems like adding complexity to
> optimize something that is nowhere prove to really be a problem.
Agreed. Vernon, let us do the increment in the loop then.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-06 10:00 ` Dev Jain
@ 2026-02-06 11:10 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-06 11:10 UTC (permalink / raw)
To: Dev Jain, Vernon Yang
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 2/6/26 11:00, Dev Jain wrote:
>
> On 06/02/26 2:32 pm, David Hildenbrand (Arm) wrote:
>> On 2/5/26 15:25, Dev Jain wrote:
>>>
>>>
>>>
>>> I mean .... we don't, but the alternate is a one liner using max().
>>
>> I'm fine with the max(), but it still seems like adding complexity to
>> optimize something that is nowhere prove to really be a problem.
>
> Agreed. Vernon, let us do the increment in the loop then.
I'm fine with the min(), so if you both think it's better, let's do that!
It makes it slightly harder to understand what's happening, but
fortunately, if we mess up slightly nobody will really notice :)
--
Cheers,
David
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-06 9:02 ` David Hildenbrand (Arm)
2026-02-06 10:00 ` Dev Jain
@ 2026-02-06 11:12 ` Vernon Yang
2026-02-06 13:52 ` Lance Yang
2026-02-08 9:05 ` Dev Jain
1 sibling, 2 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-06 11:12 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Dev Jain, akpm, lorenzo.stoakes, ziy, baohua, lance.yang,
linux-mm, linux-kernel, Vernon Yang
On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
> On 2/5/26 15:25, Dev Jain wrote:
> >
> > On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
> > > On 2/5/26 07:08, Vernon Yang wrote:
> > > > On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
> > > > <david@kernel.org> wrote:
> > > >
> > > > I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
> > >
> > > Yes!
> > >
> > > >
> > > >
> > > > I'm also worried that the compiler can't optimize this since the body of
> > > > the loop is complex, as with Dev's opinion [1].
> > >
> > > Why do we even have to optimize this? :)
> > >
> > > Premature ... ? :)
> >
> >
> > I mean .... we don't, but the alternate is a one liner using max().
>
> I'm fine with the max(), but it still seems like adding complexity to
> optimize something that is nowhere prove to really be a problem.
Hi David, Dev,
I use "*cur_progress += 1" at the beginning of the loop, the compiler
optimize that. Assembly as follows:
60c1: 4d 29 ca sub %r9,%r10 // r10 is _pte, r9 is pte, r10 = _pte - pte
60c4: b8 00 02 00 00 mov $0x200,%eax // eax = HPAGE_PMD_NR
60c9: 44 89 5c 24 10 mov %r11d,0x10(%rsp) //
60ce: 49 c1 fa 03 sar $0x3,%r10 //
60d2: 49 83 c2 01 add $0x1,%r10 // r10 += 1
60d6: 49 39 c2 cmp %rax,%r10 // r10 = min(r10, eax)
60d9: 4c 0f 4f d0 cmovg %rax,%r10 //
60dd: 44 89 55 00 mov %r10d,0x0(%rbp) // *cur_progress = r10
To make the code simpler, Let us use "*cur_progress += 1".
--
Thanks,
Vernon
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-06 11:12 ` Vernon Yang
@ 2026-02-06 13:52 ` Lance Yang
2026-02-08 9:05 ` Dev Jain
1 sibling, 0 replies; 26+ messages in thread
From: Lance Yang @ 2026-02-06 13:52 UTC (permalink / raw)
To: Vernon Yang, David Hildenbrand (Arm), Dev Jain
Cc: akpm, lorenzo.stoakes, ziy, baohua, linux-mm, linux-kernel, Vernon Yang
On 2026/2/6 19:12, Vernon Yang wrote:
> On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
>> On 2/5/26 15:25, Dev Jain wrote:
>>>
>>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>>> <david@kernel.org> wrote:
>>>>>
>>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>>
>>>> Yes!
>>>>
>>>>>
>>>>>
>>>>> I'm also worried that the compiler can't optimize this since the body of
>>>>> the loop is complex, as with Dev's opinion [1].
>>>>
>>>> Why do we even have to optimize this? :)
>>>>
>>>> Premature ... ? :)
>>>
>>>
>>> I mean .... we don't, but the alternate is a one liner using max().
>>
>> I'm fine with the max(), but it still seems like adding complexity to
>> optimize something that is nowhere prove to really be a problem.
>
> Hi David, Dev,
>
> I use "*cur_progress += 1" at the beginning of the loop, the compiler
> optimize that. Assembly as follows:
>
> 60c1: 4d 29 ca sub %r9,%r10 // r10 is _pte, r9 is pte, r10 = _pte - pte
> 60c4: b8 00 02 00 00 mov $0x200,%eax // eax = HPAGE_PMD_NR
> 60c9: 44 89 5c 24 10 mov %r11d,0x10(%rsp) //
> 60ce: 49 c1 fa 03 sar $0x3,%r10 //
> 60d2: 49 83 c2 01 add $0x1,%r10 // r10 += 1
> 60d6: 49 39 c2 cmp %rax,%r10 // r10 = min(r10, eax)
> 60d9: 4c 0f 4f d0 cmovg %rax,%r10 //
> 60dd: 44 89 55 00 mov %r10d,0x0(%rbp) // *cur_progress = r10
>
> To make the code simpler, Let us use "*cur_progress += 1".
Cool! Compiler did the right thing and the heavy lifting after all - we get
to keep it simple :p
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-06 11:12 ` Vernon Yang
2026-02-06 13:52 ` Lance Yang
@ 2026-02-08 9:05 ` Dev Jain
2026-02-08 9:32 ` Lance Yang
2026-02-08 13:23 ` Vernon Yang
1 sibling, 2 replies; 26+ messages in thread
From: Dev Jain @ 2026-02-08 9:05 UTC (permalink / raw)
To: Vernon Yang, David Hildenbrand (Arm)
Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 06/02/26 4:42 pm, Vernon Yang wrote:
> On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
>> On 2/5/26 15:25, Dev Jain wrote:
>>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>>> <david@kernel.org> wrote:
>>>>>
>>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>> Yes!
>>>>
>>>>>
>>>>> I'm also worried that the compiler can't optimize this since the body of
>>>>> the loop is complex, as with Dev's opinion [1].
>>>> Why do we even have to optimize this? :)
>>>>
>>>> Premature ... ? :)
>>>
>>> I mean .... we don't, but the alternate is a one liner using max().
>> I'm fine with the max(), but it still seems like adding complexity to
>> optimize something that is nowhere prove to really be a problem.
> Hi David, Dev,
>
> I use "*cur_progress += 1" at the beginning of the loop, the compiler
> optimize that. Assembly as follows:
>
> 60c1: 4d 29 ca sub %r9,%r10 // r10 is _pte, r9 is pte, r10 = _pte - pte
> 60c4: b8 00 02 00 00 mov $0x200,%eax // eax = HPAGE_PMD_NR
> 60c9: 44 89 5c 24 10 mov %r11d,0x10(%rsp) //
> 60ce: 49 c1 fa 03 sar $0x3,%r10 //
> 60d2: 49 83 c2 01 add $0x1,%r10 // r10 += 1
> 60d6: 49 39 c2 cmp %rax,%r10 // r10 = min(r10, eax)
> 60d9: 4c 0f 4f d0 cmovg %rax,%r10 //
> 60dd: 44 89 55 00 mov %r10d,0x0(%rbp) // *cur_progress = r10
>
> To make the code simpler, Let us use "*cur_progress += 1".
Wow! Wasn't expecting that. What's your gcc version? I checked with
gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't
optimize.
> --
> Thanks,
> Vernon
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-08 9:05 ` Dev Jain
@ 2026-02-08 9:32 ` Lance Yang
2026-02-08 13:23 ` Vernon Yang
1 sibling, 0 replies; 26+ messages in thread
From: Lance Yang @ 2026-02-08 9:32 UTC (permalink / raw)
To: Dev Jain, Vernon Yang, David Hildenbrand (Arm)
Cc: akpm, lorenzo.stoakes, ziy, baohua, linux-mm, linux-kernel, Vernon Yang
On 2026/2/8 17:05, Dev Jain wrote:
>
> On 06/02/26 4:42 pm, Vernon Yang wrote:
>> On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
>>> On 2/5/26 15:25, Dev Jain wrote:
>>>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>>>> <david@kernel.org> wrote:
>>>>>>
>>>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>>> Yes!
>>>>>
>>>>>>
>>>>>> I'm also worried that the compiler can't optimize this since the body of
>>>>>> the loop is complex, as with Dev's opinion [1].
>>>>> Why do we even have to optimize this? :)
>>>>>
>>>>> Premature ... ? :)
>>>>
>>>> I mean .... we don't, but the alternate is a one liner using max().
>>> I'm fine with the max(), but it still seems like adding complexity to
>>> optimize something that is nowhere prove to really be a problem.
>> Hi David, Dev,
>>
>> I use "*cur_progress += 1" at the beginning of the loop, the compiler
>> optimize that. Assembly as follows:
>>
>> 60c1: 4d 29 ca sub %r9,%r10 // r10 is _pte, r9 is pte, r10 = _pte - pte
>> 60c4: b8 00 02 00 00 mov $0x200,%eax // eax = HPAGE_PMD_NR
>> 60c9: 44 89 5c 24 10 mov %r11d,0x10(%rsp) //
>> 60ce: 49 c1 fa 03 sar $0x3,%r10 //
>> 60d2: 49 83 c2 01 add $0x1,%r10 // r10 += 1
>> 60d6: 49 39 c2 cmp %rax,%r10 // r10 = min(r10, eax)
>> 60d9: 4c 0f 4f d0 cmovg %rax,%r10 //
>> 60dd: 44 89 55 00 mov %r10d,0x0(%rbp) // *cur_progress = r10
>>
>> To make the code simpler, Let us use "*cur_progress += 1".
>
> Wow! Wasn't expecting that. What's your gcc version? I checked with
> gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't
> optimize.
FWIW, 11.4.0 is newer that the minimum GCC version (8.1) required by
kernel. See Documentation/process/changes.rst
The optimization might just be version-dependent :)
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
2026-02-08 9:05 ` Dev Jain
2026-02-08 9:32 ` Lance Yang
@ 2026-02-08 13:23 ` Vernon Yang
1 sibling, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-08 13:23 UTC (permalink / raw)
To: Dev Jain
Cc: David Hildenbrand (Arm),
akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On Sun, Feb 8, 2026 at 5:05 PM Dev Jain <dev.jain@arm.com> wrote:
>
> On 06/02/26 4:42 pm, Vernon Yang wrote:
> > On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
> >> On 2/5/26 15:25, Dev Jain wrote:
> >>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
> >>>> On 2/5/26 07:08, Vernon Yang wrote:
> >>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
> >>>>> <david@kernel.org> wrote:
> >>>>>
> >>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
> >>>> Yes!
> >>>>
> >>>>>
> >>>>> I'm also worried that the compiler can't optimize this since the body of
> >>>>> the loop is complex, as with Dev's opinion [1].
> >>>> Why do we even have to optimize this? :)
> >>>>
> >>>> Premature ... ? :)
> >>>
> >>> I mean .... we don't, but the alternate is a one liner using max().
> >> I'm fine with the max(), but it still seems like adding complexity to
> >> optimize something that is nowhere prove to really be a problem.
> > Hi David, Dev,
> >
> > I use "*cur_progress += 1" at the beginning of the loop, the compiler
> > optimize that. Assembly as follows:
> >
> > 60c1: 4d 29 ca sub %r9,%r10 // r10 is _pte, r9 is pte, r10 = _pte - pte
> > 60c4: b8 00 02 00 00 mov $0x200,%eax // eax = HPAGE_PMD_NR
> > 60c9: 44 89 5c 24 10 mov %r11d,0x10(%rsp) //
> > 60ce: 49 c1 fa 03 sar $0x3,%r10 //
> > 60d2: 49 83 c2 01 add $0x1,%r10 // r10 += 1
> > 60d6: 49 39 c2 cmp %rax,%r10 // r10 = min(r10, eax)
> > 60d9: 4c 0f 4f d0 cmovg %rax,%r10 //
> > 60dd: 44 89 55 00 mov %r10d,0x0(%rbp) // *cur_progress = r10
> >
> > To make the code simpler, Let us use "*cur_progress += 1".
>
> Wow! Wasn't expecting that. What's your gcc version? I checked with
> gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't
> optimize.
$ gcc --version
gcc (GCC) 15.2.1 20250808 (Red Hat 15.2.1-1)
Above is my gcc version. However, I performed the assembly again without
any optimization :(
I suspect that I might have messed up the environment earlier, failing to
compile the newly modified code successfully, which resulted is assembly
old_khugepaged.o.
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH mm-new v6 3/5] mm: add folio_test_lazyfree helper
2026-02-01 12:25 [PATCH mm-new v6 0/5] Improve khugepaged scan logic Vernon Yang
2026-02-01 12:25 ` [PATCH mm-new v6 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
2026-02-01 12:25 ` [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number Vernon Yang
@ 2026-02-01 12:25 ` Vernon Yang
2026-02-01 12:25 ` [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
2026-02-01 12:25 ` [PATCH mm-new v6 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
4 siblings, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-01 12:25 UTC (permalink / raw)
To: akpm, david
Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
From: Vernon Yang <yanglincheng@kylinos.cn>
Add folio_test_lazyfree() function to identify lazy-free folios to improve
code readability.
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
---
include/linux/page-flags.h | 5 +++++
mm/rmap.c | 2 +-
mm/vmscan.c | 5 ++---
3 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f7a0e4af0c73..415e9f2ef616 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -724,6 +724,11 @@ static __always_inline bool folio_test_anon(const struct folio *folio)
return ((unsigned long)folio->mapping & FOLIO_MAPPING_ANON) != 0;
}
+static __always_inline bool folio_test_lazyfree(const struct folio *folio)
+{
+ return folio_test_anon(folio) && !folio_test_swapbacked(folio);
+}
+
static __always_inline bool PageAnonNotKsm(const struct page *page)
{
unsigned long flags = (unsigned long)page_folio(page)->mapping;
diff --git a/mm/rmap.c b/mm/rmap.c
index 618df3385c8b..ea55e12b3e87 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2049,7 +2049,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
}
if (!pvmw.pte) {
- if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
+ if (folio_test_lazyfree(folio)) {
if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
goto walk_done;
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 01d3364fe506..17d039bdbb53 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -963,8 +963,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
* They could be mistakenly treated as file lru. So further anon
* test is needed.
*/
- if (!folio_is_file_lru(folio) ||
- (folio_test_anon(folio) && !folio_test_swapbacked(folio))) {
+ if (!folio_is_file_lru(folio) || folio_test_lazyfree(folio)) {
*dirty = false;
*writeback = false;
return;
@@ -1508,7 +1507,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
}
}
- if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
+ if (folio_test_lazyfree(folio)) {
/* follow __remove_mapping for reference */
if (!folio_ref_freeze(folio, 1))
goto keep_locked;
--
2.51.0
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios
2026-02-01 12:25 [PATCH mm-new v6 0/5] Improve khugepaged scan logic Vernon Yang
` (2 preceding siblings ...)
2026-02-01 12:25 ` [PATCH mm-new v6 3/5] mm: add folio_test_lazyfree helper Vernon Yang
@ 2026-02-01 12:25 ` Vernon Yang
2026-02-03 11:23 ` Lance Yang
2026-02-04 21:23 ` David Hildenbrand (arm)
2026-02-01 12:25 ` [PATCH mm-new v6 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
4 siblings, 2 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-01 12:25 UTC (permalink / raw)
To: akpm, david
Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
From: Vernon Yang <yanglincheng@kylinos.cn>
For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.
And if we collapse with a lazyfree page, that content will never be none
and the deferred shrinker cannot reclaim them.
So if the user has explicitly informed us via MADV_FREE that this memory
will be freed, it is appropriate for khugepaged to skip it only, thereby
avoiding unnecessary scan and collapse operations to reducing CPU
wastage.
Here are the performance test results:
(Throughput bigger is better, other smaller is better)
Testing on x86_64 machine:
| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.14 sec | 2.93 sec | -6.69% |
| cycles per access | 4.96 | 2.21 | -55.44% |
| Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
| dTLB-load-misses | 284814532 | 69597236 | -75.56% |
Testing on qemu-system-x86_64 -enable-kvm:
| task hot2 | without patch | with patch | delta |
|---------------------|---------------|---------------|---------|
| total accesses time | 3.35 sec | 2.96 sec | -11.64% |
| cycles per access | 7.29 | 2.07 | -71.60% |
| Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
| dTLB-load-misses | 241600871 | 3216108 | -98.67% |
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
include/trace/events/huge_memory.h | 1 +
mm/khugepaged.c | 13 +++++++++++++
2 files changed, 14 insertions(+)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 384e29f6bef0..bcdc57eea270 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -25,6 +25,7 @@
EM( SCAN_PAGE_LRU, "page_not_in_lru") \
EM( SCAN_PAGE_LOCK, "page_locked") \
EM( SCAN_PAGE_ANON, "page_not_anon") \
+ EM( SCAN_PAGE_LAZYFREE, "page_lazyfree") \
EM( SCAN_PAGE_COMPOUND, "page_compound") \
EM( SCAN_ANY_PROCESS, "no_process_for_page") \
EM( SCAN_VMA_NULL, "vma_null") \
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index df22b2274d92..b4def001ccd0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -46,6 +46,7 @@ enum scan_result {
SCAN_PAGE_LRU,
SCAN_PAGE_LOCK,
SCAN_PAGE_ANON,
+ SCAN_PAGE_LAZYFREE,
SCAN_PAGE_COMPOUND,
SCAN_ANY_PROCESS,
SCAN_VMA_NULL,
@@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
folio = page_folio(page);
VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
+ if (cc->is_khugepaged && !pte_dirty(pteval) &&
+ folio_test_lazyfree(folio)) {
+ result = SCAN_PAGE_LAZYFREE;
+ goto out;
+ }
+
/* See hpage_collapse_scan_pmd(). */
if (folio_maybe_mapped_shared(folio)) {
++shared;
@@ -1332,6 +1339,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
}
folio = page_folio(page);
+ if (cc->is_khugepaged && !pte_dirty(pteval) &&
+ folio_test_lazyfree(folio)) {
+ result = SCAN_PAGE_LAZYFREE;
+ goto out_unmap;
+ }
+
if (!folio_test_anon(folio)) {
result = SCAN_PAGE_ANON;
goto out_unmap;
--
2.51.0
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios
2026-02-01 12:25 ` [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
@ 2026-02-03 11:23 ` Lance Yang
2026-02-05 6:01 ` Vernon Yang
2026-02-04 21:23 ` David Hildenbrand (arm)
1 sibling, 1 reply; 26+ messages in thread
From: Lance Yang @ 2026-02-03 11:23 UTC (permalink / raw)
To: Vernon Yang
Cc: lorenzo.stoakes, ziy, dev.jain, baohua, linux-mm, linux-kernel,
Vernon Yang, david, akpm
On 2026/2/1 20:25, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
s/andthen/and then/
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
>
> And if we collapse with a lazyfree page, that content will never be none
> and the deferred shrinker cannot reclaim them.
>
> So if the user has explicitly informed us via MADV_FREE that this memory
> will be freed, it is appropriate for khugepaged to skip it only, thereby
> avoiding unnecessary scan and collapse operations to reducing CPU
> wastage.
>
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
>
> Testing on x86_64 machine:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
> | cycles per access | 4.96 | 2.21 | -55.44% |
> | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
> | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
>
> Testing on qemu-system-x86_64 -enable-kvm:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> | cycles per access | 7.29 | 2.07 | -71.60% |
> | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
> | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
> include/trace/events/huge_memory.h | 1 +
> mm/khugepaged.c | 13 +++++++++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 384e29f6bef0..bcdc57eea270 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -25,6 +25,7 @@
> EM( SCAN_PAGE_LRU, "page_not_in_lru") \
> EM( SCAN_PAGE_LOCK, "page_locked") \
> EM( SCAN_PAGE_ANON, "page_not_anon") \
> + EM( SCAN_PAGE_LAZYFREE, "page_lazyfree") \
> EM( SCAN_PAGE_COMPOUND, "page_compound") \
> EM( SCAN_ANY_PROCESS, "no_process_for_page") \
> EM( SCAN_VMA_NULL, "vma_null") \
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index df22b2274d92..b4def001ccd0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -46,6 +46,7 @@ enum scan_result {
> SCAN_PAGE_LRU,
> SCAN_PAGE_LOCK,
> SCAN_PAGE_ANON,
> + SCAN_PAGE_LAZYFREE,
> SCAN_PAGE_COMPOUND,
> SCAN_ANY_PROCESS,
> SCAN_VMA_NULL,
> @@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> folio = page_folio(page);
> VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>
> + if (cc->is_khugepaged && !pte_dirty(pteval) &&
> + folio_test_lazyfree(folio)) {
> + result = SCAN_PAGE_LAZYFREE;
> + goto out;
> + }
> +
> /* See hpage_collapse_scan_pmd(). */
> if (folio_maybe_mapped_shared(folio)) {
> ++shared;
> @@ -1332,6 +1339,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> }
> folio = page_folio(page);
>
> + if (cc->is_khugepaged && !pte_dirty(pteval) &&
> + folio_test_lazyfree(folio)) {
> + result = SCAN_PAGE_LAZYFREE;
> + goto out_unmap;
> + }
> +
> if (!folio_test_anon(folio)) {
> result = SCAN_PAGE_ANON;
> goto out_unmap;
Nothing else jumped at me, LGTM.
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios
2026-02-03 11:23 ` Lance Yang
@ 2026-02-05 6:01 ` Vernon Yang
0 siblings, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-05 6:01 UTC (permalink / raw)
To: Lance Yang
Cc: lorenzo.stoakes, ziy, dev.jain, baohua, linux-mm, linux-kernel,
Vernon Yang, david, akpm
On Tue, Feb 3, 2026 at 7:23 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> On 2026/2/1 20:25, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
>
> s/andthen/and then/
LGTM, Thank you for review and suggestion, I will do it in the next version.
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > And if we collapse with a lazyfree page, that content will never be none
> > and the deferred shrinker cannot reclaim them.
> >
> > So if the user has explicitly informed us via MADV_FREE that this memory
> > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > avoiding unnecessary scan and collapse operations to reducing CPU
> > wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2 | without patch | with patch | delta |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
> > | cycles per access | 4.96 | 2.21 | -55.44% |
> > | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
> > | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2 | without patch | with patch | delta |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> > | cycles per access | 7.29 | 2.07 | -71.60% |
> > | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
> > | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> > include/trace/events/huge_memory.h | 1 +
> > mm/khugepaged.c | 13 +++++++++++++
> > 2 files changed, 14 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 384e29f6bef0..bcdc57eea270 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -25,6 +25,7 @@
> > EM( SCAN_PAGE_LRU, "page_not_in_lru") \
> > EM( SCAN_PAGE_LOCK, "page_locked") \
> > EM( SCAN_PAGE_ANON, "page_not_anon") \
> > + EM( SCAN_PAGE_LAZYFREE, "page_lazyfree") \
> > EM( SCAN_PAGE_COMPOUND, "page_compound") \
> > EM( SCAN_ANY_PROCESS, "no_process_for_page") \
> > EM( SCAN_VMA_NULL, "vma_null") \
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index df22b2274d92..b4def001ccd0 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -46,6 +46,7 @@ enum scan_result {
> > SCAN_PAGE_LRU,
> > SCAN_PAGE_LOCK,
> > SCAN_PAGE_ANON,
> > + SCAN_PAGE_LAZYFREE,
> > SCAN_PAGE_COMPOUND,
> > SCAN_ANY_PROCESS,
> > SCAN_VMA_NULL,
> > @@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > folio = page_folio(page);
> > VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >
> > + if (cc->is_khugepaged && !pte_dirty(pteval) &&
> > + folio_test_lazyfree(folio)) {
> > + result = SCAN_PAGE_LAZYFREE;
> > + goto out;
> > + }
> > +
> > /* See hpage_collapse_scan_pmd(). */
> > if (folio_maybe_mapped_shared(folio)) {
> > ++shared;
> > @@ -1332,6 +1339,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> > }
> > folio = page_folio(page);
> >
> > + if (cc->is_khugepaged && !pte_dirty(pteval) &&
> > + folio_test_lazyfree(folio)) {
> > + result = SCAN_PAGE_LAZYFREE;
> > + goto out_unmap;
> > + }
> > +
> > if (!folio_test_anon(folio)) {
> > result = SCAN_PAGE_ANON;
> > goto out_unmap;
>
> Nothing else jumped at me, LGTM.
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios
2026-02-01 12:25 ` [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
2026-02-03 11:23 ` Lance Yang
@ 2026-02-04 21:23 ` David Hildenbrand (arm)
2026-02-05 6:05 ` Vernon Yang
1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand (arm) @ 2026-02-04 21:23 UTC (permalink / raw)
To: Vernon Yang, akpm
Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
On 2/1/26 13:25, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
>
> And if we collapse with a lazyfree page, that content will never be none
> and the deferred shrinker cannot reclaim them.
>
> So if the user has explicitly informed us via MADV_FREE that this memory
> will be freed, it is appropriate for khugepaged to skip it only, thereby
> avoiding unnecessary scan and collapse operations to reducing CPU
> wastage.
>
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
>
> Testing on x86_64 machine:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
> | cycles per access | 4.96 | 2.21 | -55.44% |
> | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
> | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
>
> Testing on qemu-system-x86_64 -enable-kvm:
>
> | task hot2 | without patch | with patch | delta |
> |---------------------|---------------|---------------|---------|
> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> | cycles per access | 7.29 | 2.07 | -71.60% |
> | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
> | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
> include/trace/events/huge_memory.h | 1 +
> mm/khugepaged.c | 13 +++++++++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 384e29f6bef0..bcdc57eea270 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -25,6 +25,7 @@
> EM( SCAN_PAGE_LRU, "page_not_in_lru") \
> EM( SCAN_PAGE_LOCK, "page_locked") \
> EM( SCAN_PAGE_ANON, "page_not_anon") \
> + EM( SCAN_PAGE_LAZYFREE, "page_lazyfree") \
> EM( SCAN_PAGE_COMPOUND, "page_compound") \
> EM( SCAN_ANY_PROCESS, "no_process_for_page") \
> EM( SCAN_VMA_NULL, "vma_null") \
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index df22b2274d92..b4def001ccd0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -46,6 +46,7 @@ enum scan_result {
> SCAN_PAGE_LRU,
> SCAN_PAGE_LOCK,
> SCAN_PAGE_ANON,
> + SCAN_PAGE_LAZYFREE,
> SCAN_PAGE_COMPOUND,
> SCAN_ANY_PROCESS,
> SCAN_VMA_NULL,
> @@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> folio = page_folio(page);
> VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>
> + if (cc->is_khugepaged && !pte_dirty(pteval) &&
> + folio_test_lazyfree(folio)) {
Should be aligned as
if (cc->is_khugepaged && !pte_dirty(pteval) &&
folio_test_lazyfree(folio)) {
But you could just have it in a single line.
> + result = SCAN_PAGE_LAZYFREE;
> + goto out;
> + }
> +
> /* See hpage_collapse_scan_pmd(). */
> if (folio_maybe_mapped_shared(folio)) {
> ++shared;
> @@ -1332,6 +1339,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> }
> folio = page_folio(page);
>
> + if (cc->is_khugepaged && !pte_dirty(pteval) &&
> + folio_test_lazyfree(folio)) {
> + result = SCAN_PAGE_LAZYFREE;
> + goto out_unmap;
> + }
Dito.
> +
> if (!folio_test_anon(folio)) {
> result = SCAN_PAGE_ANON;
> goto out_unmap;
Surprised that there is no need to add checks for SCAN_PAGE_LAZYFREE
anywhere, but it's similar to SCAN_PAGE_LOCK just that we cannot ever
run into it for madvise.
Acked-by: David Hildenbrand (arm) <david@kernel.org>
--
Cheers,
David
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios
2026-02-04 21:23 ` David Hildenbrand (arm)
@ 2026-02-05 6:05 ` Vernon Yang
0 siblings, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-05 6:05 UTC (permalink / raw)
To: David Hildenbrand (arm)
Cc: akpm, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
linux-mm, linux-kernel, Vernon Yang
On Thu, Feb 5, 2026 at 5:24 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>
> On 2/1/26 13:25, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > And if we collapse with a lazyfree page, that content will never be none
> > and the deferred shrinker cannot reclaim them.
> >
> > So if the user has explicitly informed us via MADV_FREE that this memory
> > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > avoiding unnecessary scan and collapse operations to reducing CPU
> > wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2 | without patch | with patch | delta |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time | 3.14 sec | 2.93 sec | -6.69% |
> > | cycles per access | 4.96 | 2.21 | -55.44% |
> > | Throughput | 104.38 M/sec | 111.89 M/sec | +7.19% |
> > | dTLB-load-misses | 284814532 | 69597236 | -75.56% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2 | without patch | with patch | delta |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
> > | cycles per access | 7.29 | 2.07 | -71.60% |
> > | Throughput | 97.67 M/sec | 110.77 M/sec | +13.41% |
> > | dTLB-load-misses | 241600871 | 3216108 | -98.67% |
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> > include/trace/events/huge_memory.h | 1 +
> > mm/khugepaged.c | 13 +++++++++++++
> > 2 files changed, 14 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 384e29f6bef0..bcdc57eea270 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -25,6 +25,7 @@
> > EM( SCAN_PAGE_LRU, "page_not_in_lru") \
> > EM( SCAN_PAGE_LOCK, "page_locked") \
> > EM( SCAN_PAGE_ANON, "page_not_anon") \
> > + EM( SCAN_PAGE_LAZYFREE, "page_lazyfree") \
> > EM( SCAN_PAGE_COMPOUND, "page_compound") \
> > EM( SCAN_ANY_PROCESS, "no_process_for_page") \
> > EM( SCAN_VMA_NULL, "vma_null") \
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index df22b2274d92..b4def001ccd0 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -46,6 +46,7 @@ enum scan_result {
> > SCAN_PAGE_LRU,
> > SCAN_PAGE_LOCK,
> > SCAN_PAGE_ANON,
> > + SCAN_PAGE_LAZYFREE,
> > SCAN_PAGE_COMPOUND,
> > SCAN_ANY_PROCESS,
> > SCAN_VMA_NULL,
> > @@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > folio = page_folio(page);
> > VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >
> > + if (cc->is_khugepaged && !pte_dirty(pteval) &&
> > + folio_test_lazyfree(folio)) {
>
> Should be aligned as
>
> if (cc->is_khugepaged && !pte_dirty(pteval) &&
> folio_test_lazyfree(folio)) {
LGTM, Thank you for review and suggestion, I will do it in the next version.
> But you could just have it in a single line.
If it is placed on a single line, it will exceed 80 characters.
> > + result = SCAN_PAGE_LAZYFREE;
> > + goto out;
> > + }
> > +
> > /* See hpage_collapse_scan_pmd(). */
> > if (folio_maybe_mapped_shared(folio)) {
> > ++shared;
> > @@ -1332,6 +1339,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> > }
> > folio = page_folio(page);
> >
> > + if (cc->is_khugepaged && !pte_dirty(pteval) &&
> > + folio_test_lazyfree(folio)) {
> > + result = SCAN_PAGE_LAZYFREE;
> > + goto out_unmap;
> > + }
>
> Dito.
>
> > +
> > if (!folio_test_anon(folio)) {
> > result = SCAN_PAGE_ANON;
> > goto out_unmap;
>
> Surprised that there is no need to add checks for SCAN_PAGE_LAZYFREE
> anywhere, but it's similar to SCAN_PAGE_LOCK just that we cannot ever
> run into it for madvise.
>
> Acked-by: David Hildenbrand (arm) <david@kernel.org>
Thank you for review and explanation.
> --
> Cheers,
>
> David
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH mm-new v6 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
2026-02-01 12:25 [PATCH mm-new v6 0/5] Improve khugepaged scan logic Vernon Yang
` (3 preceding siblings ...)
2026-02-01 12:25 ` [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
@ 2026-02-01 12:25 ` Vernon Yang
4 siblings, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-01 12:25 UTC (permalink / raw)
To: akpm, david
Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
linux-kernel, Vernon Yang
From: Vernon Yang <yanglincheng@kylinos.cn>
When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
reduce redundant operation.
Without this patch, entering khugepaged_scan_mm_slot() next time, we
will set khugepaged_scan.mm_slot to the next mm_slot.
With this patch, we will directly set khugepaged_scan.mm_slot to the
next mm_slot.
Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
---
mm/khugepaged.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b4def001ccd0..94cd064f79a5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2549,9 +2549,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
VM_BUG_ON(khugepaged_scan.mm_slot != slot);
/*
* Release the current mm_slot if this mm is about to die, or
- * if we scanned all vmas of this mm.
+ * if we scanned all vmas of this mm, or THP got disabled.
*/
- if (hpage_collapse_test_exit(mm) || !vma) {
+ if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
/*
* Make sure that if mm_users is reaching zero while
* khugepaged runs here, khugepaged_exit will find
--
2.51.0
^ permalink raw reply [flat|nested] 26+ messages in thread