[PATCH mm-new v6 0/5] Improve khugepaged scan logic

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH mm-new v6 0/5] Improve khugepaged scan logic
@ 2026-02-01 12:25 Vernon Yang
  2026-02-01 12:25 ` [PATCH mm-new v6 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-01 12:25 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Hi all,

This series is improve the khugepaged scan logic, reduce CPU consumption,
prioritize scanning task that access memory frequently.

The following data is traced by bpftrace[1] on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan by
khugepaged.

@scan_pmd_status[1]: 1           ## SCAN_SUCCEED
@scan_pmd_status[6]: 2           ## SCAN_EXCEED_SHARED_PTE
@scan_pmd_status[3]: 142         ## SCAN_PMD_MAPPED
@scan_pmd_status[2]: 178         ## SCAN_NO_PTE_TABLE
total progress size: 674 MB
Total time         : 419 seconds ## include khugepaged_scan_sleep_millisecs

The khugepaged has below phenomenon: the khugepaged list is scanned in a
FIFO manner, as long as the task is not destroyed,
1. the task no longer has memory that can be collapsed into hugepage,
   continues scan it always.
2. the task at the front of the khugepaged scan list is cold, they are
   still scanned first.
3. everyone scan at intervals of khugepaged_scan_sleep_millisecs
   (default 10s). If we always scan the above two cases first, the valid
   scan will have to wait for a long time.

For the first case, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE, just skip it.

For the second case, if the user has explicitly informed us via
MADV_FREE that these folios will be freed, just skip it only.

The below is some performance test results.

kernbench results (testing on x86_64 machine):

                     baseline w/o patches   test w/ patches
Amean     user-32    18522.51 (   0.00%)    18333.64 *   1.02%*
Amean     syst-32     1137.96 (   0.00%)     1113.79 *   2.12%*
Amean     elsp-32      666.04 (   0.00%)      659.44 *   0.99%*
BAmean-95 user-32    18520.01 (   0.00%)    18323.57 (   1.06%)
BAmean-95 syst-32     1137.68 (   0.00%)     1110.50 (   2.39%)
BAmean-95 elsp-32      665.92 (   0.00%)      659.06 (   1.03%)
BAmean-99 user-32    18520.01 (   0.00%)    18323.57 (   1.06%)
BAmean-99 syst-32     1137.68 (   0.00%)     1110.50 (   2.39%)
BAmean-99 elsp-32      665.92 (   0.00%)      659.06 (   1.03%)

Create three task[2]: hot1 -> cold -> hot2. After all three task are
created, each allocate memory 128MB. the hot1/hot2 task continuously
access 128 MB memory, while the cold task only accesses its memory
briefly andthen call madvise(MADV_FREE). Here are the performance test
results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
| cycles per access   |  4.96         |  2.21         | -55.44% |
| Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
| dTLB-load-misses    |  284814532    |  69597236     | -75.56% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.29         |  2.07         | -71.60% |
| Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
| dTLB-load-misses    |  241600871    |  3216108      | -98.67% |

This series is based on mm-new.

Thank you very much for your comments and discussions.


V5 -> V6:
- Simplify hpage_collapse_scan_file() [3] and hpage_collapse_scan_pmd().
- Skip lazy-free folios in the khugepaged only [4].
- pickup Reviewed-by.

V4 -> V5:
- Patch #3 are squashed to Patch #2
- File patch utilize "xas->xa_index" to fix issue.
- folio_is_lazyfree() to folio_test_lazyfree()
- Just skip lazyfree folio simply.
- Again test kernbench in the performance mode to improve stability.
- pickup Acked-by and Reviewed-by.

V3 -> V4:
- Rebase on mm-new.
- Make Patch #2 cleaner
- Fix the lazyfree folio continue to be collapsed when skipped ahead.

V2 -> V3:
- Refine scan progress number, add folio_is_lazyfree helper
- Fix warnings at SCAN_PTE_MAPPED_HUGEPAGE.
- For MADV_FREE, we will skip the lazy-free folios instead.
- For MADV_COLD, remove it.
- Used hpage_collapse_test_exit_or_disable() instead of vma = NULL.
- pickup Reviewed-by.

V1 -> V2:
- Rename full to full_scan_finished, pickup Acked-by.
- Just skip SCAN_PMD_MAPPED/NO_PTE_TABLE memory, not remove mm.
- Set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE to just skip, not move mm.
- Again test performance at the v6.19-rc2.

V5 : https://lore.kernel.org/linux-mm/20260123082232.16413-1-vernon2gm@gmail.com
V4 : https://lore.kernel.org/linux-mm/20260111121909.8410-1-yanglincheng@kylinos.cn
V3 : https://lore.kernel.org/linux-mm/20260104054112.4541-1-yanglincheng@kylinos.cn
V2 : https://lore.kernel.org/linux-mm/20251229055151.54887-1-yanglincheng@kylinos.cn
V1 : https://lore.kernel.org/linux-mm/20251215090419.174418-1-yanglincheng@kylinos.cn

[1] https://github.com/vernon2gh/app_and_module/blob/main/khugepaged/khugepaged_mm.bt
[2] https://github.com/vernon2gh/app_and_module/blob/main/khugepaged/app.c
[3] https://lore.kernel.org/linux-mm/4c35391e-a944-4e62-9103-4a1c4961f62a@arm.com
[4] https://lore.kernel.org/linux-mm/CACZaFFNY8+UKLzBGnmB3ij9amzBdKJgytcSNtA8fLCake8Ua=A@mail.gmail.com

Vernon Yang (5):
  mm: khugepaged: add trace_mm_khugepaged_scan event
  mm: khugepaged: refine scan progress number
  mm: add folio_test_lazyfree helper
  mm: khugepaged: skip lazy-free folios
  mm: khugepaged: set to next mm direct when mm has
    MMF_DISABLE_THP_COMPLETELY

 include/linux/page-flags.h         |  5 +++
 include/trace/events/huge_memory.h | 26 +++++++++++++
 mm/khugepaged.c                    | 60 ++++++++++++++++++++++++------
 mm/rmap.c                          |  2 +-
 mm/vmscan.c                        |  5 +--
 5 files changed, 82 insertions(+), 16 deletions(-)

--
2.51.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH mm-new v6 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event
  2026-02-01 12:25 [PATCH mm-new v6 0/5] Improve khugepaged scan logic Vernon Yang
@ 2026-02-01 12:25 ` Vernon Yang
  2026-02-01 12:25 ` [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number Vernon Yang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-01 12:25 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Add mm_khugepaged_scan event to track the total time for full scan
and the total number of pages scanned of khugepaged.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Dev Jain <dev.jain@arm.com>
---
 include/trace/events/huge_memory.h | 25 +++++++++++++++++++++++++
 mm/khugepaged.c                    |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4e41bff31888..384e29f6bef0 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -237,5 +237,30 @@ TRACE_EVENT(mm_khugepaged_collapse_file,
 		__print_symbolic(__entry->result, SCAN_STATUS))
 );
 
+TRACE_EVENT(mm_khugepaged_scan,
+
+	TP_PROTO(struct mm_struct *mm, unsigned int progress,
+		 bool full_scan_finished),
+
+	TP_ARGS(mm, progress, full_scan_finished),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned int, progress)
+		__field(bool, full_scan_finished)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->progress = progress;
+		__entry->full_scan_finished = full_scan_finished;
+	),
+
+	TP_printk("mm=%p, progress=%u, full_scan_finished=%d",
+		__entry->mm,
+		__entry->progress,
+		__entry->full_scan_finished)
+);
+
 #endif /* __HUGE_MEMORY_H */
 #include <trace/define_trace.h>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1d994b6c58c6..d94b34e10bdf 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2534,6 +2534,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 		collect_mm_slot(slot);
 	}
 
+	trace_mm_khugepaged_scan(mm, progress, khugepaged_scan.mm_slot == NULL);
+
 	return progress;
 }
 
-- 
2.51.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-01 12:25 [PATCH mm-new v6 0/5] Improve khugepaged scan logic Vernon Yang
  2026-02-01 12:25 ` [PATCH mm-new v6 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
@ 2026-02-01 12:25 ` Vernon Yang
  2026-02-04 21:35   ` David Hildenbrand (arm)
  2026-02-01 12:25 ` [PATCH mm-new v6 3/5] mm: add folio_test_lazyfree helper Vernon Yang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Vernon Yang @ 2026-02-01 12:25 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Currently, each scan always increases "progress" by HPAGE_PMD_NR,
even if only scanning a single PTE/PMD entry.

- When only scanning a sigle PTE entry, let me provide a detailed
  example:

static int hpage_collapse_scan_pmd()
{
	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
	     _pte++, addr += PAGE_SIZE) {
		pte_t pteval = ptep_get(_pte);
		...
		if (pte_uffd_wp(pteval)) { <-- first scan hit
			result = SCAN_PTE_UFFD_WP;
			goto out_unmap;
		}
	}
}

During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
directly. In practice, only one PTE is scanned before termination.
Here, "progress += 1" reflects the actual number of PTEs scanned, but
previously "progress += HPAGE_PMD_NR" always.

- When the memory has been collapsed to PMD, let me provide a detailed
  example:

The following data is traced by bpftrace on a desktop system. After
the system has been left idle for 10 minutes upon booting, a lot of
SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
by khugepaged.

From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
following statuses were observed, with frequency mentioned next to them:

SCAN_SUCCEED          : 1
SCAN_EXCEED_SHARED_PTE: 2
SCAN_PMD_MAPPED       : 142
SCAN_NO_PTE_TABLE     : 178
total progress size   : 674 MB
Total time            : 419 seconds, include khugepaged_scan_sleep_millisecs

The khugepaged_scan list save all task that support collapse into hugepage,
as long as the task is not destroyed, khugepaged will not remove it from
the khugepaged_scan list. This exist a phenomenon where task has already
collapsed all memory regions into hugepage, but khugepaged continues to
scan it, which wastes CPU time and invalid, and due to
khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
scanning a large number of invalid task, so scanning really valid task
is later.

After applying this patch, when the memory is either SCAN_PMD_MAPPED or
SCAN_NO_PTE_TABLE, just skip it, as follow:

SCAN_EXCEED_SHARED_PTE: 2
SCAN_PMD_MAPPED       : 147
SCAN_NO_PTE_TABLE     : 173
total progress size   : 45 MB
Total time            : 20 seconds

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 mm/khugepaged.c | 41 +++++++++++++++++++++++++++++++----------
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d94b34e10bdf..df22b2274d92 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -68,7 +68,10 @@ enum scan_result {
 static struct task_struct *khugepaged_thread __read_mostly;
 static DEFINE_MUTEX(khugepaged_mutex);
 
-/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
+/*
+ * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
+ * every 10 second.
+ */
 static unsigned int khugepaged_pages_to_scan __read_mostly;
 static unsigned int khugepaged_pages_collapsed;
 static unsigned int khugepaged_full_scans;
@@ -1240,7 +1243,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 }
 
 static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
+		struct vm_area_struct *vma, unsigned long start_addr,
+		bool *mmap_locked, unsigned int *cur_progress,
 		struct collapse_control *cc)
 {
 	pmd_t *pmd;
@@ -1256,13 +1260,18 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 	VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
 
 	result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
-	if (result != SCAN_SUCCEED)
+	if (result != SCAN_SUCCEED) {
+		if (cur_progress)
+			*cur_progress = 1;
 		goto out;
+	}
 
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
 	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
 	if (!pte) {
+		if (cur_progress)
+			*cur_progress = 1;
 		result = SCAN_NO_PTE_TABLE;
 		goto out;
 	}
@@ -1396,6 +1405,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 		result = SCAN_SUCCEED;
 	}
 out_unmap:
+	if (cur_progress) {
+		if (_pte >= pte + HPAGE_PMD_NR)
+			*cur_progress = HPAGE_PMD_NR;
+		else
+			*cur_progress = _pte - pte + 1;
+	}
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, start_addr, referenced,
@@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
-		struct file *file, pgoff_t start, struct collapse_control *cc)
+static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
+		unsigned long addr, struct file *file, pgoff_t start,
+		unsigned int *cur_progress, struct collapse_control *cc)
 {
 	struct folio *folio = NULL;
 	struct address_space *mapping = file->f_mapping;
@@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
 			cond_resched_rcu();
 		}
 	}
+	if (cur_progress)
+		*cur_progress = max(xas.xa_index - start, 1UL);
 	rcu_read_unlock();
 
 	if (result == SCAN_SUCCEED) {
@@ -2455,6 +2473,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 
 		while (khugepaged_scan.address < hend) {
 			bool mmap_locked = true;
+			unsigned int cur_progress = 0;
 
 			cond_resched();
 			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
@@ -2471,7 +2490,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 				mmap_read_unlock(mm);
 				mmap_locked = false;
 				*result = hpage_collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
+					khugepaged_scan.address, file, pgoff,
+					&cur_progress, cc);
 				fput(file);
 				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
 					mmap_read_lock(mm);
@@ -2485,7 +2505,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 				}
 			} else {
 				*result = hpage_collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
+					khugepaged_scan.address, &mmap_locked,
+					&cur_progress, cc);
 			}
 
 			if (*result == SCAN_SUCCEED)
@@ -2493,7 +2514,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
-			progress += HPAGE_PMD_NR;
+			progress += cur_progress;
 			if (!mmap_locked)
 				/*
 				 * We released mmap_lock so break loop.  Note
@@ -2816,7 +2837,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			mmap_locked = false;
 			*lock_dropped = true;
 			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
+							  NULL, cc);
 
 			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
 			    mapping_can_writeback(file->f_mapping)) {
@@ -2831,7 +2852,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			fput(file);
 		} else {
 			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
+							 &mmap_locked, NULL, cc);
 		}
 		if (!mmap_locked)
 			*lock_dropped = true;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-01 12:25 ` [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number Vernon Yang
@ 2026-02-04 21:35   ` David Hildenbrand (arm)
  2026-02-05  6:08     ` Vernon Yang
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand (arm) @ 2026-02-04 21:35 UTC (permalink / raw)
  To: Vernon Yang, akpm
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

[...]

> +	if (cur_progress) {
> +		if (_pte >= pte + HPAGE_PMD_NR)
> +			*cur_progress = HPAGE_PMD_NR;
> +		else
> +			*cur_progress = _pte - pte + 1;

*cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);

?

It's still a bit nasty, though.

Can't we just add one at the beginning of the loop and let the compiler
optimize that? ;)

> +	}
>   	pte_unmap_unlock(pte, ptl);
>   	if (result == SCAN_SUCCEED) {
>   		result = collapse_huge_page(mm, start_addr, referenced,
> @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>   	return result;
>   }
>   
> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> -		struct file *file, pgoff_t start, struct collapse_control *cc)
> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> +		unsigned long addr, struct file *file, pgoff_t start,
> +		unsigned int *cur_progress, struct collapse_control *cc)
>   {
>   	struct folio *folio = NULL;
>   	struct address_space *mapping = file->f_mapping;
> @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>   			cond_resched_rcu();
>   		}
>   	}
> +	if (cur_progress)
> +		*cur_progress = max(xas.xa_index - start, 1UL);
I would really just keep it simple here and do a

*cur_progress = HPAGE_PMD_NR;

This stuff is hard to reason about, so I would just leave the file case 
essentially unchanged.

IIRC, it would not affect the numbers you report in the patch description?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-04 21:35   ` David Hildenbrand (arm)
@ 2026-02-05  6:08     ` Vernon Yang
  2026-02-05 12:07       ` Dev Jain
  2026-02-05 12:11       ` David Hildenbrand (arm)
  0 siblings, 2 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-05  6:08 UTC (permalink / raw)
  To: David Hildenbrand (arm)
  Cc: akpm, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>
> [...]
>
> > +     if (cur_progress) {
> > +             if (_pte >= pte + HPAGE_PMD_NR)
> > +                     *cur_progress = HPAGE_PMD_NR;
> > +             else
> > +                     *cur_progress = _pte - pte + 1;
>
> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);

I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().

> ?
>
> It's still a bit nasty, though.
>
> Can't we just add one at the beginning of the loop and let the compiler
> optimize that? ;)

I'm also worried that the compiler can't optimize this since the body of
the loop is complex, as with Dev's opinion [1].

[1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com

If you have a strong recommendation for this, please let me know, Thanks!

> > +     }
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> >               result = collapse_huge_page(mm, start_addr, referenced,
> > @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
> >       return result;
> >   }
> >
> > -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
> > -             struct file *file, pgoff_t start, struct collapse_control *cc)
> > +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
> > +             unsigned long addr, struct file *file, pgoff_t start,
> > +             unsigned int *cur_progress, struct collapse_control *cc)
> >   {
> >       struct folio *folio = NULL;
> >       struct address_space *mapping = file->f_mapping;
> > @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
> >                       cond_resched_rcu();
> >               }
> >       }
> > +     if (cur_progress)
> > +             *cur_progress = max(xas.xa_index - start, 1UL);
> I would really just keep it simple here and do a
>
> *cur_progress = HPAGE_PMD_NR;
>
> This stuff is hard to reason about, so I would just leave the file case
> essentially unchanged.
>
> IIRC, it would not affect the numbers you report in the patch description?

Yes, Let's keep it simple, always equal to HPAGE_PMD_NR in file case.

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-05  6:08     ` Vernon Yang
@ 2026-02-05 12:07       ` Dev Jain
  2026-02-05 12:28         ` David Hildenbrand (Arm)
  2026-02-05 12:11       ` David Hildenbrand (arm)
  1 sibling, 1 reply; 26+ messages in thread
From: Dev Jain @ 2026-02-05 12:07 UTC (permalink / raw)
  To: Vernon Yang, David Hildenbrand (arm)
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang


On 05/02/26 11:38 am, Vernon Yang wrote:
> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>> [...]
>>
>>> +     if (cur_progress) {
>>> +             if (_pte >= pte + HPAGE_PMD_NR)
>>> +                     *cur_progress = HPAGE_PMD_NR;
>>> +             else
>>> +                     *cur_progress = _pte - pte + 1;
>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>
>> ?
>>
>> It's still a bit nasty, though.
>>
>> Can't we just add one at the beginning of the loop and let the compiler
>> optimize that? ;)
> I'm also worried that the compiler can't optimize this since the body of
> the loop is complex, as with Dev's opinion [1].
>
> [1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com
>
> If you have a strong recommendation for this, please let me know, Thanks!

I haven't explicitly checked with assembly, but I am fairly sure this won't get optimized.
There are two cases where it could have been optimized:

1) Had the compiler inlined hpage_collapse_scan_pmd
2) Had the compiler done something like
   if (p) -> foo(), where foo() contains the complete for loop, with the increment
   else -> bar(), where bar() contains the complete for loop, without the increment

Both of which are highly unlikely because of the complexity of the function.

>
>>> +     }
>>>       pte_unmap_unlock(pte, ptl);
>>>       if (result == SCAN_SUCCEED) {
>>>               result = collapse_huge_page(mm, start_addr, referenced,
>>> @@ -2286,8 +2301,9 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>>       return result;
>>>   }
>>>
>>> -static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>>> -             struct file *file, pgoff_t start, struct collapse_control *cc)
>>> +static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>>> +             unsigned long addr, struct file *file, pgoff_t start,
>>> +             unsigned int *cur_progress, struct collapse_control *cc)
>>>   {
>>>       struct folio *folio = NULL;
>>>       struct address_space *mapping = file->f_mapping;
>>> @@ -2376,6 +2392,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm, unsigned
>>>                       cond_resched_rcu();
>>>               }
>>>       }
>>> +     if (cur_progress)
>>> +             *cur_progress = max(xas.xa_index - start, 1UL);
>> I would really just keep it simple here and do a
>>
>> *cur_progress = HPAGE_PMD_NR;
>>
>> This stuff is hard to reason about, so I would just leave the file case
>> essentially unchanged.
>>
>> IIRC, it would not affect the numbers you report in the patch description?
> Yes, Let's keep it simple, always equal to HPAGE_PMD_NR in file case.
>
> --
> Thanks,
> Vernon


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-05 12:07       ` Dev Jain
@ 2026-02-05 12:28         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-05 12:28 UTC (permalink / raw)
  To: Dev Jain, Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On 2/5/26 13:07, Dev Jain wrote:
> 
> On 05/02/26 11:38 am, Vernon Yang wrote:
>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>>> [...]
>>>
>>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>
>>> ?
>>>
>>> It's still a bit nasty, though.
>>>
>>> Can't we just add one at the beginning of the loop and let the compiler
>>> optimize that? ;)
>> I'm also worried that the compiler can't optimize this since the body of
>> the loop is complex, as with Dev's opinion [1].
>>
>> [1] https://lore.kernel.org/linux-mm/7c4b5933-7bbd-4ad7-baef-830304a09485@arm.com
>>
>> If you have a strong recommendation for this, please let me know, Thanks!
> 
> I haven't explicitly checked with assembly, but I am fairly sure this won't get optimized.
> There are two cases where it could have been optimized:
> 
> 1) Had the compiler inlined hpage_collapse_scan_pmd

Yeah, there are two callers so that likely does not happen.

> 2) Had the compiler done something like
>     if (p) -> foo(), where foo() contains the complete for loop, with the increment
>     else -> bar(), where bar() contains the complete for loop, without the increment
> 
> Both of which are highly unlikely because of the complexity of the function.

Not sure if the compiler would be to optimize this out also in 
non-inlined cases. In any case, I wonder if this must be optimized at 
all ...

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-05  6:08     ` Vernon Yang
  2026-02-05 12:07       ` Dev Jain
@ 2026-02-05 12:11       ` David Hildenbrand (arm)
  2026-02-05 14:25         ` Dev Jain
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand (arm) @ 2026-02-05 12:11 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On 2/5/26 07:08, Vernon Yang wrote:
> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>>
>> [...]
>>
>>> +     if (cur_progress) {
>>> +             if (_pte >= pte + HPAGE_PMD_NR)
>>> +                     *cur_progress = HPAGE_PMD_NR;
>>> +             else
>>> +                     *cur_progress = _pte - pte + 1;
>>
>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
> 
> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().

Yes!

> 
>> ?
>>
>> It's still a bit nasty, though.
>>
>> Can't we just add one at the beginning of the loop and let the compiler
>> optimize that? ;)
> 
> I'm also worried that the compiler can't optimize this since the body of
> the loop is complex, as with Dev's opinion [1].

Why do we even have to optimize this? :)

Premature ... ? :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-05 12:11       ` David Hildenbrand (arm)
@ 2026-02-05 14:25         ` Dev Jain
  2026-02-05 14:30           ` Dev Jain
  2026-02-06  9:02           ` David Hildenbrand (Arm)
  0 siblings, 2 replies; 26+ messages in thread
From: Dev Jain @ 2026-02-05 14:25 UTC (permalink / raw)
  To: David Hildenbrand (arm), Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang


On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
> On 2/5/26 07:08, Vernon Yang wrote:
>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>> <david@kernel.org> wrote:
>>>
>>> [...]
>>>
>>>> +     if (cur_progress) {
>>>> +             if (_pte >= pte + HPAGE_PMD_NR)
>>>> +                     *cur_progress = HPAGE_PMD_NR;
>>>> +             else
>>>> +                     *cur_progress = _pte - pte + 1;
>>>
>>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>>
>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>
> Yes!
>
>>
>>> ?
>>>
>>> It's still a bit nasty, though.
>>>
>>> Can't we just add one at the beginning of the loop and let the compiler
>>> optimize that? ;)
>>
>> I'm also worried that the compiler can't optimize this since the body of
>> the loop is complex, as with Dev's opinion [1].
>
> Why do we even have to optimize this? :)
>
> Premature ... ? :)


I mean .... we don't, but the alternate is a one liner using max().

The objective is to compute the number of iterations of the for-loop.

It just seems weird to me to track that in the loop, when we have the

loop iterator, which *literally* does that only.



Anyhow, I won't shout in any case : ) If you deem incrementing in the

loop prettier, that's fine.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-05 14:25         ` Dev Jain
@ 2026-02-05 14:30           ` Dev Jain
  2026-02-06  9:03             ` David Hildenbrand (Arm)
  2026-02-06  9:02           ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 26+ messages in thread
From: Dev Jain @ 2026-02-05 14:30 UTC (permalink / raw)
  To: David Hildenbrand (arm), Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang


On 05/02/26 7:55 pm, Dev Jain wrote:
> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>> On 2/5/26 07:08, Vernon Yang wrote:
>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>> <david@kernel.org> wrote:
>>>> [...]
>>>>
>>>>> +     if (cur_progress) {
>>>>> +             if (_pte >= pte + HPAGE_PMD_NR)
>>>>> +                     *cur_progress = HPAGE_PMD_NR;
>>>>> +             else
>>>>> +                     *cur_progress = _pte - pte + 1;
>>>> *cur_progress = max(_pte - pte + 1, HPAGE_PMD_NR);
>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>> Yes!
>>
>>>> ?
>>>>
>>>> It's still a bit nasty, though.
>>>>
>>>> Can't we just add one at the beginning of the loop and let the compiler
>>>> optimize that? ;)
>>> I'm also worried that the compiler can't optimize this since the body of
>>> the loop is complex, as with Dev's opinion [1].
>> Why do we even have to optimize this? :)
>>
>> Premature ... ? :)
>
> I mean .... we don't, but the alternate is a one liner using max().
>
> The objective is to compute the number of iterations of the for-loop.
>
> It just seems weird to me to track that in the loop, when we have the
>
> loop iterator, which *literally* does that only.

I realize I shouldn't have bolded out the "literally" - below I wrote that
I won't shout, but the bold seems like shouting :)

>
>
>
> Anyhow, I won't shout in any case : ) If you deem incrementing in the
>
> loop prettier, that's fine.
>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-05 14:30           ` Dev Jain
@ 2026-02-06  9:03             ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-06  9:03 UTC (permalink / raw)
  To: Dev Jain, Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On 2/5/26 15:30, Dev Jain wrote:
> 
> On 05/02/26 7:55 pm, Dev Jain wrote:
>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>> Yes!
>>>
>>> Why do we even have to optimize this? :)
>>>
>>> Premature ... ? :)
>>
>> I mean .... we don't, but the alternate is a one liner using max().
>>
>> The objective is to compute the number of iterations of the for-loop.
>>
>> It just seems weird to me to track that in the loop, when we have the
>>
>> loop iterator, which *literally* does that only.
> 
> I realize I shouldn't have bolded out the "literally" - below I wrote that
> I won't shout, but the bold seems like shouting :)

Heh.

The thing is that the loop iterator does not quite what we want, 
otherwise we wouldn't have to mess with max() etc.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-05 14:25         ` Dev Jain
  2026-02-05 14:30           ` Dev Jain
@ 2026-02-06  9:02           ` David Hildenbrand (Arm)
  2026-02-06 10:00             ` Dev Jain
  2026-02-06 11:12             ` Vernon Yang
  1 sibling, 2 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-06  9:02 UTC (permalink / raw)
  To: Dev Jain, Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On 2/5/26 15:25, Dev Jain wrote:
> 
> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>> On 2/5/26 07:08, Vernon Yang wrote:
>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>> <david@kernel.org> wrote:
>>>
>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>
>> Yes!
>>
>>>
>>>
>>> I'm also worried that the compiler can't optimize this since the body of
>>> the loop is complex, as with Dev's opinion [1].
>>
>> Why do we even have to optimize this? :)
>>
>> Premature ... ? :)
> 
> 
> I mean .... we don't, but the alternate is a one liner using max().

I'm fine with the max(), but it still seems like adding complexity to 
optimize something that is nowhere prove to really be a problem.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-06  9:02           ` David Hildenbrand (Arm)
@ 2026-02-06 10:00             ` Dev Jain
  2026-02-06 11:10               ` David Hildenbrand (Arm)
  2026-02-06 11:12             ` Vernon Yang
  1 sibling, 1 reply; 26+ messages in thread
From: Dev Jain @ 2026-02-06 10:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang


On 06/02/26 2:32 pm, David Hildenbrand (Arm) wrote:
> On 2/5/26 15:25, Dev Jain wrote:
>>
>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>> <david@kernel.org> wrote:
>>>>
>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>
>>> Yes!
>>>
>>>>
>>>>
>>>> I'm also worried that the compiler can't optimize this since the body of
>>>> the loop is complex, as with Dev's opinion [1].
>>>
>>> Why do we even have to optimize this? :)
>>>
>>> Premature ... ? :)
>>
>>
>> I mean .... we don't, but the alternate is a one liner using max().
>
> I'm fine with the max(), but it still seems like adding complexity to
> optimize something that is nowhere prove to really be a problem. 

Agreed. Vernon, let us do the increment in the loop then.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-06 10:00             ` Dev Jain
@ 2026-02-06 11:10               ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-06 11:10 UTC (permalink / raw)
  To: Dev Jain, Vernon Yang
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On 2/6/26 11:00, Dev Jain wrote:
> 
> On 06/02/26 2:32 pm, David Hildenbrand (Arm) wrote:
>> On 2/5/26 15:25, Dev Jain wrote:
>>>
>>>
>>>
>>> I mean .... we don't, but the alternate is a one liner using max().
>>
>> I'm fine with the max(), but it still seems like adding complexity to
>> optimize something that is nowhere prove to really be a problem.
> 
> Agreed. Vernon, let us do the increment in the loop then.

I'm fine with the min(), so if you both think it's better, let's do that!

It makes it slightly harder to understand what's happening, but 
fortunately, if we mess up slightly nobody will really notice :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-06  9:02           ` David Hildenbrand (Arm)
  2026-02-06 10:00             ` Dev Jain
@ 2026-02-06 11:12             ` Vernon Yang
  2026-02-06 13:52               ` Lance Yang
  2026-02-08  9:05               ` Dev Jain
  1 sibling, 2 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-06 11:12 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Dev Jain, akpm, lorenzo.stoakes, ziy, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
> On 2/5/26 15:25, Dev Jain wrote:
> >
> > On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
> > > On 2/5/26 07:08, Vernon Yang wrote:
> > > > On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
> > > > <david@kernel.org> wrote:
> > > >
> > > > I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
> > >
> > > Yes!
> > >
> > > >
> > > >
> > > > I'm also worried that the compiler can't optimize this since the body of
> > > > the loop is complex, as with Dev's opinion [1].
> > >
> > > Why do we even have to optimize this? :)
> > >
> > > Premature ... ? :)
> >
> >
> > I mean .... we don't, but the alternate is a one liner using max().
>
> I'm fine with the max(), but it still seems like adding complexity to
> optimize something that is nowhere prove to really be a problem.

Hi David, Dev,

I use "*cur_progress += 1" at the beginning of the loop, the compiler
optimize that. Assembly as follows:

60c1:	4d 29 ca        sub    %r9,%r10		// r10 is _pte, r9 is pte, r10 = _pte - pte
60c4:	b8 00 02 00 00  mov    $0x200,%eax	// eax = HPAGE_PMD_NR
60c9:	44 89 5c 24 10  mov    %r11d,0x10(%rsp)	//
60ce:	49 c1 fa 03     sar    $0x3,%r10	//
60d2:	49 83 c2 01     add    $0x1,%r10	// r10 += 1
60d6:	49 39 c2        cmp    %rax,%r10	// r10 = min(r10, eax)
60d9:	4c 0f 4f d0     cmovg  %rax,%r10	//
60dd:	44 89 55 00     mov    %r10d,0x0(%rbp)	// *cur_progress = r10

To make the code simpler, Let us use "*cur_progress += 1".

--
Thanks,
Vernon


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-06 11:12             ` Vernon Yang
@ 2026-02-06 13:52               ` Lance Yang
  2026-02-08  9:05               ` Dev Jain
  1 sibling, 0 replies; 26+ messages in thread
From: Lance Yang @ 2026-02-06 13:52 UTC (permalink / raw)
  To: Vernon Yang, David Hildenbrand (Arm), Dev Jain
  Cc: akpm, lorenzo.stoakes, ziy, baohua, linux-mm, linux-kernel, Vernon Yang



On 2026/2/6 19:12, Vernon Yang wrote:
> On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
>> On 2/5/26 15:25, Dev Jain wrote:
>>>
>>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>>> <david@kernel.org> wrote:
>>>>>
>>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>>
>>>> Yes!
>>>>
>>>>>
>>>>>
>>>>> I'm also worried that the compiler can't optimize this since the body of
>>>>> the loop is complex, as with Dev's opinion [1].
>>>>
>>>> Why do we even have to optimize this? :)
>>>>
>>>> Premature ... ? :)
>>>
>>>
>>> I mean .... we don't, but the alternate is a one liner using max().
>>
>> I'm fine with the max(), but it still seems like adding complexity to
>> optimize something that is nowhere prove to really be a problem.
> 
> Hi David, Dev,
> 
> I use "*cur_progress += 1" at the beginning of the loop, the compiler
> optimize that. Assembly as follows:
> 
> 60c1:	4d 29 ca        sub    %r9,%r10		// r10 is _pte, r9 is pte, r10 = _pte - pte
> 60c4:	b8 00 02 00 00  mov    $0x200,%eax	// eax = HPAGE_PMD_NR
> 60c9:	44 89 5c 24 10  mov    %r11d,0x10(%rsp)	//
> 60ce:	49 c1 fa 03     sar    $0x3,%r10	//
> 60d2:	49 83 c2 01     add    $0x1,%r10	// r10 += 1
> 60d6:	49 39 c2        cmp    %rax,%r10	// r10 = min(r10, eax)
> 60d9:	4c 0f 4f d0     cmovg  %rax,%r10	//
> 60dd:	44 89 55 00     mov    %r10d,0x0(%rbp)	// *cur_progress = r10
> 
> To make the code simpler, Let us use "*cur_progress += 1".

Cool! Compiler did the right thing and the heavy lifting after all - we get
to keep it simple :p



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-06 11:12             ` Vernon Yang
  2026-02-06 13:52               ` Lance Yang
@ 2026-02-08  9:05               ` Dev Jain
  2026-02-08  9:32                 ` Lance Yang
  2026-02-08 13:23                 ` Vernon Yang
  1 sibling, 2 replies; 26+ messages in thread
From: Dev Jain @ 2026-02-08  9:05 UTC (permalink / raw)
  To: Vernon Yang, David Hildenbrand (Arm)
  Cc: akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang


On 06/02/26 4:42 pm, Vernon Yang wrote:
> On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
>> On 2/5/26 15:25, Dev Jain wrote:
>>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>>> <david@kernel.org> wrote:
>>>>>
>>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>> Yes!
>>>>
>>>>>
>>>>> I'm also worried that the compiler can't optimize this since the body of
>>>>> the loop is complex, as with Dev's opinion [1].
>>>> Why do we even have to optimize this? :)
>>>>
>>>> Premature ... ? :)
>>>
>>> I mean .... we don't, but the alternate is a one liner using max().
>> I'm fine with the max(), but it still seems like adding complexity to
>> optimize something that is nowhere prove to really be a problem.
> Hi David, Dev,
>
> I use "*cur_progress += 1" at the beginning of the loop, the compiler
> optimize that. Assembly as follows:
>
> 60c1:	4d 29 ca        sub    %r9,%r10		// r10 is _pte, r9 is pte, r10 = _pte - pte
> 60c4:	b8 00 02 00 00  mov    $0x200,%eax	// eax = HPAGE_PMD_NR
> 60c9:	44 89 5c 24 10  mov    %r11d,0x10(%rsp)	//
> 60ce:	49 c1 fa 03     sar    $0x3,%r10	//
> 60d2:	49 83 c2 01     add    $0x1,%r10	// r10 += 1
> 60d6:	49 39 c2        cmp    %rax,%r10	// r10 = min(r10, eax)
> 60d9:	4c 0f 4f d0     cmovg  %rax,%r10	//
> 60dd:	44 89 55 00     mov    %r10d,0x0(%rbp)	// *cur_progress = r10
>
> To make the code simpler, Let us use "*cur_progress += 1".

Wow! Wasn't expecting that. What's your gcc version? I checked with
gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't
optimize.

> --
> Thanks,
> Vernon


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-08  9:05               ` Dev Jain
@ 2026-02-08  9:32                 ` Lance Yang
  2026-02-08 13:23                 ` Vernon Yang
  1 sibling, 0 replies; 26+ messages in thread
From: Lance Yang @ 2026-02-08  9:32 UTC (permalink / raw)
  To: Dev Jain, Vernon Yang, David Hildenbrand (Arm)
  Cc: akpm, lorenzo.stoakes, ziy, baohua, linux-mm, linux-kernel, Vernon Yang



On 2026/2/8 17:05, Dev Jain wrote:
> 
> On 06/02/26 4:42 pm, Vernon Yang wrote:
>> On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
>>> On 2/5/26 15:25, Dev Jain wrote:
>>>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
>>>>> On 2/5/26 07:08, Vernon Yang wrote:
>>>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
>>>>>> <david@kernel.org> wrote:
>>>>>>
>>>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
>>>>> Yes!
>>>>>
>>>>>>
>>>>>> I'm also worried that the compiler can't optimize this since the body of
>>>>>> the loop is complex, as with Dev's opinion [1].
>>>>> Why do we even have to optimize this? :)
>>>>>
>>>>> Premature ... ? :)
>>>>
>>>> I mean .... we don't, but the alternate is a one liner using max().
>>> I'm fine with the max(), but it still seems like adding complexity to
>>> optimize something that is nowhere prove to really be a problem.
>> Hi David, Dev,
>>
>> I use "*cur_progress += 1" at the beginning of the loop, the compiler
>> optimize that. Assembly as follows:
>>
>> 60c1:	4d 29 ca        sub    %r9,%r10		// r10 is _pte, r9 is pte, r10 = _pte - pte
>> 60c4:	b8 00 02 00 00  mov    $0x200,%eax	// eax = HPAGE_PMD_NR
>> 60c9:	44 89 5c 24 10  mov    %r11d,0x10(%rsp)	//
>> 60ce:	49 c1 fa 03     sar    $0x3,%r10	//
>> 60d2:	49 83 c2 01     add    $0x1,%r10	// r10 += 1
>> 60d6:	49 39 c2        cmp    %rax,%r10	// r10 = min(r10, eax)
>> 60d9:	4c 0f 4f d0     cmovg  %rax,%r10	//
>> 60dd:	44 89 55 00     mov    %r10d,0x0(%rbp)	// *cur_progress = r10
>>
>> To make the code simpler, Let us use "*cur_progress += 1".
> 
> Wow! Wasn't expecting that. What's your gcc version? I checked with
> gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't
> optimize.

FWIW, 11.4.0 is newer that the minimum GCC version (8.1) required by
kernel. See Documentation/process/changes.rst

The optimization might just be version-dependent :)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number
  2026-02-08  9:05               ` Dev Jain
  2026-02-08  9:32                 ` Lance Yang
@ 2026-02-08 13:23                 ` Vernon Yang
  1 sibling, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-08 13:23 UTC (permalink / raw)
  To: Dev Jain
  Cc: David Hildenbrand (Arm),
	akpm, lorenzo.stoakes, ziy, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On Sun, Feb 8, 2026 at 5:05 PM Dev Jain <dev.jain@arm.com> wrote:
>
> On 06/02/26 4:42 pm, Vernon Yang wrote:
> > On Fri, Feb 06, 2026 at 10:02:48AM +0100, David Hildenbrand (Arm) wrote:
> >> On 2/5/26 15:25, Dev Jain wrote:
> >>> On 05/02/26 5:41 pm, David Hildenbrand (arm) wrote:
> >>>> On 2/5/26 07:08, Vernon Yang wrote:
> >>>>> On Thu, Feb 5, 2026 at 5:35 AM David Hildenbrand (arm)
> >>>>> <david@kernel.org> wrote:
> >>>>>
> >>>>> I guess, your meaning is "min(_pte - pte + 1, HPAGE_PMD_NR)", not max().
> >>>> Yes!
> >>>>
> >>>>>
> >>>>> I'm also worried that the compiler can't optimize this since the body of
> >>>>> the loop is complex, as with Dev's opinion [1].
> >>>> Why do we even have to optimize this? :)
> >>>>
> >>>> Premature ... ? :)
> >>>
> >>> I mean .... we don't, but the alternate is a one liner using max().
> >> I'm fine with the max(), but it still seems like adding complexity to
> >> optimize something that is nowhere prove to really be a problem.
> > Hi David, Dev,
> >
> > I use "*cur_progress += 1" at the beginning of the loop, the compiler
> > optimize that. Assembly as follows:
> >
> > 60c1: 4d 29 ca        sub    %r9,%r10         // r10 is _pte, r9 is pte, r10 = _pte - pte
> > 60c4: b8 00 02 00 00  mov    $0x200,%eax      // eax = HPAGE_PMD_NR
> > 60c9: 44 89 5c 24 10  mov    %r11d,0x10(%rsp) //
> > 60ce: 49 c1 fa 03     sar    $0x3,%r10        //
> > 60d2: 49 83 c2 01     add    $0x1,%r10        // r10 += 1
> > 60d6: 49 39 c2        cmp    %rax,%r10        // r10 = min(r10, eax)
> > 60d9: 4c 0f 4f d0     cmovg  %rax,%r10        //
> > 60dd: 44 89 55 00     mov    %r10d,0x0(%rbp)  // *cur_progress = r10
> >
> > To make the code simpler, Let us use "*cur_progress += 1".
>
> Wow! Wasn't expecting that. What's your gcc version? I checked with
> gcc 11.4.0 (looks pretty old) with both x86 and arm64, and it couldn't
> optimize.

$ gcc --version
gcc (GCC) 15.2.1 20250808 (Red Hat 15.2.1-1)

Above is my gcc version. However, I performed the assembly again without
any optimization :(

I suspect that I might have messed up the environment earlier, failing to
compile the newly modified code successfully, which resulted is assembly
old_khugepaged.o.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH mm-new v6 3/5] mm: add folio_test_lazyfree helper
  2026-02-01 12:25 [PATCH mm-new v6 0/5] Improve khugepaged scan logic Vernon Yang
  2026-02-01 12:25 ` [PATCH mm-new v6 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
  2026-02-01 12:25 ` [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number Vernon Yang
@ 2026-02-01 12:25 ` Vernon Yang
  2026-02-01 12:25 ` [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
  2026-02-01 12:25 ` [PATCH mm-new v6 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
  4 siblings, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-01 12:25 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Add folio_test_lazyfree() function to identify lazy-free folios to improve
code readability.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
---
 include/linux/page-flags.h | 5 +++++
 mm/rmap.c                  | 2 +-
 mm/vmscan.c                | 5 ++---
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f7a0e4af0c73..415e9f2ef616 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -724,6 +724,11 @@ static __always_inline bool folio_test_anon(const struct folio *folio)
 	return ((unsigned long)folio->mapping & FOLIO_MAPPING_ANON) != 0;
 }
 
+static __always_inline bool folio_test_lazyfree(const struct folio *folio)
+{
+	return folio_test_anon(folio) && !folio_test_swapbacked(folio);
+}
+
 static __always_inline bool PageAnonNotKsm(const struct page *page)
 {
 	unsigned long flags = (unsigned long)page_folio(page)->mapping;
diff --git a/mm/rmap.c b/mm/rmap.c
index 618df3385c8b..ea55e12b3e87 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2049,7 +2049,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		if (!pvmw.pte) {
-			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
+			if (folio_test_lazyfree(folio)) {
 				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
 					goto walk_done;
 				/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 01d3364fe506..17d039bdbb53 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -963,8 +963,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
 	 * They could be mistakenly treated as file lru. So further anon
 	 * test is needed.
 	 */
-	if (!folio_is_file_lru(folio) ||
-	    (folio_test_anon(folio) && !folio_test_swapbacked(folio))) {
+	if (!folio_is_file_lru(folio) || folio_test_lazyfree(folio)) {
 		*dirty = false;
 		*writeback = false;
 		return;
@@ -1508,7 +1507,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			}
 		}
 
-		if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
+		if (folio_test_lazyfree(folio)) {
 			/* follow __remove_mapping for reference */
 			if (!folio_ref_freeze(folio, 1))
 				goto keep_locked;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios
  2026-02-01 12:25 [PATCH mm-new v6 0/5] Improve khugepaged scan logic Vernon Yang
                   ` (2 preceding siblings ...)
  2026-02-01 12:25 ` [PATCH mm-new v6 3/5] mm: add folio_test_lazyfree helper Vernon Yang
@ 2026-02-01 12:25 ` Vernon Yang
  2026-02-03 11:23   ` Lance Yang
  2026-02-04 21:23   ` David Hildenbrand (arm)
  2026-02-01 12:25 ` [PATCH mm-new v6 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang
  4 siblings, 2 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-01 12:25 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

For example, create three task: hot1 -> cold -> hot2. After all three
task are created, each allocate memory 128MB. the hot1/hot2 task
continuously access 128 MB memory, while the cold task only accesses
its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
still prioritizes scanning the cold task and only scans the hot2 task
after completing the scan of the cold task.

And if we collapse with a lazyfree page, that content will never be none
and the deferred shrinker cannot reclaim them.

So if the user has explicitly informed us via MADV_FREE that this memory
will be freed, it is appropriate for khugepaged to skip it only, thereby
avoiding unnecessary scan and collapse operations to reducing CPU
wastage.

Here are the performance test results:
(Throughput bigger is better, other smaller is better)

Testing on x86_64 machine:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
| cycles per access   |  4.96         |  2.21         | -55.44% |
| Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
| dTLB-load-misses    |  284814532    |  69597236     | -75.56% |

Testing on qemu-system-x86_64 -enable-kvm:

| task hot2           | without patch | with patch    |  delta  |
|---------------------|---------------|---------------|---------|
| total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
| cycles per access   |  7.29         |  2.07         | -71.60% |
| Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
| dTLB-load-misses    |  241600871    |  3216108      | -98.67% |

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/trace/events/huge_memory.h |  1 +
 mm/khugepaged.c                    | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 384e29f6bef0..bcdc57eea270 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -25,6 +25,7 @@
 	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
 	EM( SCAN_PAGE_LOCK,		"page_locked")			\
 	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
+	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
 	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
 	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
 	EM( SCAN_VMA_NULL,		"vma_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index df22b2274d92..b4def001ccd0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -46,6 +46,7 @@ enum scan_result {
 	SCAN_PAGE_LRU,
 	SCAN_PAGE_LOCK,
 	SCAN_PAGE_ANON,
+	SCAN_PAGE_LAZYFREE,
 	SCAN_PAGE_COMPOUND,
 	SCAN_ANY_PROCESS,
 	SCAN_VMA_NULL,
@@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
+		if (cc->is_khugepaged && !pte_dirty(pteval) &&
+				folio_test_lazyfree(folio)) {
+			result = SCAN_PAGE_LAZYFREE;
+			goto out;
+		}
+
 		/* See hpage_collapse_scan_pmd(). */
 		if (folio_maybe_mapped_shared(folio)) {
 			++shared;
@@ -1332,6 +1339,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
 		}
 		folio = page_folio(page);
 
+		if (cc->is_khugepaged && !pte_dirty(pteval) &&
+				folio_test_lazyfree(folio)) {
+			result = SCAN_PAGE_LAZYFREE;
+			goto out_unmap;
+		}
+
 		if (!folio_test_anon(folio)) {
 			result = SCAN_PAGE_ANON;
 			goto out_unmap;
-- 
2.51.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios
  2026-02-01 12:25 ` [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
@ 2026-02-03 11:23   ` Lance Yang
  2026-02-05  6:01     ` Vernon Yang
  2026-02-04 21:23   ` David Hildenbrand (arm)
  1 sibling, 1 reply; 26+ messages in thread
From: Lance Yang @ 2026-02-03 11:23 UTC (permalink / raw)
  To: Vernon Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, linux-mm, linux-kernel,
	Vernon Yang, david, akpm



On 2026/2/1 20:25, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
> 
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged

s/andthen/and then/

> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
> 
> And if we collapse with a lazyfree page, that content will never be none
> and the deferred shrinker cannot reclaim them.
> 
> So if the user has explicitly informed us via MADV_FREE that this memory
> will be freed, it is appropriate for khugepaged to skip it only, thereby
> avoiding unnecessary scan and collapse operations to reducing CPU
> wastage.
> 
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
> 
> Testing on x86_64 machine:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> | cycles per access   |  4.96         |  2.21         | -55.44% |
> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> 
> Testing on qemu-system-x86_64 -enable-kvm:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> | cycles per access   |  7.29         |  2.07         | -71.60% |
> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   include/trace/events/huge_memory.h |  1 +
>   mm/khugepaged.c                    | 13 +++++++++++++
>   2 files changed, 14 insertions(+)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 384e29f6bef0..bcdc57eea270 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -25,6 +25,7 @@
>   	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
>   	EM( SCAN_PAGE_LOCK,		"page_locked")			\
>   	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
> +	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
>   	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
>   	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
>   	EM( SCAN_VMA_NULL,		"vma_null")			\
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index df22b2274d92..b4def001ccd0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -46,6 +46,7 @@ enum scan_result {
>   	SCAN_PAGE_LRU,
>   	SCAN_PAGE_LOCK,
>   	SCAN_PAGE_ANON,
> +	SCAN_PAGE_LAZYFREE,
>   	SCAN_PAGE_COMPOUND,
>   	SCAN_ANY_PROCESS,
>   	SCAN_VMA_NULL,
> @@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		folio = page_folio(page);
>   		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>   
> +		if (cc->is_khugepaged && !pte_dirty(pteval) &&
> +				folio_test_lazyfree(folio)) {
> +			result = SCAN_PAGE_LAZYFREE;
> +			goto out;
> +		}
> +
>   		/* See hpage_collapse_scan_pmd(). */
>   		if (folio_maybe_mapped_shared(folio)) {
>   			++shared;
> @@ -1332,6 +1339,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>   		}
>   		folio = page_folio(page);
>   
> +		if (cc->is_khugepaged && !pte_dirty(pteval) &&
> +				folio_test_lazyfree(folio)) {
> +			result = SCAN_PAGE_LAZYFREE;
> +			goto out_unmap;
> +		}
> +
>   		if (!folio_test_anon(folio)) {
>   			result = SCAN_PAGE_ANON;
>   			goto out_unmap;

Nothing else jumped at me, LGTM.

Reviewed-by: Lance Yang <lance.yang@linux.dev>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios
  2026-02-03 11:23   ` Lance Yang
@ 2026-02-05  6:01     ` Vernon Yang
  0 siblings, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-05  6:01 UTC (permalink / raw)
  To: Lance Yang
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, linux-mm, linux-kernel,
	Vernon Yang, david, akpm

On Tue, Feb 3, 2026 at 7:23 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> On 2026/2/1 20:25, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
>
> s/andthen/and then/

LGTM, Thank you for review and suggestion, I will do it in the next version.

> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > And if we collapse with a lazyfree page, that content will never be none
> > and the deferred shrinker cannot reclaim them.
> >
> > So if the user has explicitly informed us via MADV_FREE that this memory
> > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > avoiding unnecessary scan and collapse operations to reducing CPU
> > wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   include/trace/events/huge_memory.h |  1 +
> >   mm/khugepaged.c                    | 13 +++++++++++++
> >   2 files changed, 14 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 384e29f6bef0..bcdc57eea270 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -25,6 +25,7 @@
> >       EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> >       EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> >       EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> > +     EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> >       EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> >       EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> >       EM( SCAN_VMA_NULL,              "vma_null")                     \
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index df22b2274d92..b4def001ccd0 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -46,6 +46,7 @@ enum scan_result {
> >       SCAN_PAGE_LRU,
> >       SCAN_PAGE_LOCK,
> >       SCAN_PAGE_ANON,
> > +     SCAN_PAGE_LAZYFREE,
> >       SCAN_PAGE_COMPOUND,
> >       SCAN_ANY_PROCESS,
> >       SCAN_VMA_NULL,
> > @@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >               folio = page_folio(page);
> >               VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >
> > +             if (cc->is_khugepaged && !pte_dirty(pteval) &&
> > +                             folio_test_lazyfree(folio)) {
> > +                     result = SCAN_PAGE_LAZYFREE;
> > +                     goto out;
> > +             }
> > +
> >               /* See hpage_collapse_scan_pmd(). */
> >               if (folio_maybe_mapped_shared(folio)) {
> >                       ++shared;
> > @@ -1332,6 +1339,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >               }
> >               folio = page_folio(page);
> >
> > +             if (cc->is_khugepaged && !pte_dirty(pteval) &&
> > +                             folio_test_lazyfree(folio)) {
> > +                     result = SCAN_PAGE_LAZYFREE;
> > +                     goto out_unmap;
> > +             }
> > +
> >               if (!folio_test_anon(folio)) {
> >                       result = SCAN_PAGE_ANON;
> >                       goto out_unmap;
>
> Nothing else jumped at me, LGTM.
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios
  2026-02-01 12:25 ` [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
  2026-02-03 11:23   ` Lance Yang
@ 2026-02-04 21:23   ` David Hildenbrand (arm)
  2026-02-05  6:05     ` Vernon Yang
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand (arm) @ 2026-02-04 21:23 UTC (permalink / raw)
  To: Vernon Yang, akpm
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

On 2/1/26 13:25, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
> 
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
> 
> And if we collapse with a lazyfree page, that content will never be none
> and the deferred shrinker cannot reclaim them.
> 
> So if the user has explicitly informed us via MADV_FREE that this memory
> will be freed, it is appropriate for khugepaged to skip it only, thereby
> avoiding unnecessary scan and collapse operations to reducing CPU
> wastage.
> 
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
> 
> Testing on x86_64 machine:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> | cycles per access   |  4.96         |  2.21         | -55.44% |
> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> 
> Testing on qemu-system-x86_64 -enable-kvm:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> | cycles per access   |  7.29         |  2.07         | -71.60% |
> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> 
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>   include/trace/events/huge_memory.h |  1 +
>   mm/khugepaged.c                    | 13 +++++++++++++
>   2 files changed, 14 insertions(+)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 384e29f6bef0..bcdc57eea270 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -25,6 +25,7 @@
>   	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
>   	EM( SCAN_PAGE_LOCK,		"page_locked")			\
>   	EM( SCAN_PAGE_ANON,		"page_not_anon")		\
> +	EM( SCAN_PAGE_LAZYFREE,		"page_lazyfree")		\
>   	EM( SCAN_PAGE_COMPOUND,		"page_compound")		\
>   	EM( SCAN_ANY_PROCESS,		"no_process_for_page")		\
>   	EM( SCAN_VMA_NULL,		"vma_null")			\
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index df22b2274d92..b4def001ccd0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -46,6 +46,7 @@ enum scan_result {
>   	SCAN_PAGE_LRU,
>   	SCAN_PAGE_LOCK,
>   	SCAN_PAGE_ANON,
> +	SCAN_PAGE_LAZYFREE,
>   	SCAN_PAGE_COMPOUND,
>   	SCAN_ANY_PROCESS,
>   	SCAN_VMA_NULL,
> @@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   		folio = page_folio(page);
>   		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
>   
> +		if (cc->is_khugepaged && !pte_dirty(pteval) &&
> +				folio_test_lazyfree(folio)) {

Should be aligned as

		if (cc->is_khugepaged && !pte_dirty(pteval) &&
		    folio_test_lazyfree(folio)) {


But you could just have it in a single line.

> +			result = SCAN_PAGE_LAZYFREE;
> +			goto out;
> +		}
> +
>   		/* See hpage_collapse_scan_pmd(). */
>   		if (folio_maybe_mapped_shared(folio)) {
>   			++shared;
> @@ -1332,6 +1339,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>   		}
>   		folio = page_folio(page);
>   
> +		if (cc->is_khugepaged && !pte_dirty(pteval) &&
> +				folio_test_lazyfree(folio)) {
> +			result = SCAN_PAGE_LAZYFREE;
> +			goto out_unmap;
> +		}

Dito.

> +
>   		if (!folio_test_anon(folio)) {
>   			result = SCAN_PAGE_ANON;
>   			goto out_unmap;

Surprised that there is no need to add checks for SCAN_PAGE_LAZYFREE 
anywhere, but it's similar to SCAN_PAGE_LOCK just that we cannot ever 
run into it for madvise.

Acked-by: David Hildenbrand (arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios
  2026-02-04 21:23   ` David Hildenbrand (arm)
@ 2026-02-05  6:05     ` Vernon Yang
  0 siblings, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-05  6:05 UTC (permalink / raw)
  To: David Hildenbrand (arm)
  Cc: akpm, lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang,
	linux-mm, linux-kernel, Vernon Yang

On Thu, Feb 5, 2026 at 5:24 AM David Hildenbrand (arm) <david@kernel.org> wrote:
>
> On 2/1/26 13:25, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > For example, create three task: hot1 -> cold -> hot2. After all three
> > task are created, each allocate memory 128MB. the hot1/hot2 task
> > continuously access 128 MB memory, while the cold task only accesses
> > its memory briefly andthen call madvise(MADV_FREE). However, khugepaged
> > still prioritizes scanning the cold task and only scans the hot2 task
> > after completing the scan of the cold task.
> >
> > And if we collapse with a lazyfree page, that content will never be none
> > and the deferred shrinker cannot reclaim them.
> >
> > So if the user has explicitly informed us via MADV_FREE that this memory
> > will be freed, it is appropriate for khugepaged to skip it only, thereby
> > avoiding unnecessary scan and collapse operations to reducing CPU
> > wastage.
> >
> > Here are the performance test results:
> > (Throughput bigger is better, other smaller is better)
> >
> > Testing on x86_64 machine:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> > | cycles per access   |  4.96         |  2.21         | -55.44% |
> > | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> > | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> >
> > Testing on qemu-system-x86_64 -enable-kvm:
> >
> > | task hot2           | without patch | with patch    |  delta  |
> > |---------------------|---------------|---------------|---------|
> > | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> > | cycles per access   |  7.29         |  2.07         | -71.60% |
> > | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> > | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> >
> > Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> > ---
> >   include/trace/events/huge_memory.h |  1 +
> >   mm/khugepaged.c                    | 13 +++++++++++++
> >   2 files changed, 14 insertions(+)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 384e29f6bef0..bcdc57eea270 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -25,6 +25,7 @@
> >       EM( SCAN_PAGE_LRU,              "page_not_in_lru")              \
> >       EM( SCAN_PAGE_LOCK,             "page_locked")                  \
> >       EM( SCAN_PAGE_ANON,             "page_not_anon")                \
> > +     EM( SCAN_PAGE_LAZYFREE,         "page_lazyfree")                \
> >       EM( SCAN_PAGE_COMPOUND,         "page_compound")                \
> >       EM( SCAN_ANY_PROCESS,           "no_process_for_page")          \
> >       EM( SCAN_VMA_NULL,              "vma_null")                     \
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index df22b2274d92..b4def001ccd0 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -46,6 +46,7 @@ enum scan_result {
> >       SCAN_PAGE_LRU,
> >       SCAN_PAGE_LOCK,
> >       SCAN_PAGE_ANON,
> > +     SCAN_PAGE_LAZYFREE,
> >       SCAN_PAGE_COMPOUND,
> >       SCAN_ANY_PROCESS,
> >       SCAN_VMA_NULL,
> > @@ -583,6 +584,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >               folio = page_folio(page);
> >               VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
> >
> > +             if (cc->is_khugepaged && !pte_dirty(pteval) &&
> > +                             folio_test_lazyfree(folio)) {
>
> Should be aligned as
>
>                 if (cc->is_khugepaged && !pte_dirty(pteval) &&
>                     folio_test_lazyfree(folio)) {

LGTM, Thank you for review and suggestion, I will do it in the next version.

> But you could just have it in a single line.

If it is placed on a single line, it will exceed 80 characters.

> > +                     result = SCAN_PAGE_LAZYFREE;
> > +                     goto out;
> > +             }
> > +
> >               /* See hpage_collapse_scan_pmd(). */
> >               if (folio_maybe_mapped_shared(folio)) {
> >                       ++shared;
> > @@ -1332,6 +1339,12 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >               }
> >               folio = page_folio(page);
> >
> > +             if (cc->is_khugepaged && !pte_dirty(pteval) &&
> > +                             folio_test_lazyfree(folio)) {
> > +                     result = SCAN_PAGE_LAZYFREE;
> > +                     goto out_unmap;
> > +             }
>
> Dito.
>
> > +
> >               if (!folio_test_anon(folio)) {
> >                       result = SCAN_PAGE_ANON;
> >                       goto out_unmap;
>
> Surprised that there is no need to add checks for SCAN_PAGE_LAZYFREE
> anywhere, but it's similar to SCAN_PAGE_LOCK just that we cannot ever
> run into it for madvise.
>
> Acked-by: David Hildenbrand (arm) <david@kernel.org>

Thank you for review and explanation.

> --
> Cheers,
>
> David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH mm-new v6 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY
  2026-02-01 12:25 [PATCH mm-new v6 0/5] Improve khugepaged scan logic Vernon Yang
                   ` (3 preceding siblings ...)
  2026-02-01 12:25 ` [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
@ 2026-02-01 12:25 ` Vernon Yang
  4 siblings, 0 replies; 26+ messages in thread
From: Vernon Yang @ 2026-02-01 12:25 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, ziy, dev.jain, baohua, lance.yang, linux-mm,
	linux-kernel, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

When an mm with the MMF_DISABLE_THP_COMPLETELY flag is detected during
scanning, directly set khugepaged_scan.mm_slot to the next mm_slot,
reduce redundant operation.

Without this patch, entering khugepaged_scan_mm_slot() next time, we
will set khugepaged_scan.mm_slot to the next mm_slot.

With this patch, we will directly set khugepaged_scan.mm_slot to the
next mm_slot.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
---
 mm/khugepaged.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b4def001ccd0..94cd064f79a5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2549,9 +2549,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, enum scan_result
 	VM_BUG_ON(khugepaged_scan.mm_slot != slot);
 	/*
 	 * Release the current mm_slot if this mm is about to die, or
-	 * if we scanned all vmas of this mm.
+	 * if we scanned all vmas of this mm, or THP got disabled.
 	 */
-	if (hpage_collapse_test_exit(mm) || !vma) {
+	if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
-- 
2.51.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-02-08 13:23 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-01 12:25 [PATCH mm-new v6 0/5] Improve khugepaged scan logic Vernon Yang
2026-02-01 12:25 ` [PATCH mm-new v6 1/5] mm: khugepaged: add trace_mm_khugepaged_scan event Vernon Yang
2026-02-01 12:25 ` [PATCH mm-new v6 2/5] mm: khugepaged: refine scan progress number Vernon Yang
2026-02-04 21:35   ` David Hildenbrand (arm)
2026-02-05  6:08     ` Vernon Yang
2026-02-05 12:07       ` Dev Jain
2026-02-05 12:28         ` David Hildenbrand (Arm)
2026-02-05 12:11       ` David Hildenbrand (arm)
2026-02-05 14:25         ` Dev Jain
2026-02-05 14:30           ` Dev Jain
2026-02-06  9:03             ` David Hildenbrand (Arm)
2026-02-06  9:02           ` David Hildenbrand (Arm)
2026-02-06 10:00             ` Dev Jain
2026-02-06 11:10               ` David Hildenbrand (Arm)
2026-02-06 11:12             ` Vernon Yang
2026-02-06 13:52               ` Lance Yang
2026-02-08  9:05               ` Dev Jain
2026-02-08  9:32                 ` Lance Yang
2026-02-08 13:23                 ` Vernon Yang
2026-02-01 12:25 ` [PATCH mm-new v6 3/5] mm: add folio_test_lazyfree helper Vernon Yang
2026-02-01 12:25 ` [PATCH mm-new v6 4/5] mm: khugepaged: skip lazy-free folios Vernon Yang
2026-02-03 11:23   ` Lance Yang
2026-02-05  6:01     ` Vernon Yang
2026-02-04 21:23   ` David Hildenbrand (arm)
2026-02-05  6:05     ` Vernon Yang
2026-02-01 12:25 ` [PATCH mm-new v6 5/5] mm: khugepaged: set to next mm direct when mm has MMF_DISABLE_THP_COMPLETELY Vernon Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox