[PATCH 1/2] ksm: Initial the addr only once in rmap_walk

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] ksm: Initial the addr only once in rmap_walk_ksm
       [not found] <20260112215315996jocrkFSqeYfhABkZxqs4T@zte.com.cn>
@ 2026-01-12 13:59 ` xu.xin16
  2026-01-12 14:01 ` [PATCH 2/2] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
  1 sibling, 0 replies; 4+ messages in thread
From: xu.xin16 @ 2026-01-12 13:59 UTC (permalink / raw)
  To: akpm, david, chengming.zhou, hughd
  Cc: wang.yaxin, yang.yang29, xu.xin16, linux-mm, linux-kernel

From: xu xin <xu.xin16@zte.com.cn>

This is a minor performance optimization, especially when there are many
for-loop iterations, because the addr variable doesn’t change across
iterations.

Therefore, it only needs to be initialized once before the loop.

Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 mm/ksm.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index cfc182255c7b..335e7151e4a1 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3171,6 +3171,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 		struct anon_vma *anon_vma = rmap_item->anon_vma;
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
+		unsigned long addr;

 		cond_resched();
 		if (!anon_vma_trylock_read(anon_vma)) {
@@ -3180,16 +3181,16 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 			}
 			anon_vma_lock_read(anon_vma);
 		}
+
+		/* Ignore the stable/unstable/sqnr flags */
+		addr = rmap_item->address & PAGE_MASK;
+
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
 					       0, ULONG_MAX) {
-			unsigned long addr;

 			cond_resched();
 			vma = vmac->vma;

-			/* Ignore the stable/unstable/sqnr flags */
-			addr = rmap_item->address & PAGE_MASK;
-
 			if (addr < vma->vm_start || addr >= vma->vm_end)
 				continue;
 			/*
-- 
2.25.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 2/2] ksm: Optimize rmap_walk_ksm by passing a suitable address  range
       [not found] <20260112215315996jocrkFSqeYfhABkZxqs4T@zte.com.cn>
  2026-01-12 13:59 ` [PATCH 1/2] ksm: Initial the addr only once in rmap_walk_ksm xu.xin16
@ 2026-01-12 14:01 ` xu.xin16
  2026-01-12 17:47   ` Andrew Morton
  2026-01-12 19:25   ` David Hildenbrand (Red Hat)
  1 sibling, 2 replies; 4+ messages in thread
From: xu.xin16 @ 2026-01-12 14:01 UTC (permalink / raw)
  To: akpm, david, chengming.zhou, hughd
  Cc: david, wang.yaxin, yang.yang29, linux-mm, linux-kernel

From: xu xin <xu.xin16@zte.com.cn>

Problem
=======
When available memory is extremely tight, causing KSM pages to be swapped
out, or when there is significant memory fragmentation and THP triggers
memory compaction, the system will invoke the rmap_walk_ksm function to
perform reverse mapping. However, we observed that this function becomes
particularly time-consuming when a large number of VMAs (e.g., 20,000)
share the same anon_vma. Through debug trace analysis, we found that most
of the latency occurs within anon_vma_interval_tree_foreach, leading to an
excessively long hold time on the anon_vma lock (even reaching 500ms or
more), which in turn causes upper-layer applications (waiting for the
anon_vma lock) to be blocked for extended periods.

Root Reaon
==========
Further investigation revealed that 99.9% of iterations inside the
anon_vma_interval_tree_foreach loop are skipped due to the first check
"if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
number of loop iterations are ineffective. This inefficiency arises because
the pgoff_start and pgoff_end parameters passed to
anon_vma_interval_tree_foreach span the entire address space from 0 to
ULONG_MAX, resulting in very poor loop efficiency.

Solution
========
In fact, we can significantly improve performance by passing a more precise
range based on the given addr. Since the original pages merged by KSM
correspond to anonymous VMAs, the page offset can be calculated as
pgoff = address >> PAGE_SHIFT. Therefore, we can optimize the call by
defining:

	pgoff_start = rmap_item->address >> PAGE_SHIFT;
	pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;

Performance
===========
In our real embedded Linux environment, the measured metrcis were as follows:

1) Time_ms: Max time for holding anon_vma lock in a single rmap_walk_ksm.
2) Nr_iteration_total: The max times of iterations in a loop of anon_vma_interval_tree_foreach
3) Skip_addr_out_of_range: The max times of skipping due to the first check (vma->vm_start
            and vma->vm_end) in a loop of anon_vma_interval_tree_foreach.
4) Skip_mm_mismatch: The max times of skipping due to the second check (rmap_item->mm == vma->vm_mm)
            in a loop of anon_vma_interval_tree_foreach.

The result is as follows:

                 Time_ms      Nr_iteration_total    Skip_addr_out_of_range   Skip_mm_mismatch
Before patched:  228.65       22169                 22168                    0
After pacthed:   0.396        3                     0                        2

Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 mm/ksm.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 335e7151e4a1..0a074ad8e867 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3172,6 +3172,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 		unsigned long addr;
+		pgoff_t pgoff_start, pgoff_end;

 		cond_resched();
 		if (!anon_vma_trylock_read(anon_vma)) {
@@ -3185,8 +3186,11 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 		/* Ignore the stable/unstable/sqnr flags */
 		addr = rmap_item->address & PAGE_MASK;

+		pgoff_start = rmap_item->address >> PAGE_SHIFT;
+		pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;
+
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
-					       0, ULONG_MAX) {
+					       pgoff_start, pgoff_end) {

 			cond_resched();
 			vma = vmac->vma;
-- 
2.25.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/2] ksm: Optimize rmap_walk_ksm by passing a suitable address  range
  2026-01-12 14:01 ` [PATCH 2/2] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
@ 2026-01-12 17:47   ` Andrew Morton
  2026-01-12 19:25   ` David Hildenbrand (Red Hat)
  1 sibling, 0 replies; 4+ messages in thread
From: Andrew Morton @ 2026-01-12 17:47 UTC (permalink / raw)
  To: xu.xin16
  Cc: david, chengming.zhou, hughd, wang.yaxin, yang.yang29, linux-mm,
	linux-kernel

On Mon, 12 Jan 2026 22:01:43 +0800 (CST) <xu.xin16@zte.com.cn> wrote:

> From: xu xin <xu.xin16@zte.com.cn>
> 
> Problem
> =======
> When available memory is extremely tight, causing KSM pages to be swapped
> out, or when there is significant memory fragmentation and THP triggers
> memory compaction, the system will invoke the rmap_walk_ksm function to
> perform reverse mapping. However, we observed that this function becomes
> particularly time-consuming when a large number of VMAs (e.g., 20,000)
> share the same anon_vma. Through debug trace analysis, we found that most
> of the latency occurs within anon_vma_interval_tree_foreach, leading to an
> excessively long hold time on the anon_vma lock (even reaching 500ms or
> more), which in turn causes upper-layer applications (waiting for the
> anon_vma lock) to be blocked for extended periods.
> 
> Root Reaon
> ==========
> Further investigation revealed that 99.9% of iterations inside the
> anon_vma_interval_tree_foreach loop are skipped due to the first check
> "if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
> number of loop iterations are ineffective. This inefficiency arises because
> the pgoff_start and pgoff_end parameters passed to
> anon_vma_interval_tree_foreach span the entire address space from 0 to
> ULONG_MAX, resulting in very poor loop efficiency.
> 
> Solution
> ========
> In fact, we can significantly improve performance by passing a more precise
> range based on the given addr. Since the original pages merged by KSM
> correspond to anonymous VMAs, the page offset can be calculated as
> pgoff = address >> PAGE_SHIFT. Therefore, we can optimize the call by
> defining:
> 
> 	pgoff_start = rmap_item->address >> PAGE_SHIFT;
> 	pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;
> 
> Performance
> ===========
> In our real embedded Linux environment, the measured metrcis were as follows:
> 
> 1) Time_ms: Max time for holding anon_vma lock in a single rmap_walk_ksm.
> 2) Nr_iteration_total: The max times of iterations in a loop of anon_vma_interval_tree_foreach
> 3) Skip_addr_out_of_range: The max times of skipping due to the first check (vma->vm_start
>             and vma->vm_end) in a loop of anon_vma_interval_tree_foreach.
> 4) Skip_mm_mismatch: The max times of skipping due to the second check (rmap_item->mm == vma->vm_mm)
>             in a loop of anon_vma_interval_tree_foreach.
> 
> The result is as follows:
> 
>                  Time_ms      Nr_iteration_total    Skip_addr_out_of_range   Skip_mm_mismatch
> Before patched:  228.65       22169                 22168                    0
> After pacthed:   0.396        3                     0                        2

Wow.  This was not the best code we've ever delivered.  It's really old
code - over a decade?  Your workload seems a reasonable one and I
wonder why it took so long to find this.

> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3172,6 +3172,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>  		struct anon_vma_chain *vmac;
>  		struct vm_area_struct *vma;
>  		unsigned long addr;
> +		pgoff_t pgoff_start, pgoff_end;
> 
>  		cond_resched();
>  		if (!anon_vma_trylock_read(anon_vma)) {
> @@ -3185,8 +3186,11 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>  		/* Ignore the stable/unstable/sqnr flags */
>  		addr = rmap_item->address & PAGE_MASK;
> 
> +		pgoff_start = rmap_item->address >> PAGE_SHIFT;
> +		pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;
> +
>  		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
> -					       0, ULONG_MAX) {
> +					       pgoff_start, pgoff_end) {
> 
>  			cond_resched();
>  			vma = vmac->vma;

Thanks, I'll queue this for testing - hopefully somehugh will find time
to check the change.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/2] ksm: Optimize rmap_walk_ksm by passing a suitable address range
  2026-01-12 14:01 ` [PATCH 2/2] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
  2026-01-12 17:47   ` Andrew Morton
@ 2026-01-12 19:25   ` David Hildenbrand (Red Hat)
  1 sibling, 0 replies; 4+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-12 19:25 UTC (permalink / raw)
  To: xu.xin16, akpm, chengming.zhou, hughd
  Cc: wang.yaxin, yang.yang29, linux-mm, linux-kernel

On 1/12/26 15:01, xu.xin16@zte.com.cn wrote:
> From: xu xin <xu.xin16@zte.com.cn>
> 
> Problem
> =======
> When available memory is extremely tight, causing KSM pages to be swapped
> out, or when there is significant memory fragmentation and THP triggers
> memory compaction, the system will invoke the rmap_walk_ksm function to
> perform reverse mapping. However, we observed that this function becomes
> particularly time-consuming when a large number of VMAs (e.g., 20,000)
> share the same anon_vma. Through debug trace analysis, we found that most
> of the latency occurs within anon_vma_interval_tree_foreach, leading to an
> excessively long hold time on the anon_vma lock (even reaching 500ms or
> more), which in turn causes upper-layer applications (waiting for the
> anon_vma lock) to be blocked for extended periods.
> 
> Root Reaon
> ==========
> Further investigation revealed that 99.9% of iterations inside the
> anon_vma_interval_tree_foreach loop are skipped due to the first check
> "if (addr < vma->vm_start || addr >= vma->vm_end)), indicating that a large
> number of loop iterations are ineffective. This inefficiency arises because
> the pgoff_start and pgoff_end parameters passed to
> anon_vma_interval_tree_foreach span the entire address space from 0 to
> ULONG_MAX, resulting in very poor loop efficiency.
> 
> Solution
> ========
> In fact, we can significantly improve performance by passing a more precise
> range based on the given addr. Since the original pages merged by KSM
> correspond to anonymous VMAs, the page offset can be calculated as
> pgoff = address >> PAGE_SHIFT. Therefore, we can optimize the call by
> defining:
> 
> 	pgoff_start = rmap_item->address >> PAGE_SHIFT;
> 	pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;
> 
> Performance
> ===========
> In our real embedded Linux environment, the measured metrcis were as follows:
> 
> 1) Time_ms: Max time for holding anon_vma lock in a single rmap_walk_ksm.
> 2) Nr_iteration_total: The max times of iterations in a loop of anon_vma_interval_tree_foreach
> 3) Skip_addr_out_of_range: The max times of skipping due to the first check (vma->vm_start
>              and vma->vm_end) in a loop of anon_vma_interval_tree_foreach.
> 4) Skip_mm_mismatch: The max times of skipping due to the second check (rmap_item->mm == vma->vm_mm)
>              in a loop of anon_vma_interval_tree_foreach.
> 
> The result is as follows:
> 
>                   Time_ms      Nr_iteration_total    Skip_addr_out_of_range   Skip_mm_mismatch
> Before patched:  228.65       22169                 22168                    0
> After pacthed:   0.396        3                     0                        2

Nice improvement.

Can you make your reproducer available?

> 
> Co-developed-by: Wang Yaxin <wang.yaxin@zte.com.cn>
> Signed-off-by: xu xin <xu.xin16@zte.com.cn>
> ---
>   mm/ksm.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 335e7151e4a1..0a074ad8e867 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3172,6 +3172,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>   		struct anon_vma_chain *vmac;
>   		struct vm_area_struct *vma;
>   		unsigned long addr;
> +		pgoff_t pgoff_start, pgoff_end;
> 
>   		cond_resched();
>   		if (!anon_vma_trylock_read(anon_vma)) {
> @@ -3185,8 +3186,11 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>   		/* Ignore the stable/unstable/sqnr flags */
>   		addr = rmap_item->address & PAGE_MASK;
> 
> +		pgoff_start = rmap_item->address >> PAGE_SHIFT;
> +		pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;

KSM folios are always order-0, so you can keep it simple and hard-code 
PAGE_SIZE here.

You can also initialize both values directly and make them const.

> +
>   		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
> -					       0, ULONG_MAX) {
> +					       pgoff_start, pgoff_end) {

This is interesting. When we fork() with KSM pages we don't duplicate 
the rmap items. So we rely on this handling here to find all KSM pages 
even in child processes without distinct rmap items.

The important thing is that, whenever we mremap(), we break COW to 
unshare all KSM pages (see prep_move_vma).

So, indeed, I would expect that we only ever have to search at 
rmap->address even in child processes. So makes sense to me.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-01-12 19:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20260112215315996jocrkFSqeYfhABkZxqs4T@zte.com.cn>
2026-01-12 13:59 ` [PATCH 1/2] ksm: Initial the addr only once in rmap_walk_ksm xu.xin16
2026-01-12 14:01 ` [PATCH 2/2] ksm: Optimize rmap_walk_ksm by passing a suitable address range xu.xin16
2026-01-12 17:47   ` Andrew Morton
2026-01-12 19:25   ` David Hildenbrand (Red Hat)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox