Re: [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range()
@ 2025-10-28 11:32 zhangqilong
  0 siblings, 0 replies; 5+ messages in thread
From: zhangqilong @ 2025-10-28 11:32 UTC (permalink / raw)
  To: David Hildenbrand, akpm, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, Wangkefeng (OS Kernel Lab), Sunnanyong

> On 27.10.25 15:03, Zhang Qilong wrote:
> > In current mincore_pte_range(), if pte_batch_hint() return one pte,
> > it's not efficient, just call new added can_pte_batch_count().
> >
> > In ARM64 qemu, with 8 CPUs, 32G memory, a simple test demo like:
> > 1. mmap 1G anon memory
> > 2. write 1G data by 4k step
> > 3. mincore the mmaped 1G memory
> > 4. get the time consumed by mincore
> >
> > Tested the following cases:
> >   - 4k, disabled all hugepage setting.
> >   - 64k mTHP, only enable 64k hugepage setting.
> >
> > Before
> >
> > Case status | Consumed time (us)  |
> > ----------------------------------|
> > 4k          | 7356                |
> > 64k mTHP    | 3670                |
> >
> > Pathed:
> >
> > Case status | Consumed time (us)  |
> > ----------------------------------|
> > 4k          | 4419                |
> > 64k mTHP    | 3061                |
> >
> 
> I assume you're only lucky in that benchmark because you got consecutive 4k
> pages / 64k mTHP from the buddy, right?

Year, the demo case is relatively simple, which may result in stronger continuity
of allocated physical page addresses. 
This case primarily aims to validate optimization effectiveness in contiguous page
address. Maybe we also need watch side effectiveness in non-contiguous page
address.

> 
> So I suspect that this will mostly just make a micro benchmark happy, because
> the reality where we allocate randomly over time, for the PCP, etc will look
> quite different.
> 
> --
> Cheers
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range()
@ 2025-10-28 11:13 zhangqilong
  0 siblings, 0 replies; 5+ messages in thread
From: zhangqilong @ 2025-10-28 11:13 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, Liam.Howlett, vbabka, rppt, surenb, mhocko, jannh,
	pfalcato, linux-mm, linux-kernel, Wangkefeng (OS Kernel Lab),
	Sunnanyong

 On Mon, Oct 27, 2025 at 10:03:14PM +0800, Zhang Qilong wrote:
> > In current mincore_pte_range(), if pte_batch_hint() return one pte,
> > it's not efficient, just call new added can_pte_batch_count().
> >
> > In ARM64 qemu, with 8 CPUs, 32G memory, a simple test demo like:
> > 1. mmap 1G anon memory
> > 2. write 1G data by 4k step
> > 3. mincore the mmaped 1G memory
> > 4. get the time consumed by mincore
> >
> > Tested the following cases:
> >  - 4k, disabled all hugepage setting.
> >  - 64k mTHP, only enable 64k hugepage setting.
> >
> > Before
> >
> > Case status | Consumed time (us)  |
> > ----------------------------------|
> > 4k          | 7356                |
> > 64k mTHP    | 3670                |
> >
> > Pathed:
> >
> > Case status | Consumed time (us)  |
> > ----------------------------------|
> > 4k          | 4419                |
> > 64k mTHP    | 3061                |
> >
> > The result is evident and demonstrate a significant improvement in the
> > pte batch. While verification within a single environment may have
> > inherent randomness. there is a high probability of achieving positive
> > effects.
> 
> Recent batch PTE series seriously regressed non-arm, so I'm afraid we can't
> accept any series that doesn't show statistics for _other platforms_.
> 
> Please make sure you at least test x86-64.

OK, I will have  a  test on x86-64 as soon and it may yield unexpected results.

> 
> This code is very sensitive and we're not going to accept a patch like this without
> _being sure_ it's ok.

Year, it's a hot path, we should be extremely cautious.

> 
> >
> > Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
> > ---
> >  mm/mincore.c | 10 +++-------
> >  1 file changed, 3 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/mincore.c b/mm/mincore.c index
> > 8ec4719370e1..2cc5d276d1cd 100644
> > --- a/mm/mincore.c
> > +++ b/mm/mincore.c
> > @@ -178,18 +178,14 @@ static int mincore_pte_range(pmd_t *pmd,
> unsigned long addr, unsigned long end,
> >  		/* We need to do cache lookup too for pte markers */
> >  		if (pte_none_mostly(pte))
> >  			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
> >  						 vma, vec);
> >  		else if (pte_present(pte)) {
> > -			unsigned int batch = pte_batch_hint(ptep, pte);
> > -
> > -			if (batch > 1) {
> > -				unsigned int max_nr = (end - addr) >>
> PAGE_SHIFT;
> > -
> > -				step = min_t(unsigned int, batch, max_nr);
> > -			}
> > +			unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
> >
> > +			step = can_pte_batch_count(vma, ptep, &pte,
> > +						   max_nr, 0);
> >  			for (i = 0; i < step; i++)
> >  				vec[i] = 1;
> >  		} else { /* pte is a swap entry */
> >  			*vec = mincore_swap(pte_to_swp_entry(pte), false);
> >  		}
> > --
> > 2.43.0
> >



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH 0/3] mm: PTEs batch optimization in mincore and mremap
@ 2025-10-27 14:03 Zhang Qilong
  2025-10-27 14:03 ` [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range() Zhang Qilong
  0 siblings, 1 reply; 5+ messages in thread
From: Zhang Qilong @ 2025-10-27 14:03 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

This first patch extract a new interface named can_pte_batch_count()
from folio_pte_batch_flags() for pte batch. Thew new interface avoids
folio access, and counts more pte, not just limited to entries mapped
within a single folio. Caller need pass a range within a single VMA
and a single page and it detect consecutive (present) PTEs that map
consecutive pages. The 2th and 3rd patches use can_pte_batch_count()
do pte batch.

Zhang Qilong (3):
  mm: Introduce can_pte_batch_count() for PTEs batch optimization.
  mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte
    batch mincore_pte_range()
  mm/mremap: Use can_pte_batch_count() instead of folio_pte_batch() for
    pte batch

 mm/internal.h | 76 +++++++++++++++++++++++++++++++++++++++------------
 mm/mincore.c  | 10 ++-----
 mm/mremap.c   | 16 ++---------
 3 files changed, 64 insertions(+), 38 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range()
  2025-10-27 14:03 [RFC PATCH 0/3] mm: PTEs batch optimization in mincore and mremap Zhang Qilong
@ 2025-10-27 14:03 ` Zhang Qilong
  2025-10-27 19:27   ` David Hildenbrand
  2025-10-27 19:34   ` Lorenzo Stoakes
  0 siblings, 2 replies; 5+ messages in thread
From: Zhang Qilong @ 2025-10-27 14:03 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

In current mincore_pte_range(), if pte_batch_hint() return one
pte, it's not efficient, just call new added can_pte_batch_count().

In ARM64 qemu, with 8 CPUs, 32G memory, a simple test demo like:
1. mmap 1G anon memory
2. write 1G data by 4k step
3. mincore the mmaped 1G memory
4. get the time consumed by mincore

Tested the following cases:
 - 4k, disabled all hugepage setting.
 - 64k mTHP, only enable 64k hugepage setting.

Before

Case status | Consumed time (us)  |
----------------------------------|
4k          | 7356                |
64k mTHP    | 3670                |

Pathed:

Case status | Consumed time (us)  |
----------------------------------|
4k          | 4419                |
64k mTHP    | 3061                |

The result is evident and demonstrate a significant improvement in
the pte batch. While verification within a single environment may
have inherent randomness. there is a high probability of achieving
positive effects.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
---
 mm/mincore.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 8ec4719370e1..2cc5d276d1cd 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -178,18 +178,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		/* We need to do cache lookup too for pte markers */
 		if (pte_none_mostly(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
 						 vma, vec);
 		else if (pte_present(pte)) {
-			unsigned int batch = pte_batch_hint(ptep, pte);
-
-			if (batch > 1) {
-				unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
-
-				step = min_t(unsigned int, batch, max_nr);
-			}
+			unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
 
+			step = can_pte_batch_count(vma, ptep, &pte,
+						   max_nr, 0);
 			for (i = 0; i < step; i++)
 				vec[i] = 1;
 		} else { /* pte is a swap entry */
 			*vec = mincore_swap(pte_to_swp_entry(pte), false);
 		}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range()
  2025-10-27 14:03 ` [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range() Zhang Qilong
@ 2025-10-27 19:27   ` David Hildenbrand
  2025-10-27 19:34   ` Lorenzo Stoakes
  1 sibling, 0 replies; 5+ messages in thread
From: David Hildenbrand @ 2025-10-27 19:27 UTC (permalink / raw)
  To: Zhang Qilong, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

On 27.10.25 15:03, Zhang Qilong wrote:
> In current mincore_pte_range(), if pte_batch_hint() return one
> pte, it's not efficient, just call new added can_pte_batch_count().
> 
> In ARM64 qemu, with 8 CPUs, 32G memory, a simple test demo like:
> 1. mmap 1G anon memory
> 2. write 1G data by 4k step
> 3. mincore the mmaped 1G memory
> 4. get the time consumed by mincore
> 
> Tested the following cases:
>   - 4k, disabled all hugepage setting.
>   - 64k mTHP, only enable 64k hugepage setting.
> 
> Before
> 
> Case status | Consumed time (us)  |
> ----------------------------------|
> 4k          | 7356                |
> 64k mTHP    | 3670                |
> 
> Pathed:
> 
> Case status | Consumed time (us)  |
> ----------------------------------|
> 4k          | 4419                |
> 64k mTHP    | 3061                |
> 

I assume you're only lucky in that benchmark because you got consecutive 
4k pages / 64k mTHP from the buddy, right?

So I suspect that this will mostly just make a micro benchmark happy, 
because the reality where we allocate randomly over time, for the PCP, 
etc will look quite different.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range()
  2025-10-27 14:03 ` [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range() Zhang Qilong
  2025-10-27 19:27   ` David Hildenbrand
@ 2025-10-27 19:34   ` Lorenzo Stoakes
  1 sibling, 0 replies; 5+ messages in thread
From: Lorenzo Stoakes @ 2025-10-27 19:34 UTC (permalink / raw)
  To: Zhang Qilong
  Cc: akpm, david, Liam.Howlett, vbabka, rppt, surenb, mhocko, jannh,
	pfalcato, linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

On Mon, Oct 27, 2025 at 10:03:14PM +0800, Zhang Qilong wrote:
> In current mincore_pte_range(), if pte_batch_hint() return one
> pte, it's not efficient, just call new added can_pte_batch_count().
>
> In ARM64 qemu, with 8 CPUs, 32G memory, a simple test demo like:
> 1. mmap 1G anon memory
> 2. write 1G data by 4k step
> 3. mincore the mmaped 1G memory
> 4. get the time consumed by mincore
>
> Tested the following cases:
>  - 4k, disabled all hugepage setting.
>  - 64k mTHP, only enable 64k hugepage setting.
>
> Before
>
> Case status | Consumed time (us)  |
> ----------------------------------|
> 4k          | 7356                |
> 64k mTHP    | 3670                |
>
> Pathed:
>
> Case status | Consumed time (us)  |
> ----------------------------------|
> 4k          | 4419                |
> 64k mTHP    | 3061                |
>
> The result is evident and demonstrate a significant improvement in
> the pte batch. While verification within a single environment may
> have inherent randomness. there is a high probability of achieving
> positive effects.

Recent batch PTE series seriously regressed non-arm, so I'm afraid we can't
accept any series that doesn't show statistics for _other platforms_.

Please make sure you at least test x86-64.

This code is very sensitive and we're not going to accept a patch like this
without _being sure_ it's ok.

>
> Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
> ---
>  mm/mincore.c | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 8ec4719370e1..2cc5d276d1cd 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -178,18 +178,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  		/* We need to do cache lookup too for pte markers */
>  		if (pte_none_mostly(pte))
>  			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
>  						 vma, vec);
>  		else if (pte_present(pte)) {
> -			unsigned int batch = pte_batch_hint(ptep, pte);
> -
> -			if (batch > 1) {
> -				unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
> -
> -				step = min_t(unsigned int, batch, max_nr);
> -			}
> +			unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
>
> +			step = can_pte_batch_count(vma, ptep, &pte,
> +						   max_nr, 0);
>  			for (i = 0; i < step; i++)
>  				vec[i] = 1;
>  		} else { /* pte is a swap entry */
>  			*vec = mincore_swap(pte_to_swp_entry(pte), false);
>  		}
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-10-28 11:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-28 11:32 [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range() zhangqilong
  -- strict thread matches above, loose matches on Subject: below --
2025-10-28 11:13 zhangqilong
2025-10-27 14:03 [RFC PATCH 0/3] mm: PTEs batch optimization in mincore and mremap Zhang Qilong
2025-10-27 14:03 ` [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range() Zhang Qilong
2025-10-27 19:27   ` David Hildenbrand
2025-10-27 19:34   ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox