[RFC PATCH 0/3] mm: PTEs batch optimization in mincore and mremap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] mm: PTEs batch optimization in mincore and mremap
@ 2025-10-27 14:03 Zhang Qilong
  2025-10-27 14:03 ` [RFC PATCH 1/3] mm: Introduce can_pte_batch_count() for PTEs batch optimization Zhang Qilong
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Zhang Qilong @ 2025-10-27 14:03 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

This first patch extract a new interface named can_pte_batch_count()
from folio_pte_batch_flags() for pte batch. Thew new interface avoids
folio access, and counts more pte, not just limited to entries mapped
within a single folio. Caller need pass a range within a single VMA
and a single page and it detect consecutive (present) PTEs that map
consecutive pages. The 2th and 3rd patches use can_pte_batch_count()
do pte batch.

Zhang Qilong (3):
  mm: Introduce can_pte_batch_count() for PTEs batch optimization.
  mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte
    batch mincore_pte_range()
  mm/mremap: Use can_pte_batch_count() instead of folio_pte_batch() for
    pte batch

 mm/internal.h | 76 +++++++++++++++++++++++++++++++++++++++------------
 mm/mincore.c  | 10 ++-----
 mm/mremap.c   | 16 ++---------
 3 files changed, 64 insertions(+), 38 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/3] mm: Introduce can_pte_batch_count() for PTEs batch optimization.
  2025-10-27 14:03 [RFC PATCH 0/3] mm: PTEs batch optimization in mincore and mremap Zhang Qilong
@ 2025-10-27 14:03 ` Zhang Qilong
  2025-10-27 19:24   ` David Hildenbrand
  2025-10-27 14:03 ` [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range() Zhang Qilong
  2025-10-27 14:03 ` [RFC PATCH 3/3] mm/mremap: Use can_pte_batch_count() instead of folio_pte_batch() for pte batch Zhang Qilong
  2 siblings, 1 reply; 13+ messages in thread
From: Zhang Qilong @ 2025-10-27 14:03 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

Currently, the PTEs batch requires folio access, with the maximum
quantity limited to the PFNs contained within the folio. However,
in certain case (such as mremap_folio_pte_batch and mincore_pte_range),
accessing the folio is unnecessary and expensive.

For scenarios that do not require folio access, this patch introduces
can_pte_batch_count(). With contiguous physical addresses and identical
PTE attribut bits, we can now process more page table entries at once,
in batch, not just limited to entries mapped within a single folio. On
the other hand, it avoid the folio access.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
---
 mm/internal.h | 76 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 58 insertions(+), 18 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 1561fc2ff5b8..92034ca9092d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -233,61 +233,62 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
 		pte = pte_wrprotect(pte);
 	return pte_mkold(pte);
 }
 
 /**
- * folio_pte_batch_flags - detect a PTE batch for a large folio
- * @folio: The large folio to detect a PTE batch for.
+ * can_pte_batch_count - detect a PTE batch in range [ptep, to ptep + max_nr)
  * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
  * @ptep: Page table pointer for the first entry.
  * @ptentp: Pointer to a COPY of the first page table entry whose flags this
  *	    function updates based on @flags if appropriate.
  * @max_nr: The maximum number of table entries to consider.
  * @flags: Flags to modify the PTE batch semantics.
  *
- * Detect a PTE batch: consecutive (present) PTEs that map consecutive
- * pages of the same large folio in a single VMA and a single page table.
+ * This interface is designed for this case that do not require folio access.
+ * If folio consideration is needed, please call folio_pte_batch_flags instead.
+ *
+ * Detect a PTE batch: consecutive (present) PTEs that map consecutive pages
+ * in a single VMA and a single page table.
  *
  * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
  * the accessed bit, writable bit, dirty bit (unless FPB_RESPECT_DIRTY is set)
  * and soft-dirty bit (unless FPB_RESPECT_SOFT_DIRTY is set).
  *
- * @ptep must map any page of the folio. max_nr must be at least one and
+ * @ptep point to the first entry in range, max_nr must be at least one and
  * must be limited by the caller so scanning cannot exceed a single VMA and
  * a single page table.
  *
  * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
  * be updated: it's crucial that a pointer to a COPY of the first
  * page table entry, obtained through ptep_get(), is provided as @ptentp.
  *
- * This function will be inlined to optimize based on the input parameters;
- * consider using folio_pte_batch() instead if applicable.
+ * The following folio_pte_batch_flags() deal with PTEs that mapped in a
+ * single folio. However can_pte_batch_count has the capability to handle
+ * PTEs that mapped in consecutive folios. If flags is not set, it will ignore
+ * the accessed, writable and dirty bits. Once the flags is set, the respect
+ * bit(s) will be compared in pte_same(), if the advanced pte_batch_hint()
+ * respect pte bit is different, pte_same() will return false and break. This
+ * ensures the correctness of handling multiple folio PTEs.
+ *
+ * This function will be inlined to optimize based on the input parameters.
  *
  * Return: the number of table entries in the batch.
  */
-static inline unsigned int folio_pte_batch_flags(struct folio *folio,
-		struct vm_area_struct *vma, pte_t *ptep, pte_t *ptentp,
-		unsigned int max_nr, fpb_t flags)
+static inline unsigned int can_pte_batch_count(struct vm_area_struct *vma,
+		pte_t *ptep, pte_t *ptentp, unsigned int max_nr, fpb_t flags)
 {
 	bool any_writable = false, any_young = false, any_dirty = false;
 	pte_t expected_pte, pte = *ptentp;
 	unsigned int nr, cur_nr;
 
-	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
-	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
-	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
+	VM_WARN_ON(!pte_present(pte));
 	/*
 	 * Ensure this is a pointer to a copy not a pointer into a page table.
 	 * If this is a stack value, it won't be a valid virtual address, but
 	 * that's fine because it also cannot be pointing into the page table.
 	 */
 	VM_WARN_ON(virt_addr_valid(ptentp) && PageTable(virt_to_page(ptentp)));
-
-	/* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
-	max_nr = min_t(unsigned long, max_nr,
-		       folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
-
 	nr = pte_batch_hint(ptep, pte);
 	expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
 	ptep = ptep + nr;
 
 	while (nr < max_nr) {
@@ -317,10 +318,49 @@ static inline unsigned int folio_pte_batch_flags(struct folio *folio,
 		*ptentp = pte_mkdirty(*ptentp);
 
 	return min(nr, max_nr);
 }
 
+/**
+ * folio_pte_batch_flags - detect a PTE batch for a large folio
+ * @folio: The large folio to detect a PTE batch for.
+ * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
+ * @ptep: Page table pointer for the first entry.
+ * @ptentp: Pointer to a COPY of the first page table entry whose flags this
+ *         function updates based on @flags if appropriate.
+ * @max_nr: The maximum number of table entries to consider.
+ * @flags: Flags to modify the PTE batch semantics.
+ *
+ * Detect a PTE batch: consecutive (present) PTEs that map consecutive
+ * pages of the same large folio and have the same PTE bits set excluding the
+ * PFN, the accessed bit, writable bit, dirty bit. (unless FPB_RESPECT_DIRTY
+ * is set) and soft-dirty bit (unless FPB_RESPECT_SOFT_DIRTY is set).
+ *
+ * @ptep must map any page of the folio.
+ *
+ * This function will be inlined to optimize based on the input parameters;
+ * consider using folio_pte_batch() instead if applicable.
+ *
+ * Return: the number of table entries in the batch.
+ */
+static inline unsigned int folio_pte_batch_flags(struct folio *folio,
+		struct vm_area_struct *vma, pte_t *ptep, pte_t *ptentp,
+		unsigned int max_nr, fpb_t flags)
+{
+	pte_t pte = *ptentp;
+
+	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
+	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
+	VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
+
+	/* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
+	max_nr = min_t(unsigned long, max_nr,
+		       folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
+
+	return can_pte_batch_count(vma, ptep, ptentp, max_nr, flags);
+}
+
 unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
 		unsigned int max_nr);
 
 /**
  * pte_move_swp_offset - Move the swap entry offset field of a swap pte
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range()
  2025-10-27 14:03 [RFC PATCH 0/3] mm: PTEs batch optimization in mincore and mremap Zhang Qilong
  2025-10-27 14:03 ` [RFC PATCH 1/3] mm: Introduce can_pte_batch_count() for PTEs batch optimization Zhang Qilong
@ 2025-10-27 14:03 ` Zhang Qilong
  2025-10-27 19:27   ` David Hildenbrand
  2025-10-27 19:34   ` Lorenzo Stoakes
  2025-10-27 14:03 ` [RFC PATCH 3/3] mm/mremap: Use can_pte_batch_count() instead of folio_pte_batch() for pte batch Zhang Qilong
  2 siblings, 2 replies; 13+ messages in thread
From: Zhang Qilong @ 2025-10-27 14:03 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

In current mincore_pte_range(), if pte_batch_hint() return one
pte, it's not efficient, just call new added can_pte_batch_count().

In ARM64 qemu, with 8 CPUs, 32G memory, a simple test demo like:
1. mmap 1G anon memory
2. write 1G data by 4k step
3. mincore the mmaped 1G memory
4. get the time consumed by mincore

Tested the following cases:
 - 4k, disabled all hugepage setting.
 - 64k mTHP, only enable 64k hugepage setting.

Before

Case status | Consumed time (us)  |
----------------------------------|
4k          | 7356                |
64k mTHP    | 3670                |

Pathed:

Case status | Consumed time (us)  |
----------------------------------|
4k          | 4419                |
64k mTHP    | 3061                |

The result is evident and demonstrate a significant improvement in
the pte batch. While verification within a single environment may
have inherent randomness. there is a high probability of achieving
positive effects.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
---
 mm/mincore.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 8ec4719370e1..2cc5d276d1cd 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -178,18 +178,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		/* We need to do cache lookup too for pte markers */
 		if (pte_none_mostly(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
 						 vma, vec);
 		else if (pte_present(pte)) {
-			unsigned int batch = pte_batch_hint(ptep, pte);
-
-			if (batch > 1) {
-				unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
-
-				step = min_t(unsigned int, batch, max_nr);
-			}
+			unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
 
+			step = can_pte_batch_count(vma, ptep, &pte,
+						   max_nr, 0);
 			for (i = 0; i < step; i++)
 				vec[i] = 1;
 		} else { /* pte is a swap entry */
 			*vec = mincore_swap(pte_to_swp_entry(pte), false);
 		}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 3/3] mm/mremap: Use can_pte_batch_count() instead of folio_pte_batch() for pte batch
  2025-10-27 14:03 [RFC PATCH 0/3] mm: PTEs batch optimization in mincore and mremap Zhang Qilong
  2025-10-27 14:03 ` [RFC PATCH 1/3] mm: Introduce can_pte_batch_count() for PTEs batch optimization Zhang Qilong
  2025-10-27 14:03 ` [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range() Zhang Qilong
@ 2025-10-27 14:03 ` Zhang Qilong
  2025-10-27 19:41   ` David Hildenbrand
  2025-10-27 19:57   ` Lorenzo Stoakes
  2 siblings, 2 replies; 13+ messages in thread
From: Zhang Qilong @ 2025-10-27 14:03 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

In current mremap_folio_pte_batch(), 1) pte_batch_hint() always
return one pte in non-ARM64 machine, it is not efficient. 2) Next,
it need to acquire a folio to call the folio_pte_batch().

Due to new added can_pte_batch_count(), we just call it instead of
folio_pte_batch(). And then rename mremap_folio_pte_batch() to
mremap_pte_batch().

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
---
 mm/mremap.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index bd7314898ec5..d11f93f1622f 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -169,27 +169,17 @@ static pte_t move_soft_dirty_pte(pte_t pte)
 		pte = pte_swp_mksoft_dirty(pte);
 #endif
 	return pte;
 }
 
-static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
+static int mremap_pte_batch(struct vm_area_struct *vma, unsigned long addr,
 		pte_t *ptep, pte_t pte, int max_nr)
 {
-	struct folio *folio;
-
 	if (max_nr == 1)
 		return 1;
 
-	/* Avoid expensive folio lookup if we stand no chance of benefit. */
-	if (pte_batch_hint(ptep, pte) == 1)
-		return 1;
-
-	folio = vm_normal_folio(vma, addr, pte);
-	if (!folio || !folio_test_large(folio))
-		return 1;
-
-	return folio_pte_batch(folio, ptep, pte, max_nr);
+	return can_pte_batch_count(vma, ptep, &pte, max_nr, 0);
 }
 
 static int move_ptes(struct pagetable_move_control *pmc,
 		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
 {
@@ -278,11 +268,11 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		 * make sure the physical page stays valid until
 		 * the TLB entry for the old mapping has been
 		 * flushed.
 		 */
 		if (pte_present(old_pte)) {
-			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
+			nr_ptes = mremap_pte_batch(vma, old_addr, old_ptep,
 							 old_pte, max_nr_ptes);
 			force_flush = true;
 		}
 		pte = get_and_clear_ptes(mm, old_addr, old_ptep, nr_ptes);
 		pte = move_pte(pte, old_addr, new_addr);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] mm: Introduce can_pte_batch_count() for PTEs batch optimization.
  2025-10-27 14:03 ` [RFC PATCH 1/3] mm: Introduce can_pte_batch_count() for PTEs batch optimization Zhang Qilong
@ 2025-10-27 19:24   ` David Hildenbrand
  2025-10-27 19:51     ` Lorenzo Stoakes
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-10-27 19:24 UTC (permalink / raw)
  To: Zhang Qilong, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

On 27.10.25 15:03, Zhang Qilong wrote:
> Currently, the PTEs batch requires folio access, with the maximum
> quantity limited to the PFNs contained within the folio. However,
> in certain case (such as mremap_folio_pte_batch and mincore_pte_range),
> accessing the folio is unnecessary and expensive.
> 
> For scenarios that do not require folio access, this patch introduces
> can_pte_batch_count(). With contiguous physical addresses and identical
> PTE attribut bits, we can now process more page table entries at once,
> in batch, not just limited to entries mapped within a single folio. On
> the other hand, it avoid the folio access.
> 
> Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
> ---
>   mm/internal.h | 76 +++++++++++++++++++++++++++++++++++++++------------
>   1 file changed, 58 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 1561fc2ff5b8..92034ca9092d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -233,61 +233,62 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   		pte = pte_wrprotect(pte);
>   	return pte_mkold(pte);
>   }
>   
>   /**
> - * folio_pte_batch_flags - detect a PTE batch for a large folio
> - * @folio: The large folio to detect a PTE batch for.
> + * can_pte_batch_count - detect a PTE batch in range [ptep, to ptep + max_nr)

I really don't like the name.

Maybe it's just pte_batch().

>    * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
>    * @ptep: Page table pointer for the first entry.
>    * @ptentp: Pointer to a COPY of the first page table entry whose flags this
>    *	    function updates based on @flags if appropriate.
>    * @max_nr: The maximum number of table entries to consider.
>    * @flags: Flags to modify the PTE batch semantics.
>    *
> - * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> - * pages of the same large folio in a single VMA and a single page table.
> + * This interface is designed for this case that do not require folio access.
> + * If folio consideration is needed, please call folio_pte_batch_flags instead.
> + *
> + * Detect a PTE batch: consecutive (present) PTEs that map consecutive pages
> + * in a single VMA and a single page table.
>    *
>    * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
>    * the accessed bit, writable bit, dirty bit (unless FPB_RESPECT_DIRTY is set)
>    * and soft-dirty bit (unless FPB_RESPECT_SOFT_DIRTY is set).
>    *
> - * @ptep must map any page of the folio. max_nr must be at least one and
> + * @ptep point to the first entry in range, max_nr must be at least one and
>    * must be limited by the caller so scanning cannot exceed a single VMA and
>    * a single page table.
>    *
>    * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
>    * be updated: it's crucial that a pointer to a COPY of the first
>    * page table entry, obtained through ptep_get(), is provided as @ptentp.
>    *
> - * This function will be inlined to optimize based on the input parameters;
> - * consider using folio_pte_batch() instead if applicable.
> + * The following folio_pte_batch_flags() deal with PTEs that mapped in a
> + * single folio. However can_pte_batch_count has the capability to handle
> + * PTEs that mapped in consecutive folios. If flags is not set, it will ignore
> + * the accessed, writable and dirty bits. Once the flags is set, the respect
> + * bit(s) will be compared in pte_same(), if the advanced pte_batch_hint()
> + * respect pte bit is different, pte_same() will return false and break. This
> + * ensures the correctness of handling multiple folio PTEs.
> + *
> + * This function will be inlined to optimize based on the input parameters.
>    *
>    * Return: the number of table entries in the batch.
>    */

I recall trouble if we try batching across folios:

commit 7b08b74f3d99f6b801250683c751d391128799ec (tag: mm-hotfixes-stable-2025-05-10-14-23)
Author: Petr Vaněk <arkamar@atlas.cz>
Date:   Fri May 2 23:50:19 2025 +0200

     mm: fix folio_pte_batch() on XEN PV
     
     On XEN PV, folio_pte_batch() can incorrectly batch beyond the end of a
     folio due to a corner case in pte_advance_pfn().  Specifically, when the
     PFN following the folio maps to an invalidated MFN,
     
             expected_pte = pte_advance_pfn(expected_pte, nr);
     
     produces a pte_none().  If the actual next PTE in memory is also
     pte_none(), the pte_same() succeeds,
     
             if (!pte_same(pte, expected_pte))
                     break;
     
     the loop is not broken, and batching continues into unrelated memory.
     
...


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range()
  2025-10-27 14:03 ` [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range() Zhang Qilong
@ 2025-10-27 19:27   ` David Hildenbrand
  2025-10-27 19:34   ` Lorenzo Stoakes
  1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-10-27 19:27 UTC (permalink / raw)
  To: Zhang Qilong, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

On 27.10.25 15:03, Zhang Qilong wrote:
> In current mincore_pte_range(), if pte_batch_hint() return one
> pte, it's not efficient, just call new added can_pte_batch_count().
> 
> In ARM64 qemu, with 8 CPUs, 32G memory, a simple test demo like:
> 1. mmap 1G anon memory
> 2. write 1G data by 4k step
> 3. mincore the mmaped 1G memory
> 4. get the time consumed by mincore
> 
> Tested the following cases:
>   - 4k, disabled all hugepage setting.
>   - 64k mTHP, only enable 64k hugepage setting.
> 
> Before
> 
> Case status | Consumed time (us)  |
> ----------------------------------|
> 4k          | 7356                |
> 64k mTHP    | 3670                |
> 
> Pathed:
> 
> Case status | Consumed time (us)  |
> ----------------------------------|
> 4k          | 4419                |
> 64k mTHP    | 3061                |
> 

I assume you're only lucky in that benchmark because you got consecutive 
4k pages / 64k mTHP from the buddy, right?

So I suspect that this will mostly just make a micro benchmark happy, 
because the reality where we allocate randomly over time, for the PCP, 
etc will look quite different.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range()
  2025-10-27 14:03 ` [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range() Zhang Qilong
  2025-10-27 19:27   ` David Hildenbrand
@ 2025-10-27 19:34   ` Lorenzo Stoakes
  1 sibling, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-10-27 19:34 UTC (permalink / raw)
  To: Zhang Qilong
  Cc: akpm, david, Liam.Howlett, vbabka, rppt, surenb, mhocko, jannh,
	pfalcato, linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

On Mon, Oct 27, 2025 at 10:03:14PM +0800, Zhang Qilong wrote:
> In current mincore_pte_range(), if pte_batch_hint() return one
> pte, it's not efficient, just call new added can_pte_batch_count().
>
> In ARM64 qemu, with 8 CPUs, 32G memory, a simple test demo like:
> 1. mmap 1G anon memory
> 2. write 1G data by 4k step
> 3. mincore the mmaped 1G memory
> 4. get the time consumed by mincore
>
> Tested the following cases:
>  - 4k, disabled all hugepage setting.
>  - 64k mTHP, only enable 64k hugepage setting.
>
> Before
>
> Case status | Consumed time (us)  |
> ----------------------------------|
> 4k          | 7356                |
> 64k mTHP    | 3670                |
>
> Pathed:
>
> Case status | Consumed time (us)  |
> ----------------------------------|
> 4k          | 4419                |
> 64k mTHP    | 3061                |
>
> The result is evident and demonstrate a significant improvement in
> the pte batch. While verification within a single environment may
> have inherent randomness. there is a high probability of achieving
> positive effects.

Recent batch PTE series seriously regressed non-arm, so I'm afraid we can't
accept any series that doesn't show statistics for _other platforms_.

Please make sure you at least test x86-64.

This code is very sensitive and we're not going to accept a patch like this
without _being sure_ it's ok.

>
> Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
> ---
>  mm/mincore.c | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 8ec4719370e1..2cc5d276d1cd 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -178,18 +178,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  		/* We need to do cache lookup too for pte markers */
>  		if (pte_none_mostly(pte))
>  			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
>  						 vma, vec);
>  		else if (pte_present(pte)) {
> -			unsigned int batch = pte_batch_hint(ptep, pte);
> -
> -			if (batch > 1) {
> -				unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
> -
> -				step = min_t(unsigned int, batch, max_nr);
> -			}
> +			unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
>
> +			step = can_pte_batch_count(vma, ptep, &pte,
> +						   max_nr, 0);
>  			for (i = 0; i < step; i++)
>  				vec[i] = 1;
>  		} else { /* pte is a swap entry */
>  			*vec = mincore_swap(pte_to_swp_entry(pte), false);
>  		}
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/3] mm/mremap: Use can_pte_batch_count() instead of folio_pte_batch() for pte batch
  2025-10-27 14:03 ` [RFC PATCH 3/3] mm/mremap: Use can_pte_batch_count() instead of folio_pte_batch() for pte batch Zhang Qilong
@ 2025-10-27 19:41   ` David Hildenbrand
  2025-10-27 19:57   ` Lorenzo Stoakes
  1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-10-27 19:41 UTC (permalink / raw)
  To: Zhang Qilong, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

On 27.10.25 15:03, Zhang Qilong wrote:
> In current mremap_folio_pte_batch(), 1) pte_batch_hint() always
> return one pte in non-ARM64 machine, it is not efficient. 2) Next,
> it need to acquire a folio to call the folio_pte_batch().
> 
> Due to new added can_pte_batch_count(), we just call it instead of
> folio_pte_batch(). And then rename mremap_folio_pte_batch() to
> mremap_pte_batch().
> 
> Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
> ---
>   mm/mremap.c | 16 +++-------------
>   1 file changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index bd7314898ec5..d11f93f1622f 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -169,27 +169,17 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>   		pte = pte_swp_mksoft_dirty(pte);
>   #endif
>   	return pte;
>   }
>   
> -static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> +static int mremap_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>   		pte_t *ptep, pte_t pte, int max_nr)
>   {
> -	struct folio *folio;
> -
>   	if (max_nr == 1)
>   		return 1;
>   
> -	/* Avoid expensive folio lookup if we stand no chance of benefit. */
> -	if (pte_batch_hint(ptep, pte) == 1)
> -		return 1;
> -
> -	folio = vm_normal_folio(vma, addr, pte);
> -	if (!folio || !folio_test_large(folio))
> -		return 1;
> -
> -	return folio_pte_batch(folio, ptep, pte, max_nr);
> +	return can_pte_batch_count(vma, ptep, &pte, max_nr, 0);
>   }
>   
>   static int move_ptes(struct pagetable_move_control *pmc,
>   		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>   {
> @@ -278,11 +268,11 @@ static int move_ptes(struct pagetable_move_control *pmc,
>   		 * make sure the physical page stays valid until
>   		 * the TLB entry for the old mapping has been
>   		 * flushed.
>   		 */
>   		if (pte_present(old_pte)) {
> -			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> +			nr_ptes = mremap_pte_batch(vma, old_addr, old_ptep,
>   							 old_pte, max_nr_ptes);
>   			force_flush = true;
>   		}
>   		pte = get_and_clear_ptes(mm, old_addr, old_ptep, nr_ptes);

get_and_clear_ptes() documents: "Clear present PTEs that map consecutive 
pages of the same folio, collecting dirty/accessed bits."

And as can_pte_batch_count() will merge access/dirty bits, you would 
silently set ptes dirty/accessed that belong to other folios, which 
sounds very wrong.

Staring at the code, I wonder if there is also a problem with the write 
bit, have to dig into that.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] mm: Introduce can_pte_batch_count() for PTEs batch optimization.
  2025-10-27 19:24   ` David Hildenbrand
@ 2025-10-27 19:51     ` Lorenzo Stoakes
  2025-10-27 20:21       ` Ryan Roberts
  0 siblings, 1 reply; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-10-27 19:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zhang Qilong, akpm, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	jannh, pfalcato, linux-mm, linux-kernel, wangkefeng.wang,
	sunnanyong, Dev Jain, Ryan Roberts

+Dev, Ryan

Please ensure to keep Dev + Ryan in the loop on all future iterations of this.

On Mon, Oct 27, 2025 at 08:24:40PM +0100, David Hildenbrand wrote:
> On 27.10.25 15:03, Zhang Qilong wrote:
> > Currently, the PTEs batch requires folio access, with the maximum
> > quantity limited to the PFNs contained within the folio. However,
> > in certain case (such as mremap_folio_pte_batch and mincore_pte_range),
> > accessing the folio is unnecessary and expensive.
> >
> > For scenarios that do not require folio access, this patch introduces
> > can_pte_batch_count(). With contiguous physical addresses and identical
> > PTE attribut bits, we can now process more page table entries at once,
> > in batch, not just limited to entries mapped within a single folio. On
> > the other hand, it avoid the folio access.
> >
> > Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
> > ---
> >   mm/internal.h | 76 +++++++++++++++++++++++++++++++++++++++------------
> >   1 file changed, 58 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 1561fc2ff5b8..92034ca9092d 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -233,61 +233,62 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >   		pte = pte_wrprotect(pte);
> >   	return pte_mkold(pte);
> >   }
> >   /**
> > - * folio_pte_batch_flags - detect a PTE batch for a large folio
> > - * @folio: The large folio to detect a PTE batch for.
> > + * can_pte_batch_count - detect a PTE batch in range [ptep, to ptep + max_nr)
>
> I really don't like the name.
>
> Maybe it's just pte_batch().

Yeah the name's terrible.

But I'm iffy about this series as a whole.

'can' implies boolean, it should be something like get pte batch or count pte
batch or something like this. It's silly to partially replace other functions
also.

But I've doubtful as to whether any of this will work...

>
> >    * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
> >    * @ptep: Page table pointer for the first entry.
> >    * @ptentp: Pointer to a COPY of the first page table entry whose flags this
> >    *	    function updates based on @flags if appropriate.
> >    * @max_nr: The maximum number of table entries to consider.
> >    * @flags: Flags to modify the PTE batch semantics.
> >    *
> > - * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> > - * pages of the same large folio in a single VMA and a single page table.
> > + * This interface is designed for this case that do not require folio access.
> > + * If folio consideration is needed, please call folio_pte_batch_flags instead.

I'm pretty certain we need to make sure we do not cross folio boundaries, which
kills this series if so does it not?

Ryan - can you confirm?

> > + *
> > + * Detect a PTE batch: consecutive (present) PTEs that map consecutive pages
> > + * in a single VMA and a single page table.
> >    *
> >    * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
> >    * the accessed bit, writable bit, dirty bit (unless FPB_RESPECT_DIRTY is set)
> >    * and soft-dirty bit (unless FPB_RESPECT_SOFT_DIRTY is set).
> >    *
> > - * @ptep must map any page of the folio. max_nr must be at least one and
> > + * @ptep point to the first entry in range, max_nr must be at least one and
> >    * must be limited by the caller so scanning cannot exceed a single VMA and
> >    * a single page table.
> >    *
> >    * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
> >    * be updated: it's crucial that a pointer to a COPY of the first
> >    * page table entry, obtained through ptep_get(), is provided as @ptentp.
> >    *
> > - * This function will be inlined to optimize based on the input parameters;
> > - * consider using folio_pte_batch() instead if applicable.
> > + * The following folio_pte_batch_flags() deal with PTEs that mapped in a
> > + * single folio. However can_pte_batch_count has the capability to handle
> > + * PTEs that mapped in consecutive folios. If flags is not set, it will ignore
> > + * the accessed, writable and dirty bits. Once the flags is set, the respect
> > + * bit(s) will be compared in pte_same(), if the advanced pte_batch_hint()
> > + * respect pte bit is different, pte_same() will return false and break. This
> > + * ensures the correctness of handling multiple folio PTEs.
> > + *
> > + * This function will be inlined to optimize based on the input parameters.
> >    *
> >    * Return: the number of table entries in the batch.
> >    */
>
> I recall trouble if we try batching across folios:

Yup pretty sure Ryan said we don't/can't in a previous thread. Now cc'd...

>
> commit 7b08b74f3d99f6b801250683c751d391128799ec (tag: mm-hotfixes-stable-2025-05-10-14-23)
> Author: Petr Vaněk <arkamar@atlas.cz>
> Date:   Fri May 2 23:50:19 2025 +0200
>
>     mm: fix folio_pte_batch() on XEN PV
>     On XEN PV, folio_pte_batch() can incorrectly batch beyond the end of a
>     folio due to a corner case in pte_advance_pfn().  Specifically, when the
>     PFN following the folio maps to an invalidated MFN,
>             expected_pte = pte_advance_pfn(expected_pte, nr);
>     produces a pte_none().  If the actual next PTE in memory is also
>     pte_none(), the pte_same() succeeds,
>             if (!pte_same(pte, expected_pte))
>                     break;
>     the loop is not broken, and batching continues into unrelated memory.
> ...
>
>
> --
> Cheers
>
> David / dhildenb
>

Thanks, Lorenzk


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 3/3] mm/mremap: Use can_pte_batch_count() instead of folio_pte_batch() for pte batch
  2025-10-27 14:03 ` [RFC PATCH 3/3] mm/mremap: Use can_pte_batch_count() instead of folio_pte_batch() for pte batch Zhang Qilong
  2025-10-27 19:41   ` David Hildenbrand
@ 2025-10-27 19:57   ` Lorenzo Stoakes
  1 sibling, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-10-27 19:57 UTC (permalink / raw)
  To: Zhang Qilong
  Cc: akpm, david, Liam.Howlett, vbabka, rppt, surenb, mhocko, jannh,
	pfalcato, linux-mm, linux-kernel, wangkefeng.wang, sunnanyong

On Mon, Oct 27, 2025 at 10:03:15PM +0800, Zhang Qilong wrote:
> In current mremap_folio_pte_batch(), 1) pte_batch_hint() always
> return one pte in non-ARM64 machine, it is not efficient. 2) Next,

Err... but there's basically no benefit for non-arm64 machines?

The key benefit is the mTHP side of things and making the underlying
arch-specific code more efficient right?

And again you need to get numbers to demonstrate you don't regress non-arm64.

> it need to acquire a folio to call the folio_pte_batch().
>
> Due to new added can_pte_batch_count(), we just call it instead of
> folio_pte_batch(). And then rename mremap_folio_pte_batch() to
> mremap_pte_batch().
>
> Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
> ---
>  mm/mremap.c | 16 +++-------------
>  1 file changed, 3 insertions(+), 13 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index bd7314898ec5..d11f93f1622f 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -169,27 +169,17 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  		pte = pte_swp_mksoft_dirty(pte);
>  #endif
>  	return pte;
>  }
>
> -static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> +static int mremap_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>  		pte_t *ptep, pte_t pte, int max_nr)
>  {
> -	struct folio *folio;
> -
>  	if (max_nr == 1)
>  		return 1;
>
> -	/* Avoid expensive folio lookup if we stand no chance of benefit. */
> -	if (pte_batch_hint(ptep, pte) == 1)
> -		return 1;

Why are we eliminating an easy exit here and instead always invoking the
more involved function?

Again this has to be tested against non-arm architectures.

> -
> -	folio = vm_normal_folio(vma, addr, pte);
> -	if (!folio || !folio_test_large(folio))
> -		return 1;
> -
> -	return folio_pte_batch(folio, ptep, pte, max_nr);
> +	return can_pte_batch_count(vma, ptep, &pte, max_nr, 0);

This is very silly to have this function now ust return another function + a
trivial check that your function should be doing...

>  }
>
>  static int move_ptes(struct pagetable_move_control *pmc,
>  		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>  {
> @@ -278,11 +268,11 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		 * make sure the physical page stays valid until
>  		 * the TLB entry for the old mapping has been
>  		 * flushed.
>  		 */
>  		if (pte_present(old_pte)) {
> -			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> +			nr_ptes = mremap_pte_batch(vma, old_addr, old_ptep,
>  							 old_pte, max_nr_ptes);
>  			force_flush = true;
>  		}
>  		pte = get_and_clear_ptes(mm, old_addr, old_ptep, nr_ptes);
>  		pte = move_pte(pte, old_addr, new_addr);
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/3] mm: Introduce can_pte_batch_count() for PTEs batch optimization.
  2025-10-27 19:51     ` Lorenzo Stoakes
@ 2025-10-27 20:21       ` Ryan Roberts
  0 siblings, 0 replies; 13+ messages in thread
From: Ryan Roberts @ 2025-10-27 20:21 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: Zhang Qilong, akpm, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	jannh, pfalcato, linux-mm, linux-kernel, wangkefeng.wang,
	sunnanyong, Dev Jain

On 27/10/2025 19:51, Lorenzo Stoakes wrote:
> +Dev, Ryan
> 
> Please ensure to keep Dev + Ryan in the loop on all future iterations of this.
> 
> On Mon, Oct 27, 2025 at 08:24:40PM +0100, David Hildenbrand wrote:
>> On 27.10.25 15:03, Zhang Qilong wrote:
>>> Currently, the PTEs batch requires folio access, with the maximum
>>> quantity limited to the PFNs contained within the folio. However,
>>> in certain case (such as mremap_folio_pte_batch and mincore_pte_range),
>>> accessing the folio is unnecessary and expensive.
>>>
>>> For scenarios that do not require folio access, this patch introduces
>>> can_pte_batch_count(). With contiguous physical addresses and identical
>>> PTE attribut bits, we can now process more page table entries at once,
>>> in batch, not just limited to entries mapped within a single folio. On
>>> the other hand, it avoid the folio access.
>>>
>>> Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
>>> ---
>>>   mm/internal.h | 76 +++++++++++++++++++++++++++++++++++++++------------
>>>   1 file changed, 58 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 1561fc2ff5b8..92034ca9092d 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -233,61 +233,62 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>   		pte = pte_wrprotect(pte);
>>>   	return pte_mkold(pte);
>>>   }
>>>   /**
>>> - * folio_pte_batch_flags - detect a PTE batch for a large folio
>>> - * @folio: The large folio to detect a PTE batch for.
>>> + * can_pte_batch_count - detect a PTE batch in range [ptep, to ptep + max_nr)
>>
>> I really don't like the name.
>>
>> Maybe it's just pte_batch().
> 
> Yeah the name's terrible.
> 
> But I'm iffy about this series as a whole.
> 
> 'can' implies boolean, it should be something like get pte batch or count pte
> batch or something like this. It's silly to partially replace other functions
> also.
> 
> But I've doubtful as to whether any of this will work...
> 
>>
>>>    * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
>>>    * @ptep: Page table pointer for the first entry.
>>>    * @ptentp: Pointer to a COPY of the first page table entry whose flags this
>>>    *	    function updates based on @flags if appropriate.
>>>    * @max_nr: The maximum number of table entries to consider.
>>>    * @flags: Flags to modify the PTE batch semantics.
>>>    *
>>> - * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>>> - * pages of the same large folio in a single VMA and a single page table.
>>> + * This interface is designed for this case that do not require folio access.
>>> + * If folio consideration is needed, please call folio_pte_batch_flags instead.
> 
> I'm pretty certain we need to make sure we do not cross folio boundaries, which
> kills this series if so does it not?
> 
> Ryan - can you confirm?

Whenever you call set_ptes() for more than 1 pte, you are signalling that the
architecture is permitted to track only a single access and dirty bit that
covers the whole batch. arm64 takes advantage of this to use it's "contiguous
PTE" HW feature.

The core-mm considers access and dirty properties of a folio, so while fidelity
is lost in the pgtable due to contiguous mappings, the core-mm only cares at the
granularity of a folio, so as long as all the ptes set by a single call to
set_ptes() belong to the same folio, the system isn't losing any fidelity overall.

So to keep things simple, we document that all the pages must belong to the same
folio in set_ptes():

/**
 * set_ptes - Map consecutive pages to a contiguous range of addresses.
 * @mm: Address space to map the pages into.
 * @addr: Address to map the first page at.
 * @ptep: Page table pointer for the first entry.
 * @pte: Page table entry for the first page.
 * @nr: Number of pages to map.
 *
 * When nr==1, initial state of pte may be present or not present, and new state
 * may be present or not present. When nr>1, initial state of all ptes must be
 * not present, and new state must be present.
 *
 * May be overridden by the architecture, or the architecture can define
 * set_pte() and PFN_PTE_SHIFT.
 *
 * Context: The caller holds the page table lock.  The pages all belong
 * to the same folio.  The PTEs are all in the same PMD.
 */

So I guess this new function isn't problematic on it's own, but if the plan is
to use it to decide how to call set_ptes() then it becomes problematic. The side
effect I forsee is writeback amplification because neighouring folios get
erroneously dirtied due to a write to an adjacently mapped one that ends up
sharing a contpte mapping.

But there may be cases where we don't care about access/dirty tracking at all.
PFNMAP? In which case something like this may make sense.

Thanks,
Ryan


> 
>>> + *
>>> + * Detect a PTE batch: consecutive (present) PTEs that map consecutive pages
>>> + * in a single VMA and a single page table.
>>>    *
>>>    * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
>>>    * the accessed bit, writable bit, dirty bit (unless FPB_RESPECT_DIRTY is set)
>>>    * and soft-dirty bit (unless FPB_RESPECT_SOFT_DIRTY is set).
>>>    *
>>> - * @ptep must map any page of the folio. max_nr must be at least one and
>>> + * @ptep point to the first entry in range, max_nr must be at least one and
>>>    * must be limited by the caller so scanning cannot exceed a single VMA and
>>>    * a single page table.
>>>    *
>>>    * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
>>>    * be updated: it's crucial that a pointer to a COPY of the first
>>>    * page table entry, obtained through ptep_get(), is provided as @ptentp.
>>>    *
>>> - * This function will be inlined to optimize based on the input parameters;
>>> - * consider using folio_pte_batch() instead if applicable.
>>> + * The following folio_pte_batch_flags() deal with PTEs that mapped in a
>>> + * single folio. However can_pte_batch_count has the capability to handle
>>> + * PTEs that mapped in consecutive folios. If flags is not set, it will ignore
>>> + * the accessed, writable and dirty bits. Once the flags is set, the respect
>>> + * bit(s) will be compared in pte_same(), if the advanced pte_batch_hint()
>>> + * respect pte bit is different, pte_same() will return false and break. This
>>> + * ensures the correctness of handling multiple folio PTEs.
>>> + *
>>> + * This function will be inlined to optimize based on the input parameters.
>>>    *
>>>    * Return: the number of table entries in the batch.
>>>    */
>>
>> I recall trouble if we try batching across folios:
> 
> Yup pretty sure Ryan said we don't/can't in a previous thread. Now cc'd...
> 
>>
>> commit 7b08b74f3d99f6b801250683c751d391128799ec (tag: mm-hotfixes-stable-2025-05-10-14-23)
>> Author: Petr Vaněk <arkamar@atlas.cz>
>> Date:   Fri May 2 23:50:19 2025 +0200
>>
>>     mm: fix folio_pte_batch() on XEN PV
>>     On XEN PV, folio_pte_batch() can incorrectly batch beyond the end of a
>>     folio due to a corner case in pte_advance_pfn().  Specifically, when the
>>     PFN following the folio maps to an invalidated MFN,
>>             expected_pte = pte_advance_pfn(expected_pte, nr);
>>     produces a pte_none().  If the actual next PTE in memory is also
>>     pte_none(), the pte_same() succeeds,
>>             if (!pte_same(pte, expected_pte))
>>                     break;
>>     the loop is not broken, and batching continues into unrelated memory.
>> ...
>>
>>
>> --
>> Cheers
>>
>> David / dhildenb
>>
> 
> Thanks, Lorenzk



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range()
@ 2025-10-28 11:32 zhangqilong
  0 siblings, 0 replies; 13+ messages in thread
From: zhangqilong @ 2025-10-28 11:32 UTC (permalink / raw)
  To: David Hildenbrand, akpm, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, jannh, pfalcato
  Cc: linux-mm, linux-kernel, Wangkefeng (OS Kernel Lab), Sunnanyong

> On 27.10.25 15:03, Zhang Qilong wrote:
> > In current mincore_pte_range(), if pte_batch_hint() return one pte,
> > it's not efficient, just call new added can_pte_batch_count().
> >
> > In ARM64 qemu, with 8 CPUs, 32G memory, a simple test demo like:
> > 1. mmap 1G anon memory
> > 2. write 1G data by 4k step
> > 3. mincore the mmaped 1G memory
> > 4. get the time consumed by mincore
> >
> > Tested the following cases:
> >   - 4k, disabled all hugepage setting.
> >   - 64k mTHP, only enable 64k hugepage setting.
> >
> > Before
> >
> > Case status | Consumed time (us)  |
> > ----------------------------------|
> > 4k          | 7356                |
> > 64k mTHP    | 3670                |
> >
> > Pathed:
> >
> > Case status | Consumed time (us)  |
> > ----------------------------------|
> > 4k          | 4419                |
> > 64k mTHP    | 3061                |
> >
> 
> I assume you're only lucky in that benchmark because you got consecutive 4k
> pages / 64k mTHP from the buddy, right?

Year, the demo case is relatively simple, which may result in stronger continuity
of allocated physical page addresses. 
This case primarily aims to validate optimization effectiveness in contiguous page
address. Maybe we also need watch side effectiveness in non-contiguous page
address.

> 
> So I suspect that this will mostly just make a micro benchmark happy, because
> the reality where we allocate randomly over time, for the PCP, etc will look
> quite different.
> 
> --
> Cheers
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range()
@ 2025-10-28 11:13 zhangqilong
  0 siblings, 0 replies; 13+ messages in thread
From: zhangqilong @ 2025-10-28 11:13 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, Liam.Howlett, vbabka, rppt, surenb, mhocko, jannh,
	pfalcato, linux-mm, linux-kernel, Wangkefeng (OS Kernel Lab),
	Sunnanyong

 On Mon, Oct 27, 2025 at 10:03:14PM +0800, Zhang Qilong wrote:
> > In current mincore_pte_range(), if pte_batch_hint() return one pte,
> > it's not efficient, just call new added can_pte_batch_count().
> >
> > In ARM64 qemu, with 8 CPUs, 32G memory, a simple test demo like:
> > 1. mmap 1G anon memory
> > 2. write 1G data by 4k step
> > 3. mincore the mmaped 1G memory
> > 4. get the time consumed by mincore
> >
> > Tested the following cases:
> >  - 4k, disabled all hugepage setting.
> >  - 64k mTHP, only enable 64k hugepage setting.
> >
> > Before
> >
> > Case status | Consumed time (us)  |
> > ----------------------------------|
> > 4k          | 7356                |
> > 64k mTHP    | 3670                |
> >
> > Pathed:
> >
> > Case status | Consumed time (us)  |
> > ----------------------------------|
> > 4k          | 4419                |
> > 64k mTHP    | 3061                |
> >
> > The result is evident and demonstrate a significant improvement in the
> > pte batch. While verification within a single environment may have
> > inherent randomness. there is a high probability of achieving positive
> > effects.
> 
> Recent batch PTE series seriously regressed non-arm, so I'm afraid we can't
> accept any series that doesn't show statistics for _other platforms_.
> 
> Please make sure you at least test x86-64.

OK, I will have  a  test on x86-64 as soon and it may yield unexpected results.

> 
> This code is very sensitive and we're not going to accept a patch like this without
> _being sure_ it's ok.

Year, it's a hot path, we should be extremely cautious.

> 
> >
> > Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
> > ---
> >  mm/mincore.c | 10 +++-------
> >  1 file changed, 3 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/mincore.c b/mm/mincore.c index
> > 8ec4719370e1..2cc5d276d1cd 100644
> > --- a/mm/mincore.c
> > +++ b/mm/mincore.c
> > @@ -178,18 +178,14 @@ static int mincore_pte_range(pmd_t *pmd,
> unsigned long addr, unsigned long end,
> >  		/* We need to do cache lookup too for pte markers */
> >  		if (pte_none_mostly(pte))
> >  			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
> >  						 vma, vec);
> >  		else if (pte_present(pte)) {
> > -			unsigned int batch = pte_batch_hint(ptep, pte);
> > -
> > -			if (batch > 1) {
> > -				unsigned int max_nr = (end - addr) >>
> PAGE_SHIFT;
> > -
> > -				step = min_t(unsigned int, batch, max_nr);
> > -			}
> > +			unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
> >
> > +			step = can_pte_batch_count(vma, ptep, &pte,
> > +						   max_nr, 0);
> >  			for (i = 0; i < step; i++)
> >  				vec[i] = 1;
> >  		} else { /* pte is a swap entry */
> >  			*vec = mincore_swap(pte_to_swp_entry(pte), false);
> >  		}
> > --
> > 2.43.0
> >



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-10-28 11:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-27 14:03 [RFC PATCH 0/3] mm: PTEs batch optimization in mincore and mremap Zhang Qilong
2025-10-27 14:03 ` [RFC PATCH 1/3] mm: Introduce can_pte_batch_count() for PTEs batch optimization Zhang Qilong
2025-10-27 19:24   ` David Hildenbrand
2025-10-27 19:51     ` Lorenzo Stoakes
2025-10-27 20:21       ` Ryan Roberts
2025-10-27 14:03 ` [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range() Zhang Qilong
2025-10-27 19:27   ` David Hildenbrand
2025-10-27 19:34   ` Lorenzo Stoakes
2025-10-27 14:03 ` [RFC PATCH 3/3] mm/mremap: Use can_pte_batch_count() instead of folio_pte_batch() for pte batch Zhang Qilong
2025-10-27 19:41   ` David Hildenbrand
2025-10-27 19:57   ` Lorenzo Stoakes
2025-10-28 11:13 [RFC PATCH 2/3] mm/mincore: Use can_pte_batch_count() in mincore_pte_range() for pte batch mincore_pte_range() zhangqilong
2025-10-28 11:32 zhangqilong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox