linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Expand scope of khugepaged anonymous collapse
@ 2025-09-08  7:50 Dev Jain
  2025-09-08  7:50 ` [PATCH v2 1/2] mm: Enable khugepaged anonymous collapse on non-writable regions Dev Jain
  2025-09-08  7:50 ` [PATCH v2 2/2] mm: Drop all references of writable and SCAN_PAGE_RO Dev Jain
  0 siblings, 2 replies; 7+ messages in thread
From: Dev Jain @ 2025-09-08  7:50 UTC (permalink / raw)
  To: akpm, david, kas, willy, hughd
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, richard.weiyang, linux-mm, linux-kernel,
	Dev Jain

Currently khugepaged does not collapse an anonymous region which does not
have a single writable pte. This is wasteful since a region mapped with
non-writable ptes, for example, non-writable VMAs mapped by the
application, won't benefit from THP collapse.

An additional consequence of this constraint is that MADV_COLLAPSE does not
perform a collapse on a non-writable VMA, and this restriction is nowhere
to be found on the manpage - the restriction itself sounds wrong to me
since the user knows the protection of the memory it has mapped, so
collapsing read-only memory via madvise() should be a choice of the
user which shouldn't be overridden by the kernel.

Therefore, remove this constraint.

On an arm64 bare metal machine, comparing with vanilla 6.17-rc2, an
average of 5% improvement is seen on some mmtests benchmarks,
particularly hackbench, with a maximum improvement of 12%. In the
following table, (I) denotes statistically significant improvement,
(R) denotes statistically significant regression.

+-------------------------+--------------------------------+---------------+
| mmtests/hackbench       | process-pipes-1 (seconds)      |        -0.06% |
|                         | process-pipes-4 (seconds)      |        -0.27% |
|                         | process-pipes-7 (seconds)      |   (I) -12.13% |
|                         | process-pipes-12 (seconds)     |    (I) -5.32% |
|                         | process-pipes-21 (seconds)     |    (I) -2.87% |
|                         | process-pipes-30 (seconds)     |    (I) -3.39% |
|                         | process-pipes-48 (seconds)     |    (I) -5.65% |
|                         | process-pipes-79 (seconds)     |    (I) -6.74% |
|                         | process-pipes-110 (seconds)    |    (I) -6.26% |
|                         | process-pipes-141 (seconds)    |    (I) -4.99% |
|                         | process-pipes-172 (seconds)    |    (I) -4.45% |
|                         | process-pipes-203 (seconds)    |    (I) -3.65% |
|                         | process-pipes-234 (seconds)    |    (I) -3.45% |
|                         | process-pipes-256 (seconds)    |    (I) -3.47% |
|                         | process-sockets-1 (seconds)    |         2.13% |
|                         | process-sockets-4 (seconds)    |         1.02% |
|                         | process-sockets-7 (seconds)    |        -0.26% |
|                         | process-sockets-12 (seconds)   |        -1.24% |
|                         | process-sockets-21 (seconds)   |         0.01% |
|                         | process-sockets-30 (seconds)   |        -0.15% |
|                         | process-sockets-48 (seconds)   |         0.15% |
|                         | process-sockets-79 (seconds)   |         1.45% |
|                         | process-sockets-110 (seconds)  |        -1.64% |
|                         | process-sockets-141 (seconds)  |    (I) -4.27% |
|                         | process-sockets-172 (seconds)  |         0.30% |
|                         | process-sockets-203 (seconds)  |        -1.71% |
|                         | process-sockets-234 (seconds)  |        -1.94% |
|                         | process-sockets-256 (seconds)  |        -0.71% |
|                         | thread-pipes-1 (seconds)       |         0.66% |
|                         | thread-pipes-4 (seconds)       |         1.66% |
|                         | thread-pipes-7 (seconds)       |        -0.17% |
|                         | thread-pipes-12 (seconds)      |    (I) -4.12% |
|                         | thread-pipes-21 (seconds)      |    (I) -2.13% |
|                         | thread-pipes-30 (seconds)      |    (I) -3.78% |
|                         | thread-pipes-48 (seconds)      |    (I) -5.77% |
|                         | thread-pipes-79 (seconds)      |    (I) -5.31% |
|                         | thread-pipes-110 (seconds)     |    (I) -6.12% |
|                         | thread-pipes-141 (seconds)     |    (I) -4.00% |
|                         | thread-pipes-172 (seconds)     |    (I) -3.01% |
|                         | thread-pipes-203 (seconds)     |    (I) -2.62% |
|                         | thread-pipes-234 (seconds)     |    (I) -2.00% |
|                         | thread-pipes-256 (seconds)     |    (I) -2.30% |
|                         | thread-sockets-1 (seconds)     |     (R) 2.39% |
+-------------------------+--------------------------------+---------------+

+-------------------------+------------------------------------------------+
| mmtests/sysbench-mutex  | sysbenchmutex-1 (usec)         |        -0.02% |
|                         | sysbenchmutex-4 (usec)         |        -0.02% |
|                         | sysbenchmutex-7 (usec)         |         0.00% |
|                         | sysbenchmutex-12 (usec)        |         0.12% |
|                         | sysbenchmutex-21 (usec)        |        -0.40% |
|                         | sysbenchmutex-30 (usec)        |         0.08% |
|                         | sysbenchmutex-48 (usec)        |         2.59% |
|                         | sysbenchmutex-79 (usec)        |        -0.80% |
|                         | sysbenchmutex-110 (usec)       |        -3.87% |
|                         | sysbenchmutex-128 (usec)       |    (I) -4.46% |
+-------------------------+--------------------------------+---------------+

---
Based on today's mm-new.

v1->v2:
- Replace non-writable VMAs with non-writable PTEs to be more specific
- Add cover letter

RFC->v1:
- Drop writable references from tracepoints

RFC:
- https://lore.kernel.org/all/20250901074817.73012-1-dev.jain@arm.com/

Dev Jain (2):
  mm: Enable khugepaged anonymous collapse on non-writable regions
  mm: Drop all references of writable and SCAN_PAGE_RO

 include/trace/events/huge_memory.h | 19 ++++++-------------
 mm/khugepaged.c                    | 23 +++++------------------
 2 files changed, 11 insertions(+), 31 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/2] mm: Enable khugepaged anonymous collapse on non-writable regions
  2025-09-08  7:50 [PATCH v2 0/2] Expand scope of khugepaged anonymous collapse Dev Jain
@ 2025-09-08  7:50 ` Dev Jain
  2025-09-09 18:49   ` Zach O'Keefe
  2025-09-10  4:03   ` Anshuman Khandual
  2025-09-08  7:50 ` [PATCH v2 2/2] mm: Drop all references of writable and SCAN_PAGE_RO Dev Jain
  1 sibling, 2 replies; 7+ messages in thread
From: Dev Jain @ 2025-09-08  7:50 UTC (permalink / raw)
  To: akpm, david, kas, willy, hughd
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, richard.weiyang, linux-mm, linux-kernel,
	Dev Jain

Currently khugepaged does not collapse an anonymous region which does not
have a single writable pte. This is wasteful since a region mapped with
non-writable ptes, for example, non-writable VMAs mapped by the
application, won't benefit from THP collapse.

An additional consequence of this constraint is that MADV_COLLAPSE does not
perform a collapse on a non-writable VMA, and this restriction is nowhere
to be found on the manpage - the restriction itself sounds wrong to me
since the user knows the protection of the memory it has mapped, so
collapsing read-only memory via madvise() should be a choice of the
user which shouldn't be overridden by the kernel.

Therefore, remove this restriction by not honouring SCAN_PAGE_RO.

Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com> 
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4ec324a4c1fe..a0f1df2a7ae6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -676,9 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			writable = true;
 	}
 
-	if (unlikely(!writable)) {
-		result = SCAN_PAGE_RO;
-	} else if (unlikely(cc->is_khugepaged && !referenced)) {
+	if (unlikely(cc->is_khugepaged && !referenced)) {
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
@@ -1421,9 +1419,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		     mmu_notifier_test_young(vma->vm_mm, _address)))
 			referenced++;
 	}
-	if (!writable) {
-		result = SCAN_PAGE_RO;
-	} else if (cc->is_khugepaged &&
+	if (cc->is_khugepaged &&
 		   (!referenced ||
 		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
 		result = SCAN_LACK_REFERENCED_PAGE;
@@ -2830,7 +2826,6 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 		case SCAN_PMD_NULL:
 		case SCAN_PTE_NON_PRESENT:
 		case SCAN_PTE_UFFD_WP:
-		case SCAN_PAGE_RO:
 		case SCAN_LACK_REFERENCED_PAGE:
 		case SCAN_PAGE_NULL:
 		case SCAN_PAGE_COUNT:
-- 
2.30.2



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 2/2] mm: Drop all references of writable and SCAN_PAGE_RO
  2025-09-08  7:50 [PATCH v2 0/2] Expand scope of khugepaged anonymous collapse Dev Jain
  2025-09-08  7:50 ` [PATCH v2 1/2] mm: Enable khugepaged anonymous collapse on non-writable regions Dev Jain
@ 2025-09-08  7:50 ` Dev Jain
  2025-09-09 18:51   ` Zach O'Keefe
  2025-09-10  4:06   ` Anshuman Khandual
  1 sibling, 2 replies; 7+ messages in thread
From: Dev Jain @ 2025-09-08  7:50 UTC (permalink / raw)
  To: akpm, david, kas, willy, hughd
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, richard.weiyang, linux-mm, linux-kernel,
	Dev Jain

Now that all actionable outcomes from checking pte_write() are gone,
drop the related references.

Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/trace/events/huge_memory.h | 19 ++++++-------------
 mm/khugepaged.c                    | 14 +++-----------
 2 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 2305df6cb485..dd94d14a2427 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -19,7 +19,6 @@
 	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
 	EM( SCAN_PTE_UFFD_WP,		"pte_uffd_wp")			\
 	EM( SCAN_PTE_MAPPED_HUGEPAGE,	"pte_mapped_hugepage")		\
-	EM( SCAN_PAGE_RO,		"no_writable_page")		\
 	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
 	EM( SCAN_PAGE_NULL,		"page_null")			\
 	EM( SCAN_SCAN_ABORT,		"scan_aborted")			\
@@ -55,15 +54,14 @@ SCAN_STATUS
 
 TRACE_EVENT(mm_khugepaged_scan_pmd,
 
-	TP_PROTO(struct mm_struct *mm, struct folio *folio, bool writable,
+	TP_PROTO(struct mm_struct *mm, struct folio *folio,
 		 int referenced, int none_or_zero, int status, int unmapped),
 
-	TP_ARGS(mm, folio, writable, referenced, none_or_zero, status, unmapped),
+	TP_ARGS(mm, folio, referenced, none_or_zero, status, unmapped),
 
 	TP_STRUCT__entry(
 		__field(struct mm_struct *, mm)
 		__field(unsigned long, pfn)
-		__field(bool, writable)
 		__field(int, referenced)
 		__field(int, none_or_zero)
 		__field(int, status)
@@ -73,17 +71,15 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
 	TP_fast_assign(
 		__entry->mm = mm;
 		__entry->pfn = folio ? folio_pfn(folio) : -1;
-		__entry->writable = writable;
 		__entry->referenced = referenced;
 		__entry->none_or_zero = none_or_zero;
 		__entry->status = status;
 		__entry->unmapped = unmapped;
 	),
 
-	TP_printk("mm=%p, scan_pfn=0x%lx, writable=%d, referenced=%d, none_or_zero=%d, status=%s, unmapped=%d",
+	TP_printk("mm=%p, scan_pfn=0x%lx, referenced=%d, none_or_zero=%d, status=%s, unmapped=%d",
 		__entry->mm,
 		__entry->pfn,
-		__entry->writable,
 		__entry->referenced,
 		__entry->none_or_zero,
 		__print_symbolic(__entry->status, SCAN_STATUS),
@@ -117,15 +113,14 @@ TRACE_EVENT(mm_collapse_huge_page,
 TRACE_EVENT(mm_collapse_huge_page_isolate,
 
 	TP_PROTO(struct folio *folio, int none_or_zero,
-		 int referenced, bool  writable, int status),
+		 int referenced, int status),
 
-	TP_ARGS(folio, none_or_zero, referenced, writable, status),
+	TP_ARGS(folio, none_or_zero, referenced, status),
 
 	TP_STRUCT__entry(
 		__field(unsigned long, pfn)
 		__field(int, none_or_zero)
 		__field(int, referenced)
-		__field(bool, writable)
 		__field(int, status)
 	),
 
@@ -133,15 +128,13 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
 		__entry->pfn = folio ? folio_pfn(folio) : -1;
 		__entry->none_or_zero = none_or_zero;
 		__entry->referenced = referenced;
-		__entry->writable = writable;
 		__entry->status = status;
 	),
 
-	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s",
+	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s",
 		__entry->pfn,
 		__entry->none_or_zero,
 		__entry->referenced,
-		__entry->writable,
 		__print_symbolic(__entry->status, SCAN_STATUS))
 );
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a0f1df2a7ae6..af5f5c80fe4e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -39,7 +39,6 @@ enum scan_result {
 	SCAN_PTE_NON_PRESENT,
 	SCAN_PTE_UFFD_WP,
 	SCAN_PTE_MAPPED_HUGEPAGE,
-	SCAN_PAGE_RO,
 	SCAN_LACK_REFERENCED_PAGE,
 	SCAN_PAGE_NULL,
 	SCAN_SCAN_ABORT,
@@ -557,7 +556,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	struct folio *folio = NULL;
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
-	bool writable = false;
 
 	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
 	     _pte++, address += PAGE_SIZE) {
@@ -671,9 +669,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
 								     address)))
 			referenced++;
-
-		if (pte_write(pteval))
-			writable = true;
 	}
 
 	if (unlikely(cc->is_khugepaged && !referenced)) {
@@ -681,13 +676,13 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	} else {
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-						    referenced, writable, result);
+						    referenced, result);
 		return result;
 	}
 out:
 	release_pte_pages(pte, _pte, compound_pagelist);
 	trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
-					    referenced, writable, result);
+					    referenced, result);
 	return result;
 }
 
@@ -1280,7 +1275,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	unsigned long _address;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
-	bool writable = false;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -1344,8 +1338,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 			result = SCAN_PTE_UFFD_WP;
 			goto out_unmap;
 		}
-		if (pte_write(pteval))
-			writable = true;
 
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
@@ -1435,7 +1427,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		*mmap_locked = false;
 	}
 out:
-	trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
+	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
 				     none_or_zero, result, unmapped);
 	return result;
 }
-- 
2.30.2



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] mm: Enable khugepaged anonymous collapse on non-writable regions
  2025-09-08  7:50 ` [PATCH v2 1/2] mm: Enable khugepaged anonymous collapse on non-writable regions Dev Jain
@ 2025-09-09 18:49   ` Zach O'Keefe
  2025-09-10  4:03   ` Anshuman Khandual
  1 sibling, 0 replies; 7+ messages in thread
From: Zach O'Keefe @ 2025-09-09 18:49 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, kas, willy, hughd, ziy, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, baohua,
	richard.weiyang, linux-mm, linux-kernel

On Mon, Sep 8, 2025 at 12:51 AM Dev Jain <dev.jain@arm.com> wrote:
>
> Currently khugepaged does not collapse an anonymous region which does not
> have a single writable pte. This is wasteful since a region mapped with
> non-writable ptes, for example, non-writable VMAs mapped by the
> application, won't benefit from THP collapse.
>
> An additional consequence of this constraint is that MADV_COLLAPSE does not
> perform a collapse on a non-writable VMA, and this restriction is nowhere
> to be found on the manpage - the restriction itself sounds wrong to me
> since the user knows the protection of the memory it has mapped, so
> collapsing read-only memory via madvise() should be a choice of the
> user which shouldn't be overridden by the kernel.

Sorry ; late to the party. Certainly agree wrt MADV_COLLAPSE.

Ditto for khugepaged as well. Check added when support for
non-writable pages were added to khugepaged, though retaining
heuristic that at least one pte should be writable;  10359213d05a
("mm: incorporate read-only pages into transparent huge pages"), which
predates max_ptes_swap.

> Therefore, remove this restriction by not honouring SCAN_PAGE_RO.>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Reviewed-by: Zach O'Keefe <zokeefe@google.com>

> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4ec324a4c1fe..a0f1df2a7ae6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -676,9 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                         writable = true;
>         }
>
> -       if (unlikely(!writable)) {
> -               result = SCAN_PAGE_RO;
> -       } else if (unlikely(cc->is_khugepaged && !referenced)) {
> +       if (unlikely(cc->is_khugepaged && !referenced)) {
>                 result = SCAN_LACK_REFERENCED_PAGE;
>         } else {
>                 result = SCAN_SUCCEED;
> @@ -1421,9 +1419,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>                      mmu_notifier_test_young(vma->vm_mm, _address)))
>                         referenced++;
>         }
> -       if (!writable) {
> -               result = SCAN_PAGE_RO;
> -       } else if (cc->is_khugepaged &&
> +       if (cc->is_khugepaged &&
>                    (!referenced ||
>                     (unmapped && referenced < HPAGE_PMD_NR / 2))) {
>                 result = SCAN_LACK_REFERENCED_PAGE;
> @@ -2830,7 +2826,6 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>                 case SCAN_PMD_NULL:
>                 case SCAN_PTE_NON_PRESENT:
>                 case SCAN_PTE_UFFD_WP:
> -               case SCAN_PAGE_RO:
>                 case SCAN_LACK_REFERENCED_PAGE:
>                 case SCAN_PAGE_NULL:
>                 case SCAN_PAGE_COUNT:
> --
> 2.30.2
>
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] mm: Drop all references of writable and SCAN_PAGE_RO
  2025-09-08  7:50 ` [PATCH v2 2/2] mm: Drop all references of writable and SCAN_PAGE_RO Dev Jain
@ 2025-09-09 18:51   ` Zach O'Keefe
  2025-09-10  4:06   ` Anshuman Khandual
  1 sibling, 0 replies; 7+ messages in thread
From: Zach O'Keefe @ 2025-09-09 18:51 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, kas, willy, hughd, ziy, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, baohua,
	richard.weiyang, linux-mm, linux-kernel

Thanks, Dev.

On Mon, Sep 8, 2025 at 12:51 AM Dev Jain <dev.jain@arm.com> wrote:
>
> Now that all actionable outcomes from checking pte_write() are gone,
> drop the related references.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Reviewed-by: Zach O'Keefe <zokeefe@google.com>


> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/trace/events/huge_memory.h | 19 ++++++-------------
>  mm/khugepaged.c                    | 14 +++-----------
>  2 files changed, 9 insertions(+), 24 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 2305df6cb485..dd94d14a2427 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -19,7 +19,6 @@
>         EM( SCAN_PTE_NON_PRESENT,       "pte_non_present")              \
>         EM( SCAN_PTE_UFFD_WP,           "pte_uffd_wp")                  \
>         EM( SCAN_PTE_MAPPED_HUGEPAGE,   "pte_mapped_hugepage")          \
> -       EM( SCAN_PAGE_RO,               "no_writable_page")             \
>         EM( SCAN_LACK_REFERENCED_PAGE,  "lack_referenced_page")         \
>         EM( SCAN_PAGE_NULL,             "page_null")                    \
>         EM( SCAN_SCAN_ABORT,            "scan_aborted")                 \
> @@ -55,15 +54,14 @@ SCAN_STATUS
>
>  TRACE_EVENT(mm_khugepaged_scan_pmd,
>
> -       TP_PROTO(struct mm_struct *mm, struct folio *folio, bool writable,
> +       TP_PROTO(struct mm_struct *mm, struct folio *folio,
>                  int referenced, int none_or_zero, int status, int unmapped),
>
> -       TP_ARGS(mm, folio, writable, referenced, none_or_zero, status, unmapped),
> +       TP_ARGS(mm, folio, referenced, none_or_zero, status, unmapped),
>
>         TP_STRUCT__entry(
>                 __field(struct mm_struct *, mm)
>                 __field(unsigned long, pfn)
> -               __field(bool, writable)
>                 __field(int, referenced)
>                 __field(int, none_or_zero)
>                 __field(int, status)
> @@ -73,17 +71,15 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
>         TP_fast_assign(
>                 __entry->mm = mm;
>                 __entry->pfn = folio ? folio_pfn(folio) : -1;
> -               __entry->writable = writable;
>                 __entry->referenced = referenced;
>                 __entry->none_or_zero = none_or_zero;
>                 __entry->status = status;
>                 __entry->unmapped = unmapped;
>         ),
>
> -       TP_printk("mm=%p, scan_pfn=0x%lx, writable=%d, referenced=%d, none_or_zero=%d, status=%s, unmapped=%d",
> +       TP_printk("mm=%p, scan_pfn=0x%lx, referenced=%d, none_or_zero=%d, status=%s, unmapped=%d",
>                 __entry->mm,
>                 __entry->pfn,
> -               __entry->writable,
>                 __entry->referenced,
>                 __entry->none_or_zero,
>                 __print_symbolic(__entry->status, SCAN_STATUS),
> @@ -117,15 +113,14 @@ TRACE_EVENT(mm_collapse_huge_page,
>  TRACE_EVENT(mm_collapse_huge_page_isolate,
>
>         TP_PROTO(struct folio *folio, int none_or_zero,
> -                int referenced, bool  writable, int status),
> +                int referenced, int status),
>
> -       TP_ARGS(folio, none_or_zero, referenced, writable, status),
> +       TP_ARGS(folio, none_or_zero, referenced, status),
>
>         TP_STRUCT__entry(
>                 __field(unsigned long, pfn)
>                 __field(int, none_or_zero)
>                 __field(int, referenced)
> -               __field(bool, writable)
>                 __field(int, status)
>         ),
>
> @@ -133,15 +128,13 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
>                 __entry->pfn = folio ? folio_pfn(folio) : -1;
>                 __entry->none_or_zero = none_or_zero;
>                 __entry->referenced = referenced;
> -               __entry->writable = writable;
>                 __entry->status = status;
>         ),
>
> -       TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s",
> +       TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s",
>                 __entry->pfn,
>                 __entry->none_or_zero,
>                 __entry->referenced,
> -               __entry->writable,
>                 __print_symbolic(__entry->status, SCAN_STATUS))
>  );
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a0f1df2a7ae6..af5f5c80fe4e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -39,7 +39,6 @@ enum scan_result {
>         SCAN_PTE_NON_PRESENT,
>         SCAN_PTE_UFFD_WP,
>         SCAN_PTE_MAPPED_HUGEPAGE,
> -       SCAN_PAGE_RO,
>         SCAN_LACK_REFERENCED_PAGE,
>         SCAN_PAGE_NULL,
>         SCAN_SCAN_ABORT,
> @@ -557,7 +556,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>         struct folio *folio = NULL;
>         pte_t *_pte;
>         int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> -       bool writable = false;
>
>         for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>              _pte++, address += PAGE_SIZE) {
> @@ -671,9 +669,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                      folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
>                                                                      address)))
>                         referenced++;
> -
> -               if (pte_write(pteval))
> -                       writable = true;
>         }
>
>         if (unlikely(cc->is_khugepaged && !referenced)) {
> @@ -681,13 +676,13 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>         } else {
>                 result = SCAN_SUCCEED;
>                 trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
> -                                                   referenced, writable, result);
> +                                                   referenced, result);
>                 return result;
>         }
>  out:
>         release_pte_pages(pte, _pte, compound_pagelist);
>         trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
> -                                           referenced, writable, result);
> +                                           referenced, result);
>         return result;
>  }
>
> @@ -1280,7 +1275,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>         unsigned long _address;
>         spinlock_t *ptl;
>         int node = NUMA_NO_NODE, unmapped = 0;
> -       bool writable = false;
>
>         VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> @@ -1344,8 +1338,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>                         result = SCAN_PTE_UFFD_WP;
>                         goto out_unmap;
>                 }
> -               if (pte_write(pteval))
> -                       writable = true;
>
>                 page = vm_normal_page(vma, _address, pteval);
>                 if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> @@ -1435,7 +1427,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>                 *mmap_locked = false;
>         }
>  out:
> -       trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
> +       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
>                                      none_or_zero, result, unmapped);
>         return result;
>  }
> --
> 2.30.2
>
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] mm: Enable khugepaged anonymous collapse on non-writable regions
  2025-09-08  7:50 ` [PATCH v2 1/2] mm: Enable khugepaged anonymous collapse on non-writable regions Dev Jain
  2025-09-09 18:49   ` Zach O'Keefe
@ 2025-09-10  4:03   ` Anshuman Khandual
  1 sibling, 0 replies; 7+ messages in thread
From: Anshuman Khandual @ 2025-09-10  4:03 UTC (permalink / raw)
  To: Dev Jain, akpm, david, kas, willy, hughd
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, richard.weiyang, linux-mm, linux-kernel

On 08/09/25 1:20 PM, Dev Jain wrote:
> Currently khugepaged does not collapse an anonymous region which does not
> have a single writable pte. This is wasteful since a region mapped with
> non-writable ptes, for example, non-writable VMAs mapped by the
> application, won't benefit from THP collapse.
> 
> An additional consequence of this constraint is that MADV_COLLAPSE does not
> perform a collapse on a non-writable VMA, and this restriction is nowhere
> to be found on the manpage - the restriction itself sounds wrong to me
> since the user knows the protection of the memory it has mapped, so
> collapsing read-only memory via madvise() should be a choice of the
> user which shouldn't be overridden by the kernel.

Agreed. Dropping this constraint makes sense both for MAD_COLLAPSE
system call and khugepaged based collapse as well.
> 
> Therefore, remove this restriction by not honouring SCAN_PAGE_RO.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Zi Yan <ziy@nvidia.com> 
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

>  mm/khugepaged.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4ec324a4c1fe..a0f1df2a7ae6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -676,9 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  			writable = true;
>  	}
>  
> -	if (unlikely(!writable)) {
> -		result = SCAN_PAGE_RO;
> -	} else if (unlikely(cc->is_khugepaged && !referenced)) {
> +	if (unlikely(cc->is_khugepaged && !referenced)) {
>  		result = SCAN_LACK_REFERENCED_PAGE;
>  	} else {
>  		result = SCAN_SUCCEED;
> @@ -1421,9 +1419,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  		     mmu_notifier_test_young(vma->vm_mm, _address)))
>  			referenced++;
>  	}
> -	if (!writable) {
> -		result = SCAN_PAGE_RO;
> -	} else if (cc->is_khugepaged &&
> +	if (cc->is_khugepaged &&
>  		   (!referenced ||
>  		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
>  		result = SCAN_LACK_REFERENCED_PAGE;
> @@ -2830,7 +2826,6 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  		case SCAN_PMD_NULL:
>  		case SCAN_PTE_NON_PRESENT:
>  		case SCAN_PTE_UFFD_WP:
> -		case SCAN_PAGE_RO:
>  		case SCAN_LACK_REFERENCED_PAGE:
>  		case SCAN_PAGE_NULL:
>  		case SCAN_PAGE_COUNT:



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] mm: Drop all references of writable and SCAN_PAGE_RO
  2025-09-08  7:50 ` [PATCH v2 2/2] mm: Drop all references of writable and SCAN_PAGE_RO Dev Jain
  2025-09-09 18:51   ` Zach O'Keefe
@ 2025-09-10  4:06   ` Anshuman Khandual
  1 sibling, 0 replies; 7+ messages in thread
From: Anshuman Khandual @ 2025-09-10  4:06 UTC (permalink / raw)
  To: Dev Jain, akpm, david, kas, willy, hughd
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, richard.weiyang, linux-mm, linux-kernel



On 08/09/25 1:20 PM, Dev Jain wrote:
> Now that all actionable outcomes from checking pte_write() are gone,
> drop the related references.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Kiryl Shutsemau <kas@kernel.org>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>

> ---
>  include/trace/events/huge_memory.h | 19 ++++++-------------
>  mm/khugepaged.c                    | 14 +++-----------
>  2 files changed, 9 insertions(+), 24 deletions(-)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 2305df6cb485..dd94d14a2427 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -19,7 +19,6 @@
>  	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
>  	EM( SCAN_PTE_UFFD_WP,		"pte_uffd_wp")			\
>  	EM( SCAN_PTE_MAPPED_HUGEPAGE,	"pte_mapped_hugepage")		\
> -	EM( SCAN_PAGE_RO,		"no_writable_page")		\
>  	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
>  	EM( SCAN_PAGE_NULL,		"page_null")			\
>  	EM( SCAN_SCAN_ABORT,		"scan_aborted")			\
> @@ -55,15 +54,14 @@ SCAN_STATUS
>  
>  TRACE_EVENT(mm_khugepaged_scan_pmd,
>  
> -	TP_PROTO(struct mm_struct *mm, struct folio *folio, bool writable,
> +	TP_PROTO(struct mm_struct *mm, struct folio *folio,
>  		 int referenced, int none_or_zero, int status, int unmapped),
>  
> -	TP_ARGS(mm, folio, writable, referenced, none_or_zero, status, unmapped),
> +	TP_ARGS(mm, folio, referenced, none_or_zero, status, unmapped),
>  
>  	TP_STRUCT__entry(
>  		__field(struct mm_struct *, mm)
>  		__field(unsigned long, pfn)
> -		__field(bool, writable)
>  		__field(int, referenced)
>  		__field(int, none_or_zero)
>  		__field(int, status)
> @@ -73,17 +71,15 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
>  	TP_fast_assign(
>  		__entry->mm = mm;
>  		__entry->pfn = folio ? folio_pfn(folio) : -1;
> -		__entry->writable = writable;
>  		__entry->referenced = referenced;
>  		__entry->none_or_zero = none_or_zero;
>  		__entry->status = status;
>  		__entry->unmapped = unmapped;
>  	),
>  
> -	TP_printk("mm=%p, scan_pfn=0x%lx, writable=%d, referenced=%d, none_or_zero=%d, status=%s, unmapped=%d",
> +	TP_printk("mm=%p, scan_pfn=0x%lx, referenced=%d, none_or_zero=%d, status=%s, unmapped=%d",
>  		__entry->mm,
>  		__entry->pfn,
> -		__entry->writable,
>  		__entry->referenced,
>  		__entry->none_or_zero,
>  		__print_symbolic(__entry->status, SCAN_STATUS),
> @@ -117,15 +113,14 @@ TRACE_EVENT(mm_collapse_huge_page,
>  TRACE_EVENT(mm_collapse_huge_page_isolate,
>  
>  	TP_PROTO(struct folio *folio, int none_or_zero,
> -		 int referenced, bool  writable, int status),
> +		 int referenced, int status),
>  
> -	TP_ARGS(folio, none_or_zero, referenced, writable, status),
> +	TP_ARGS(folio, none_or_zero, referenced, status),
>  
>  	TP_STRUCT__entry(
>  		__field(unsigned long, pfn)
>  		__field(int, none_or_zero)
>  		__field(int, referenced)
> -		__field(bool, writable)
>  		__field(int, status)
>  	),
>  
> @@ -133,15 +128,13 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
>  		__entry->pfn = folio ? folio_pfn(folio) : -1;
>  		__entry->none_or_zero = none_or_zero;
>  		__entry->referenced = referenced;
> -		__entry->writable = writable;
>  		__entry->status = status;
>  	),
>  
> -	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, writable=%d, status=%s",
> +	TP_printk("scan_pfn=0x%lx, none_or_zero=%d, referenced=%d, status=%s",
>  		__entry->pfn,
>  		__entry->none_or_zero,
>  		__entry->referenced,
> -		__entry->writable,
>  		__print_symbolic(__entry->status, SCAN_STATUS))
>  );
>  
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a0f1df2a7ae6..af5f5c80fe4e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -39,7 +39,6 @@ enum scan_result {
>  	SCAN_PTE_NON_PRESENT,
>  	SCAN_PTE_UFFD_WP,
>  	SCAN_PTE_MAPPED_HUGEPAGE,
> -	SCAN_PAGE_RO,
>  	SCAN_LACK_REFERENCED_PAGE,
>  	SCAN_PAGE_NULL,
>  	SCAN_SCAN_ABORT,
> @@ -557,7 +556,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	struct folio *folio = NULL;
>  	pte_t *_pte;
>  	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> -	bool writable = false;
>  
>  	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>  	     _pte++, address += PAGE_SIZE) {
> @@ -671,9 +669,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
>  								     address)))
>  			referenced++;
> -
> -		if (pte_write(pteval))
> -			writable = true;
>  	}
>  
>  	if (unlikely(cc->is_khugepaged && !referenced)) {
> @@ -681,13 +676,13 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	} else {
>  		result = SCAN_SUCCEED;
>  		trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
> -						    referenced, writable, result);
> +						    referenced, result);
>  		return result;
>  	}
>  out:
>  	release_pte_pages(pte, _pte, compound_pagelist);
>  	trace_mm_collapse_huge_page_isolate(folio, none_or_zero,
> -					    referenced, writable, result);
> +					    referenced, result);
>  	return result;
>  }
>  
> @@ -1280,7 +1275,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  	unsigned long _address;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
> -	bool writable = false;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> @@ -1344,8 +1338,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  			result = SCAN_PTE_UFFD_WP;
>  			goto out_unmap;
>  		}
> -		if (pte_write(pteval))
> -			writable = true;
>  
>  		page = vm_normal_page(vma, _address, pteval);
>  		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> @@ -1435,7 +1427,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  		*mmap_locked = false;
>  	}
>  out:
> -	trace_mm_khugepaged_scan_pmd(mm, folio, writable, referenced,
> +	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
>  				     none_or_zero, result, unmapped);
>  	return result;
>  }



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-09-10  4:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-08  7:50 [PATCH v2 0/2] Expand scope of khugepaged anonymous collapse Dev Jain
2025-09-08  7:50 ` [PATCH v2 1/2] mm: Enable khugepaged anonymous collapse on non-writable regions Dev Jain
2025-09-09 18:49   ` Zach O'Keefe
2025-09-10  4:03   ` Anshuman Khandual
2025-09-08  7:50 ` [PATCH v2 2/2] mm: Drop all references of writable and SCAN_PAGE_RO Dev Jain
2025-09-09 18:51   ` Zach O'Keefe
2025-09-10  4:06   ` Anshuman Khandual

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox