[PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse
@ 2025-02-11 11:13 Dev Jain
  2025-02-11 11:13 ` [PATCH v2 01/17] khugepaged: Generalize alloc_charge_folio() Dev Jain
                   ` (18 more replies)
  0 siblings, 19 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

This patchset extends khugepaged from collapsing only PMD-sized THPs to
collapsing anonymous mTHPs.

mTHPs were introduced in the kernel to improve memory management by allocating
chunks of larger memory, so as to reduce number of page faults, TLB misses (due
to TLB coalescing), reduce length of LRU lists, etc. However, the mTHP property
is often lost due to CoW, swap-in/out, and when the kernel just cannot find
enough physically contiguous memory to allocate on fault. Henceforth, there is a
need to regain mTHPs in the system asynchronously. This work is an attempt in
this direction, starting with anonymous folios.

In the fault handler, we select the THP order in a greedy manner; the same has
been used here, along with the same sysfs interface to control the order of
collapse. In contrast to PMD-collapse, we (hopefully) get rid of the mmap_write_lock().

---------------------------------------------------------
Testing
---------------------------------------------------------

The set has been build tested on x86_64.
For Aarch64,
1. mm-selftests: No regressions.
2. Analyzing with tools/mm/thpmaps on different userspace programs mapping
   aligned VMAs of a large size, faulting in basepages/mTHPs (according to sysfs),
   and then madvise()'ing the VMA, khugepaged is able to 100% collapse the VMAs.

This patchset is rebased on mm-unstable (4637fa5d47a49c977116321cc575ea22215df22d).

v1->v2:
 - Handle VMAs less than PMD size (patches 12-15)
 - Do not add mTHP into deferred split queue
 - Drop lock optimization and collapse mTHP under mmap_write_lock()
 - Define policy on what to do when we encounter a folio order larger than
   the order we are scanning for
 - Prevent the creep problem by enforcing tunable simplification
 - Update Documentation
 - Drop patch 12 from v1 updating selftest w.r.t the creep problem
 - Drop patch 1 from v1

 v1:
 https://lore.kernel.org/all/20241216165105.56185-1-dev.jain@arm.com/

Dev Jain (17):
  khugepaged: Generalize alloc_charge_folio()
  khugepaged: Generalize hugepage_vma_revalidate()
  khugepaged: Generalize __collapse_huge_page_swapin()
  khugepaged: Generalize __collapse_huge_page_isolate()
  khugepaged: Generalize __collapse_huge_page_copy()
  khugepaged: Abstract PMD-THP collapse
  khugepaged: Scan PTEs order-wise
  khugepaged: Introduce vma_collapse_anon_folio()
  khugepaged: Define collapse policy if a larger folio is already mapped
  khugepaged: Exit early on fully-mapped aligned mTHP
  khugepaged: Enable sysfs to control order of collapse
  khugepaged: Enable variable-sized VMA collapse
  khugepaged: Lock all VMAs mapping the PTE table
  khugepaged: Reset scan address to correct alignment
  khugepaged: Delay cond_resched()
  khugepaged: Implement strict policy for mTHP collapse
  Documentation: transhuge: Define khugepaged mTHP collapse policy

 Documentation/admin-guide/mm/transhuge.rst |  49 +-
 include/linux/huge_mm.h                    |   2 +
 mm/huge_memory.c                           |   4 +
 mm/khugepaged.c                            | 603 ++++++++++++++++-----
 4 files changed, 511 insertions(+), 147 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 01/17] khugepaged: Generalize alloc_charge_folio()
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 02/17] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Pass order to alloc_charge_folio() and update mTHP statistics.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/huge_mm.h |  2 ++
 mm/huge_memory.c        |  4 ++++
 mm/khugepaged.c         | 17 +++++++++++------
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 93e509b6c00e..ffe47785854a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -119,6 +119,8 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_COLLAPSE_ALLOC,
+	MTHP_STAT_COLLAPSE_ALLOC_FAILED,
 	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPIN,
 	MTHP_STAT_SWPIN_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3d3ebdc002d5..996e802543f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -615,6 +615,8 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc, MTHP_STAT_COLLAPSE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(collapse_alloc_failed, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpin, MTHP_STAT_SWPIN);
 DEFINE_MTHP_STAT_ATTR(swpin_fallback, MTHP_STAT_SWPIN_FALLBACK);
@@ -680,6 +682,8 @@ static struct attribute *any_stats_attrs[] = {
 #endif
 	&split_attr.attr,
 	&split_failed_attr.attr,
+	&collapse_alloc_attr.attr,
+	&collapse_alloc_failed_attr.attr,
 	NULL,
 };
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5f0be134141e..4342003b1c33 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1074,21 +1074,26 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 }
 
 static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
-			      struct collapse_control *cc)
+			      int order, struct collapse_control *cc)
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
 	int node = hpage_collapse_find_target_node(cc);
 	struct folio *folio;
 
-	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
 	if (!folio) {
 		*foliop = NULL;
-		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		if (order == HPAGE_PMD_ORDER)
+			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC_FAILED);
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
 	}
 
-	count_vm_event(THP_COLLAPSE_ALLOC);
+	if (order == HPAGE_PMD_ORDER)
+		count_vm_event(THP_COLLAPSE_ALLOC);
+	count_mthp_stat(order, MTHP_STAT_COLLAPSE_ALLOC);
+
 	if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
 		folio_put(folio);
 		*foliop = NULL;
@@ -1125,7 +1130,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_folio(&folio, mm, cc);
+	result = alloc_charge_folio(&folio, mm, HPAGE_PMD_ORDER, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
@@ -1851,7 +1856,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_folio(&new_folio, mm, cc);
+	result = alloc_charge_folio(&new_folio, mm, HPAGE_PMD_ORDER, cc);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 02/17] khugepaged: Generalize hugepage_vma_revalidate()
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
  2025-02-11 11:13 ` [PATCH v2 01/17] khugepaged: Generalize alloc_charge_folio() Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 03/17] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Post retaking the lock, it must be checked that the VMA is suitable for our
scan order. Hence, generalize hugepage_vma_revalidate().

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4342003b1c33..3d105cacf855 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -919,7 +919,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
 
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 				   bool expect_anon,
-				   struct vm_area_struct **vmap,
+				   struct vm_area_struct **vmap, int order,
 				   struct collapse_control *cc)
 {
 	struct vm_area_struct *vma;
@@ -932,9 +932,9 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
-	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
+	if (!thp_vma_suitable_order(vma, address, order))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1135,7 +1135,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORDER, cc);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1169,7 +1169,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORDER, cc);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -2779,7 +2779,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 			mmap_read_lock(mm);
 			mmap_locked = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
-							 cc);
+							 HPAGE_PMD_ORDER, cc);
 			if (result  != SCAN_SUCCEED) {
 				last_fail = result;
 				goto out_nolock;
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 03/17] khugepaged: Generalize __collapse_huge_page_swapin()
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
  2025-02-11 11:13 ` [PATCH v2 01/17] khugepaged: Generalize alloc_charge_folio() Dev Jain
  2025-02-11 11:13 ` [PATCH v2 02/17] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 04/17] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

If any PTE in our scan range is a swap entry, then use do_swap_page() to swap-in
the corresponding folio.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3d105cacf855..221823c0d95f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -999,17 +999,17 @@ static int check_pmd_still_valid(struct mm_struct *mm,
  */
 static int __collapse_huge_page_swapin(struct mm_struct *mm,
 				       struct vm_area_struct *vma,
-				       unsigned long haddr, pmd_t *pmd,
-				       int referenced)
+				       unsigned long addr, pmd_t *pmd,
+				       int referenced, int order)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
-	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+	unsigned long address, end = addr + (PAGE_SIZE << order);
 	int result;
 	pte_t *pte = NULL;
 	spinlock_t *ptl;
 
-	for (address = haddr; address < end; address += PAGE_SIZE) {
+	for (address = addr; address < end; address += PAGE_SIZE) {
 		struct vm_fault vmf = {
 			.vma = vma,
 			.address = address,
@@ -1154,7 +1154,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
 		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced);
+						     referenced, HPAGE_PMD_ORDER);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 04/17] khugepaged: Generalize __collapse_huge_page_isolate()
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (2 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 03/17] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 05/17] khugepaged: Generalize __collapse_huge_page_copy() Dev Jain
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Scale down the scan range and the sysfs tunables (to be changed in subsequent patches)
according to the scan order, and isolate the folios.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 221823c0d95f..0ea99df115cb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -565,15 +565,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
 					struct collapse_control *cc,
-					struct list_head *compound_pagelist)
+					struct list_head *compound_pagelist,
+					int order)
 {
-	struct page *page = NULL;
 	struct folio *folio = NULL;
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
 	bool writable = false;
+	unsigned int max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
+	unsigned int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1UL << order);
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none(pteval) || (pte_present(pteval) &&
@@ -581,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			     none_or_zero <= max_ptes_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -597,20 +599,19 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			result = SCAN_PTE_UFFD_WP;
 			goto out;
 		}
-		page = vm_normal_page(vma, address, pteval);
-		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
+		folio = vm_normal_folio(vma, address, pteval);
+		if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
 			result = SCAN_PAGE_NULL;
 			goto out;
 		}
 
-		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
 		/* See hpage_collapse_scan_pmd(). */
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
 			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			    shared > max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out;
@@ -1201,7 +1202,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      &compound_pagelist);
+						      &compound_pagelist, HPAGE_PMD_ORDER);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 05/17] khugepaged: Generalize __collapse_huge_page_copy()
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (3 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 04/17] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 06/17] khugepaged: Abstract PMD-THP collapse Dev Jain
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Generalize folio copying, PTE clearing and the failure path.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0ea99df115cb..99eb1f72a508 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -712,13 +712,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 						struct vm_area_struct *vma,
 						unsigned long address,
 						spinlock_t *ptl,
-						struct list_head *compound_pagelist)
+						struct list_head *compound_pagelist,
+						int order)
 {
 	struct folio *src, *tmp;
 	pte_t *_pte;
 	pte_t pteval;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1UL << order);
 	     _pte++, address += PAGE_SIZE) {
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
@@ -765,7 +766,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 					     pmd_t *pmd,
 					     pmd_t orig_pmd,
 					     struct vm_area_struct *vma,
-					     struct list_head *compound_pagelist)
+					     struct list_head *compound_pagelist,
+					     int order)
 {
 	spinlock_t *pmd_ptl;
 
@@ -782,7 +784,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 	 * Release both raw and compound pages isolated
 	 * in __collapse_huge_page_isolate.
 	 */
-	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+	release_pte_pages(pte, pte + (1UL << order), compound_pagelist);
 }
 
 /*
@@ -803,7 +805,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
 		unsigned long address, spinlock_t *ptl,
-		struct list_head *compound_pagelist)
+		struct list_head *compound_pagelist, int order)
 {
 	unsigned int i;
 	int result = SCAN_SUCCEED;
@@ -811,7 +813,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 	/*
 	 * Copying pages' contents is subject to memory poison at any iteration.
 	 */
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < (1 << order); i++) {
 		pte_t pteval = ptep_get(pte + i);
 		struct page *page = folio_page(folio, i);
 		unsigned long src_addr = address + i * PAGE_SIZE;
@@ -830,10 +832,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 
 	if (likely(result == SCAN_SUCCEED))
 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
-						    compound_pagelist);
+						    compound_pagelist, order);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
-						 compound_pagelist);
+						 compound_pagelist, order);
 
 	return result;
 }
@@ -1232,7 +1234,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, address, pte_ptl,
-					   &compound_pagelist);
+					   &compound_pagelist, HPAGE_PMD_ORDER);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 06/17] khugepaged: Abstract PMD-THP collapse
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (4 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 05/17] khugepaged: Generalize __collapse_huge_page_copy() Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 07/17] khugepaged: Scan PTEs order-wise Dev Jain
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Abstract away copying page contents, and setting the PMD, into
vma_collapse_anon_folio_pmd().

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 140 +++++++++++++++++++++++++++---------------------
 1 file changed, 78 insertions(+), 62 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 99eb1f72a508..498cb5ad9ff1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1109,76 +1109,27 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 	return SCAN_SUCCEED;
 }
 
-static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
-			      int referenced, int unmapped,
-			      struct collapse_control *cc)
+static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long address,
+		struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd,
+		struct folio *folio)
 {
 	LIST_HEAD(compound_pagelist);
-	pmd_t *pmd, _pmd;
-	pte_t *pte;
 	pgtable_t pgtable;
-	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int result = SCAN_FAIL;
-	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
+	pmd_t _pmd;
+	pte_t *pte;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	/*
-	 * Before allocating the hugepage, release the mmap_lock read lock.
-	 * The allocation can take potentially a long time if it involves
-	 * sync compaction, and we do not need to hold the mmap_lock during
-	 * that. We will recheck the vma after taking it again in write mode.
-	 */
-	mmap_read_unlock(mm);
-
-	result = alloc_charge_folio(&folio, mm, HPAGE_PMD_ORDER, cc);
-	if (result != SCAN_SUCCEED)
-		goto out_nolock;
-
-	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORDER, cc);
-	if (result != SCAN_SUCCEED) {
-		mmap_read_unlock(mm);
-		goto out_nolock;
-	}
-
-	result = find_pmd_or_thp_or_none(mm, address, &pmd);
-	if (result != SCAN_SUCCEED) {
-		mmap_read_unlock(mm);
-		goto out_nolock;
-	}
-
-	if (unmapped) {
-		/*
-		 * __collapse_huge_page_swapin will return with mmap_lock
-		 * released when it fails. So we jump out_nolock directly in
-		 * that case.  Continuing to collapse causes inconsistency.
-		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced, HPAGE_PMD_ORDER);
-		if (result != SCAN_SUCCEED)
-			goto out_nolock;
-	}
-
-	mmap_read_unlock(mm);
-	/*
-	 * Prevent all access to pagetables with the exception of
-	 * gup_fast later handled by the ptep_clear_flush and the VM
-	 * handled by the anon_vma lock + PG_lock.
-	 *
-	 * UFFDIO_MOVE is prevented to race as well thanks to the
-	 * mmap_lock.
-	 */
-	mmap_write_lock(mm);
 	result = hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORDER, cc);
 	if (result != SCAN_SUCCEED)
-		goto out_up_write;
+		goto out;
 	/* check if the pmd is still valid */
 	result = check_pmd_still_valid(mm, address, pmd);
 	if (result != SCAN_SUCCEED)
-		goto out_up_write;
+		goto out;
 
 	vma_start_write(vma);
 	anon_vma_lock_write(vma->anon_vma);
@@ -1223,7 +1174,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
 		spin_unlock(pmd_ptl);
 		anon_vma_unlock_write(vma->anon_vma);
-		goto out_up_write;
+		goto out;
 	}
 
 	/*
@@ -1237,7 +1188,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 					   &compound_pagelist, HPAGE_PMD_ORDER);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
-		goto out_up_write;
+		goto out;
 
 	/*
 	 * The smp_wmb() inside __folio_mark_uptodate() ensures the
@@ -1260,11 +1211,76 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	deferred_split_folio(folio, false);
 	spin_unlock(pmd_ptl);
 
-	folio = NULL;
-
 	result = SCAN_SUCCEED;
-out_up_write:
+out:
+	return result;
+}
+
+static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
+			      int referenced, int unmapped, int order,
+			      struct collapse_control *cc)
+{
+	struct vm_area_struct *vma;
+	int result = SCAN_FAIL;
+	struct folio *folio;
+	pmd_t *pmd;
+
+	/*
+	 * Before allocating the hugepage, release the mmap_lock read lock.
+	 * The allocation can take potentially a long time if it involves
+	 * sync compaction, and we do not need to hold the mmap_lock during
+	 * that. We will recheck the vma after taking it again in write mode.
+	 */
+	mmap_read_unlock(mm);
+
+	result = alloc_charge_folio(&folio, mm, order, cc);
+	if (result != SCAN_SUCCEED)
+		goto out_nolock;
+
+	mmap_read_lock(mm);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
+	if (result != SCAN_SUCCEED) {
+		mmap_read_unlock(mm);
+		goto out_nolock;
+	}
+
+	result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	if (result != SCAN_SUCCEED) {
+		mmap_read_unlock(mm);
+		goto out_nolock;
+	}
+
+	if (unmapped) {
+		/*
+		 * __collapse_huge_page_swapin will return with mmap_lock
+		 * released when it fails. So we jump out_nolock directly in
+		 * that case.  Continuing to collapse causes inconsistency.
+		 */
+		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
+						     referenced, order);
+		if (result != SCAN_SUCCEED)
+			goto out_nolock;
+	}
+
+	mmap_read_unlock(mm);
+	/*
+	 * Prevent all access to pagetables with the exception of
+	 * gup_fast later handled by the ptep_clear_flush and the VM
+	 * handled by the anon_vma lock + PG_lock.
+	 *
+	 * UFFDIO_MOVE is prevented to race as well thanks to the
+	 * mmap_lock.
+	 */
+	mmap_write_lock(mm);
+
+	if (order == HPAGE_PMD_ORDER)
+		result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio);
+
 	mmap_write_unlock(mm);
+
+	if (result == SCAN_SUCCEED)
+		folio = NULL;
+
 out_nolock:
 	if (folio)
 		folio_put(folio);
@@ -1440,7 +1456,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc);
+					    unmapped, HPAGE_PMD_ORDER, cc);
 		/* collapse_huge_page will return with the mmap_lock released */
 		*mmap_locked = false;
 	}
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 07/17] khugepaged: Scan PTEs order-wise
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (5 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 06/17] khugepaged: Abstract PMD-THP collapse Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 08/17] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Scan the PTEs order-wise, using the mask of suitable orders for this VMA
derived in conjunction with sysfs THP settings. Scale down the tunables (to
be changed in subsequent patches); in case of collapse failure, we drop down
to the next order. Otherwise, we try to jump to the highest possible order
and then start a fresh scan. Note that madvise(MADV_COLLAPSE) has not been generalized.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 97 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 83 insertions(+), 14 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 498cb5ad9ff1..fbfd8a78ef51 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -21,6 +21,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/dax.h>
 #include <linux/ksm.h>
+#include <linux/count_zeros.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -1295,36 +1296,57 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
-	int result = SCAN_FAIL, referenced = 0;
-	int none_or_zero = 0, shared = 0;
-	struct page *page = NULL;
 	struct folio *folio = NULL;
-	unsigned long _address;
+	int result = SCAN_FAIL;
 	spinlock_t *ptl;
-	int node = NUMA_NO_NODE, unmapped = 0;
+	unsigned int max_ptes_shared, max_ptes_none, max_ptes_swap;
+	int referenced, shared, none_or_zero, unmapped;
+	unsigned long _address, orig_address = address;
+	int node = NUMA_NO_NODE;
 	bool writable = false;
+	unsigned long orders, orig_orders;
+	int order, prev_order;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL_ANON);
+	orders = thp_vma_suitable_orders(vma, address, orders);
+	orig_orders = orders;
+	order = highest_order(orders);
+
+	/* MADV_COLLAPSE needs to work irrespective of sysfs setting */
+	if (!cc->is_khugepaged)
+		order = HPAGE_PMD_ORDER;
+
+scan_pte_range:
+
+	max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
+	max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
+	max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order);
+	referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0;
+
+	/* Check pmd after taking mmap lock */
 	result = find_pmd_or_thp_or_none(mm, address, &pmd);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!pte) {
 		result = SCAN_PMD_NULL;
 		goto out;
 	}
 
-	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_address = address, _pte = pte; _pte < pte + (1UL << order);
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (is_swap_pte(pteval)) {
 			++unmapped;
 			if (!cc->is_khugepaged ||
-			    unmapped <= khugepaged_max_ptes_swap) {
+			    unmapped <= max_ptes_swap) {
 				/*
 				 * Always be strict with uffd-wp
 				 * enabled swap entries.  Please see
@@ -1345,7 +1367,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			     none_or_zero <= max_ptes_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1369,12 +1391,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		if (pte_write(pteval))
 			writable = true;
 
-		page = vm_normal_page(vma, _address, pteval);
-		if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
+		folio = vm_normal_folio(vma, _address, pteval);
+		if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
 			result = SCAN_PAGE_NULL;
 			goto out_unmap;
 		}
-		folio = page_folio(page);
 
 		if (!folio_test_anon(folio)) {
 			result = SCAN_PAGE_ANON;
@@ -1390,7 +1411,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
 			if (cc->is_khugepaged &&
-			    shared > khugepaged_max_ptes_shared) {
+			    shared > max_ptes_shared) {
 				result = SCAN_EXCEED_SHARED_PTE;
 				count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
 				goto out_unmap;
@@ -1447,7 +1468,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		result = SCAN_PAGE_RO;
 	} else if (cc->is_khugepaged &&
 		   (!referenced ||
-		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
+		    (unmapped && referenced < (1UL << order) / 2))) {
 		result = SCAN_LACK_REFERENCED_PAGE;
 	} else {
 		result = SCAN_SUCCEED;
@@ -1456,10 +1477,58 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, HPAGE_PMD_ORDER, cc);
+					    unmapped, order, cc);
 		/* collapse_huge_page will return with the mmap_lock released */
 		*mmap_locked = false;
+		/* Skip over this range and decide order */
+		if (result == SCAN_SUCCEED)
+			goto decide_order;
+	}
+	if (result != SCAN_SUCCEED) {
+
+		/* Go to the next order */
+		prev_order = order;
+		order = next_order(&orders, order);
+		if (order < 2) {
+			/* Skip over this range, and decide order */
+			_address = address + (PAGE_SIZE << prev_order);
+			_pte = pte + (1UL << prev_order);
+			goto decide_order;
+		}
+		goto maybe_mmap_lock;
 	}
+
+decide_order:
+		/* Immediately exit on exhaustion of range */
+		if (_address == orig_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
+			goto out;
+
+		/* Get highest order possible starting from address */
+		order = count_trailing_zeros(_address >> PAGE_SHIFT);
+
+		orders = orig_orders & ((1UL << (order + 1)) - 1);
+		if (!(orders & (1UL << order)))
+			order = next_order(&orders, order);
+
+		/* This should never happen, since we are on an aligned address */
+		BUG_ON(cc->is_khugepaged && order < 2);
+
+		address = _address;
+		pte = _pte;
+
+maybe_mmap_lock:
+		if (!(*mmap_locked)) {
+			mmap_read_lock(mm);
+			*mmap_locked = true;
+			/* Validate VMA after retaking mmap_lock */
+			result = hugepage_vma_revalidate(mm, address, true, &vma,
+							 order, cc);
+			if (result != SCAN_SUCCEED) {
+				mmap_read_unlock(mm);
+				goto out;
+			}
+		}
+		goto scan_pte_range;
 out:
 	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
 				     none_or_zero, result, unmapped);
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 08/17] khugepaged: Introduce vma_collapse_anon_folio()
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (6 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 07/17] khugepaged: Scan PTEs order-wise Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 09/17] khugepaged: Define collapse policy if a larger folio is already mapped Dev Jain
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Similar to PMD collapse, take the write locks to stop pagetable walking.
Copy page contents, clear the PTEs, remove folio pins, and (try to) unmap the
old folios. Set the PTEs to the new folio using the set_ptes() API.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fbfd8a78ef51..a674014b6563 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1217,6 +1217,96 @@ static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long addre
 	return result;
 }
 
+/* Similar to the PMD case except we have to batch set the PTEs */
+static int vma_collapse_anon_folio(struct mm_struct *mm, unsigned long address,
+		struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd,
+		struct folio *folio, int order)
+{
+	LIST_HEAD(compound_pagelist);
+	spinlock_t *pmd_ptl, *pte_ptl;
+	int result = SCAN_FAIL;
+	struct mmu_notifier_range range;
+	pmd_t _pmd;
+	pte_t *pte;
+	pte_t entry;
+	int nr_pages = folio_nr_pages(folio);
+	unsigned long haddress = address & HPAGE_PMD_MASK;
+
+	VM_BUG_ON(address & ((PAGE_SIZE << order) - 1));;
+
+	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
+	if (result != SCAN_SUCCEED)
+		goto out;
+	result = check_pmd_still_valid(mm, address, pmd);
+	if (result != SCAN_SUCCEED)
+		goto out;
+
+	vma_start_write(vma);
+	anon_vma_lock_write(vma->anon_vma);
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, haddress,
+				haddress + HPAGE_PMD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	pmd_ptl = pmd_lock(mm, pmd);
+	_pmd = pmdp_collapse_flush(vma, haddress, pmd);
+	spin_unlock(pmd_ptl);
+	mmu_notifier_invalidate_range_end(&range);
+	tlb_remove_table_sync_one();
+
+	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+	if (pte) {
+		result = __collapse_huge_page_isolate(vma, address, pte, cc,
+						      &compound_pagelist, order);
+		spin_unlock(pte_ptl);
+	} else {
+		result = SCAN_PMD_NULL;
+	}
+
+	if (unlikely(result != SCAN_SUCCEED)) {
+		if (pte)
+			pte_unmap(pte);
+		spin_lock(pmd_ptl);
+		BUG_ON(!pmd_none(*pmd));
+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+		spin_unlock(pmd_ptl);
+		anon_vma_unlock_write(vma->anon_vma);
+		goto out;
+	}
+
+	anon_vma_unlock_write(vma->anon_vma);
+
+	result = __collapse_huge_page_copy(pte, folio, pmd, *pmd,
+					   vma, address, pte_ptl,
+					   &compound_pagelist, order);
+	pte_unmap(pte);
+	if (unlikely(result != SCAN_SUCCEED))
+		goto out;
+
+	__folio_mark_uptodate(folio);
+	entry = mk_pte(&folio->page, vma->vm_page_prot);
+	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+
+	spin_lock(pte_ptl);
+	folio_ref_add(folio, nr_pages - 1);
+	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
+	folio_add_lru_vma(folio, vma);
+	set_ptes(mm, address, pte, entry, nr_pages);
+	spin_unlock(pte_ptl);
+	spin_lock(pmd_ptl);
+
+	/* See pmd_install() */
+	smp_wmb();
+	BUG_ON(!pmd_none(*pmd));
+	pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+	update_mmu_cache_pmd(vma, haddress, pmd);
+	spin_unlock(pmd_ptl);
+
+	result = SCAN_SUCCEED;
+out:
+	return result;
+}
+
 static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 			      int referenced, int unmapped, int order,
 			      struct collapse_control *cc)
@@ -1276,6 +1366,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	if (order == HPAGE_PMD_ORDER)
 		result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio);
+	else
+		result = vma_collapse_anon_folio(mm, address, vma, cc, pmd, folio, order);
 
 	mmap_write_unlock(mm);
 
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 09/17] khugepaged: Define collapse policy if a larger folio is already mapped
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (7 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 08/17] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 10/17] khugepaged: Exit early on fully-mapped aligned mTHP Dev Jain
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

As noted in [1], khugepaged's goal must be to collapse memory to the highest aligned
order possible. Suppose khugepaged is scanning for 64K, and we have a 128K folio,
whose first 64K half is VA-PA aligned and fully mapped. In such a case, it does not make
sense to break this down into two 64K folios. On the other hand, if the first half is
not aligned, or it is partially mapped, it makes sense for khugepaged to collapse this
portion into a VA-PA aligned fully mapped 64K folio. 

[1] https://lore.kernel.org/all/aa647830-cf55-48f0-98c2-8230796e35b3@arm.com/

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a674014b6563..0d0d8f415a2e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -34,6 +34,7 @@ enum scan_result {
 	SCAN_PMD_NULL,
 	SCAN_PMD_NONE,
 	SCAN_PMD_MAPPED,
+	SCAN_PTE_MAPPED_THP,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_EXCEED_SWAP_PTE,
 	SCAN_EXCEED_SHARED_PTE,
@@ -562,6 +563,14 @@ static bool is_refcount_suitable(struct folio *folio)
 	return folio_ref_count(folio) == expected_refcount;
 }
 
+/* Assumes an embedded PFN */
+static bool is_same_folio(pte_t *first_pte, pte_t *last_pte)
+{
+	struct folio *folio1 = page_folio(pte_page(ptep_get(first_pte)));
+	struct folio *folio2 = page_folio(pte_page(ptep_get(last_pte)));
+	return folio1 == folio2;
+}
+
 static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
@@ -575,13 +584,22 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	bool writable = false;
 	unsigned int max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
 	unsigned int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
+	bool all_pfns_present = true;
+	bool all_pfns_contig = true;
+	bool first_pfn_aligned = true;
+	pte_t prev_pteval;
 
 	for (_pte = pte; _pte < pte + (1UL << order);
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
+		if (_pte == pte) {
+			if (!IS_ALIGNED(pte_pfn(pteval), (1UL << order)))
+				first_pfn_aligned = false;
+		}
 		if (pte_none(pteval) || (pte_present(pteval) &&
 				is_zero_pfn(pte_pfn(pteval)))) {
 			++none_or_zero;
+			all_pfns_present = false;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
 			     none_or_zero <= max_ptes_none)) {
@@ -660,6 +678,12 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			goto out;
 		}
 
+		if (all_pfns_contig && (pte != _pte) && !(all_pfns_present &&
+		    (pte_pfn(pteval) == pte_pfn(prev_pteval) + 1)))
+			all_pfns_contig = false;
+
+		prev_pteval = pteval;
+
 		/*
 		 * Isolate the page to avoid collapsing an hugepage
 		 * currently in use by the VM.
@@ -696,6 +720,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		result = SCAN_PAGE_RO;
 	} else if (unlikely(cc->is_khugepaged && !referenced)) {
 		result = SCAN_LACK_REFERENCED_PAGE;
+	} else if ((result == SCAN_SUCCEED) && (order != HPAGE_PMD_ORDER) && all_pfns_present &&
+		    all_pfns_contig && first_pfn_aligned &&
+		    is_same_folio(pte, pte + (1UL << order) - 1)) {
+		result = SCAN_PTE_MAPPED_THP;
 	} else {
 		result = SCAN_SUCCEED;
 		trace_mm_collapse_huge_page_isolate(&folio->page, none_or_zero,
@@ -1398,6 +1426,8 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	bool writable = false;
 	unsigned long orders, orig_orders;
 	int order, prev_order;
+	bool all_pfns_present, all_pfns_contig, first_pfn_aligned;
+	pte_t prev_pteval;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -1417,6 +1447,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
 	max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order);
 	referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0;
+	all_pfns_present = true, all_pfns_contig = true, first_pfn_aligned = true;
 
 	/* Check pmd after taking mmap lock */
 	result = find_pmd_or_thp_or_none(mm, address, &pmd);
@@ -1435,8 +1466,14 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	for (_address = address, _pte = pte; _pte < pte + (1UL << order);
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
+		if (_pte == pte) {
+			if (!IS_ALIGNED(pte_pfn(pteval), (1UL << order)))
+				first_pfn_aligned = false;
+		}
+
 		if (is_swap_pte(pteval)) {
 			++unmapped;
+			all_pfns_present = false;
 			if (!cc->is_khugepaged ||
 			    unmapped <= max_ptes_swap) {
 				/*
@@ -1457,6 +1494,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			++none_or_zero;
+			all_pfns_present = false;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
 			     none_or_zero <= max_ptes_none)) {
@@ -1546,6 +1584,17 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 			goto out_unmap;
 		}
 
+
+		/*
+		 * PFNs not contig, if either at least one PFN not present, or the previous
+		 * and this PFN are not contig
+		 */
+		if (all_pfns_contig && (pte != _pte) && !(all_pfns_present &&
+		    (pte_pfn(pteval) == pte_pfn(prev_pteval) + 1)))
+			all_pfns_contig = false;
+
+		prev_pteval = pteval;
+
 		/*
 		 * If collapse was initiated by khugepaged, check that there is
 		 * enough young pte to justify collapsing the page
@@ -1567,15 +1616,30 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	}
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
+
+	/*
+	 * We skip if the following conditions are true:
+	 * 1) All PTEs point to consecutive PFNs
+	 * 2) All PFNs belong to the same folio
+	 * 3) The PFNs are PA-aligned to the order we are scanning for
+	 */
+	if ((result == SCAN_SUCCEED) && (order != HPAGE_PMD_ORDER) && all_pfns_present &&
+	     all_pfns_contig && first_pfn_aligned &&
+	     is_same_folio(pte, pte + (1UL << order) - 1)) {
+		result = SCAN_PTE_MAPPED_THP;
+		goto decide_order;
+	}
+
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, address, referenced,
 					    unmapped, order, cc);
 		/* collapse_huge_page will return with the mmap_lock released */
 		*mmap_locked = false;
 		/* Skip over this range and decide order */
-		if (result == SCAN_SUCCEED)
+		if (result == SCAN_SUCCEED || result == SCAN_PTE_MAPPED_THP)
 			goto decide_order;
 	}
+
 	if (result != SCAN_SUCCEED) {
 
 		/* Go to the next order */
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 10/17] khugepaged: Exit early on fully-mapped aligned mTHP
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (8 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 09/17] khugepaged: Define collapse policy if a larger folio is already mapped Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 11/17] khugepaged: Enable sysfs to control order of collapse Dev Jain
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Since mTHP orders under consideration by khugepaged are also candidates for the
fault handler, the case we hit frequently is that khugepaged scans a region for
order-x, whereas an order-x folio was already installed by the fault handler there.
Therefore, exit early; this prevents a timeout in the khugepaged selftest. Earlier
this was not a problem because a PMD-hugepage will get checked by find_pmd_or_thp_or_none(),
and the previous patch does not solve this problem because it will do the entire PTE
scan to exit.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0d0d8f415a2e..baa5b44968ac 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -626,6 +626,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
+		if (_pte == pte && (order != HPAGE_PMD_ORDER) && (folio_order(folio) == order) &&
+		    test_bit(PG_head, &folio->page.flags) && !folio_test_partially_mapped(folio)) {
+			result = SCAN_PTE_MAPPED_THP;
+			goto out;
+		}
 		/* See hpage_collapse_scan_pmd(). */
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
@@ -1532,6 +1537,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 			goto out_unmap;
 		}
 
+		/* Exit early: There is high chance of this due to faulting */
+		if (_pte == pte && (order != HPAGE_PMD_ORDER) && (folio_order(folio) == order) &&
+		    test_bit(PG_head, &folio->page.flags) && !folio_test_partially_mapped(folio)) {
+			pte_unmap_unlock(pte, ptl);
+			_address = address + (PAGE_SIZE << order);
+			_pte = pte + (1UL << order);
+			result = SCAN_PTE_MAPPED_THP;
+			goto decide_order;
+		}
+
 		/*
 		 * We treat a single page as shared if any part of the THP
 		 * is shared. "False negatives" from
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 11/17] khugepaged: Enable sysfs to control order of collapse
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (9 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 10/17] khugepaged: Exit early on fully-mapped aligned mTHP Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 12/17] khugepaged: Enable variable-sized VMA collapse Dev Jain
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Activate khugepaged for anonymous collapse even if a single order is activated.
This condition will be updated upon by subsequent patches.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index baa5b44968ac..37cfa7beba3d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -415,24 +415,20 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
 	       test_bit(MMF_DISABLE_THP, &mm->flags);
 }
 
-static bool hugepage_pmd_enabled(void)
+static bool thp_enabled(void)
 {
 	/*
 	 * We cover the anon, shmem and the file-backed case here; file-backed
 	 * hugepages, when configured in, are determined by the global control.
-	 * Anon pmd-sized hugepages are determined by the pmd-size control.
+	 * Anon mTHPs are determined by the per-size control.
 	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
 	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
 	 */
 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 	    hugepage_global_enabled())
 		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
-		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
-		return true;
-	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
-	    hugepage_global_enabled())
+	if (huge_anon_orders_always || huge_anon_orders_madvise ||
+	    (huge_anon_orders_inherit && hugepage_global_enabled()))
 		return true;
 	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
 		return true;
@@ -475,9 +471,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  unsigned long vm_flags)
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
-	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
-					    PMD_ORDER))
+	    thp_enabled()) {
+		if (thp_vma_allowable_orders(vma, vm_flags, TVA_ENFORCE_SYSFS,
+					     THP_ORDERS_ALL_ANON))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -2679,8 +2675,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags,
-					TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+		if (!thp_vma_allowable_orders(vma, vma->vm_flags,
+					TVA_ENFORCE_SYSFS, THP_ORDERS_ALL_ANON)) {
 skip:
 			progress++;
 			continue;
@@ -2704,6 +2700,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
 			if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
+				if (!thp_vma_allowable_order(vma, vma->vm_flags,
+				    TVA_ENFORCE_SYSFS, PMD_ORDER))
+					break;
+
 				struct file *file = get_file(vma->vm_file);
 				pgoff_t pgoff = linear_page_index(vma,
 						khugepaged_scan.address);
@@ -2782,7 +2782,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 
 static int khugepaged_has_work(void)
 {
-	return !list_empty(&khugepaged_scan.mm_head) && hugepage_pmd_enabled();
+	return !list_empty(&khugepaged_scan.mm_head) && thp_enabled();
 }
 
 static int khugepaged_wait_event(void)
@@ -2855,7 +2855,7 @@ static void khugepaged_wait_work(void)
 		return;
 	}
 
-	if (hugepage_pmd_enabled())
+	if (thp_enabled())
 		wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
 }
 
@@ -2886,7 +2886,7 @@ static void set_recommended_min_free_kbytes(void)
 	int nr_zones = 0;
 	unsigned long recommended_min;
 
-	if (!hugepage_pmd_enabled()) {
+	if (!thp_enabled()) {
 		calculate_min_free_kbytes();
 		goto update_wmarks;
 	}
@@ -2936,7 +2936,7 @@ int start_stop_khugepaged(void)
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled()) {
+	if (thp_enabled()) {
 		if (!khugepaged_thread)
 			khugepaged_thread = kthread_run(khugepaged, NULL,
 							"khugepaged");
@@ -2962,7 +2962,7 @@ int start_stop_khugepaged(void)
 void khugepaged_min_free_kbytes_update(void)
 {
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled() && khugepaged_thread)
+	if (thp_enabled() && khugepaged_thread)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 12/17] khugepaged: Enable variable-sized VMA collapse
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (10 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 11/17] khugepaged: Enable sysfs to control order of collapse Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 13/17] khugepaged: Lock all VMAs mapping the PTE table Dev Jain
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Applications in general may have a lot of VMAs less than PMD-size. Therefore
it is essential that khugepaged is able to collapse these VMAs.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 68 +++++++++++++++++++++++++++++--------------------
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 37cfa7beba3d..048f990d8507 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1413,7 +1413,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
-				   struct collapse_control *cc)
+				   unsigned long orders, struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1425,22 +1425,14 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	unsigned long _address, orig_address = address;
 	int node = NUMA_NO_NODE;
 	bool writable = false;
-	unsigned long orders, orig_orders;
+	unsigned long orig_orders;
 	int order, prev_order;
 	bool all_pfns_present, all_pfns_contig, first_pfn_aligned;
 	pte_t prev_pteval;
 
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
-	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
-			TVA_IN_PF | TVA_ENFORCE_SYSFS, THP_ORDERS_ALL_ANON);
-	orders = thp_vma_suitable_orders(vma, address, orders);
 	orig_orders = orders;
 	order = highest_order(orders);
-
-	/* MADV_COLLAPSE needs to work irrespective of sysfs setting */
-	if (!cc->is_khugepaged)
-		order = HPAGE_PMD_ORDER;
+	VM_BUG_ON(address & ((PAGE_SIZE << order) - 1));
 
 scan_pte_range:
 
@@ -1667,7 +1659,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 
 decide_order:
 		/* Immediately exit on exhaustion of range */
-		if (_address == orig_address + (PAGE_SIZE << HPAGE_PMD_ORDER))
+		if (_address == orig_address + (PAGE_SIZE << (highest_order(orig_orders))))
 			goto out;
 
 		/* Get highest order possible starting from address */
@@ -2636,6 +2628,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	int progress = 0;
+	unsigned long orders;
+	int order;
+	bool is_file_vma;
 
 	VM_BUG_ON(!pages);
 	lockdep_assert_held(&khugepaged_mm_lock);
@@ -2675,19 +2670,40 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_orders(vma, vma->vm_flags,
-					TVA_ENFORCE_SYSFS, THP_ORDERS_ALL_ANON)) {
+		orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+					TVA_ENFORCE_SYSFS, THP_ORDERS_ALL_ANON);
+		if (!orders) {
 skip:
 			progress++;
 			continue;
 		}
-		hstart = round_up(vma->vm_start, HPAGE_PMD_SIZE);
-		hend = round_down(vma->vm_end, HPAGE_PMD_SIZE);
+
+		/* We can collapse anonymous VMAs less than PMD_SIZE */
+		is_file_vma = IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma);
+		if (is_file_vma) {
+			order = HPAGE_PMD_ORDER;
+			if (!(orders & (1UL << order)))
+				goto skip;
+			hend = round_down(vma->vm_end, PAGE_SIZE << order);
+		}
+		else {
+			/* select the highest possible order for the VMA */
+			order = highest_order(orders);
+			while (orders) {
+				hend = round_down(vma->vm_end, PAGE_SIZE << order);
+				if (khugepaged_scan.address <= hend)
+					break;
+				order = next_order(&orders, order);
+			}
+		}
+		if (!orders)
+			goto skip;
 		if (khugepaged_scan.address > hend)
 			goto skip;
+		hstart = round_up(vma->vm_start, PAGE_SIZE << order);
 		if (khugepaged_scan.address < hstart)
 			khugepaged_scan.address = hstart;
-		VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+		VM_BUG_ON(khugepaged_scan.address & ((PAGE_SIZE << order) - 1));
 
 		while (khugepaged_scan.address < hend) {
 			bool mmap_locked = true;
@@ -2697,13 +2713,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
-				  khugepaged_scan.address + HPAGE_PMD_SIZE >
+				  khugepaged_scan.address + (PAGE_SIZE << order) >
 				  hend);
-			if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
-				if (!thp_vma_allowable_order(vma, vma->vm_flags,
-				    TVA_ENFORCE_SYSFS, PMD_ORDER))
-					break;
-
+			if (is_file_vma) {
 				struct file *file = get_file(vma->vm_file);
 				pgoff_t pgoff = linear_page_index(vma,
 						khugepaged_scan.address);
@@ -2725,15 +2737,15 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 				}
 			} else {
 				*result = hpage_collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
+					khugepaged_scan.address, &mmap_locked, orders, cc);
 			}
 
 			if (*result == SCAN_SUCCEED)
 				++khugepaged_pages_collapsed;
 
 			/* move to next address */
-			khugepaged_scan.address += HPAGE_PMD_SIZE;
-			progress += HPAGE_PMD_NR;
+			khugepaged_scan.address += (PAGE_SIZE << order);
+			progress += (1UL << order);
 			if (!mmap_locked)
 				/*
 				 * We released mmap_lock so break loop.  Note
@@ -3060,7 +3072,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 			fput(file);
 		} else {
 			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
+							 &mmap_locked,
+							 BIT(HPAGE_PMD_ORDER),
+							 cc);
 		}
 		if (!mmap_locked)
 			*prev = NULL;  /* Tell caller we dropped mmap_lock */
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 13/17] khugepaged: Lock all VMAs mapping the PTE table
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (11 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 12/17] khugepaged: Enable variable-sized VMA collapse Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 14/17] khugepaged: Reset scan address to correct alignment Dev Jain
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

After enabling khugepaged to handle VMAs of any size, it may happen that
the process faults on a VMA other than the VMA under collapse, and both
these VMAs span the same PTE table. As a result, the fault handler will
install a new PTE table after khugepaged isolates the PTE table. Therefore,
scan the PTE table, retrieve all VMAs, and write lock them. Note that,
rmap can still reach the PTE table from folios not under collapse; this is
fine since it does not interfere with the PTEs under collapse, nor the folios
under collapse, nor can rmap fill the PMD.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 048f990d8507..e1c2c5b89f6d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1139,6 +1139,23 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 	return SCAN_SUCCEED;
 }
 
+static void take_vma_locks_per_pte(struct mm_struct *mm, unsigned long haddress)
+{
+	struct vm_area_struct *vma;
+	unsigned long start = haddress;
+	unsigned long end = haddress + HPAGE_PMD_SIZE;
+
+	while (start < end) {
+		vma = vma_lookup(mm, start);
+		if (!vma) {
+			start += PAGE_SIZE;
+			continue;
+		}
+		vma_start_write(vma);
+		start = vma->vm_end;
+	}
+}
+
 static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long address,
 		struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd,
 		struct folio *folio)
@@ -1270,7 +1287,9 @@ static int vma_collapse_anon_folio(struct mm_struct *mm, unsigned long address,
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-	vma_start_write(vma);
+	/* Faulting may fill the PMD after flush; lock all VMAs mapping this PTE */
+	take_vma_locks_per_pte(mm, haddress);
+
 	anon_vma_lock_write(vma->anon_vma);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, haddress,
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 14/17] khugepaged: Reset scan address to correct alignment
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (12 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 13/17] khugepaged: Lock all VMAs mapping the PTE table Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 15/17] khugepaged: Delay cond_resched() Dev Jain
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

There are two situations:
1) After retaking the mmap lock, the next VMA expands downwards.
2) After khugepaged sleeps and starts again, it will pick up the starting address
   from the global struct khugepaged_scan, and hence will pick up the same VMA as
   in the previous cycle.

In both cases, khugepaged_scan.address > hstart. Therefore, explicitly align the
address to the order we are scanning for. Previously this was not a problem since
the alignment was to be always PMD-aligned.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e1c2c5b89f6d..7c9a758f6817 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2722,6 +2722,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		hstart = round_up(vma->vm_start, PAGE_SIZE << order);
 		if (khugepaged_scan.address < hstart)
 			khugepaged_scan.address = hstart;
+		else
+			khugepaged_scan.address = round_down(khugepaged_scan.address, PAGE_SIZE << order);
+
 		VM_BUG_ON(khugepaged_scan.address & ((PAGE_SIZE << order) - 1));
 
 		while (khugepaged_scan.address < hend) {
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 15/17] khugepaged: Delay cond_resched()
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (13 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 14/17] khugepaged: Reset scan address to correct alignment Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 16/17] khugepaged: Implement strict policy for mTHP collapse Dev Jain
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Post scanning VMAs less than PMD-size, cond_resched() may get called at
a frequency of 1 << order worth of pte scan. Earlier, this was at a
PMD-worth scan. Therefore, manually enforce the previous behaviour; not
doing this causes the khugepaged selftest to timeout.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7c9a758f6817..d2bb008b95e7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2650,6 +2650,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	unsigned long orders;
 	int order;
 	bool is_file_vma;
+	int prev_progress = 0;
 
 	VM_BUG_ON(!pages);
 	lockdep_assert_held(&khugepaged_mm_lock);
@@ -2730,7 +2731,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		while (khugepaged_scan.address < hend) {
 			bool mmap_locked = true;
 
-			cond_resched();
+			if (progress - prev_progress >= HPAGE_PMD_NR) {
+				cond_resched();
+				prev_progress = progress;
+			}
 			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
 				goto breakouterloop;
 
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 16/17] khugepaged: Implement strict policy for mTHP collapse
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (14 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 15/17] khugepaged: Delay cond_resched() Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 11:13 ` [PATCH v2 17/17] Documentation: transhuge: Define khugepaged mTHP collapse policy Dev Jain
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

As noted in the discussion thread ending at [1], avoid the creep problem by
collapsing to mTHPs only if max_ptes_none is zero or 511. Along with this,
make mTHP collapse conditions stricter by removing scaling of max_ptes_shared
and max_ptes_swap, and consider collapse only if there are no shared or
swap PTEs in the range.

[1] https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/khugepaged.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d2bb008b95e7..b589f889bb5a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -417,6 +417,17 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
 
 static bool thp_enabled(void)
 {
+	bool anon_pmd_enabled = (test_bit(PMD_ORDER, &huge_anon_orders_always) ||
+				 test_bit(PMD_ORDER, &huge_anon_orders_madvise) ||
+			         (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+			         hugepage_global_enabled()));
+
+	/*
+	 * If PMD_ORDER is ineligible for collapse, check if mTHP collapse policy is obeyed;
+	 * see Documentation/admin-guide/transhuge.rst
+	 */
+	bool anon_collapse_mthp = (khugepaged_max_ptes_none == 0 ||
+			      khugepaged_max_ptes_none == HPAGE_PMD_NR - 1);
 	/*
 	 * We cover the anon, shmem and the file-backed case here; file-backed
 	 * hugepages, when configured in, are determined by the global control.
@@ -427,8 +438,9 @@ static bool thp_enabled(void)
 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
 	    hugepage_global_enabled())
 		return true;
-	if (huge_anon_orders_always || huge_anon_orders_madvise ||
-	    (huge_anon_orders_inherit && hugepage_global_enabled()))
+	if ((huge_anon_orders_always || huge_anon_orders_madvise ||
+	    (huge_anon_orders_inherit && hugepage_global_enabled())) &&
+	    (anon_pmd_enabled || anon_collapse_mthp))
 		return true;
 	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
 		return true;
@@ -578,13 +590,16 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
 	bool writable = false;
-	unsigned int max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
+	unsigned int max_ptes_shared = khugepaged_max_ptes_shared;
 	unsigned int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
 	bool all_pfns_present = true;
 	bool all_pfns_contig = true;
 	bool first_pfn_aligned = true;
 	pte_t prev_pteval;
 
+	if (order != HPAGE_PMD_ORDER)
+		max_ptes_shared = 0;
+
 	for (_pte = pte; _pte < pte + (1UL << order);
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
@@ -1453,11 +1468,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 	order = highest_order(orders);
 	VM_BUG_ON(address & ((PAGE_SIZE << order) - 1));
 
+	max_ptes_none = khugepaged_max_ptes_none;
+	max_ptes_shared = khugepaged_max_ptes_shared;
+	max_ptes_swap = khugepaged_max_ptes_swap;
+
 scan_pte_range:
 
-	max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
+	if (order != HPAGE_PMD_ORDER)
+		max_ptes_shared = max_ptes_swap = 0;
+
 	max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
-	max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order);
 	referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0;
 	all_pfns_present = true, all_pfns_contig = true, first_pfn_aligned = true;
 
@@ -2651,6 +2671,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	int order;
 	bool is_file_vma;
 	int prev_progress = 0;
+	bool collapse_mthp = true;
+
+	/* Avoid the creep problem; see Documentation/admin-guide/transhuge.rst */
+	if (khugepaged_max_ptes_none && khugepaged_max_ptes_none != HPAGE_PMD_NR - 1)
+		collapse_mthp = false;
 
 	VM_BUG_ON(!pages);
 	lockdep_assert_held(&khugepaged_mm_lock);
@@ -2710,6 +2735,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			/* select the highest possible order for the VMA */
 			order = highest_order(orders);
 			while (orders) {
+				if (order != HPAGE_PMD_ORDER && !collapse_mthp)
+					goto skip;
 				hend = round_down(vma->vm_end, PAGE_SIZE << order);
 				if (khugepaged_scan.address <= hend)
 					break;
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 17/17] Documentation: transhuge: Define khugepaged mTHP collapse policy
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (15 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 16/17] khugepaged: Implement strict policy for mTHP collapse Dev Jain
@ 2025-02-11 11:13 ` Dev Jain
  2025-02-11 23:23 ` [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Andrew Morton
  2025-02-15  1:47 ` Nico Pache
  18 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-11 11:13 UTC (permalink / raw)
  To: akpm, david, willy, kirill.shutemov
  Cc: npache, ryan.roberts, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, linux-mm, linux-kernel,
	Dev Jain

Update documentation to reflect the mTHP specific changes for khugepaged.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 49 +++++++++++++++++-----
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index dff8d5985f0f..6a513fa81005 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -63,7 +63,7 @@ often.
 THP can be enabled system wide or restricted to certain tasks or even
 memory ranges inside task's address space. Unless THP is completely
 disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into PMD-sized huge pages.
+collapses sequences of basic pages into huge pages.
 
 The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
 interface and using madvise(2) and prctl(2) system calls.
@@ -212,20 +212,16 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing
 	echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
 	echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
 
-khugepaged will be automatically started when PMD-sized THP is enabled
+khugepaged will be automatically started when THP is enabled
 (either of the per-size anon control or the top-level control are set
 to "always" or "madvise"), and it'll be automatically shutdown when
-PMD-sized THP is disabled (when both the per-size anon control and the
-top-level control are "never")
+THP is disabled (when all of the per-size anon controls and the
+top-level control are "never"). mTHP collapse is supported only for
+private-anonymous memory.
 
 Khugepaged controls
 -------------------
 
-.. note::
-   khugepaged currently only searches for opportunities to collapse to
-   PMD-sized THP and no attempt is made to collapse to other THP
-   sizes.
-
 khugepaged runs usually at low frequency so while one may not want to
 invoke defrag algorithms synchronously during the page faults, it
 should be worth invoking defrag at least in khugepaged. However it's
@@ -254,8 +250,9 @@ The khugepaged progress can be seen in the number of pages collapsed (note
 that this counter may not be an exact count of the number of pages
 collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
 being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
-one 2M hugepage. Each may happen independently, or together, depending on
-the type of memory and the failures that occur. As such, this value should
+one 2M hugepage, or (3) A portion of the PTE mapping 4K pages replaced by
+a mapping to an mTHP. Each may happen independently, or together, depending
+on the type of memory and the failures that occur. As such, this value should
 be interpreted roughly as a sign of progress, and counters in /proc/vmstat
 consulted for more accurate accounting)::
 
@@ -294,6 +291,36 @@ that THP is shared. Exceeding the number would block the collapse::
 
 A higher value may increase memory footprint for some workloads.
 
+Khugepaged specifics for anon-mTHP collapse
+------------------------------------------
+
+The objective of khugepaged is to collapse memory to the highest aligned order
+possible. If it fails on PMD order, it will greedily try the lower orders.
+
+The tunables max_ptes_shared and max_ptes_swap are considered to be zero for
+mTHP collapsing; i.e the memory range must not have any shared or swap PTE
+for it to be eligible for mTHP collapse.
+
+The tunable max_ptes_none is scaled downwards, according to the order of
+the collapse. For example, if max_ptes_none = 511, and khugepaged tries to
+collapse to order 4, then the memory range under consideration will become
+a candidate for collapse only when the number of none PTEs (out of the 16 PTEs)
+does not exceed: 511 >> (9 - 4) = 15.
+
+mTHP collapse is supported only if max_ptes_none is either zero or 511 (one less
+than the number of entries in the PTE table). Any other value, given the scaling
+logic presented above, produces what we call the "creep" problem; let the bitmask
+00110000 denote a memory range mapped by 8 consecutive pagetable entries, where 0
+denotes an empty pte and 1, a pte embedding a physical folio. Let max_ptes_none = 50%
+(i.e max_ptes_none = 256, which implies 256 >> (9 - 4) = 8 for our case). If order-2 and
+order-3 are enabled, khugepaged may do the following: it scans the range for order-3, but
+since the percentage of none ptes = 5/8 * 100 = 62.5%, it drops down to order 2.
+It successfully collapses to order-2 for the first 4 PTEs, and the memory range becomes:
+11110000
+Now, from the order-3 PoV, the range has 4 out of 8 PTEs filled, and the range has now
+suddenly become eligible for order-3 collapse. So, we can creep into large order
+collapses in a very inefficient manner.
+
 Boot parameters
 ===============
 
-- 
2.30.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (16 preceding siblings ...)
  2025-02-11 11:13 ` [PATCH v2 17/17] Documentation: transhuge: Define khugepaged mTHP collapse policy Dev Jain
@ 2025-02-11 23:23 ` Andrew Morton
  2025-02-12  4:18   ` Dev Jain
  2025-02-15  1:47 ` Nico Pache
  18 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2025-02-11 23:23 UTC (permalink / raw)
  To: Dev Jain
  Cc: david, willy, kirill.shutemov, npache, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

On Tue, 11 Feb 2025 16:43:09 +0530 Dev Jain <dev.jain@arm.com> wrote:

> This patchset extends khugepaged from collapsing only PMD-sized THPs to
> collapsing anonymous mTHPs.
> 
> mTHPs were introduced in the kernel to improve memory management by allocating
> chunks of larger memory, so as to reduce number of page faults, TLB misses (due
> to TLB coalescing), reduce length of LRU lists, etc. However, the mTHP property
> is often lost due to CoW, swap-in/out, and when the kernel just cannot find
> enough physically contiguous memory to allocate on fault. Henceforth, there is a
> need to regain mTHPs in the system asynchronously. This work is an attempt in
> this direction, starting with anonymous folios.
> 
> In the fault handler, we select the THP order in a greedy manner; the same has
> been used here, along with the same sysfs interface to control the order of
> collapse. In contrast to PMD-collapse, we (hopefully) get rid of the mmap_write_lock().
> 
> ---------------------------------------------------------
> Testing
> ---------------------------------------------------------
> 
> The set has been build tested on x86_64.
> For Aarch64,
> 1. mm-selftests: No regressions.
> 2. Analyzing with tools/mm/thpmaps on different userspace programs mapping
>    aligned VMAs of a large size, faulting in basepages/mTHPs (according to sysfs),
>    and then madvise()'ing the VMA, khugepaged is able to 100% collapse the VMAs.

It would be nice to provide some evidence that this patchset actually
makes Linux better for our users, and by how much.

Thanks, I think I'll skip v2 and shall await reviewer input.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse
  2025-02-11 23:23 ` [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Andrew Morton
@ 2025-02-12  4:18   ` Dev Jain
  0 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-12  4:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: david, willy, kirill.shutemov, npache, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel



On 12/02/25 4:53 am, Andrew Morton wrote:
> On Tue, 11 Feb 2025 16:43:09 +0530 Dev Jain <dev.jain@arm.com> wrote:
> 
>> This patchset extends khugepaged from collapsing only PMD-sized THPs to
>> collapsing anonymous mTHPs.
>>
>> mTHPs were introduced in the kernel to improve memory management by allocating
>> chunks of larger memory, so as to reduce number of page faults, TLB misses (due
>> to TLB coalescing), reduce length of LRU lists, etc. However, the mTHP property
>> is often lost due to CoW, swap-in/out, and when the kernel just cannot find
>> enough physically contiguous memory to allocate on fault. Henceforth, there is a
>> need to regain mTHPs in the system asynchronously. This work is an attempt in
>> this direction, starting with anonymous folios.
>>
>> In the fault handler, we select the THP order in a greedy manner; the same has
>> been used here, along with the same sysfs interface to control the order of
>> collapse. In contrast to PMD-collapse, we (hopefully) get rid of the mmap_write_lock().
>>
>> ---------------------------------------------------------
>> Testing
>> ---------------------------------------------------------
>>
>> The set has been build tested on x86_64.
>> For Aarch64,
>> 1. mm-selftests: No regressions.
>> 2. Analyzing with tools/mm/thpmaps on different userspace programs mapping
>>     aligned VMAs of a large size, faulting in basepages/mTHPs (according to sysfs),
>>     and then madvise()'ing the VMA, khugepaged is able to 100% collapse the VMAs.
> 
> It would be nice to provide some evidence that this patchset actually
> makes Linux better for our users, and by how much.
> 
> Thanks, I think I'll skip v2 and shall await reviewer input.

Hi Andrew, thanks for your reply.

Although the introduction of mTHPs leads to the natural conclusion of 
extending khugepaged to support mTHP collapse, I'll try to get some 
performance statistics out.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse
  2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
                   ` (17 preceding siblings ...)
  2025-02-11 23:23 ` [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Andrew Morton
@ 2025-02-15  1:47 ` Nico Pache
  2025-02-15  7:36   ` Dev Jain
  18 siblings, 1 reply; 22+ messages in thread
From: Nico Pache @ 2025-02-15  1:47 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, willy, kirill.shutemov, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel

Hi Dev,

I tried to run your kernel to get some performance numbers out of it,
but ran into the following issue while running my defer-mthp-test.sh
workload.

[  297.393032] =====================================
[  297.393618] WARNING: bad unlock balance detected!
[  297.394201] 6.14.0-rc2mthpDEV #2 Not tainted
[  297.394732] -------------------------------------
[  297.395421] khugepaged/111 is trying to release lock (&mm->mmap_lock) at:
[  297.396509] [<ffffffff947cb76a>] khugepaged+0x23a/0xb40
[  297.397205] but there are no more locks to release!
[  297.397865]
[  297.397865] other info that might help us debug this:
[  297.398684] no locks held by khugepaged/111.
[  297.399155]
[  297.399155] stack backtrace:
[  297.399591] CPU: 10 UID: 0 PID: 111 Comm: khugepaged Not tainted
6.14.0-rc2mthpDEV #2
[  297.399593] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.3-2.fc40 04/01/2014
[  297.399595] Call Trace:
[  297.399599]  <TASK>
[  297.399602]  dump_stack_lvl+0x6e/0xa0
[  297.399607]  ? khugepaged+0x23a/0xb40
[  297.399610]  print_unlock_imbalance_bug.part.0+0xfb/0x110
[  297.399612]  ? khugepaged+0x23a/0xb40
[  297.399614]  lock_release+0x283/0x3f0
[  297.399620]  up_read+0x1b/0x30
[  297.399622]  khugepaged+0x23a/0xb40
[  297.399631]  ? __pfx_khugepaged+0x10/0x10
[  297.399633]  kthread+0xf2/0x240
[  297.399636]  ? __pfx_kthread+0x10/0x10
[  297.399638]  ret_from_fork+0x34/0x50
[  297.399640]  ? __pfx_kthread+0x10/0x10
[  297.399642]  ret_from_fork_asm+0x1a/0x30
[  297.399649]  </TASK>
[  297.505555] ------------[ cut here ]------------
[  297.506044] DEBUG_RWSEMS_WARN_ON(tmp < 0): count =
0xffffffffffffff00, magic = 0xffff8c6e03bc1f88, owner = 0x1, curr
0xffff8c6e0eccb700, list empty
[  297.507362] WARNING: CPU: 8 PID: 1946 at
kernel/locking/rwsem.c:1346 __up_read+0x1ba/0x220
[  297.508220] Modules linked in: nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 rfkill nf_tables intel_rapl_msr intel_rapl_common
kvm_amd iTCO_wdt intel_pmc_bxt iTCO_vendor_support kvm i2c_i801
i2c_smbus lpc_ich virtio_net net_failover failover virtio_balloon
joydev fuse loop nfnetlink zram xfs polyval_clmulni polyval_generic
ghash_clmulni_intel sha512_ssse3 sha256_ssse3 virtio_console
virtio_blk sha1_ssse3 serio_raw qemu_fw_cfg
[  297.513474] CPU: 8 UID: 0 PID: 1946 Comm: thp_test Not tainted
6.14.0-rc2mthpDEV #2
[  297.514314] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.3-2.fc40 04/01/2014
[  297.515265] RIP: 0010:__up_read+0x1ba/0x220
[  297.515756] Code: c6 78 8b e1 95 48 c7 c7 88 0e d3 95 48 39 c2 48
c7 c2 be 39 e4 95 48 c7 c0 29 8b e1 95 48 0f 44 c2 48 8b 13 50 e8 e6
44 f5 ff <0f> 0b 58 e9 20 ff ff ff 48 8b 57 60 48 8d 47 60 4c 8b 47 08
c6 05
[  297.517659] RSP: 0018:ffffa8a943533ac8 EFLAGS: 00010282
[  297.518209] RAX: 0000000000000000 RBX: ffff8c6e03bc1f88 RCX: 0000000000000000
[  297.518884] RDX: ffff8c7366ff0980 RSI: ffff8c7366fe1a80 RDI: ffff8c7366fe1a80
[  297.519577] RBP: ffffa8a943533b58 R08: 0000000000000000 R09: 0000000000000001
[  297.520272] R10: 0000000000000000 R11: 0770076d07650720 R12: ffffa8a943533b10
[  297.520949] R13: ffff8c6e03bc1f88 R14: ffffa8a943533b58 R15: ffffa8a943533b10
[  297.521651] FS:  00007f24de01b740(0000) GS:ffff8c7366e00000(0000)
knlGS:0000000000000000
[  297.522425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  297.522990] CR2: 0000000a7ffef000 CR3: 000000010d9d6000 CR4: 0000000000750ef0
[  297.523799] PKRU: 55555554
[  297.524100] Call Trace:
[  297.524367]  <TASK>
[  297.524597]  ? __warn.cold+0xb7/0x151
[  297.525072]  ? __up_read+0x1ba/0x220
[  297.525442]  ? report_bug+0xff/0x140
[  297.525804]  ? console_unlock+0x9d/0x150
[  297.526233]  ? handle_bug+0x58/0x90
[  297.526590]  ? exc_invalid_op+0x17/0x70
[  297.526993]  ? asm_exc_invalid_op+0x1a/0x20
[  297.527420]  ? __up_read+0x1ba/0x220
[  297.527783]  ? __up_read+0x1ba/0x220
[  297.528160]  vms_complete_munmap_vmas+0x19c/0x1f0
[  297.528628]  do_vmi_align_munmap+0x20a/0x280
[  297.529069]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.529552]  do_vmi_munmap+0xd0/0x190
[  297.529920]  __vm_munmap+0xb1/0x1b0
[  297.530293]  __x64_sys_munmap+0x1b/0x30
[  297.530677]  do_syscall_64+0x95/0x180
[  297.531058]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.531534]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[  297.532167]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.532640]  ? syscall_exit_to_user_mode+0x97/0x290
[  297.533226]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.533701]  ? do_syscall_64+0xa1/0x180
[  297.534097]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.534587]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[  297.535129]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.535603]  ? syscall_exit_to_user_mode+0x97/0x290
[  297.536092]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.536568]  ? do_syscall_64+0xa1/0x180
[  297.536954]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.537444]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[  297.537936]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.538524]  ? syscall_exit_to_user_mode+0x97/0x290
[  297.539044]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.539526]  ? do_syscall_64+0xa1/0x180
[  297.539931]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.540597]  ? do_user_addr_fault+0x5a9/0x8a0
[  297.541102]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.541580]  ? trace_hardirqs_off+0x4b/0xc0
[  297.542011]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.542488]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[  297.542991]  ? srso_alias_return_thunk+0x5/0xfbef5
[  297.543466]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  297.543960] RIP: 0033:0x7f24de1367eb
[  297.544344] Code: 73 01 c3 48 8b 0d 2d f6 0c 00 f7 d8 64 89 01 48
83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fd f5 0c 00 f7 d8 64 89
01 48
[  297.546074] RSP: 002b:00007ffc7bb2e2b8 EFLAGS: 00000206 ORIG_RAX:
000000000000000b
[  297.546796] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f24de1367eb
[  297.547488] RDX: 0000000080000000 RSI: 0000000080000000 RDI: 0000000480000000
[  297.548182] RBP: 00007ffc7bb2e390 R08: 0000000000000064 R09: 00000000fffffffe
[  297.548884] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000006
[  297.549594] R13: 0000000000000000 R14: 00007f24de258000 R15: 0000000000403e00
[  297.550292]  </TASK>
[  297.550530] irq event stamp: 64417291
[  297.550903] hardirqs last  enabled at (64417291):
[<ffffffff94749232>] seqcount_lockdep_reader_access+0x82/0x90
[  297.551859] hardirqs last disabled at (64417290):
[<ffffffff947491fe>] seqcount_lockdep_reader_access+0x4e/0x90
[  297.552810] softirqs last  enabled at (64413640):
[<ffffffff943bf3c2>] __irq_exit_rcu+0xe2/0x100
[  297.553654] softirqs last disabled at (64413627):
[<ffffffff943bf3c2>] __irq_exit_rcu+0xe2/0x100
[  297.554504] ---[ end trace 0000000000000000 ]---




On Tue, Feb 11, 2025 at 4:13 AM Dev Jain <dev.jain@arm.com> wrote:
>
> This patchset extends khugepaged from collapsing only PMD-sized THPs to
> collapsing anonymous mTHPs.
>
> mTHPs were introduced in the kernel to improve memory management by allocating
> chunks of larger memory, so as to reduce number of page faults, TLB misses (due
> to TLB coalescing), reduce length of LRU lists, etc. However, the mTHP property
> is often lost due to CoW, swap-in/out, and when the kernel just cannot find
> enough physically contiguous memory to allocate on fault. Henceforth, there is a
> need to regain mTHPs in the system asynchronously. This work is an attempt in
> this direction, starting with anonymous folios.
>
> In the fault handler, we select the THP order in a greedy manner; the same has
> been used here, along with the same sysfs interface to control the order of
> collapse. In contrast to PMD-collapse, we (hopefully) get rid of the mmap_write_lock().
>
> ---------------------------------------------------------
> Testing
> ---------------------------------------------------------
>
> The set has been build tested on x86_64.
> For Aarch64,
> 1. mm-selftests: No regressions.
> 2. Analyzing with tools/mm/thpmaps on different userspace programs mapping
>    aligned VMAs of a large size, faulting in basepages/mTHPs (according to sysfs),
>    and then madvise()'ing the VMA, khugepaged is able to 100% collapse the VMAs.
>
> This patchset is rebased on mm-unstable (4637fa5d47a49c977116321cc575ea22215df22d).
>
> v1->v2:
>  - Handle VMAs less than PMD size (patches 12-15)
>  - Do not add mTHP into deferred split queue
>  - Drop lock optimization and collapse mTHP under mmap_write_lock()
>  - Define policy on what to do when we encounter a folio order larger than
>    the order we are scanning for
>  - Prevent the creep problem by enforcing tunable simplification
>  - Update Documentation
>  - Drop patch 12 from v1 updating selftest w.r.t the creep problem
>  - Drop patch 1 from v1
>
>  v1:
>  https://lore.kernel.org/all/20241216165105.56185-1-dev.jain@arm.com/
>
> Dev Jain (17):
>   khugepaged: Generalize alloc_charge_folio()
>   khugepaged: Generalize hugepage_vma_revalidate()
>   khugepaged: Generalize __collapse_huge_page_swapin()
>   khugepaged: Generalize __collapse_huge_page_isolate()
>   khugepaged: Generalize __collapse_huge_page_copy()
>   khugepaged: Abstract PMD-THP collapse
>   khugepaged: Scan PTEs order-wise
>   khugepaged: Introduce vma_collapse_anon_folio()
>   khugepaged: Define collapse policy if a larger folio is already mapped
>   khugepaged: Exit early on fully-mapped aligned mTHP
>   khugepaged: Enable sysfs to control order of collapse
>   khugepaged: Enable variable-sized VMA collapse
>   khugepaged: Lock all VMAs mapping the PTE table
>   khugepaged: Reset scan address to correct alignment
>   khugepaged: Delay cond_resched()
>   khugepaged: Implement strict policy for mTHP collapse
>   Documentation: transhuge: Define khugepaged mTHP collapse policy
>
>  Documentation/admin-guide/mm/transhuge.rst |  49 +-
>  include/linux/huge_mm.h                    |   2 +
>  mm/huge_memory.c                           |   4 +
>  mm/khugepaged.c                            | 603 ++++++++++++++++-----
>  4 files changed, 511 insertions(+), 147 deletions(-)
>
> --
> 2.30.2
>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse
  2025-02-15  1:47 ` Nico Pache
@ 2025-02-15  7:36   ` Dev Jain
  0 siblings, 0 replies; 22+ messages in thread
From: Dev Jain @ 2025-02-15  7:36 UTC (permalink / raw)
  To: Nico Pache
  Cc: akpm, david, willy, kirill.shutemov, ryan.roberts,
	anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, linux-mm, linux-kernel



On 15/02/25 7:17 am, Nico Pache wrote:
> Hi Dev,
> 
> I tried to run your kernel to get some performance numbers out of it,
> but ran into the following issue while running my defer-mthp-test.sh
> workload.
> 
> [  297.393032] =====================================
> [  297.393618] WARNING: bad unlock balance detected!
> [  297.394201] 6.14.0-rc2mthpDEV #2 Not tainted
> [  297.394732] -------------------------------------
> [  297.395421] khugepaged/111 is trying to release lock (&mm->mmap_lock) at:
> [  297.396509] [<ffffffff947cb76a>] khugepaged+0x23a/0xb40
> [  297.397205] but there are no more locks to release!
> [  297.397865]
> [  297.397865] other info that might help us debug this:
> [  297.398684] no locks held by khugepaged/111.
> [  297.399155]
> [  297.399155] stack backtrace:
> [  297.399591] CPU: 10 UID: 0 PID: 111 Comm: khugepaged Not tainted
> 6.14.0-rc2mthpDEV #2
> [  297.399593] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS 1.16.3-2.fc40 04/01/2014
> [  297.399595] Call Trace:
> [  297.399599]  <TASK>
> [  297.399602]  dump_stack_lvl+0x6e/0xa0
> [  297.399607]  ? khugepaged+0x23a/0xb40
> [  297.399610]  print_unlock_imbalance_bug.part.0+0xfb/0x110
> [  297.399612]  ? khugepaged+0x23a/0xb40
> [  297.399614]  lock_release+0x283/0x3f0
> [  297.399620]  up_read+0x1b/0x30
> [  297.399622]  khugepaged+0x23a/0xb40
> [  297.399631]  ? __pfx_khugepaged+0x10/0x10
> [  297.399633]  kthread+0xf2/0x240
> [  297.399636]  ? __pfx_kthread+0x10/0x10
> [  297.399638]  ret_from_fork+0x34/0x50
> [  297.399640]  ? __pfx_kthread+0x10/0x10
> [  297.399642]  ret_from_fork_asm+0x1a/0x30
> [  297.399649]  </TASK>
> [  297.505555] ------------[ cut here ]------------
> [  297.506044] DEBUG_RWSEMS_WARN_ON(tmp < 0): count =
> 0xffffffffffffff00, magic = 0xffff8c6e03bc1f88, owner = 0x1, curr
> 0xffff8c6e0eccb700, list empty
> [  297.507362] WARNING: CPU: 8 PID: 1946 at
> kernel/locking/rwsem.c:1346 __up_read+0x1ba/0x220
> [  297.508220] Modules linked in: nft_fib_inet nft_fib_ipv4
> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
> nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 rfkill nf_tables intel_rapl_msr intel_rapl_common
> kvm_amd iTCO_wdt intel_pmc_bxt iTCO_vendor_support kvm i2c_i801
> i2c_smbus lpc_ich virtio_net net_failover failover virtio_balloon
> joydev fuse loop nfnetlink zram xfs polyval_clmulni polyval_generic
> ghash_clmulni_intel sha512_ssse3 sha256_ssse3 virtio_console
> virtio_blk sha1_ssse3 serio_raw qemu_fw_cfg
> [  297.513474] CPU: 8 UID: 0 PID: 1946 Comm: thp_test Not tainted
> 6.14.0-rc2mthpDEV #2
> [  297.514314] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS 1.16.3-2.fc40 04/01/2014
> [  297.515265] RIP: 0010:__up_read+0x1ba/0x220
> [  297.515756] Code: c6 78 8b e1 95 48 c7 c7 88 0e d3 95 48 39 c2 48
> c7 c2 be 39 e4 95 48 c7 c0 29 8b e1 95 48 0f 44 c2 48 8b 13 50 e8 e6
> 44 f5 ff <0f> 0b 58 e9 20 ff ff ff 48 8b 57 60 48 8d 47 60 4c 8b 47 08
> c6 05
> [  297.517659] RSP: 0018:ffffa8a943533ac8 EFLAGS: 00010282
> [  297.518209] RAX: 0000000000000000 RBX: ffff8c6e03bc1f88 RCX: 0000000000000000
> [  297.518884] RDX: ffff8c7366ff0980 RSI: ffff8c7366fe1a80 RDI: ffff8c7366fe1a80
> [  297.519577] RBP: ffffa8a943533b58 R08: 0000000000000000 R09: 0000000000000001
> [  297.520272] R10: 0000000000000000 R11: 0770076d07650720 R12: ffffa8a943533b10
> [  297.520949] R13: ffff8c6e03bc1f88 R14: ffffa8a943533b58 R15: ffffa8a943533b10
> [  297.521651] FS:  00007f24de01b740(0000) GS:ffff8c7366e00000(0000)
> knlGS:0000000000000000
> [  297.522425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  297.522990] CR2: 0000000a7ffef000 CR3: 000000010d9d6000 CR4: 0000000000750ef0
> [  297.523799] PKRU: 55555554
> [  297.524100] Call Trace:
> [  297.524367]  <TASK>
> [  297.524597]  ? __warn.cold+0xb7/0x151
> [  297.525072]  ? __up_read+0x1ba/0x220
> [  297.525442]  ? report_bug+0xff/0x140
> [  297.525804]  ? console_unlock+0x9d/0x150
> [  297.526233]  ? handle_bug+0x58/0x90
> [  297.526590]  ? exc_invalid_op+0x17/0x70
> [  297.526993]  ? asm_exc_invalid_op+0x1a/0x20
> [  297.527420]  ? __up_read+0x1ba/0x220
> [  297.527783]  ? __up_read+0x1ba/0x220
> [  297.528160]  vms_complete_munmap_vmas+0x19c/0x1f0
> [  297.528628]  do_vmi_align_munmap+0x20a/0x280
> [  297.529069]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.529552]  do_vmi_munmap+0xd0/0x190
> [  297.529920]  __vm_munmap+0xb1/0x1b0
> [  297.530293]  __x64_sys_munmap+0x1b/0x30
> [  297.530677]  do_syscall_64+0x95/0x180
> [  297.531058]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.531534]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [  297.532167]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.532640]  ? syscall_exit_to_user_mode+0x97/0x290
> [  297.533226]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.533701]  ? do_syscall_64+0xa1/0x180
> [  297.534097]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.534587]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [  297.535129]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.535603]  ? syscall_exit_to_user_mode+0x97/0x290
> [  297.536092]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.536568]  ? do_syscall_64+0xa1/0x180
> [  297.536954]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.537444]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [  297.537936]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.538524]  ? syscall_exit_to_user_mode+0x97/0x290
> [  297.539044]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.539526]  ? do_syscall_64+0xa1/0x180
> [  297.539931]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.540597]  ? do_user_addr_fault+0x5a9/0x8a0
> [  297.541102]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.541580]  ? trace_hardirqs_off+0x4b/0xc0
> [  297.542011]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.542488]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [  297.542991]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  297.543466]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  297.543960] RIP: 0033:0x7f24de1367eb
> [  297.544344] Code: 73 01 c3 48 8b 0d 2d f6 0c 00 f7 d8 64 89 01 48
> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 0b 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fd f5 0c 00 f7 d8 64 89
> 01 48
> [  297.546074] RSP: 002b:00007ffc7bb2e2b8 EFLAGS: 00000206 ORIG_RAX:
> 000000000000000b
> [  297.546796] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f24de1367eb
> [  297.547488] RDX: 0000000080000000 RSI: 0000000080000000 RDI: 0000000480000000
> [  297.548182] RBP: 00007ffc7bb2e390 R08: 0000000000000064 R09: 00000000fffffffe
> [  297.548884] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000006
> [  297.549594] R13: 0000000000000000 R14: 00007f24de258000 R15: 0000000000403e00
> [  297.550292]  </TASK>
> [  297.550530] irq event stamp: 64417291
> [  297.550903] hardirqs last  enabled at (64417291):
> [<ffffffff94749232>] seqcount_lockdep_reader_access+0x82/0x90
> [  297.551859] hardirqs last disabled at (64417290):
> [<ffffffff947491fe>] seqcount_lockdep_reader_access+0x4e/0x90
> [  297.552810] softirqs last  enabled at (64413640):
> [<ffffffff943bf3c2>] __irq_exit_rcu+0xe2/0x100
> [  297.553654] softirqs last disabled at (64413627):
> [<ffffffff943bf3c2>] __irq_exit_rcu+0xe2/0x100
> [  297.554504] ---[ end trace 0000000000000000 ]---

Thanks for testing. Hmm...can you do this: Drop patches 12-16, and 
instead of 16, apply this:

commit 112f4fa8e92b2bb93051595b2a804b3546b3545a
Author: Dev Jain <dev.jain@arm.com>
Date:   Fri Jan 24 10:52:15 2025 +0000

     khugepaged: Implement strict policy for mTHP collapse

     Signed-off-by: Dev Jain <dev.jain@arm.com>

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 37cfa7beba3d..1caf9eb3bfd9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -417,6 +417,17 @@ static inline int 
hpage_collapse_test_exit_or_disable(struct mm_struct *mm)

  static bool thp_enabled(void)
  {
+	bool anon_pmd_enabled = (test_bit(PMD_ORDER, &huge_anon_orders_always) ||
+				 test_bit(PMD_ORDER, &huge_anon_orders_madvise) ||
+			         (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
+			         hugepage_global_enabled()));
+
+	/*
+	 * If PMD_ORDER is ineligible for collapse, check if mTHP collapse 
policy is obeyed;
+	 * see Documentation/admin-guide/transhuge.rst
+	 */
+	bool anon_collapse_mthp = (khugepaged_max_ptes_none == 0 ||
+			      khugepaged_max_ptes_none == HPAGE_PMD_NR - 1);
  	/*
  	 * We cover the anon, shmem and the file-backed case here; file-backed
  	 * hugepages, when configured in, are determined by the global control.
@@ -427,8 +438,9 @@ static bool thp_enabled(void)
  	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
  	    hugepage_global_enabled())
  		return true;
-	if (huge_anon_orders_always || huge_anon_orders_madvise ||
-	    (huge_anon_orders_inherit && hugepage_global_enabled()))
+	if ((huge_anon_orders_always || huge_anon_orders_madvise ||
+	    (huge_anon_orders_inherit && hugepage_global_enabled())) &&
+	    (anon_pmd_enabled || anon_collapse_mthp))
  		return true;
  	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
  		return true;
@@ -578,13 +590,16 @@ static int __collapse_huge_page_isolate(struct 
vm_area_struct *vma,
  	pte_t *_pte;
  	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
  	bool writable = false;
-	unsigned int max_ptes_shared = khugepaged_max_ptes_shared >> 
(HPAGE_PMD_ORDER - order);
+	unsigned int max_ptes_shared = khugepaged_max_ptes_shared;
  	unsigned int max_ptes_none = khugepaged_max_ptes_none >> 
(HPAGE_PMD_ORDER - order);
  	bool all_pfns_present = true;
  	bool all_pfns_contig = true;
  	bool first_pfn_aligned = true;
  	pte_t prev_pteval;

+	if (order != HPAGE_PMD_ORDER)
+		max_ptes_shared = 0;
+
  	for (_pte = pte; _pte < pte + (1UL << order);
  	     _pte++, address += PAGE_SIZE) {
  		pte_t pteval = ptep_get(_pte);
@@ -1442,11 +1457,16 @@ static int hpage_collapse_scan_pmd(struct 
mm_struct *mm,
  	if (!cc->is_khugepaged)
  		order = HPAGE_PMD_ORDER;

+	max_ptes_none = khugepaged_max_ptes_none;
+	max_ptes_shared = khugepaged_max_ptes_shared;
+	max_ptes_swap = khugepaged_max_ptes_swap;
+
  scan_pte_range:

-	max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order);
+	if (order != HPAGE_PMD_ORDER)
+		max_ptes_shared = max_ptes_swap = 0;
+
  	max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
-	max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order);
  	referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0;
  	all_pfns_present = true, all_pfns_contig = true, first_pfn_aligned = 
true;

@@ -2636,6 +2656,11 @@ static unsigned int 
khugepaged_scan_mm_slot(unsigned int pages, int *result,
  	struct mm_struct *mm;
  	struct vm_area_struct *vma;
  	int progress = 0;
+	bool collapse_mthp = true;
+
+	/* Avoid the creep problem; see Documentation/admin-guide/transhuge.rst */
+	if (khugepaged_max_ptes_none && khugepaged_max_ptes_none != 
HPAGE_PMD_NR - 1)
+		collapse_mthp = false;

  	VM_BUG_ON(!pages);
  	lockdep_assert_held(&khugepaged_mm_lock);

The dropped patches are the variable-sized VMA extension, and 
implementing that was quite a task, I ran into a lot of problems...and 
also, David notes that we may have to take the rmap locks in patch 13 of 
my v2 after all...in any case the implementation can be brute-forced by 
implementing a function akin to mm_take_all_locks().

Also, the policy I am implementing for large folio skip is different 
from v1; now I am not necessarily skipping if I see a large folio. So 
this may increase the latency of my method too, so it may not be a fair 
comparison, although I don't think this should cause a major difference.

> 
> 
> 
> 
> On Tue, Feb 11, 2025 at 4:13 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>> This patchset extends khugepaged from collapsing only PMD-sized THPs to
>> collapsing anonymous mTHPs.
>>
>> mTHPs were introduced in the kernel to improve memory management by allocating
>> chunks of larger memory, so as to reduce number of page faults, TLB misses (due
>> to TLB coalescing), reduce length of LRU lists, etc. However, the mTHP property
>> is often lost due to CoW, swap-in/out, and when the kernel just cannot find
>> enough physically contiguous memory to allocate on fault. Henceforth, there is a
>> need to regain mTHPs in the system asynchronously. This work is an attempt in
>> this direction, starting with anonymous folios.
>>
>> In the fault handler, we select the THP order in a greedy manner; the same has
>> been used here, along with the same sysfs interface to control the order of
>> collapse. In contrast to PMD-collapse, we (hopefully) get rid of the mmap_write_lock().
>>
>> ---------------------------------------------------------
>> Testing
>> ---------------------------------------------------------
>>
>> The set has been build tested on x86_64.
>> For Aarch64,
>> 1. mm-selftests: No regressions.
>> 2. Analyzing with tools/mm/thpmaps on different userspace programs mapping
>>     aligned VMAs of a large size, faulting in basepages/mTHPs (according to sysfs),
>>     and then madvise()'ing the VMA, khugepaged is able to 100% collapse the VMAs.
>>
>> This patchset is rebased on mm-unstable (4637fa5d47a49c977116321cc575ea22215df22d).
>>
>> v1->v2:
>>   - Handle VMAs less than PMD size (patches 12-15)
>>   - Do not add mTHP into deferred split queue
>>   - Drop lock optimization and collapse mTHP under mmap_write_lock()
>>   - Define policy on what to do when we encounter a folio order larger than
>>     the order we are scanning for
>>   - Prevent the creep problem by enforcing tunable simplification
>>   - Update Documentation
>>   - Drop patch 12 from v1 updating selftest w.r.t the creep problem
>>   - Drop patch 1 from v1
>>
>>   v1:
>>   https://lore.kernel.org/all/20241216165105.56185-1-dev.jain@arm.com/
>>
>> Dev Jain (17):
>>    khugepaged: Generalize alloc_charge_folio()
>>    khugepaged: Generalize hugepage_vma_revalidate()
>>    khugepaged: Generalize __collapse_huge_page_swapin()
>>    khugepaged: Generalize __collapse_huge_page_isolate()
>>    khugepaged: Generalize __collapse_huge_page_copy()
>>    khugepaged: Abstract PMD-THP collapse
>>    khugepaged: Scan PTEs order-wise
>>    khugepaged: Introduce vma_collapse_anon_folio()
>>    khugepaged: Define collapse policy if a larger folio is already mapped
>>    khugepaged: Exit early on fully-mapped aligned mTHP
>>    khugepaged: Enable sysfs to control order of collapse
>>    khugepaged: Enable variable-sized VMA collapse
>>    khugepaged: Lock all VMAs mapping the PTE table
>>    khugepaged: Reset scan address to correct alignment
>>    khugepaged: Delay cond_resched()
>>    khugepaged: Implement strict policy for mTHP collapse
>>    Documentation: transhuge: Define khugepaged mTHP collapse policy
>>
>>   Documentation/admin-guide/mm/transhuge.rst |  49 +-
>>   include/linux/huge_mm.h                    |   2 +
>>   mm/huge_memory.c                           |   4 +
>>   mm/khugepaged.c                            | 603 ++++++++++++++++-----
>>   4 files changed, 511 insertions(+), 147 deletions(-)
>>
>> --
>> 2.30.2
>>
> 
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-02-15  7:36 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-11 11:13 [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Dev Jain
2025-02-11 11:13 ` [PATCH v2 01/17] khugepaged: Generalize alloc_charge_folio() Dev Jain
2025-02-11 11:13 ` [PATCH v2 02/17] khugepaged: Generalize hugepage_vma_revalidate() Dev Jain
2025-02-11 11:13 ` [PATCH v2 03/17] khugepaged: Generalize __collapse_huge_page_swapin() Dev Jain
2025-02-11 11:13 ` [PATCH v2 04/17] khugepaged: Generalize __collapse_huge_page_isolate() Dev Jain
2025-02-11 11:13 ` [PATCH v2 05/17] khugepaged: Generalize __collapse_huge_page_copy() Dev Jain
2025-02-11 11:13 ` [PATCH v2 06/17] khugepaged: Abstract PMD-THP collapse Dev Jain
2025-02-11 11:13 ` [PATCH v2 07/17] khugepaged: Scan PTEs order-wise Dev Jain
2025-02-11 11:13 ` [PATCH v2 08/17] khugepaged: Introduce vma_collapse_anon_folio() Dev Jain
2025-02-11 11:13 ` [PATCH v2 09/17] khugepaged: Define collapse policy if a larger folio is already mapped Dev Jain
2025-02-11 11:13 ` [PATCH v2 10/17] khugepaged: Exit early on fully-mapped aligned mTHP Dev Jain
2025-02-11 11:13 ` [PATCH v2 11/17] khugepaged: Enable sysfs to control order of collapse Dev Jain
2025-02-11 11:13 ` [PATCH v2 12/17] khugepaged: Enable variable-sized VMA collapse Dev Jain
2025-02-11 11:13 ` [PATCH v2 13/17] khugepaged: Lock all VMAs mapping the PTE table Dev Jain
2025-02-11 11:13 ` [PATCH v2 14/17] khugepaged: Reset scan address to correct alignment Dev Jain
2025-02-11 11:13 ` [PATCH v2 15/17] khugepaged: Delay cond_resched() Dev Jain
2025-02-11 11:13 ` [PATCH v2 16/17] khugepaged: Implement strict policy for mTHP collapse Dev Jain
2025-02-11 11:13 ` [PATCH v2 17/17] Documentation: transhuge: Define khugepaged mTHP collapse policy Dev Jain
2025-02-11 23:23 ` [PATCH v2 00/17] khugepaged: Asynchronous mTHP collapse Andrew Morton
2025-02-12  4:18   ` Dev Jain
2025-02-15  1:47 ` Nico Pache
2025-02-15  7:36   ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox