[RFC 00/11] khugepaged: mTHP support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 00/11] khugepaged: mTHP support
@ 2025-01-08 23:31 Nico Pache
  2025-01-08 23:31 ` [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd Nico Pache
                   ` (13 more replies)
  0 siblings, 14 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

The following series provides khugepaged and madvise collapse with the 
capability to collapse regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
(defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
using a bitmap. After the PMD scan is done, we do binary recursion on the
bitmap to find the optimal mTHP sizes for the PMD range. The restriction
on max_ptes_none is removed during the scan, to make sure we account for
the whole PMD range. max_ptes_none is mapped to a 0-100 range to 
determine how full a mTHP order needs to be before collapsing it.

Some design choices to note: 
 - bitmap structures are allocated dynamically because on some arch's 
    (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
    compile time leading to warnings.
 - The recursion is masked through a stack structure.
 - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
    64bit on x86. This provides some optimization on the bitmap operations.
    if other arches/configs that have larger than 512 PTEs per PMD want to 
    compress their bitmap further we can change this value per arch.

Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
Patch 3:    A minor "fix"/optimization
Patch 4:    Refactor/rename hpage_collapse
Patch 5-7:  Generalize khugepaged functions for arbitrary orders
Patch 8-11: The mTHP patches

This series acts as an alternative to Dev Jain's approach [1]. The two 
series differ in a few ways:
  - My approach uses a bitmap to store the state of the linear scan_pmd to
    then determine potential mTHP batches. Devs incorporates his directly
    into the scan, and will try each available order. 
  - Dev is attempting to optimize the locking, while my approach keeps the
    locking changes to a minimum. I believe his changes are not safe for
    uffd.
  - Dev's changes only work for khugepaged not madvise_collapse (although
    i think that was by choice and it could easily support madvise)
  - Dev scales all khugepaged sysfs tunables by order, while im removing 
    the restriction of max_ptes_none and converting it to a scale to 
    determine a (m)THP threshold.
  - Dev turns on khugepaged if any order is available while mine still 
    only runs if PMDs are enabled. I like Dev's approach and will most
    likely do the same in my PATCH posting.
  - mTHPs need their ref count updated to 1<<order, which Dev is missing.

Patch 11 was inspired by one of Dev's changes.

[1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/

Nico Pache (11):
  introduce khugepaged_collapse_single_pmd to collapse a single pmd
  khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
  khugepaged: Don't allocate khugepaged mm_slot early
  khugepaged: rename hpage_collapse_* to khugepaged_*
  khugepaged: generalize hugepage_vma_revalidate for mTHP support
  khugepaged: generalize alloc_charge_folio for mTHP support
  khugepaged: generalize __collapse_huge_page_* for mTHP support
  khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  khugepaged: add mTHP support
  khugepaged: remove max_ptes_none restriction on the pmd scan
  khugepaged: skip collapsing mTHP to smaller orders

 include/linux/khugepaged.h |   4 +-
 mm/huge_memory.c           |   3 +-
 mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
 3 files changed, 306 insertions(+), 137 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
@ 2025-01-08 23:31 ` Nico Pache
  2025-01-10  6:25   ` Dev Jain
  2025-01-08 23:31 ` [RFC 02/11] khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot Nico Pache
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

The khugepaged daemon and madvise_collapse have two different
implementations that do almost the thing.

Create khugepaged_collapse_single_pmd to increase code
reuse and create a entry point for future khugepaged changes.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 653dbb1ff05c..4d932839ff1d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2348,6 +2348,52 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 }
 #endif
 
+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *mm,
+				   struct vm_area_struct *vma, bool *mmap_locked,
+				   struct collapse_control *cc)
+{
+	int result = SCAN_FAIL;
+	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
+
+	if (!*mmap_locked) {
+		mmap_read_lock(mm);
+		*mmap_locked = true;
+	}
+
+	if (thp_vma_allowable_order(vma, vma->vm_flags,
+					tva_flags, PMD_ORDER)) {
+		if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
+			struct file *file = get_file(vma->vm_file);
+			pgoff_t pgoff = linear_page_index(vma, addr);
+
+			mmap_read_unlock(mm);
+			*mmap_locked = false;
+			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
+							  cc);
+			fput(file);
+			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
+				mmap_read_lock(mm);
+				if (hpage_collapse_test_exit_or_disable(mm))
+					goto end;
+				result = collapse_pte_mapped_thp(mm, addr,
+								 !cc->is_khugepaged);
+				mmap_read_unlock(mm);
+			}
+		} else {
+			result = hpage_collapse_scan_pmd(mm, vma, addr,
+							 mmap_locked, cc);
+		}
+		if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
+			++khugepaged_pages_collapsed;
+	}
+end:
+	return result;
+}
+
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 					    struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 02/11] khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
  2025-01-08 23:31 ` [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd Nico Pache
@ 2025-01-08 23:31 ` Nico Pache
  2025-01-08 23:31 ` [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early Nico Pache
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

Now that we have a khugepaged_collapse_single_pmd, lets use that code
in madvise_collapse and khugepaged_scan_mm_slot to create a single entry
point.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 50 ++++---------------------------------------------
 1 file changed, 4 insertions(+), 46 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4d932839ff1d..ba85a8fcee88 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2468,33 +2468,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
-			if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
-				struct file *file = get_file(vma->vm_file);
-				pgoff_t pgoff = linear_page_index(vma,
-						khugepaged_scan.address);
 
-				mmap_read_unlock(mm);
-				mmap_locked = false;
-				*result = hpage_collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
-				fput(file);
-				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
-					mmap_read_lock(mm);
-					if (hpage_collapse_test_exit_or_disable(mm))
-						goto breakouterloop;
-					*result = collapse_pte_mapped_thp(mm,
-						khugepaged_scan.address, false);
-					if (*result == SCAN_PMD_MAPPED)
-						*result = SCAN_SUCCEED;
-					mmap_read_unlock(mm);
-				}
-			} else {
-				*result = hpage_collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
-			}
-
-			if (*result == SCAN_SUCCEED)
-				++khugepaged_pages_collapsed;
+			*result = khugepaged_collapse_single_pmd(khugepaged_scan.address,
+						mm, vma, &mmap_locked, cc);
 
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
@@ -2814,36 +2790,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		mmap_assert_locked(mm);
 		memset(cc->node_load, 0, sizeof(cc->node_load));
 		nodes_clear(cc->alloc_nmask);
-		if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
-			struct file *file = get_file(vma->vm_file);
-			pgoff_t pgoff = linear_page_index(vma, addr);
 
-			mmap_read_unlock(mm);
-			mmap_locked = false;
-			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-							  cc);
-			fput(file);
-		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
-		}
+		result = khugepaged_collapse_single_pmd(addr, mm, vma, &mmap_locked, cc);
+
 		if (!mmap_locked)
 			*prev = NULL;  /* Tell caller we dropped mmap_lock */
 
-handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
 			++thps;
 			break;
 		case SCAN_PTE_MAPPED_HUGEPAGE:
-			BUG_ON(mmap_locked);
-			BUG_ON(*prev);
-			mmap_read_lock(mm);
-			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_read_unlock(mm);
-			goto handle_result;
-		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
 		case SCAN_PTE_NON_PRESENT:
 		case SCAN_PTE_UFFD_WP:
-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
  2025-01-08 23:31 ` [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd Nico Pache
  2025-01-08 23:31 ` [RFC 02/11] khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot Nico Pache
@ 2025-01-08 23:31 ` Nico Pache
  2025-01-10  6:11   ` Dev Jain
  2025-01-08 23:31 ` [RFC 04/11] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

We should only "enter"/allocate the khugepaged mm_slot if we succeed at
allocating the PMD sized folio. Move the khugepaged_enter_vma call until
after we know the vma_alloc_folio was successful.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/huge_memory.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e53d83b3e5cf..635c65e7ef63 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1323,7 +1323,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	ret = vmf_anon_prepare(vmf);
 	if (ret)
 		return ret;
-	khugepaged_enter_vma(vma, vma->vm_flags);
 
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
@@ -1365,7 +1364,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		}
 		return ret;
 	}
-
+	khugepaged_enter_vma(vma, vma->vm_flags);
 	return __do_huge_pmd_anonymous_page(vmf);
 }
 
-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 04/11] khugepaged: rename hpage_collapse_* to khugepaged_*
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
                   ` (2 preceding siblings ...)
  2025-01-08 23:31 ` [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early Nico Pache
@ 2025-01-08 23:31 ` Nico Pache
  2025-01-08 23:31 ` [RFC 05/11] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

functions in khugepaged.c use a mix of hpage_collapse and khugepaged
as the function prefix.

rename all of them to khugepaged to keep things consistent and slightly
shorten the function names.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 52 ++++++++++++++++++++++++-------------------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ba85a8fcee88..90de49d11a98 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -402,14 +402,14 @@ void __init khugepaged_destroy(void)
 	kmem_cache_destroy(mm_slot_cache);
 }
 
-static inline int hpage_collapse_test_exit(struct mm_struct *mm)
+static inline int khugepaged_test_exit(struct mm_struct *mm)
 {
 	return atomic_read(&mm->mm_users) == 0;
 }
 
-static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
+static inline int khugepaged_test_exit_or_disable(struct mm_struct *mm)
 {
-	return hpage_collapse_test_exit(mm) ||
+	return khugepaged_test_exit(mm) ||
 	       test_bit(MMF_DISABLE_THP, &mm->flags);
 }
 
@@ -444,7 +444,7 @@ void __khugepaged_enter(struct mm_struct *mm)
 	int wakeup;
 
 	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
+	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags)))
 		return;
 
@@ -503,7 +503,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 	} else if (mm_slot) {
 		/*
 		 * This is required to serialize against
-		 * hpage_collapse_test_exit() (which is guaranteed to run
+		 * khugepaged_test_exit() (which is guaranteed to run
 		 * under mmap sem read mode). Stop here (after we return all
 		 * pagetables will be destroyed) until khugepaged has finished
 		 * working on the pagetables under the mmap_lock.
@@ -606,7 +606,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
-		/* See hpage_collapse_scan_pmd(). */
+		/* See khugepaged_scan_pmd(). */
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
 			if (cc->is_khugepaged &&
@@ -851,7 +851,7 @@ struct collapse_control khugepaged_collapse_control = {
 	.is_khugepaged = true,
 };
 
-static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
+static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
 
@@ -886,7 +886,7 @@ static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 }
 
 #ifdef CONFIG_NUMA
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	int nid, target_node = 0, max_value = 0;
 
@@ -905,7 +905,7 @@ static int hpage_collapse_find_target_node(struct collapse_control *cc)
 	return target_node;
 }
 #else
-static int hpage_collapse_find_target_node(struct collapse_control *cc)
+static int khugepaged_find_target_node(struct collapse_control *cc)
 {
 	return 0;
 }
@@ -925,7 +925,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	struct vm_area_struct *vma;
 	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
 
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(khugepaged_test_exit_or_disable(mm)))
 		return SCAN_ANY_PROCESS;
 
 	*vmap = vma = find_vma(mm, address);
@@ -988,7 +988,7 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 
 /*
  * Bring missing pages in from swap, to complete THP collapse.
- * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
  *
  * Called and returns without pte mapped or spinlocks held.
  * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
@@ -1074,7 +1074,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
-	int node = hpage_collapse_find_target_node(cc);
+	int node = khugepaged_find_target_node(cc);
 	struct folio *folio;
 
 	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
@@ -1260,7 +1260,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	return result;
 }
 
-static int hpage_collapse_scan_pmd(struct mm_struct *mm,
+static int khugepaged_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
 				   struct collapse_control *cc)
@@ -1376,7 +1376,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			goto out_unmap;
 		}
@@ -1445,7 +1445,7 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (hpage_collapse_test_exit(mm)) {
+	if (khugepaged_test_exit(mm)) {
 		/* free mm_slot */
 		hash_del(&slot->hash);
 		list_del(&slot->mm_node);
@@ -1740,7 +1740,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		if (find_pmd_or_thp_or_none(mm, addr, &pmd) != SCAN_SUCCEED)
 			continue;
 
-		if (hpage_collapse_test_exit(mm))
+		if (khugepaged_test_exit(mm))
 			continue;
 		/*
 		 * When a vma is registered with uffd-wp, we cannot recycle
@@ -2249,7 +2249,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
-static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
 				    struct file *file, pgoff_t start,
 				    struct collapse_control *cc)
 {
@@ -2294,7 +2294,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
+		if (khugepaged_scan_abort(node, cc)) {
 			result = SCAN_SCAN_ABORT;
 			break;
 		}
@@ -2340,7 +2340,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 #else
-static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
+static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
 				    struct file *file, pgoff_t start,
 				    struct collapse_control *cc)
 {
@@ -2372,19 +2372,19 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
 
 			mmap_read_unlock(mm);
 			*mmap_locked = false;
-			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
+			result = khugepaged_scan_file(mm, addr, file, pgoff,
 							  cc);
 			fput(file);
 			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
 				mmap_read_lock(mm);
-				if (hpage_collapse_test_exit_or_disable(mm))
+				if (khugepaged_test_exit_or_disable(mm))
 					goto end;
 				result = collapse_pte_mapped_thp(mm, addr,
 								 !cc->is_khugepaged);
 				mmap_read_unlock(mm);
 			}
 		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
+			result = khugepaged_scan_pmd(mm, vma, addr,
 							 mmap_locked, cc);
 		}
 		if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
@@ -2432,7 +2432,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		goto breakouterloop_mmap_lock;
 
 	progress++;
-	if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+	if (unlikely(khugepaged_test_exit_or_disable(mm)))
 		goto breakouterloop;
 
 	vma_iter_init(&vmi, mm, khugepaged_scan.address);
@@ -2440,7 +2440,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 		unsigned long hstart, hend;
 
 		cond_resched();
-		if (unlikely(hpage_collapse_test_exit_or_disable(mm))) {
+		if (unlikely(khugepaged_test_exit_or_disable(mm))) {
 			progress++;
 			break;
 		}
@@ -2462,7 +2462,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			bool mmap_locked = true;
 
 			cond_resched();
-			if (unlikely(hpage_collapse_test_exit_or_disable(mm)))
+			if (unlikely(khugepaged_test_exit_or_disable(mm)))
 				goto breakouterloop;
 
 			VM_BUG_ON(khugepaged_scan.address < hstart ||
@@ -2498,7 +2498,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (hpage_collapse_test_exit(mm) || !vma) {
+	if (khugepaged_test_exit(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 05/11] khugepaged: generalize hugepage_vma_revalidate for mTHP support
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
                   ` (3 preceding siblings ...)
  2025-01-08 23:31 ` [RFC 04/11] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
@ 2025-01-08 23:31 ` Nico Pache
  2025-01-08 23:31 ` [RFC 06/11] khugepaged: generalize alloc_charge_folio " Nico Pache
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

For khugepaged to support different mTHP orders, we must generalize this
function for arbitrary orders.

No functional change in this patch.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 90de49d11a98..e2e6ca9265ab 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -920,7 +920,7 @@ static int khugepaged_find_target_node(struct collapse_control *cc)
 static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 				   bool expect_anon,
 				   struct vm_area_struct **vmap,
-				   struct collapse_control *cc)
+				   struct collapse_control *cc, int order)
 {
 	struct vm_area_struct *vma;
 	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
@@ -932,9 +932,9 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	if (!vma)
 		return SCAN_VMA_NULL;
 
-	if (!thp_vma_suitable_order(vma, address, PMD_ORDER))
+	if (!thp_vma_suitable_order(vma, address, order))
 		return SCAN_ADDRESS_RANGE;
-	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, PMD_ORDER))
+	if (!thp_vma_allowable_order(vma, vma->vm_flags, tva_flags, order))
 		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
@@ -1126,7 +1126,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1160,7 +1160,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -2779,7 +2779,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 			mmap_read_lock(mm);
 			mmap_locked = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
-							 cc);
+							 cc, HPAGE_PMD_ORDER);
 			if (result  != SCAN_SUCCEED) {
 				last_fail = result;
 				goto out_nolock;
-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 06/11] khugepaged: generalize alloc_charge_folio for mTHP support
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
                   ` (4 preceding siblings ...)
  2025-01-08 23:31 ` [RFC 05/11] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
@ 2025-01-08 23:31 ` Nico Pache
  2025-01-10  6:23   ` Dev Jain
  2025-01-08 23:31 ` [RFC 07/11] khugepaged: generalize __collapse_huge_page_* " Nico Pache
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

alloc_charge_folio allocates the new folio for the khugepaged collapse.
Generalize the order of the folio allocations to support future mTHP
collapsing.

No functional changes in this patch.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e2e6ca9265ab..6daf3a943a1a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1070,14 +1070,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 }
 
 static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
-			      struct collapse_control *cc)
+			      struct collapse_control *cc, int order)
 {
 	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
 		     GFP_TRANSHUGE);
 	int node = khugepaged_find_target_node(cc);
 	struct folio *folio;
 
-	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
+	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
 	if (!folio) {
 		*foliop = NULL;
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
@@ -1121,7 +1121,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 */
 	mmap_read_unlock(mm);
 
-	result = alloc_charge_folio(&folio, mm, cc);
+	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
@@ -1834,7 +1834,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
 	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
 
-	result = alloc_charge_folio(&new_folio, mm, cc);
+	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 07/11] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
                   ` (5 preceding siblings ...)
  2025-01-08 23:31 ` [RFC 06/11] khugepaged: generalize alloc_charge_folio " Nico Pache
@ 2025-01-08 23:31 ` Nico Pache
  2025-01-10  6:38   ` Dev Jain
  2025-01-08 23:31 ` [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

generalize the order of the __collapse_huge_page_* functions
to support future mTHP collapse.

No functional changes in this patch.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6daf3a943a1a..9eb161b04ee4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -565,7 +565,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 					unsigned long address,
 					pte_t *pte,
 					struct collapse_control *cc,
-					struct list_head *compound_pagelist)
+					struct list_head *compound_pagelist,
+					u8 order)
 {
 	struct page *page = NULL;
 	struct folio *folio = NULL;
@@ -573,7 +574,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
 	bool writable = false;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+	for (_pte = pte; _pte < pte + (1 << order);
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = ptep_get(_pte);
 		if (pte_none(pteval) || (pte_present(pteval) &&
@@ -711,14 +712,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
 						struct vm_area_struct *vma,
 						unsigned long address,
 						spinlock_t *ptl,
-						struct list_head *compound_pagelist)
+						struct list_head *compound_pagelist,
+						u8 order)
 {
 	struct folio *src, *tmp;
 	pte_t *_pte;
 	pte_t pteval;
 
-	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, address += PAGE_SIZE) {
+	for (_pte = pte; _pte < pte + (1 << order);
+		_pte++, address += PAGE_SIZE) {
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
@@ -764,7 +766,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 					     pmd_t *pmd,
 					     pmd_t orig_pmd,
 					     struct vm_area_struct *vma,
-					     struct list_head *compound_pagelist)
+					     struct list_head *compound_pagelist,
+					     u8 order)
 {
 	spinlock_t *pmd_ptl;
 
@@ -781,7 +784,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 	 * Release both raw and compound pages isolated
 	 * in __collapse_huge_page_isolate.
 	 */
-	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
+	release_pte_pages(pte, pte + (1 << order), compound_pagelist);
 }
 
 /*
@@ -802,7 +805,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
 static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
 		unsigned long address, spinlock_t *ptl,
-		struct list_head *compound_pagelist)
+		struct list_head *compound_pagelist, u8 order)
 {
 	unsigned int i;
 	int result = SCAN_SUCCEED;
@@ -810,7 +813,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 	/*
 	 * Copying pages' contents is subject to memory poison at any iteration.
 	 */
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
+	for (i = 0; i < (1 << order); i++) {
 		pte_t pteval = ptep_get(pte + i);
 		struct page *page = folio_page(folio, i);
 		unsigned long src_addr = address + i * PAGE_SIZE;
@@ -829,10 +832,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
 
 	if (likely(result == SCAN_SUCCEED))
 		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
-						    compound_pagelist);
+						    compound_pagelist, order);
 	else
 		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
-						 compound_pagelist);
+						 compound_pagelist, order);
 
 	return result;
 }
@@ -996,11 +999,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
 static int __collapse_huge_page_swapin(struct mm_struct *mm,
 				       struct vm_area_struct *vma,
 				       unsigned long haddr, pmd_t *pmd,
-				       int referenced)
+				       int referenced, u8 order)
 {
 	int swapped_in = 0;
 	vm_fault_t ret = 0;
-	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+	unsigned long address, end = haddr + ((1 << order) * PAGE_SIZE);
 	int result;
 	pte_t *pte = NULL;
 	spinlock_t *ptl;
@@ -1110,7 +1113,6 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
-
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	/*
@@ -1145,7 +1147,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
 		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced);
+				referenced, HPAGE_PMD_ORDER);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
@@ -1192,7 +1194,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
 		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      &compound_pagelist);
+					&compound_pagelist, HPAGE_PMD_ORDER);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
@@ -1222,7 +1224,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, address, pte_ptl,
-					   &compound_pagelist);
+					   &compound_pagelist, HPAGE_PMD_ORDER);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
                   ` (6 preceding siblings ...)
  2025-01-08 23:31 ` [RFC 07/11] khugepaged: generalize __collapse_huge_page_* " Nico Pache
@ 2025-01-08 23:31 ` Nico Pache
  2025-01-10  9:05   ` Dev Jain
                     ` (2 more replies)
  2025-01-08 23:31 ` [RFC 09/11] khugepaged: add " Nico Pache
                   ` (5 subsequent siblings)
  13 siblings, 3 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

khugepaged scans PMD ranges for potential collapse to a hugepage. To add
mTHP support we use this scan to instead record chunks of fully utilized
sections of the PMD.

create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
by default we will set this to order 3. The reasoning is that for 4K 512
PMD size this results in a 64 bit bitmap which has some optimizations.
For other arches like ARM64 64K, we can set a larger order if needed.

khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
that represents chunks of fully utilized regions. We can then determine
what mTHP size fits best and in the following patch, we set this bitmap
while scanning the PMD.

max_ptes_none is used as a scale to determine how "full" an order must
be before being considered for collapse.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/khugepaged.h |   4 +-
 mm/khugepaged.c            | 129 +++++++++++++++++++++++++++++++++++--
 2 files changed, 126 insertions(+), 7 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 1f46046080f5..31cff8aeec4a 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -1,7 +1,9 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_KHUGEPAGED_H
 #define _LINUX_KHUGEPAGED_H
-
+#define MIN_MTHP_ORDER	3
+#define MIN_MTHP_NR	(1<<MIN_MTHP_ORDER)
+#define MTHP_BITMAP_SIZE  (1<<(HPAGE_PMD_ORDER - MIN_MTHP_ORDER))
 extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern struct attribute_group khugepaged_attr_group;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9eb161b04ee4..de1dc6ea3c71 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __ro_after_init;
 
+struct scan_bit_state {
+	u8 order;
+	u8 offset;
+};
+
 struct collapse_control {
 	bool is_khugepaged;
 
@@ -102,6 +107,15 @@ struct collapse_control {
 
 	/* nodemask for allocation fallback */
 	nodemask_t alloc_nmask;
+
+	/* bitmap used to collapse mTHP sizes. 1bit = order MIN_MTHP_ORDER mTHP */
+	unsigned long *mthp_bitmap;
+	unsigned long *mthp_bitmap_temp;
+	struct scan_bit_state *mthp_bitmap_stack;
+};
+
+struct collapse_control khugepaged_collapse_control = {
+	.is_khugepaged = true,
 };
 
 /**
@@ -389,6 +403,25 @@ int __init khugepaged_init(void)
 	if (!mm_slot_cache)
 		return -ENOMEM;
 
+	/*
+	 * allocate the bitmaps dynamically since MTHP_BITMAP_SIZE is not known at
+	 * compile time for some architectures.
+	 */
+	khugepaged_collapse_control.mthp_bitmap = kmalloc_array(
+		BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
+	if (!khugepaged_collapse_control.mthp_bitmap)
+		return -ENOMEM;
+
+	khugepaged_collapse_control.mthp_bitmap_temp = kmalloc_array(
+		BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
+	if (!khugepaged_collapse_control.mthp_bitmap_temp)
+		return -ENOMEM;
+
+	khugepaged_collapse_control.mthp_bitmap_stack = kmalloc_array(
+		MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL);
+	if (!khugepaged_collapse_control.mthp_bitmap_stack)
+		return -ENOMEM;
+
 	khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
 	khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
 	khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
@@ -400,6 +433,9 @@ int __init khugepaged_init(void)
 void __init khugepaged_destroy(void)
 {
 	kmem_cache_destroy(mm_slot_cache);
+	kfree(khugepaged_collapse_control.mthp_bitmap);
+	kfree(khugepaged_collapse_control.mthp_bitmap_temp);
+	kfree(khugepaged_collapse_control.mthp_bitmap_stack);
 }
 
 static inline int khugepaged_test_exit(struct mm_struct *mm)
@@ -850,10 +886,6 @@ static void khugepaged_alloc_sleep(void)
 	remove_wait_queue(&khugepaged_wait, &wait);
 }
 
-struct collapse_control khugepaged_collapse_control = {
-	.is_khugepaged = true,
-};
-
 static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
 {
 	int i;
@@ -1102,7 +1134,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 
 static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 			      int referenced, int unmapped,
-			      struct collapse_control *cc)
+			      struct collapse_control *cc, bool *mmap_locked,
+				  int order, int offset)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
@@ -1115,6 +1148,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	struct mmu_notifier_range range;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
+	/* if collapsing mTHPs we may have already released the read_lock, and
+	 * need to reaquire it to keep the proper locking order.
+	 */
+	if (!*mmap_locked)
+		mmap_read_lock(mm);
 	/*
 	 * Before allocating the hugepage, release the mmap_lock read lock.
 	 * The allocation can take potentially a long time if it involves
@@ -1122,6 +1160,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * that. We will recheck the vma after taking it again in write mode.
 	 */
 	mmap_read_unlock(mm);
+	*mmap_locked = false;
 
 	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
 	if (result != SCAN_SUCCEED)
@@ -1256,12 +1295,71 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 out_up_write:
 	mmap_write_unlock(mm);
 out_nolock:
+	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
 	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
 	return result;
 }
 
+// Recursive function to consume the bitmap
+static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
+			int referenced, int unmapped, struct collapse_control *cc,
+			bool *mmap_locked, unsigned long enabled_orders)
+{
+	u8 order, offset;
+	int num_chunks;
+	int bits_set, max_percent, threshold_bits;
+	int next_order, mid_offset;
+	int top = -1;
+	int collapsed = 0;
+	int ret;
+	struct scan_bit_state state;
+
+	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+		{ HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
+
+	while (top >= 0) {
+		state = cc->mthp_bitmap_stack[top--];
+		order = state.order;
+		offset = state.offset;
+		num_chunks = 1 << order;
+		// Skip mTHP orders that are not enabled
+		if (!(enabled_orders >> (order +  MIN_MTHP_ORDER)) & 1)
+			goto next;
+
+		// copy the relavant section to a new bitmap
+		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
+				  MTHP_BITMAP_SIZE);
+
+		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
+
+		// Check if the region is "almost full" based on the threshold
+		max_percent = ((HPAGE_PMD_NR - khugepaged_max_ptes_none - 1) * 100)
+						/ (HPAGE_PMD_NR - 1);
+		threshold_bits = (max_percent * num_chunks) / 100;
+
+		if (bits_set >= threshold_bits) {
+			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
+					mmap_locked, order + MIN_MTHP_ORDER, offset * MIN_MTHP_NR);
+			if (ret == SCAN_SUCCEED)
+				collapsed += (1 << (order + MIN_MTHP_ORDER));
+			continue;
+		}
+
+next:
+		if (order > 0) {
+			next_order = order - 1;
+			mid_offset = offset + (num_chunks / 2);
+			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+				{ next_order, mid_offset };
+			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
+				{ next_order, offset };
+			}
+	}
+	return collapsed;
+}
+
 static int khugepaged_scan_pmd(struct mm_struct *mm,
 				   struct vm_area_struct *vma,
 				   unsigned long address, bool *mmap_locked,
@@ -1430,7 +1528,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc);
+					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
 		/* collapse_huge_page will return with the mmap_lock released */
 		*mmap_locked = false;
 	}
@@ -2767,6 +2865,21 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return -ENOMEM;
 	cc->is_khugepaged = false;
 
+	cc->mthp_bitmap = kmalloc_array(
+		BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
+	if (!cc->mthp_bitmap)
+		return -ENOMEM;
+
+	cc->mthp_bitmap_temp = kmalloc_array(
+		BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
+	if (!cc->mthp_bitmap_temp)
+		return -ENOMEM;
+
+	cc->mthp_bitmap_stack = kmalloc_array(
+		MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL);
+	if (!cc->mthp_bitmap_stack)
+		return -ENOMEM;
+
 	mmgrab(mm);
 	lru_add_drain_all();
 
@@ -2831,8 +2944,12 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 out_nolock:
 	mmap_assert_locked(mm);
 	mmdrop(mm);
+	kfree(cc->mthp_bitmap);
+	kfree(cc->mthp_bitmap_temp);
+	kfree(cc->mthp_bitmap_stack);
 	kfree(cc);
 
+
 	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
 			: madvise_collapse_errno(last_fail);
 }
-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 09/11] khugepaged: add mTHP support
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
                   ` (7 preceding siblings ...)
  2025-01-08 23:31 ` [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
@ 2025-01-08 23:31 ` Nico Pache
  2025-01-10  9:20   ` Dev Jain
  2025-01-10 13:36   ` Dev Jain
  2025-01-08 23:31 ` [RFC 10/11] khugepaged: remove max_ptes_none restriction on the pmd scan Nico Pache
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

Introduce the ability for khugepaged to collapse to different mTHP sizes.
While scanning a PMD range for potential hugepage collapse, track pages
in MIN_MTHP_ORDER chunks. Each bit represents a fully utilized region of
order MIN_MTHP_ORDER ptes.

With this bitmap we can determine which mTHP sizes would be the most
efficient to collapse to if the PMD collapse is not suitible.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 111 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 77 insertions(+), 34 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index de1dc6ea3c71..4d3c560f20b4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1139,13 +1139,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte;
+	pte_t *pte, mthp_pte;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
+	unsigned long _address = address + offset * PAGE_SIZE;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	/* if collapsing mTHPs we may have already released the read_lock, and
@@ -1162,12 +1163,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmap_read_unlock(mm);
 	*mmap_locked = false;
 
-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+	result = alloc_charge_folio(&folio, mm, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
+	*mmap_locked = true;
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1185,13 +1187,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * released when it fails. So we jump out_nolock directly in
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-				referenced, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
+				referenced, order);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
 
 	mmap_read_unlock(mm);
+	*mmap_locked = false;
 	/*
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later handled by the ptep_clear_flush and the VM
@@ -1201,7 +1204,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -1212,11 +1215,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	vma_start_write(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
-				address + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
+				_address + (PAGE_SIZE << order));
 	mmu_notifier_invalidate_range_start(&range);
 
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
+
 	/*
 	 * This removes any huge TLB entry from the CPU so we won't allow
 	 * huge and small TLB entries for the same virtual address to
@@ -1230,10 +1234,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_remove_table_sync_one();
 
-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
 	if (pte) {
-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-					&compound_pagelist, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
+					&compound_pagelist, order);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
@@ -1262,8 +1266,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	anon_vma_unlock_write(vma->anon_vma);
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
-					   vma, address, pte_ptl,
-					   &compound_pagelist, HPAGE_PMD_ORDER);
+					   vma, _address, pte_ptl,
+					   &compound_pagelist, order);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
@@ -1274,20 +1278,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * write.
 	 */
 	__folio_mark_uptodate(folio);
-	pgtable = pmd_pgtable(_pmd);
-
-	_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
-	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
-
-	spin_lock(pmd_ptl);
-	BUG_ON(!pmd_none(*pmd));
-	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
-	folio_add_lru_vma(folio, vma);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
-	set_pmd_at(mm, address, pmd, _pmd);
-	update_mmu_cache_pmd(vma, address, pmd);
-	deferred_split_folio(folio, false);
-	spin_unlock(pmd_ptl);
+	if (order == HPAGE_PMD_ORDER) {
+		pgtable = pmd_pgtable(_pmd);
+		_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
+		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+		spin_lock(pmd_ptl);
+		BUG_ON(!pmd_none(*pmd));
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		set_pmd_at(mm, address, pmd, _pmd);
+		update_mmu_cache_pmd(vma, address, pmd);
+		deferred_split_folio(folio, false);
+		spin_unlock(pmd_ptl);
+	} else { //mTHP
+		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
+		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
+
+		spin_lock(pmd_ptl);
+		folio_ref_add(folio, (1 << order) - 1);
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		spin_lock(pte_ptl);
+		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
+		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
+		spin_unlock(pte_ptl);
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+		deferred_split_folio(folio, false);
+		spin_unlock(pmd_ptl);
+	}
 
 	folio = NULL;
 
@@ -1367,21 +1388,26 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
+	int i;
 	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long _address;
+	unsigned long enabled_orders;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
 	bool writable = false;
-
+	bool all_valid = true;
+	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	result = find_pmd_or_thp_or_none(mm, address, &pmd);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
+	bitmap_zero(cc->mthp_bitmap, 1 << (HPAGE_PMD_ORDER - MIN_MTHP_ORDER));
+	bitmap_zero(cc->mthp_bitmap_temp, 1 << (HPAGE_PMD_ORDER - MIN_MTHP_ORDER));
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -1390,8 +1416,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		goto out;
 	}
 
-	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, _address += PAGE_SIZE) {
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		if (i % MIN_MTHP_NR == 0)
+			all_valid = true;
+
+		_pte = pte + i;
+		_address = address + i * PAGE_SIZE;
 		pte_t pteval = ptep_get(_pte);
 		if (is_swap_pte(pteval)) {
 			++unmapped;
@@ -1414,6 +1444,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			}
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+			all_valid = false;
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
 			    (!cc->is_khugepaged ||
@@ -1514,7 +1545,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
 								     address)))
 			referenced++;
+
+		/*
+		 * we are reading in MIN_MTHP_NR page chunks. if there are no empty
+		 * pages keep track of it in the bitmap for mTHP collapsing.
+		 */
+		if (all_valid && (i + 1) % MIN_MTHP_NR == 0)
+			bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
 	}
+
 	if (!writable) {
 		result = SCAN_PAGE_RO;
 	} else if (cc->is_khugepaged &&
@@ -1527,10 +1566,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
-		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
-		/* collapse_huge_page will return with the mmap_lock released */
-		*mmap_locked = false;
+		enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+			tva_flags, THP_ORDERS_ALL_ANON);
+		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
+			       mmap_locked, enabled_orders);
+		if (result > 0)
+			result = SCAN_SUCCEED;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
@@ -2477,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
 			fput(file);
 			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
 				mmap_read_lock(mm);
+				*mmap_locked = true;
 				if (khugepaged_test_exit_or_disable(mm))
 					goto end;
 				result = collapse_pte_mapped_thp(mm, addr,
 								 !cc->is_khugepaged);
 				mmap_read_unlock(mm);
+				*mmap_locked = false;
 			}
 		} else {
 			result = khugepaged_scan_pmd(mm, vma, addr,
-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 10/11] khugepaged: remove max_ptes_none restriction on the pmd scan
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
                   ` (8 preceding siblings ...)
  2025-01-08 23:31 ` [RFC 09/11] khugepaged: add " Nico Pache
@ 2025-01-08 23:31 ` Nico Pache
  2025-01-08 23:31 ` [RFC 11/11] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

now that we have mTHP support, which uses max_ptes_none to determine how
"full" a mTHP size needs to collapse. lets remove the restriction during
the scan phase so we dont bailout early and miss potential mTHP
candidates.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4d3c560f20b4..61a349eb3cf4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1446,15 +1446,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
 			all_valid = false;
 			++none_or_zero;
-			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
-				continue;
-			} else {
+			if (userfaultfd_armed(vma)) {
 				result = SCAN_EXCEED_NONE_PTE;
 				count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 				goto out_unmap;
 			}
+			continue;
 		}
 		if (pte_uffd_wp(pteval)) {
 			/*
-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 11/11] khugepaged: skip collapsing mTHP to smaller orders
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
                   ` (9 preceding siblings ...)
  2025-01-08 23:31 ` [RFC 10/11] khugepaged: remove max_ptes_none restriction on the pmd scan Nico Pache
@ 2025-01-08 23:31 ` Nico Pache
  2025-01-09  6:22 ` [RFC 00/11] khugepaged: mTHP support Dev Jain
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-08 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in
some pages being unmapped. Skip these cases until we have a way to check
if its ok to collapse to a smaller mTHP size (like in the case of a
partially mapped folio).

This patch is inspired by Dev Jain's work on khugepaged mTHP support [1].

[1] https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 61a349eb3cf4..046843a0d632 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -643,6 +643,11 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		folio = page_folio(page);
 		VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio);
 
+		if (order != HPAGE_PMD_ORDER && folio_order(folio) >= order) {
+			result = SCAN_PTE_MAPPED_HUGEPAGE;
+			goto out;
+		}
+
 		/* See khugepaged_scan_pmd(). */
 		if (folio_likely_mapped_shared(folio)) {
 			++shared;
-- 
2.47.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
                   ` (10 preceding siblings ...)
  2025-01-08 23:31 ` [RFC 11/11] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
@ 2025-01-09  6:22 ` Dev Jain
  2025-01-10  2:27   ` Nico Pache
  2025-01-09  6:27 ` Dev Jain
  2025-01-16  9:47 ` Ryan Roberts
  13 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-09  6:22 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm


On 09/01/25 5:01 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
>
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> determine how full a mTHP order needs to be before collapsing it.
>
> Some design choices to note:
>   - bitmap structures are allocated dynamically because on some arch's
>      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>      compile time leading to warnings.
>   - The recursion is masked through a stack structure.
>   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>      64bit on x86. This provides some optimization on the bitmap operations.
>      if other arches/configs that have larger than 512 PTEs per PMD want to
>      compress their bitmap further we can change this value per arch.
>
> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> Patch 3:    A minor "fix"/optimization
> Patch 4:    Refactor/rename hpage_collapse
> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> Patch 8-11: The mTHP patches
>
> This series acts as an alternative to Dev Jain's approach [1]. The two
> series differ in a few ways:
>    - My approach uses a bitmap to store the state of the linear scan_pmd to
>      then determine potential mTHP batches. Devs incorporates his directly
>      into the scan, and will try each available order.
>    - Dev is attempting to optimize the locking, while my approach keeps the
>      locking changes to a minimum. I believe his changes are not safe for
>      uffd.
>    - Dev's changes only work for khugepaged not madvise_collapse (although
>      i think that was by choice and it could easily support madvise)
>    - Dev scales all khugepaged sysfs tunables by order, while im removing
>      the restriction of max_ptes_none and converting it to a scale to
>      determine a (m)THP threshold.
>    - Dev turns on khugepaged if any order is available while mine still
>      only runs if PMDs are enabled. I like Dev's approach and will most
>      likely do the same in my PATCH posting.
>    - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>
> Patch 11 was inspired by one of Dev's changes.
>
> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>
> Nico Pache (11):
>    introduce khugepaged_collapse_single_pmd to collapse a single pmd
>    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>    khugepaged: Don't allocate khugepaged mm_slot early
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: remove max_ptes_none restriction on the pmd scan
>    khugepaged: skip collapsing mTHP to smaller orders
>
>   include/linux/khugepaged.h |   4 +-
>   mm/huge_memory.c           |   3 +-
>   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>   3 files changed, 306 insertions(+), 137 deletions(-)

Before I take a proper look at your series, can you please include any testing
you may have done?



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
                   ` (11 preceding siblings ...)
  2025-01-09  6:22 ` [RFC 00/11] khugepaged: mTHP support Dev Jain
@ 2025-01-09  6:27 ` Dev Jain
  2025-01-10  1:28   ` Nico Pache
  2025-01-16  9:47 ` Ryan Roberts
  13 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-09  6:27 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm


On 09/01/25 5:01 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
>
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> determine how full a mTHP order needs to be before collapsing it.
>
> Some design choices to note:
>   - bitmap structures are allocated dynamically because on some arch's
>      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>      compile time leading to warnings.
>   - The recursion is masked through a stack structure.
>   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>      64bit on x86. This provides some optimization on the bitmap operations.
>      if other arches/configs that have larger than 512 PTEs per PMD want to
>      compress their bitmap further we can change this value per arch.
>
> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> Patch 3:    A minor "fix"/optimization
> Patch 4:    Refactor/rename hpage_collapse
> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> Patch 8-11: The mTHP patches
>
> This series acts as an alternative to Dev Jain's approach [1]. The two
> series differ in a few ways:
>    - My approach uses a bitmap to store the state of the linear scan_pmd to
>      then determine potential mTHP batches. Devs incorporates his directly
>      into the scan, and will try each available order.
>    - Dev is attempting to optimize the locking, while my approach keeps the
>      locking changes to a minimum. I believe his changes are not safe for
>      uffd.
>    - Dev's changes only work for khugepaged not madvise_collapse (although
>      i think that was by choice and it could easily support madvise)
>    - Dev scales all khugepaged sysfs tunables by order, while im removing
>      the restriction of max_ptes_none and converting it to a scale to
>      determine a (m)THP threshold.
>    - Dev turns on khugepaged if any order is available while mine still
>      only runs if PMDs are enabled. I like Dev's approach and will most
>      likely do the same in my PATCH posting.
>    - mTHPs need their ref count updated to 1<<order, which Dev is missing.

Well, I did not miss it :)

int nr_pages = folio_nr_pages(folio);
folio_ref_add(folio, nr_pages - 1);

>
> Patch 11 was inspired by one of Dev's changes.
>
> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>
> Nico Pache (11):
>    introduce khugepaged_collapse_single_pmd to collapse a single pmd
>    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>    khugepaged: Don't allocate khugepaged mm_slot early
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: remove max_ptes_none restriction on the pmd scan
>    khugepaged: skip collapsing mTHP to smaller orders
>
>   include/linux/khugepaged.h |   4 +-
>   mm/huge_memory.c           |   3 +-
>   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>   3 files changed, 306 insertions(+), 137 deletions(-)
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-09  6:27 ` Dev Jain
@ 2025-01-10  1:28   ` Nico Pache
  0 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-10  1:28 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm

On Wed, Jan 8, 2025 at 11:27 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 09/01/25 5:01 am, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> > determine how full a mTHP order needs to be before collapsing it.
> >
> > Some design choices to note:
> >   - bitmap structures are allocated dynamically because on some arch's
> >      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >      compile time leading to warnings.
> >   - The recursion is masked through a stack structure.
> >   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >      64bit on x86. This provides some optimization on the bitmap operations.
> >      if other arches/configs that have larger than 512 PTEs per PMD want to
> >      compress their bitmap further we can change this value per arch.
> >
> > Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> > Patch 3:    A minor "fix"/optimization
> > Patch 4:    Refactor/rename hpage_collapse
> > Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> > Patch 8-11: The mTHP patches
> >
> > This series acts as an alternative to Dev Jain's approach [1]. The two
> > series differ in a few ways:
> >    - My approach uses a bitmap to store the state of the linear scan_pmd to
> >      then determine potential mTHP batches. Devs incorporates his directly
> >      into the scan, and will try each available order.
> >    - Dev is attempting to optimize the locking, while my approach keeps the
> >      locking changes to a minimum. I believe his changes are not safe for
> >      uffd.
> >    - Dev's changes only work for khugepaged not madvise_collapse (although
> >      i think that was by choice and it could easily support madvise)
> >    - Dev scales all khugepaged sysfs tunables by order, while im removing
> >      the restriction of max_ptes_none and converting it to a scale to
> >      determine a (m)THP threshold.
> >    - Dev turns on khugepaged if any order is available while mine still
> >      only runs if PMDs are enabled. I like Dev's approach and will most
> >      likely do the same in my PATCH posting.
> >    - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>
> Well, I did not miss it :)
Sorry! I missed that in my initial review of your code. Seeing that
would have saved me a few hours of debugging xD

>
> int nr_pages = folio_nr_pages(folio);
> folio_ref_add(folio, nr_pages - 1);

Once I found the fix I forgot to cross reference with your series.
Missing this ref update was causing the issue I alluded to in your RFC
thread. When you said you ran into some issues on the debug configs I
figured it was the same one.


>
> >
> > Patch 11 was inspired by one of Dev's changes.
> >
> > [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >
> > Nico Pache (11):
> >    introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >    khugepaged: Don't allocate khugepaged mm_slot early
> >    khugepaged: rename hpage_collapse_* to khugepaged_*
> >    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >    khugepaged: generalize alloc_charge_folio for mTHP support
> >    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >    khugepaged: add mTHP support
> >    khugepaged: remove max_ptes_none restriction on the pmd scan
> >    khugepaged: skip collapsing mTHP to smaller orders
> >
> >   include/linux/khugepaged.h |   4 +-
> >   mm/huge_memory.c           |   3 +-
> >   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >   3 files changed, 306 insertions(+), 137 deletions(-)
> >
>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-09  6:22 ` [RFC 00/11] khugepaged: mTHP support Dev Jain
@ 2025-01-10  2:27   ` Nico Pache
  2025-01-10  4:56     ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: Nico Pache @ 2025-01-10  2:27 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm

On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
> On 09/01/25 5:01 am, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> > determine how full a mTHP order needs to be before collapsing it.
> >
> > Some design choices to note:
> >   - bitmap structures are allocated dynamically because on some arch's
> >      (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >      compile time leading to warnings.
> >   - The recursion is masked through a stack structure.
> >   - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >      64bit on x86. This provides some optimization on the bitmap operations.
> >      if other arches/configs that have larger than 512 PTEs per PMD want to
> >      compress their bitmap further we can change this value per arch.
> >
> > Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> > Patch 3:    A minor "fix"/optimization
> > Patch 4:    Refactor/rename hpage_collapse
> > Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> > Patch 8-11: The mTHP patches
> >
> > This series acts as an alternative to Dev Jain's approach [1]. The two
> > series differ in a few ways:
> >    - My approach uses a bitmap to store the state of the linear scan_pmd to
> >      then determine potential mTHP batches. Devs incorporates his directly
> >      into the scan, and will try each available order.
> >    - Dev is attempting to optimize the locking, while my approach keeps the
> >      locking changes to a minimum. I believe his changes are not safe for
> >      uffd.
> >    - Dev's changes only work for khugepaged not madvise_collapse (although
> >      i think that was by choice and it could easily support madvise)
> >    - Dev scales all khugepaged sysfs tunables by order, while im removing
> >      the restriction of max_ptes_none and converting it to a scale to
> >      determine a (m)THP threshold.
> >    - Dev turns on khugepaged if any order is available while mine still
> >      only runs if PMDs are enabled. I like Dev's approach and will most
> >      likely do the same in my PATCH posting.
> >    - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >
> > Patch 11 was inspired by one of Dev's changes.
> >
> > [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >
> > Nico Pache (11):
> >    introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >    khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >    khugepaged: Don't allocate khugepaged mm_slot early
> >    khugepaged: rename hpage_collapse_* to khugepaged_*
> >    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >    khugepaged: generalize alloc_charge_folio for mTHP support
> >    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >    khugepaged: add mTHP support
> >    khugepaged: remove max_ptes_none restriction on the pmd scan
> >    khugepaged: skip collapsing mTHP to smaller orders
> >
> >   include/linux/khugepaged.h |   4 +-
> >   mm/huge_memory.c           |   3 +-
> >   mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >   3 files changed, 306 insertions(+), 137 deletions(-)
>
> Before I take a proper look at your series, can you please include any testing
> you may have done?

I Built these changes for the following arches: x86_64, arm64,
arm64-64k, ppc64le, s390x

x86 testing:
- Selftests mm
- some stress-ng tests
- compile kernel
- I did some tests with my defer [1] set on top. This pushes all the
work to khugepaged, which removes the noise of all the PF allocations.

I recently got an ARM64 machine and did some simple sanity tests (on
both 4k and 64k) like selftests, stress-ng, and playing around with
the tunables, etc.

I will also be running all the builds through our CI, and perf testing
environments before posting.

[1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/

>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-10  2:27   ` Nico Pache
@ 2025-01-10  4:56     ` Dev Jain
  2025-01-10 22:01       ` Nico Pache
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-10  4:56 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm



On 10/01/25 7:57 am, Nico Pache wrote:
> On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>> On 09/01/25 5:01 am, Nico Pache wrote:
>>> The following series provides khugepaged and madvise collapse with the
>>> capability to collapse regions to mTHPs.
>>>
>>> To achieve this we generalize the khugepaged functions to no longer depend
>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>> on max_ptes_none is removed during the scan, to make sure we account for
>>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
>>> determine how full a mTHP order needs to be before collapsing it.
>>>
>>> Some design choices to note:
>>>    - bitmap structures are allocated dynamically because on some arch's
>>>       (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>>>       compile time leading to warnings.
>>>    - The recursion is masked through a stack structure.
>>>    - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>>>       64bit on x86. This provides some optimization on the bitmap operations.
>>>       if other arches/configs that have larger than 512 PTEs per PMD want to
>>>       compress their bitmap further we can change this value per arch.
>>>
>>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
>>> Patch 3:    A minor "fix"/optimization
>>> Patch 4:    Refactor/rename hpage_collapse
>>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
>>> Patch 8-11: The mTHP patches
>>>
>>> This series acts as an alternative to Dev Jain's approach [1]. The two
>>> series differ in a few ways:
>>>     - My approach uses a bitmap to store the state of the linear scan_pmd to
>>>       then determine potential mTHP batches. Devs incorporates his directly
>>>       into the scan, and will try each available order.
>>>     - Dev is attempting to optimize the locking, while my approach keeps the
>>>       locking changes to a minimum. I believe his changes are not safe for
>>>       uffd.
>>>     - Dev's changes only work for khugepaged not madvise_collapse (although
>>>       i think that was by choice and it could easily support madvise)
>>>     - Dev scales all khugepaged sysfs tunables by order, while im removing
>>>       the restriction of max_ptes_none and converting it to a scale to
>>>       determine a (m)THP threshold.
>>>     - Dev turns on khugepaged if any order is available while mine still
>>>       only runs if PMDs are enabled. I like Dev's approach and will most
>>>       likely do the same in my PATCH posting.
>>>     - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>>>
>>> Patch 11 was inspired by one of Dev's changes.
>>>
>>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>>>
>>> Nico Pache (11):
>>>     introduce khugepaged_collapse_single_pmd to collapse a single pmd
>>>     khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>>>     khugepaged: Don't allocate khugepaged mm_slot early
>>>     khugepaged: rename hpage_collapse_* to khugepaged_*
>>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>     khugepaged: generalize alloc_charge_folio for mTHP support
>>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>     khugepaged: add mTHP support
>>>     khugepaged: remove max_ptes_none restriction on the pmd scan
>>>     khugepaged: skip collapsing mTHP to smaller orders
>>>
>>>    include/linux/khugepaged.h |   4 +-
>>>    mm/huge_memory.c           |   3 +-
>>>    mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>>>    3 files changed, 306 insertions(+), 137 deletions(-)
>>
>> Before I take a proper look at your series, can you please include any testing
>> you may have done?
> 
> I Built these changes for the following arches: x86_64, arm64,
> arm64-64k, ppc64le, s390x
> 
> x86 testing:
> - Selftests mm
> - some stress-ng tests
> - compile kernel
> - I did some tests with my defer [1] set on top. This pushes all the
> work to khugepaged, which removes the noise of all the PF allocations.
> 
> I recently got an ARM64 machine and did some simple sanity tests (on
> both 4k and 64k) like selftests, stress-ng, and playing around with
> the tunables, etc.
> 
> I will also be running all the builds through our CI, and perf testing
> environments before posting.
> 
> [1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/
> 
>>
> 
I tested your series with the program I was using and it is not working; 
can you please confirm it.

diff --git a/mytests/mthp.c b/mytests/mthp.c
new file mode 100644
index 000000000000..e3029dbcf035
--- /dev/null
+++ b/mytests/mthp.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *
+ * Author: Dev Jain <dev.jain@arm.com>
+ *
+ * Program to test khugepaged mTHP collapse
+ */
+
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <string.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/random.h>
+#include <assert.h>
+
+int main(int argc, char *argv[])
+{
+	char *ptr;
+	unsigned long mthp_size = (1UL << 16);
+	size_t chunk_size = (1UL << 25);
+
+	ptr = mmap((void *)(1UL << 30), chunk_size, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (((unsigned long)ptr) != (1UL << 30)) {
+		printf("mmap did not work on required address\n");
+		return 1;
+	}
+
+	/* Fill first pte in every 64K interval */
+	for (int i = 0; i < chunk_size; i += mthp_size)
+		ptr[i] = i;
+
+	if (madvise(ptr, chunk_size, MADV_HUGEPAGE)) {
+		perror("madvise");
+		return 1;
+	}
+	sleep(100);
+	return 0;
+}
-- 
2.30.2

Set enabled = madvise, hugepages-2048k/enabled = hugepages-64k/enabled = 
inherit. Run the program in the background, then run tools/mm/thpmaps.
You will see PMD collapse correctly, but when you echo never into
hugepages-2048k/enabled and test this again, you won't see contpte 64K 
collapse. With my series, you will see something like

anon-cont-pte-aligned-64kB : 32768 kB (100%).



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early
  2025-01-08 23:31 ` [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early Nico Pache
@ 2025-01-10  6:11   ` Dev Jain
  2025-01-10 19:37     ` Nico Pache
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-10  6:11 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm



On 09/01/25 5:01 am, Nico Pache wrote:
> We should only "enter"/allocate the khugepaged mm_slot if we succeed at
> allocating the PMD sized folio. Move the khugepaged_enter_vma call until
> after we know the vma_alloc_folio was successful.

Why? We have the appropriate checks from thp_vma_allowable_orders() and 
friends, so the VMA should be registered with khugepaged irrespective of
whether during fault time we are able to allocate a PMD-THP or not. If 
we fail at fault time, it is the job of khugepaged to try to collapse it 
later.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/huge_memory.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e53d83b3e5cf..635c65e7ef63 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1323,7 +1323,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   	ret = vmf_anon_prepare(vmf);
>   	if (ret)
>   		return ret;
> -	khugepaged_enter_vma(vma, vma->vm_flags);
>   
>   	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>   			!mm_forbids_zeropage(vma->vm_mm) &&
> @@ -1365,7 +1364,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>   		}
>   		return ret;
>   	}
> -
> +	khugepaged_enter_vma(vma, vma->vm_flags);
>   	return __do_huge_pmd_anonymous_page(vmf);
>   }
>   

In any case, you are not achieving what you described in the patch 
description: you have moved khugepaged_enter_vma() after the read fault 
logic, what you want to do is to move it after 
vma_alloc_anon_folio_pmd() in __do_huge_pmd_anonymous_page().



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 06/11] khugepaged: generalize alloc_charge_folio for mTHP support
  2025-01-08 23:31 ` [RFC 06/11] khugepaged: generalize alloc_charge_folio " Nico Pache
@ 2025-01-10  6:23   ` Dev Jain
  2025-01-10 19:41     ` Nico Pache
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-10  6:23 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm



On 09/01/25 5:01 am, Nico Pache wrote:
> alloc_charge_folio allocates the new folio for the khugepaged collapse.
> Generalize the order of the folio allocations to support future mTHP
> collapsing.
> 
> No functional changes in this patch.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e2e6ca9265ab..6daf3a943a1a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1070,14 +1070,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>   }
>   
>   static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> -			      struct collapse_control *cc)
> +			      struct collapse_control *cc, int order)
>   {
>   	gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
>   		     GFP_TRANSHUGE);
>   	int node = khugepaged_find_target_node(cc);
>   	struct folio *folio;
>   
> -	folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> +	folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
>   	if (!folio) {
>   		*foliop = NULL;
>   		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> @@ -1121,7 +1121,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 */
>   	mmap_read_unlock(mm);
>   
> -	result = alloc_charge_folio(&folio, mm, cc);
> +	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>   	if (result != SCAN_SUCCEED)
>   		goto out_nolock;
>   
> @@ -1834,7 +1834,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>   	VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
>   	VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
>   
> -	result = alloc_charge_folio(&new_folio, mm, cc);
> +	result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
>   	if (result != SCAN_SUCCEED)
>   		goto out;
>   

I guess we will need stat updation like I did in my patch.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd
  2025-01-08 23:31 ` [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd Nico Pache
@ 2025-01-10  6:25   ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-01-10  6:25 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm



On 09/01/25 5:01 am, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the thing.
> 
> Create khugepaged_collapse_single_pmd to increase code
> reuse and create a entry point for future khugepaged changes.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 46 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 653dbb1ff05c..4d932839ff1d 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2348,6 +2348,52 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
>   }
>   #endif
>   
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *mm,
> +				   struct vm_area_struct *vma, bool *mmap_locked,
> +				   struct collapse_control *cc)
> +{
> +	int result = SCAN_FAIL;
> +	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> +
> +	if (!*mmap_locked) {
> +		mmap_read_lock(mm);
> +		*mmap_locked = true;
> +	}
> +
> +	if (thp_vma_allowable_order(vma, vma->vm_flags,
> +					tva_flags, PMD_ORDER)) {
> +		if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) {
> +			struct file *file = get_file(vma->vm_file);
> +			pgoff_t pgoff = linear_page_index(vma, addr);
> +
> +			mmap_read_unlock(mm);
> +			*mmap_locked = false;
> +			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
> +							  cc);
> +			fput(file);
> +			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> +				mmap_read_lock(mm);
> +				if (hpage_collapse_test_exit_or_disable(mm))
> +					goto end;
> +				result = collapse_pte_mapped_thp(mm, addr,
> +								 !cc->is_khugepaged);
> +				mmap_read_unlock(mm);
> +			}
> +		} else {
> +			result = hpage_collapse_scan_pmd(mm, vma, addr,
> +							 mmap_locked, cc);
> +		}
> +		if (result == SCAN_SUCCEED || result == SCAN_PMD_MAPPED)
> +			++khugepaged_pages_collapsed;
> +	}
> +end:
> +	return result;
> +}
> +
>   static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   					    struct collapse_control *cc)
>   	__releases(&khugepaged_mm_lock)

I will suggest squashing this with patch 2 to avoid unused function 
warning when sequentially building patches.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 07/11] khugepaged: generalize __collapse_huge_page_* for mTHP support
  2025-01-08 23:31 ` [RFC 07/11] khugepaged: generalize __collapse_huge_page_* " Nico Pache
@ 2025-01-10  6:38   ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-01-10  6:38 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm



On 09/01/25 5:01 am, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* functions
> to support future mTHP collapse.
> 
> No functional changes in this patch.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 36 +++++++++++++++++++-----------------
>   1 file changed, 19 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6daf3a943a1a..9eb161b04ee4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -565,7 +565,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   					unsigned long address,
>   					pte_t *pte,
>   					struct collapse_control *cc,
> -					struct list_head *compound_pagelist)
> +					struct list_head *compound_pagelist,
> +					u8 order)
>   {
>   	struct page *page = NULL;
>   	struct folio *folio = NULL;
> @@ -573,7 +574,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>   	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>   	bool writable = false;
>   
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> +	for (_pte = pte; _pte < pte + (1 << order);
>   	     _pte++, address += PAGE_SIZE) {
>   		pte_t pteval = ptep_get(_pte);
>   		if (pte_none(pteval) || (pte_present(pteval) &&
> @@ -711,14 +712,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>   						struct vm_area_struct *vma,
>   						unsigned long address,
>   						spinlock_t *ptl,
> -						struct list_head *compound_pagelist)
> +						struct list_head *compound_pagelist,
> +						u8 order)
>   {
>   	struct folio *src, *tmp;
>   	pte_t *_pte;
>   	pte_t pteval;
>   
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, address += PAGE_SIZE) {
> +	for (_pte = pte; _pte < pte + (1 << order);
> +		_pte++, address += PAGE_SIZE) {
>   		pteval = ptep_get(_pte);
>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>   			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> @@ -764,7 +766,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   					     pmd_t *pmd,
>   					     pmd_t orig_pmd,
>   					     struct vm_area_struct *vma,
> -					     struct list_head *compound_pagelist)
> +					     struct list_head *compound_pagelist,
> +					     u8 order)
>   {
>   	spinlock_t *pmd_ptl;
>   
> @@ -781,7 +784,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   	 * Release both raw and compound pages isolated
>   	 * in __collapse_huge_page_isolate.
>   	 */
> -	release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> +	release_pte_pages(pte, pte + (1 << order), compound_pagelist);
>   }
>   
>   /*
> @@ -802,7 +805,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
>   static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>   		pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
>   		unsigned long address, spinlock_t *ptl,
> -		struct list_head *compound_pagelist)
> +		struct list_head *compound_pagelist, u8 order)
>   {
>   	unsigned int i;
>   	int result = SCAN_SUCCEED;
> @@ -810,7 +813,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>   	/*
>   	 * Copying pages' contents is subject to memory poison at any iteration.
>   	 */
> -	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +	for (i = 0; i < (1 << order); i++) {
>   		pte_t pteval = ptep_get(pte + i);
>   		struct page *page = folio_page(folio, i);
>   		unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -829,10 +832,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
>   
>   	if (likely(result == SCAN_SUCCEED))
>   		__collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> -						    compound_pagelist);
> +						    compound_pagelist, order);
>   	else
>   		__collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> -						 compound_pagelist);
> +						 compound_pagelist, order);
>   
>   	return result;
>   }
> @@ -996,11 +999,11 @@ static int check_pmd_still_valid(struct mm_struct *mm,
>   static int __collapse_huge_page_swapin(struct mm_struct *mm,
>   				       struct vm_area_struct *vma,
>   				       unsigned long haddr, pmd_t *pmd,
> -				       int referenced)
> +				       int referenced, u8 order)
>   {

I had dropped 'h' from haddr because the address won't be huge-aligned
after mTHP support.

>   	int swapped_in = 0;
>   	vm_fault_t ret = 0;
> -	unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> +	unsigned long address, end = haddr + ((1 << order) * PAGE_SIZE);

Better to write PAGE_SIZE << order, as Matthew had noted in my patch :)




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-01-08 23:31 ` [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
@ 2025-01-10  9:05   ` Dev Jain
  2025-01-10 21:48     ` Nico Pache
  2025-01-10 14:54   ` Dev Jain
  2025-01-12 15:13   ` Dev Jain
  2 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-10  9:05 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm



On 09/01/25 5:01 am, Nico Pache wrote:
> khugepaged scans PMD ranges for potential collapse to a hugepage. To add
> mTHP support we use this scan to instead record chunks of fully utilized
> sections of the PMD.
> 
> create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
> by default we will set this to order 3. The reasoning is that for 4K 512
> PMD size this results in a 64 bit bitmap which has some optimizations.
> For other arches like ARM64 64K, we can set a larger order if needed.
> 
> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> that represents chunks of fully utilized regions. We can then determine
> what mTHP size fits best and in the following patch, we set this bitmap
> while scanning the PMD.
> 
> max_ptes_none is used as a scale to determine how "full" an order must
> be before being considered for collapse.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   include/linux/khugepaged.h |   4 +-
>   mm/khugepaged.c            | 129 +++++++++++++++++++++++++++++++++++--
>   2 files changed, 126 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index 1f46046080f5..31cff8aeec4a 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -1,7 +1,9 @@
>   /* SPDX-License-Identifier: GPL-2.0 */
>   #ifndef _LINUX_KHUGEPAGED_H
>   #define _LINUX_KHUGEPAGED_H
> -

Nit: I don't think this line needs to be deleted.

> +#define MIN_MTHP_ORDER	3
> +#define MIN_MTHP_NR	(1<<MIN_MTHP_ORDER)

Nit: Insert a space: (1 << MIN_MTHP_ORDER)

> +#define MTHP_BITMAP_SIZE  (1<<(HPAGE_PMD_ORDER - MIN_MTHP_ORDER))
>   extern unsigned int khugepaged_max_ptes_none __read_mostly;
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   extern struct attribute_group khugepaged_attr_group;
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9eb161b04ee4..de1dc6ea3c71 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>   
>   static struct kmem_cache *mm_slot_cache __ro_after_init;
>   
> +struct scan_bit_state {
> +	u8 order;
> +	u8 offset;
> +};
> +
>   struct collapse_control {
>   	bool is_khugepaged;
>   
> @@ -102,6 +107,15 @@ struct collapse_control {
>   
>   	/* nodemask for allocation fallback */
>   	nodemask_t alloc_nmask;
> +
> +	/* bitmap used to collapse mTHP sizes. 1bit = order MIN_MTHP_ORDER mTHP */
> +	unsigned long *mthp_bitmap;
> +	unsigned long *mthp_bitmap_temp;
> +	struct scan_bit_state *mthp_bitmap_stack;
> +};
> +
> +struct collapse_control khugepaged_collapse_control = {
> +	.is_khugepaged = true,
>   };
>   
>   /**
> @@ -389,6 +403,25 @@ int __init khugepaged_init(void)
>   	if (!mm_slot_cache)
>   		return -ENOMEM;
>   
> +	/*
> +	 * allocate the bitmaps dynamically since MTHP_BITMAP_SIZE is not known at
> +	 * compile time for some architectures.
> +	 */
> +	khugepaged_collapse_control.mthp_bitmap = kmalloc_array(
> +		BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> +	if (!khugepaged_collapse_control.mthp_bitmap)
> +		return -ENOMEM;
> +
> +	khugepaged_collapse_control.mthp_bitmap_temp = kmalloc_array(
> +		BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> +	if (!khugepaged_collapse_control.mthp_bitmap_temp)
> +		return -ENOMEM;
> +
> +	khugepaged_collapse_control.mthp_bitmap_stack = kmalloc_array(
> +		MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL);
> +	if (!khugepaged_collapse_control.mthp_bitmap_stack)
> +		return -ENOMEM;
> +
>   	khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
>   	khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
>   	khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
> @@ -400,6 +433,9 @@ int __init khugepaged_init(void)
>   void __init khugepaged_destroy(void)
>   {
>   	kmem_cache_destroy(mm_slot_cache);
> +	kfree(khugepaged_collapse_control.mthp_bitmap);
> +	kfree(khugepaged_collapse_control.mthp_bitmap_temp);
> +	kfree(khugepaged_collapse_control.mthp_bitmap_stack);
>   }
>   
>   static inline int khugepaged_test_exit(struct mm_struct *mm)
> @@ -850,10 +886,6 @@ static void khugepaged_alloc_sleep(void)
>   	remove_wait_queue(&khugepaged_wait, &wait);
>   }
>   
> -struct collapse_control khugepaged_collapse_control = {
> -	.is_khugepaged = true,
> -};
> -
>   static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
>   {
>   	int i;
> @@ -1102,7 +1134,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>   
>   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   			      int referenced, int unmapped,
> -			      struct collapse_control *cc)
> +			      struct collapse_control *cc, bool *mmap_locked,
> +				  int order, int offset)
>   {
>   	LIST_HEAD(compound_pagelist);
>   	pmd_t *pmd, _pmd;
> @@ -1115,6 +1148,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	struct mmu_notifier_range range;
>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>   
> +	/* if collapsing mTHPs we may have already released the read_lock, and
> +	 * need to reaquire it to keep the proper locking order.
> +	 */
> +	if (!*mmap_locked)
> +		mmap_read_lock(mm);

There is no need to take the read lock again, because we drop it just
after this.

>   	/*
>   	 * Before allocating the hugepage, release the mmap_lock read lock.
>   	 * The allocation can take potentially a long time if it involves
> @@ -1122,6 +1160,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 * that. We will recheck the vma after taking it again in write mode.
>   	 */
>   	mmap_read_unlock(mm);
> +	*mmap_locked = false;
>   
>   	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>   	if (result != SCAN_SUCCEED)
> @@ -1256,12 +1295,71 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   out_up_write:
>   	mmap_write_unlock(mm);
>   out_nolock:
> +	*mmap_locked = false;
>   	if (folio)
>   		folio_put(folio);
>   	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
>   	return result;
>   }
>   
> +// Recursive function to consume the bitmap
> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> +			int referenced, int unmapped, struct collapse_control *cc,
> +			bool *mmap_locked, unsigned long enabled_orders)
> +{
> +	u8 order, offset;
> +	int num_chunks;
> +	int bits_set, max_percent, threshold_bits;
> +	int next_order, mid_offset;
> +	int top = -1;
> +	int collapsed = 0;
> +	int ret;
> +	struct scan_bit_state state;
> +
> +	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +		{ HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
> +
> +	while (top >= 0) {
> +		state = cc->mthp_bitmap_stack[top--];
> +		order = state.order;
> +		offset = state.offset;
> +		num_chunks = 1 << order;
> +		// Skip mTHP orders that are not enabled
> +		if (!(enabled_orders >> (order +  MIN_MTHP_ORDER)) & 1)
> +			goto next;
> +
> +		// copy the relavant section to a new bitmap
> +		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> +				  MTHP_BITMAP_SIZE);
> +
> +		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> +
> +		// Check if the region is "almost full" based on the threshold
> +		max_percent = ((HPAGE_PMD_NR - khugepaged_max_ptes_none - 1) * 100)
> +						/ (HPAGE_PMD_NR - 1);
> +		threshold_bits = (max_percent * num_chunks) / 100;
> +
> +		if (bits_set >= threshold_bits) {
> +			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> +					mmap_locked, order + MIN_MTHP_ORDER, offset * MIN_MTHP_NR);
> +			if (ret == SCAN_SUCCEED)
> +				collapsed += (1 << (order + MIN_MTHP_ORDER));
> +			continue;
> +		}
> +
> +next:
> +		if (order > 0) {
> +			next_order = order - 1;
> +			mid_offset = offset + (num_chunks / 2);
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, mid_offset };
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, offset };
> +			}
> +	}
> +	return collapsed;
> +}
> +
>   static int khugepaged_scan_pmd(struct mm_struct *mm,
>   				   struct vm_area_struct *vma,
>   				   unsigned long address, bool *mmap_locked,
> @@ -1430,7 +1528,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   	pte_unmap_unlock(pte, ptl);
>   	if (result == SCAN_SUCCEED) {
>   		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc);
> +					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>   		/* collapse_huge_page will return with the mmap_lock released */
>   		*mmap_locked = false;
>   	}
> @@ -2767,6 +2865,21 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>   		return -ENOMEM;
>   	cc->is_khugepaged = false;
>   
> +	cc->mthp_bitmap = kmalloc_array(
> +		BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> +	if (!cc->mthp_bitmap)
> +		return -ENOMEM;
> +
> +	cc->mthp_bitmap_temp = kmalloc_array(
> +		BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> +	if (!cc->mthp_bitmap_temp)
> +		return -ENOMEM;
> +
> +	cc->mthp_bitmap_stack = kmalloc_array(
> +		MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL);
> +	if (!cc->mthp_bitmap_stack)
> +		return -ENOMEM;
> +
>   	mmgrab(mm);
>   	lru_add_drain_all();
>   
> @@ -2831,8 +2944,12 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>   out_nolock:
>   	mmap_assert_locked(mm);
>   	mmdrop(mm);
> +	kfree(cc->mthp_bitmap);
> +	kfree(cc->mthp_bitmap_temp);
> +	kfree(cc->mthp_bitmap_stack);
>   	kfree(cc);
>   
> +
>   	return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
>   			: madvise_collapse_errno(last_fail);
>   }



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 09/11] khugepaged: add mTHP support
  2025-01-08 23:31 ` [RFC 09/11] khugepaged: add " Nico Pache
@ 2025-01-10  9:20   ` Dev Jain
  2025-01-10 13:36   ` Dev Jain
  1 sibling, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-01-10  9:20 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm



On 09/01/25 5:01 am, Nico Pache wrote:
> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning a PMD range for potential hugepage collapse, track pages
> in MIN_MTHP_ORDER chunks. Each bit represents a fully utilized region of
> order MIN_MTHP_ORDER ptes.
> 
> With this bitmap we can determine which mTHP sizes would be the most
> efficient to collapse to if the PMD collapse is not suitible.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/khugepaged.c | 111 +++++++++++++++++++++++++++++++++---------------
>   1 file changed, 77 insertions(+), 34 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index de1dc6ea3c71..4d3c560f20b4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1139,13 +1139,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   {
>   	LIST_HEAD(compound_pagelist);
>   	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte, mthp_pte;
>   	pgtable_t pgtable;
>   	struct folio *folio;
>   	spinlock_t *pmd_ptl, *pte_ptl;
>   	int result = SCAN_FAIL;
>   	struct vm_area_struct *vma;
>   	struct mmu_notifier_range range;
> +	unsigned long _address = address + offset * PAGE_SIZE;
>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>   
>   	/* if collapsing mTHPs we may have already released the read_lock, and
> @@ -1162,12 +1163,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	mmap_read_unlock(mm);
>   	*mmap_locked = false;
>   
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>   	if (result != SCAN_SUCCEED)
>   		goto out_nolock;
>   
>   	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	*mmap_locked = true;
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>   	if (result != SCAN_SUCCEED) {
>   		mmap_read_unlock(mm);
>   		goto out_nolock;
> @@ -1185,13 +1187,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   		 * released when it fails. So we jump out_nolock directly in
>   		 * that case.  Continuing to collapse causes inconsistency.
>   		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -				referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> +				referenced, order);
>   		if (result != SCAN_SUCCEED)
>   			goto out_nolock;
>   	}
>   
>   	mmap_read_unlock(mm);
> +	*mmap_locked = false;
>   	/*
>   	 * Prevent all access to pagetables with the exception of
>   	 * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1201,7 +1204,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 * mmap_lock.
>   	 */
>   	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>   	if (result != SCAN_SUCCEED)
>   		goto out_up_write;
>   	/* check if the pmd is still valid */
> @@ -1212,11 +1215,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	vma_start_write(vma);
>   	anon_vma_lock_write(vma->anon_vma);
>   
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> +				_address + (PAGE_SIZE << order));

Since we are nuking the PMD for both cases, we do not need to change it 
for order, this should remain address + HPAGE_PMD_SIZE.

>   	mmu_notifier_invalidate_range_start(&range);
>   
>   	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> +
>   	/*
>   	 * This removes any huge TLB entry from the CPU so we won't allow
>   	 * huge and small TLB entries for the same virtual address to
> @@ -1230,10 +1234,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	mmu_notifier_invalidate_range_end(&range);
>   	tlb_remove_table_sync_one();
>   
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>   	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -					&compound_pagelist, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> +					&compound_pagelist, order);
>   		spin_unlock(pte_ptl);
>   	} else {
>   		result = SCAN_PMD_NULL;
> @@ -1262,8 +1266,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	anon_vma_unlock_write(vma->anon_vma);
>   
>   	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   &compound_pagelist, HPAGE_PMD_ORDER);
> +					   vma, _address, pte_ptl,
> +					   &compound_pagelist, order);
>   	pte_unmap(pte);
>   	if (unlikely(result != SCAN_SUCCEED))
>   		goto out_up_write;
> @@ -1274,20 +1278,37 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	 * write.
>   	 */
>   	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
> -	_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> -	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> -
> -	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> -	folio_add_lru_vma(folio, vma);
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	set_pmd_at(mm, address, pmd, _pmd);
> -	update_mmu_cache_pmd(vma, address, pmd);
> -	deferred_split_folio(folio, false);
> -	spin_unlock(pmd_ptl);
> +	if (order == HPAGE_PMD_ORDER) {
> +		pgtable = pmd_pgtable(_pmd);
> +		_pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot);
> +		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> +
> +		spin_lock(pmd_ptl);
> +		BUG_ON(!pmd_none(*pmd));
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		set_pmd_at(mm, address, pmd, _pmd);
> +		update_mmu_cache_pmd(vma, address, pmd);
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	} else { //mTHP
> +		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> +		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> +
> +		spin_lock(pmd_ptl);
> +		folio_ref_add(folio, (1 << order) - 1);
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		spin_lock(pte_ptl);
> +		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> +		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> +		spin_unlock(pte_ptl);
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	}

You have done lock nesting here: lock(pmd_ptl) -> lock(pte_ptl) -> 
unlock(pte_ptl) -> unlock(pmd_ptl). Anyways, you do not need to take 
pmd_ptl when you are setting the ptes. I am almost done with my v2, and 
according to me this function should look like this:

/* Similar to the PMD case except we have to batch set the PTEs */
static int vma_collapse_anon_folio(struct mm_struct *mm, unsigned long 
address,
		struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd,
		struct folio *folio, int order)
{
	LIST_HEAD(compound_pagelist);
	spinlock_t *pmd_ptl, *pte_ptl;
	int result = SCAN_FAIL;
	struct mmu_notifier_range range;
	pmd_t _pmd;
	pte_t *pte;
	pte_t entry;
	int nr_pages = folio_nr_pages(folio);
	unsigned long haddress = address & HPAGE_PMD_MASK;

	VM_BUG_ON(address & ((1UL << order) - 1));;

	mmap_read_unlock(mm);

	mmap_write_lock(mm);
	result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc);
	if (result != SCAN_SUCCEED)
		goto out_up_write;
	result = check_pmd_still_valid(mm, address, pmd);
	if (result != SCAN_SUCCEED)
		goto out_up_write;

	vma_start_write(vma);
	anon_vma_lock_write(vma->anon_vma);

	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, haddress,
				haddress + HPAGE_PMD_SIZE);
	mmu_notifier_invalidate_range_start(&range);

	pmd_ptl = pmd_lock(mm, pmd);
	_pmd = pmdp_collapse_flush(vma, haddress, pmd);
	spin_unlock(pmd_ptl);
	mmu_notifier_invalidate_range_end(&range);
	tlb_remove_table_sync_one();

	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
	if (pte) {
		result = __collapse_huge_page_isolate(vma, address, pte, cc,
						      &compound_pagelist, order);
		spin_unlock(pte_ptl);
	} else {
		result = SCAN_PMD_NULL;
	}

	if (unlikely(result != SCAN_SUCCEED)) {
		if (pte)
			pte_unmap(pte);
		spin_lock(pmd_ptl);
		BUG_ON(!pmd_none(*pmd));
		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
		spin_unlock(pmd_ptl);
		anon_vma_unlock_write(vma->anon_vma);
		goto out_up_write;
	}

	anon_vma_unlock_write(vma->anon_vma);

	__folio_mark_uptodate(folio);
	entry = mk_pte(&folio->page, vma->vm_page_prot);
	entry = maybe_mkwrite(pte_mkdirty(entry), vma);

	result = __collapse_huge_page_copy(pte, folio, pmd, *pmd,
					   vma, address, pte_ptl,
					   &compound_pagelist, order);
	pte_unmap(pte);
	if (unlikely(result != SCAN_SUCCEED))
		goto out_up_write;

	folio_ref_add(folio, nr_pages - 1);
	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
	folio_add_lru_vma(folio, vma);
	spin_lock(pte_ptl);
	set_ptes(mm, address, pte, entry, nr_pages);
	spin_unlock(pte_ptl);
	spin_lock(pmd_ptl);

	/* See pmd_install() */
	smp_wmb();
	pmd_populate(mm, pmd, pmd_pgtable(_pmd));
	update_mmu_cache_pmd(vma, haddress, pmd);
	spin_unlock(pmd_ptl);

	result = SCAN_SUCCEED;
out_up_write:
	mmap_write_unlock(mm);
	return result;
}


The difference being, I take the pte_ptl, set the ptes, drop the 
pte_ptl, then take pmd_ptl, do pmd_populate(). Now, instead of 
update_mmu_cache_range() in the mTHP case, we still need to do 
update_mmu_cache_pmd() since we are repopulating the PMD. And, IIUC 
update_mmu_cache_pmd() is a superset of update_mmu_cache_range(), so we 
can drop the latter altogether.

>   
>   	folio = NULL;
>   
> @@ -1367,21 +1388,26 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   {
>   	pmd_t *pmd;
>   	pte_t *pte, *_pte;
> +	int i;
>   	int result = SCAN_FAIL, referenced = 0;
>   	int none_or_zero = 0, shared = 0;
>   	struct page *page = NULL;
>   	struct folio *folio = NULL;
>   	unsigned long _address;
> +	unsigned long enabled_orders;
>   	spinlock_t *ptl;
>   	int node = NUMA_NO_NODE, unmapped = 0;
>   	bool writable = false;
> -
> +	bool all_valid = true;
> +	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>   	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>   
>   	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>   	if (result != SCAN_SUCCEED)
>   		goto out;
>   
> +	bitmap_zero(cc->mthp_bitmap, 1 << (HPAGE_PMD_ORDER - MIN_MTHP_ORDER));
> +	bitmap_zero(cc->mthp_bitmap_temp, 1 << (HPAGE_PMD_ORDER - MIN_MTHP_ORDER));
>   	memset(cc->node_load, 0, sizeof(cc->node_load));
>   	nodes_clear(cc->alloc_nmask);
>   	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> @@ -1390,8 +1416,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		goto out;
>   	}
>   
> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, _address += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		if (i % MIN_MTHP_NR == 0)
> +			all_valid = true;
> +
> +		_pte = pte + i;
> +		_address = address + i * PAGE_SIZE;
>   		pte_t pteval = ptep_get(_pte);
>   		if (is_swap_pte(pteval)) {
>   			++unmapped;
> @@ -1414,6 +1444,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   			}
>   		}
>   		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +			all_valid = false;
>   			++none_or_zero;
>   			if (!userfaultfd_armed(vma) &&
>   			    (!cc->is_khugepaged ||
> @@ -1514,7 +1545,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		     folio_test_referenced(folio) || mmu_notifier_test_young(vma->vm_mm,
>   								     address)))
>   			referenced++;
> +
> +		/*
> +		 * we are reading in MIN_MTHP_NR page chunks. if there are no empty
> +		 * pages keep track of it in the bitmap for mTHP collapsing.
> +		 */
> +		if (all_valid && (i + 1) % MIN_MTHP_NR == 0)
> +			bitmap_set(cc->mthp_bitmap, i / MIN_MTHP_NR, 1);
>   	}
> +
>   	if (!writable) {
>   		result = SCAN_PAGE_RO;
>   	} else if (cc->is_khugepaged &&
> @@ -1527,10 +1566,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   out_unmap:
>   	pte_unmap_unlock(pte, ptl);
>   	if (result == SCAN_SUCCEED) {
> -		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> -		/* collapse_huge_page will return with the mmap_lock released */
> -		*mmap_locked = false;
> +		enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +			tva_flags, THP_ORDERS_ALL_ANON);
> +		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> +			       mmap_locked, enabled_orders);
> +		if (result > 0)
> +			result = SCAN_SUCCEED;
>   	}
>   out:
>   	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> @@ -2477,11 +2518,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, struct mm_struct *
>   			fput(file);
>   			if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
>   				mmap_read_lock(mm);
> +				*mmap_locked = true;
>   				if (khugepaged_test_exit_or_disable(mm))
>   					goto end;
>   				result = collapse_pte_mapped_thp(mm, addr,
>   								 !cc->is_khugepaged);
>   				mmap_read_unlock(mm);
> +				*mmap_locked = false;
>   			}
>   		} else {
>   			result = khugepaged_scan_pmd(mm, vma, addr,



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 09/11] khugepaged: add mTHP support
  2025-01-08 23:31 ` [RFC 09/11] khugepaged: add " Nico Pache
  2025-01-10  9:20   ` Dev Jain
@ 2025-01-10 13:36   ` Dev Jain
  1 sibling, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-01-10 13:36 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm



On 09/01/25 5:01 am, Nico Pache wrote:
> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning a PMD range for potential hugepage collapse, track pages
> in MIN_MTHP_ORDER chunks. Each bit represents a fully utilized region of
> order MIN_MTHP_ORDER ptes.
> 
> With this bitmap we can determine which mTHP sizes would be the most
> efficient to collapse to if the PMD collapse is not suitible.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>

For the actual bitmap optimization: give me some time, I'll get back to you.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-01-08 23:31 ` [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
  2025-01-10  9:05   ` Dev Jain
@ 2025-01-10 14:54   ` Dev Jain
  2025-01-10 21:48     ` Nico Pache
  2025-01-12 15:13   ` Dev Jain
  2 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-10 14:54 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm



On 09/01/25 5:01 am, Nico Pache wrote:
> khugepaged scans PMD ranges for potential collapse to a hugepage. To add
> mTHP support we use this scan to instead record chunks of fully utilized
> sections of the PMD.
> 
> create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
> by default we will set this to order 3. The reasoning is that for 4K 512
> PMD size this results in a 64 bit bitmap which has some optimizations.
> For other arches like ARM64 64K, we can set a larger order if needed.
> 
> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> that represents chunks of fully utilized regions. We can then determine
> what mTHP size fits best and in the following patch, we set this bitmap
> while scanning the PMD.
> 
> max_ptes_none is used as a scale to determine how "full" an order must
> be before being considered for collapse.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   include/linux/khugepaged.h |   4 +-
>   mm/khugepaged.c            | 129 +++++++++++++++++++++++++++++++++++--
>   2 files changed, 126 insertions(+), 7 deletions(-)
> 

[--snip--]

>   
> +// Recursive function to consume the bitmap
> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> +			int referenced, int unmapped, struct collapse_control *cc,
> +			bool *mmap_locked, unsigned long enabled_orders)
> +{
> +	u8 order, offset;
> +	int num_chunks;
> +	int bits_set, max_percent, threshold_bits;
> +	int next_order, mid_offset;
> +	int top = -1;
> +	int collapsed = 0;
> +	int ret;
> +	struct scan_bit_state state;
> +
> +	cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +		{ HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
> +
> +	while (top >= 0) {
> +		state = cc->mthp_bitmap_stack[top--];
> +		order = state.order;
> +		offset = state.offset;
> +		num_chunks = 1 << order;
> +		// Skip mTHP orders that are not enabled
> +		if (!(enabled_orders >> (order +  MIN_MTHP_ORDER)) & 1)
> +			goto next;
> +
> +		// copy the relavant section to a new bitmap
> +		bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> +				  MTHP_BITMAP_SIZE);
> +
> +		bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> +
> +		// Check if the region is "almost full" based on the threshold
> +		max_percent = ((HPAGE_PMD_NR - khugepaged_max_ptes_none - 1) * 100)
> +						/ (HPAGE_PMD_NR - 1);
> +		threshold_bits = (max_percent * num_chunks) / 100;
> +
> +		if (bits_set >= threshold_bits) {
> +			ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> +					mmap_locked, order + MIN_MTHP_ORDER, offset * MIN_MTHP_NR);
> +			if (ret == SCAN_SUCCEED)
> +				collapsed += (1 << (order + MIN_MTHP_ORDER));
> +			continue;
> +		}

We are going to the lower order when it is not in the allowed mask of 
orders, or when we are below the threshold. What to do when these 
conditions do not happen, and the reason for collapse failure is 
collapse_huge_page()? For example, if you start with a PMD order scan,
and collapse_huge_page() fails, then you hit "continue", and then exit 
the loop because there is nothing else in the stack, so we exit without
trying mTHPs.

> +
> +next:
> +		if (order > 0) {
> +			next_order = order - 1;
> +			mid_offset = offset + (num_chunks / 2);
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, mid_offset };
> +			cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> +				{ next_order, offset };
> +			}
> +	}
> +	return collapsed;
> +}
> +


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early
  2025-01-10  6:11   ` Dev Jain
@ 2025-01-10 19:37     ` Nico Pache
  0 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-10 19:37 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm

On Thu, Jan 9, 2025 at 11:11 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 09/01/25 5:01 am, Nico Pache wrote:
> > We should only "enter"/allocate the khugepaged mm_slot if we succeed at
> > allocating the PMD sized folio. Move the khugepaged_enter_vma call until
> > after we know the vma_alloc_folio was successful.
>
> Why? We have the appropriate checks from thp_vma_allowable_orders() and
> friends, so the VMA should be registered with khugepaged irrespective of
> whether during fault time we are able to allocate a PMD-THP or not. If
> we fail at fault time, it is the job of khugepaged to try to collapse it
> later.

That's a fair point. This was written a while back when I first
started looking into khugepaged. I believe the current schema for
khugepaged_enter_vma is to only register when there is a mapping large
enough for khugepaged to work on. I'd like to remove this restriction
in the future to simplify the entry points of khugepaged. Currently we
need to add these khugepaged_enter_vma functions all over the place,
ideally we just register everything with khugepaged.

Either way, you are correct, even if we FALLBACK, the mapping would
still be eligible for promotion in the future.

Ill drop this patch. Thanks!

> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/huge_memory.c | 3 +--
> >   1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e53d83b3e5cf..635c65e7ef63 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1323,7 +1323,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >       ret = vmf_anon_prepare(vmf);
> >       if (ret)
> >               return ret;
> > -     khugepaged_enter_vma(vma, vma->vm_flags);
> >
> >       if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> >                       !mm_forbids_zeropage(vma->vm_mm) &&
> > @@ -1365,7 +1364,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >               }
> >               return ret;
> >       }
> > -
> > +     khugepaged_enter_vma(vma, vma->vm_flags);
> >       return __do_huge_pmd_anonymous_page(vmf);
> >   }
> >
>
> In any case, you are not achieving what you described in the patch
> description: you have moved khugepaged_enter_vma() after the read fault
> logic, what you want to do is to move it after
> vma_alloc_anon_folio_pmd() in __do_huge_pmd_anonymous_page().
Good catch! This was a byproduct of a rebase... back when i wrote this
the vma_alloc_folio was in the do_huge_pmd_anonymous_page function.
>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 06/11] khugepaged: generalize alloc_charge_folio for mTHP support
  2025-01-10  6:23   ` Dev Jain
@ 2025-01-10 19:41     ` Nico Pache
  0 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-10 19:41 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm

On Thu, Jan 9, 2025 at 11:24 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 09/01/25 5:01 am, Nico Pache wrote:
> > alloc_charge_folio allocates the new folio for the khugepaged collapse.
> > Generalize the order of the folio allocations to support future mTHP
> > collapsing.
> >
> > No functional changes in this patch.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   mm/khugepaged.c | 8 ++++----
> >   1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index e2e6ca9265ab..6daf3a943a1a 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1070,14 +1070,14 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> >   }
> >
> >   static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> > -                           struct collapse_control *cc)
> > +                           struct collapse_control *cc, int order)
> >   {
> >       gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> >                    GFP_TRANSHUGE);
> >       int node = khugepaged_find_target_node(cc);
> >       struct folio *folio;
> >
> > -     folio = __folio_alloc(gfp, HPAGE_PMD_ORDER, node, &cc->alloc_nmask);
> > +     folio = __folio_alloc(gfp, order, node, &cc->alloc_nmask);
> >       if (!folio) {
> >               *foliop = NULL;
> >               count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
> > @@ -1121,7 +1121,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        */
> >       mmap_read_unlock(mm);
> >
> > -     result = alloc_charge_folio(&folio, mm, cc);
> > +     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> > @@ -1834,7 +1834,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >       VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem);
> >       VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
> >
> > -     result = alloc_charge_folio(&new_folio, mm, cc);
> > +     result = alloc_charge_folio(&new_folio, mm, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> >               goto out;
> >
>
> I guess we will need stat updation like I did in my patch.

Yeah stats were on my TODO list, as well as cleaning up some of the
tracing. Those will be done before the PATCH posting.

>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-01-10 14:54   ` Dev Jain
@ 2025-01-10 21:48     ` Nico Pache
  0 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-10 21:48 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm

On Fri, Jan 10, 2025 at 7:54 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 09/01/25 5:01 am, Nico Pache wrote:
> > khugepaged scans PMD ranges for potential collapse to a hugepage. To add
> > mTHP support we use this scan to instead record chunks of fully utilized
> > sections of the PMD.
> >
> > create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
> > by default we will set this to order 3. The reasoning is that for 4K 512
> > PMD size this results in a 64 bit bitmap which has some optimizations.
> > For other arches like ARM64 64K, we can set a larger order if needed.
> >
> > khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> > that represents chunks of fully utilized regions. We can then determine
> > what mTHP size fits best and in the following patch, we set this bitmap
> > while scanning the PMD.
> >
> > max_ptes_none is used as a scale to determine how "full" an order must
> > be before being considered for collapse.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   include/linux/khugepaged.h |   4 +-
> >   mm/khugepaged.c            | 129 +++++++++++++++++++++++++++++++++++--
> >   2 files changed, 126 insertions(+), 7 deletions(-)
> >
>
> [--snip--]
>
> >
> > +// Recursive function to consume the bitmap
> > +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> > +                     int referenced, int unmapped, struct collapse_control *cc,
> > +                     bool *mmap_locked, unsigned long enabled_orders)
> > +{
> > +     u8 order, offset;
> > +     int num_chunks;
> > +     int bits_set, max_percent, threshold_bits;
> > +     int next_order, mid_offset;
> > +     int top = -1;
> > +     int collapsed = 0;
> > +     int ret;
> > +     struct scan_bit_state state;
> > +
> > +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +             { HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
> > +
> > +     while (top >= 0) {
> > +             state = cc->mthp_bitmap_stack[top--];
> > +             order = state.order;
> > +             offset = state.offset;
> > +             num_chunks = 1 << order;
> > +             // Skip mTHP orders that are not enabled
> > +             if (!(enabled_orders >> (order +  MIN_MTHP_ORDER)) & 1)
> > +                     goto next;
> > +
> > +             // copy the relavant section to a new bitmap
> > +             bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> > +                               MTHP_BITMAP_SIZE);
> > +
> > +             bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> > +
> > +             // Check if the region is "almost full" based on the threshold
> > +             max_percent = ((HPAGE_PMD_NR - khugepaged_max_ptes_none - 1) * 100)
> > +                                             / (HPAGE_PMD_NR - 1);
> > +             threshold_bits = (max_percent * num_chunks) / 100;
> > +
> > +             if (bits_set >= threshold_bits) {
> > +                     ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> > +                                     mmap_locked, order + MIN_MTHP_ORDER, offset * MIN_MTHP_NR);
> > +                     if (ret == SCAN_SUCCEED)
> > +                             collapsed += (1 << (order + MIN_MTHP_ORDER));
> > +                     continue;
> > +             }
>
> We are going to the lower order when it is not in the allowed mask of
> orders, or when we are below the threshold. What to do when these
> conditions do not happen, and the reason for collapse failure is
> collapse_huge_page()? For example, if you start with a PMD order scan,
> and collapse_huge_page() fails, then you hit "continue", and then exit
> the loop because there is nothing else in the stack, so we exit without
> trying mTHPs.

Thanks for catching that, I introduced that bug when I went from the
recursion to stack based approach.
This should only continue on SCAN_SUCCEED. If not it needs to go next:

I think I also need to handle the case where nothing succeeds in
khugepaged_scan_pmd.


>
> > +
> > +next:
> > +             if (order > 0) {
> > +                     next_order = order - 1;
> > +                     mid_offset = offset + (num_chunks / 2);
> > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +                             { next_order, mid_offset };
> > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +                             { next_order, offset };
> > +                     }
> > +     }
> > +     return collapsed;
> > +}
> > +
>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-01-10  9:05   ` Dev Jain
@ 2025-01-10 21:48     ` Nico Pache
  2025-01-12 11:23       ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: Nico Pache @ 2025-01-10 21:48 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm

On Fri, Jan 10, 2025 at 2:06 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 09/01/25 5:01 am, Nico Pache wrote:
> > khugepaged scans PMD ranges for potential collapse to a hugepage. To add
> > mTHP support we use this scan to instead record chunks of fully utilized
> > sections of the PMD.
> >
> > create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
> > by default we will set this to order 3. The reasoning is that for 4K 512
> > PMD size this results in a 64 bit bitmap which has some optimizations.
> > For other arches like ARM64 64K, we can set a larger order if needed.
> >
> > khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> > that represents chunks of fully utilized regions. We can then determine
> > what mTHP size fits best and in the following patch, we set this bitmap
> > while scanning the PMD.
> >
> > max_ptes_none is used as a scale to determine how "full" an order must
> > be before being considered for collapse.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >   include/linux/khugepaged.h |   4 +-
> >   mm/khugepaged.c            | 129 +++++++++++++++++++++++++++++++++++--
> >   2 files changed, 126 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index 1f46046080f5..31cff8aeec4a 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -1,7 +1,9 @@
> >   /* SPDX-License-Identifier: GPL-2.0 */
> >   #ifndef _LINUX_KHUGEPAGED_H
> >   #define _LINUX_KHUGEPAGED_H
> > -
>
> Nit: I don't think this line needs to be deleted.
>
> > +#define MIN_MTHP_ORDER       3
> > +#define MIN_MTHP_NR  (1<<MIN_MTHP_ORDER)
>
> Nit: Insert a space: (1 << MIN_MTHP_ORDER)
>
> > +#define MTHP_BITMAP_SIZE  (1<<(HPAGE_PMD_ORDER - MIN_MTHP_ORDER))
> >   extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >   extern struct attribute_group khugepaged_attr_group;
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 9eb161b04ee4..de1dc6ea3c71 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >
> >   static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +struct scan_bit_state {
> > +     u8 order;
> > +     u8 offset;
> > +};
> > +
> >   struct collapse_control {
> >       bool is_khugepaged;
> >
> > @@ -102,6 +107,15 @@ struct collapse_control {
> >
> >       /* nodemask for allocation fallback */
> >       nodemask_t alloc_nmask;
> > +
> > +     /* bitmap used to collapse mTHP sizes. 1bit = order MIN_MTHP_ORDER mTHP */
> > +     unsigned long *mthp_bitmap;
> > +     unsigned long *mthp_bitmap_temp;
> > +     struct scan_bit_state *mthp_bitmap_stack;
> > +};
> > +
> > +struct collapse_control khugepaged_collapse_control = {
> > +     .is_khugepaged = true,
> >   };
> >
> >   /**
> > @@ -389,6 +403,25 @@ int __init khugepaged_init(void)
> >       if (!mm_slot_cache)
> >               return -ENOMEM;
> >
> > +     /*
> > +      * allocate the bitmaps dynamically since MTHP_BITMAP_SIZE is not known at
> > +      * compile time for some architectures.
> > +      */
> > +     khugepaged_collapse_control.mthp_bitmap = kmalloc_array(
> > +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> > +     if (!khugepaged_collapse_control.mthp_bitmap)
> > +             return -ENOMEM;
> > +
> > +     khugepaged_collapse_control.mthp_bitmap_temp = kmalloc_array(
> > +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> > +     if (!khugepaged_collapse_control.mthp_bitmap_temp)
> > +             return -ENOMEM;
> > +
> > +     khugepaged_collapse_control.mthp_bitmap_stack = kmalloc_array(
> > +             MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL);
> > +     if (!khugepaged_collapse_control.mthp_bitmap_stack)
> > +             return -ENOMEM;
> > +
> >       khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
> >       khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
> >       khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
> > @@ -400,6 +433,9 @@ int __init khugepaged_init(void)
> >   void __init khugepaged_destroy(void)
> >   {
> >       kmem_cache_destroy(mm_slot_cache);
> > +     kfree(khugepaged_collapse_control.mthp_bitmap);
> > +     kfree(khugepaged_collapse_control.mthp_bitmap_temp);
> > +     kfree(khugepaged_collapse_control.mthp_bitmap_stack);
> >   }
> >
> >   static inline int khugepaged_test_exit(struct mm_struct *mm)
> > @@ -850,10 +886,6 @@ static void khugepaged_alloc_sleep(void)
> >       remove_wait_queue(&khugepaged_wait, &wait);
> >   }
> >
> > -struct collapse_control khugepaged_collapse_control = {
> > -     .is_khugepaged = true,
> > -};
> > -
> >   static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
> >   {
> >       int i;
> > @@ -1102,7 +1134,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >
> >   static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                             int referenced, int unmapped,
> > -                           struct collapse_control *cc)
> > +                           struct collapse_control *cc, bool *mmap_locked,
> > +                               int order, int offset)
> >   {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > @@ -1115,6 +1148,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       struct mmu_notifier_range range;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > +     /* if collapsing mTHPs we may have already released the read_lock, and
> > +      * need to reaquire it to keep the proper locking order.
> > +      */
> > +     if (!*mmap_locked)
> > +             mmap_read_lock(mm);
>
> There is no need to take the read lock again, because we drop it just
> after this.

collapse_huge_page expects the mmap_lock to already be taken, and it
returns with it unlocked. If we are collapsing multiple mTHPs under
the same PMD, then I think we need to reacquire the lock before
calling unlock on it.

>
> >       /*
> >        * Before allocating the hugepage, release the mmap_lock read lock.
> >        * The allocation can take potentially a long time if it involves
> > @@ -1122,6 +1160,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * that. We will recheck the vma after taking it again in write mode.
> >        */
> >       mmap_read_unlock(mm);
> > +     *mmap_locked = false;
> >
> >       result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> >       if (result != SCAN_SUCCEED)
> > @@ -1256,12 +1295,71 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >   out_up_write:
> >       mmap_write_unlock(mm);
> >   out_nolock:
> > +     *mmap_locked = false;
> >       if (folio)
> >               folio_put(folio);
> >       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> >       return result;
> >   }
> >
> > +// Recursive function to consume the bitmap
> > +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> > +                     int referenced, int unmapped, struct collapse_control *cc,
> > +                     bool *mmap_locked, unsigned long enabled_orders)
> > +{
> > +     u8 order, offset;
> > +     int num_chunks;
> > +     int bits_set, max_percent, threshold_bits;
> > +     int next_order, mid_offset;
> > +     int top = -1;
> > +     int collapsed = 0;
> > +     int ret;
> > +     struct scan_bit_state state;
> > +
> > +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +             { HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
> > +
> > +     while (top >= 0) {
> > +             state = cc->mthp_bitmap_stack[top--];
> > +             order = state.order;
> > +             offset = state.offset;
> > +             num_chunks = 1 << order;
> > +             // Skip mTHP orders that are not enabled
> > +             if (!(enabled_orders >> (order +  MIN_MTHP_ORDER)) & 1)
> > +                     goto next;
> > +
> > +             // copy the relavant section to a new bitmap
> > +             bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> > +                               MTHP_BITMAP_SIZE);
> > +
> > +             bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> > +
> > +             // Check if the region is "almost full" based on the threshold
> > +             max_percent = ((HPAGE_PMD_NR - khugepaged_max_ptes_none - 1) * 100)
> > +                                             / (HPAGE_PMD_NR - 1);
> > +             threshold_bits = (max_percent * num_chunks) / 100;
> > +
> > +             if (bits_set >= threshold_bits) {
> > +                     ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> > +                                     mmap_locked, order + MIN_MTHP_ORDER, offset * MIN_MTHP_NR);
> > +                     if (ret == SCAN_SUCCEED)
> > +                             collapsed += (1 << (order + MIN_MTHP_ORDER));
> > +                     continue;
> > +             }
> > +
> > +next:
> > +             if (order > 0) {
> > +                     next_order = order - 1;
> > +                     mid_offset = offset + (num_chunks / 2);
> > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +                             { next_order, mid_offset };
> > +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> > +                             { next_order, offset };
> > +                     }
> > +     }
> > +     return collapsed;
> > +}
> > +
> >   static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                                  struct vm_area_struct *vma,
> >                                  unsigned long address, bool *mmap_locked,
> > @@ -1430,7 +1528,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> >               result = collapse_huge_page(mm, address, referenced,
> > -                                         unmapped, cc);
> > +                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> >               /* collapse_huge_page will return with the mmap_lock released */
> >               *mmap_locked = false;
> >       }
> > @@ -2767,6 +2865,21 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >               return -ENOMEM;
> >       cc->is_khugepaged = false;
> >
> > +     cc->mthp_bitmap = kmalloc_array(
> > +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> > +     if (!cc->mthp_bitmap)
> > +             return -ENOMEM;
> > +
> > +     cc->mthp_bitmap_temp = kmalloc_array(
> > +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> > +     if (!cc->mthp_bitmap_temp)
> > +             return -ENOMEM;
> > +
> > +     cc->mthp_bitmap_stack = kmalloc_array(
> > +             MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL);
> > +     if (!cc->mthp_bitmap_stack)
> > +             return -ENOMEM;
> > +
> >       mmgrab(mm);
> >       lru_add_drain_all();
> >
> > @@ -2831,8 +2944,12 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >   out_nolock:
> >       mmap_assert_locked(mm);
> >       mmdrop(mm);
> > +     kfree(cc->mthp_bitmap);
> > +     kfree(cc->mthp_bitmap_temp);
> > +     kfree(cc->mthp_bitmap_stack);
> >       kfree(cc);
> >
> > +
> >       return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> >                       : madvise_collapse_errno(last_fail);
> >   }
>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-10  4:56     ` Dev Jain
@ 2025-01-10 22:01       ` Nico Pache
  2025-01-12 14:11         ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: Nico Pache @ 2025-01-10 22:01 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm

On Thu, Jan 9, 2025 at 9:56 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 10/01/25 7:57 am, Nico Pache wrote:
> > On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >> On 09/01/25 5:01 am, Nico Pache wrote:
> >>> The following series provides khugepaged and madvise collapse with the
> >>> capability to collapse regions to mTHPs.
> >>>
> >>> To achieve this we generalize the khugepaged functions to no longer depend
> >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> >>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>> on max_ptes_none is removed during the scan, to make sure we account for
> >>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> >>> determine how full a mTHP order needs to be before collapsing it.
> >>>
> >>> Some design choices to note:
> >>>    - bitmap structures are allocated dynamically because on some arch's
> >>>       (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >>>       compile time leading to warnings.
> >>>    - The recursion is masked through a stack structure.
> >>>    - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >>>       64bit on x86. This provides some optimization on the bitmap operations.
> >>>       if other arches/configs that have larger than 512 PTEs per PMD want to
> >>>       compress their bitmap further we can change this value per arch.
> >>>
> >>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> >>> Patch 3:    A minor "fix"/optimization
> >>> Patch 4:    Refactor/rename hpage_collapse
> >>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> >>> Patch 8-11: The mTHP patches
> >>>
> >>> This series acts as an alternative to Dev Jain's approach [1]. The two
> >>> series differ in a few ways:
> >>>     - My approach uses a bitmap to store the state of the linear scan_pmd to
> >>>       then determine potential mTHP batches. Devs incorporates his directly
> >>>       into the scan, and will try each available order.
> >>>     - Dev is attempting to optimize the locking, while my approach keeps the
> >>>       locking changes to a minimum. I believe his changes are not safe for
> >>>       uffd.
> >>>     - Dev's changes only work for khugepaged not madvise_collapse (although
> >>>       i think that was by choice and it could easily support madvise)
> >>>     - Dev scales all khugepaged sysfs tunables by order, while im removing
> >>>       the restriction of max_ptes_none and converting it to a scale to
> >>>       determine a (m)THP threshold.
> >>>     - Dev turns on khugepaged if any order is available while mine still
> >>>       only runs if PMDs are enabled. I like Dev's approach and will most
> >>>       likely do the same in my PATCH posting.
> >>>     - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >>>
> >>> Patch 11 was inspired by one of Dev's changes.
> >>>
> >>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >>>
> >>> Nico Pache (11):
> >>>     introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >>>     khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >>>     khugepaged: Don't allocate khugepaged mm_slot early
> >>>     khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>     khugepaged: generalize alloc_charge_folio for mTHP support
> >>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>     khugepaged: add mTHP support
> >>>     khugepaged: remove max_ptes_none restriction on the pmd scan
> >>>     khugepaged: skip collapsing mTHP to smaller orders
> >>>
> >>>    include/linux/khugepaged.h |   4 +-
> >>>    mm/huge_memory.c           |   3 +-
> >>>    mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >>>    3 files changed, 306 insertions(+), 137 deletions(-)
> >>
> >> Before I take a proper look at your series, can you please include any testing
> >> you may have done?
> >
> > I Built these changes for the following arches: x86_64, arm64,
> > arm64-64k, ppc64le, s390x
> >
> > x86 testing:
> > - Selftests mm
> > - some stress-ng tests
> > - compile kernel
> > - I did some tests with my defer [1] set on top. This pushes all the
> > work to khugepaged, which removes the noise of all the PF allocations.
> >
> > I recently got an ARM64 machine and did some simple sanity tests (on
> > both 4k and 64k) like selftests, stress-ng, and playing around with
> > the tunables, etc.
> >
> > I will also be running all the builds through our CI, and perf testing
> > environments before posting.
> >
> > [1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/
> >
> >>
> >
> I tested your series with the program I was using and it is not working;
> can you please confirm it.

Yes, this is expected because you are not fully filling any 32K chunk
(MIN_MTHP_ORDER) so no bit is ever set.
I should probably add some threshold to scan_pmd so we set the bitmap
if at least half of the region is full or scale it on max_ptes_none
like I do in scan_bitmap.

Thanks for your review-- Have a good weekend!
>
> diff --git a/mytests/mthp.c b/mytests/mthp.c
> new file mode 100644
> index 000000000000..e3029dbcf035
> --- /dev/null
> +++ b/mytests/mthp.c
> @@ -0,0 +1,45 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + *
> + * Author: Dev Jain <dev.jain@arm.com>
> + *
> + * Program to test khugepaged mTHP collapse
> + */
> +
> +#include <unistd.h>
> +#include <sys/ioctl.h>
> +#include <string.h>
> +#include <stdint.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <sys/mman.h>
> +#include <sys/time.h>
> +#include <sys/random.h>
> +#include <assert.h>
> +
> +int main(int argc, char *argv[])
> +{
> +       char *ptr;
> +       unsigned long mthp_size = (1UL << 16);
> +       size_t chunk_size = (1UL << 25);
> +
> +       ptr = mmap((void *)(1UL << 30), chunk_size, PROT_READ | PROT_WRITE,
> +                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +       if (((unsigned long)ptr) != (1UL << 30)) {
> +               printf("mmap did not work on required address\n");
> +               return 1;
> +       }
> +
> +       /* Fill first pte in every 64K interval */
> +       for (int i = 0; i < chunk_size; i += mthp_size)
> +               ptr[i] = i;
> +
> +       if (madvise(ptr, chunk_size, MADV_HUGEPAGE)) {
> +               perror("madvise");
> +               return 1;
> +       }
> +       sleep(100);
> +       return 0;
> +}
> --
> 2.30.2
>
> Set enabled = madvise, hugepages-2048k/enabled = hugepages-64k/enabled =
> inherit. Run the program in the background, then run tools/mm/thpmaps.
> You will see PMD collapse correctly, but when you echo never into
> hugepages-2048k/enabled and test this again, you won't see contpte 64K
> collapse. With my series, you will see something like
>
> anon-cont-pte-aligned-64kB : 32768 kB (100%).
>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-01-10 21:48     ` Nico Pache
@ 2025-01-12 11:23       ` Dev Jain
  2025-01-13 22:25         ` Nico Pache
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-12 11:23 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm



On 11/01/25 3:18 am, Nico Pache wrote:
> On Fri, Jan 10, 2025 at 2:06 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 09/01/25 5:01 am, Nico Pache wrote:
>>> khugepaged scans PMD ranges for potential collapse to a hugepage. To add
>>> mTHP support we use this scan to instead record chunks of fully utilized
>>> sections of the PMD.
>>>
>>> create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
>>> by default we will set this to order 3. The reasoning is that for 4K 512
>>> PMD size this results in a 64 bit bitmap which has some optimizations.
>>> For other arches like ARM64 64K, we can set a larger order if needed.
>>>
>>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
>>> that represents chunks of fully utilized regions. We can then determine
>>> what mTHP size fits best and in the following patch, we set this bitmap
>>> while scanning the PMD.
>>>
>>> max_ptes_none is used as a scale to determine how "full" an order must
>>> be before being considered for collapse.
>>>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>    include/linux/khugepaged.h |   4 +-
>>>    mm/khugepaged.c            | 129 +++++++++++++++++++++++++++++++++++--
>>>    2 files changed, 126 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
>>> index 1f46046080f5..31cff8aeec4a 100644
>>> --- a/include/linux/khugepaged.h
>>> +++ b/include/linux/khugepaged.h
>>> @@ -1,7 +1,9 @@
>>>    /* SPDX-License-Identifier: GPL-2.0 */
>>>    #ifndef _LINUX_KHUGEPAGED_H
>>>    #define _LINUX_KHUGEPAGED_H
>>> -
>>
>> Nit: I don't think this line needs to be deleted.
>>
>>> +#define MIN_MTHP_ORDER       3
>>> +#define MIN_MTHP_NR  (1<<MIN_MTHP_ORDER)
>>
>> Nit: Insert a space: (1 << MIN_MTHP_ORDER)
>>
>>> +#define MTHP_BITMAP_SIZE  (1<<(HPAGE_PMD_ORDER - MIN_MTHP_ORDER))
>>>    extern unsigned int khugepaged_max_ptes_none __read_mostly;
>>>    #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>    extern struct attribute_group khugepaged_attr_group;
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 9eb161b04ee4..de1dc6ea3c71 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>>>
>>>    static struct kmem_cache *mm_slot_cache __ro_after_init;
>>>
>>> +struct scan_bit_state {
>>> +     u8 order;
>>> +     u8 offset;
>>> +};
>>> +
>>>    struct collapse_control {
>>>        bool is_khugepaged;
>>>
>>> @@ -102,6 +107,15 @@ struct collapse_control {
>>>
>>>        /* nodemask for allocation fallback */
>>>        nodemask_t alloc_nmask;
>>> +
>>> +     /* bitmap used to collapse mTHP sizes. 1bit = order MIN_MTHP_ORDER mTHP */
>>> +     unsigned long *mthp_bitmap;
>>> +     unsigned long *mthp_bitmap_temp;
>>> +     struct scan_bit_state *mthp_bitmap_stack;
>>> +};
>>> +
>>> +struct collapse_control khugepaged_collapse_control = {
>>> +     .is_khugepaged = true,
>>>    };
>>>
>>>    /**
>>> @@ -389,6 +403,25 @@ int __init khugepaged_init(void)
>>>        if (!mm_slot_cache)
>>>                return -ENOMEM;
>>>
>>> +     /*
>>> +      * allocate the bitmaps dynamically since MTHP_BITMAP_SIZE is not known at
>>> +      * compile time for some architectures.
>>> +      */
>>> +     khugepaged_collapse_control.mthp_bitmap = kmalloc_array(
>>> +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
>>> +     if (!khugepaged_collapse_control.mthp_bitmap)
>>> +             return -ENOMEM;
>>> +
>>> +     khugepaged_collapse_control.mthp_bitmap_temp = kmalloc_array(
>>> +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
>>> +     if (!khugepaged_collapse_control.mthp_bitmap_temp)
>>> +             return -ENOMEM;
>>> +
>>> +     khugepaged_collapse_control.mthp_bitmap_stack = kmalloc_array(
>>> +             MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL);
>>> +     if (!khugepaged_collapse_control.mthp_bitmap_stack)
>>> +             return -ENOMEM;
>>> +
>>>        khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
>>>        khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
>>>        khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
>>> @@ -400,6 +433,9 @@ int __init khugepaged_init(void)
>>>    void __init khugepaged_destroy(void)
>>>    {
>>>        kmem_cache_destroy(mm_slot_cache);
>>> +     kfree(khugepaged_collapse_control.mthp_bitmap);
>>> +     kfree(khugepaged_collapse_control.mthp_bitmap_temp);
>>> +     kfree(khugepaged_collapse_control.mthp_bitmap_stack);
>>>    }
>>>
>>>    static inline int khugepaged_test_exit(struct mm_struct *mm)
>>> @@ -850,10 +886,6 @@ static void khugepaged_alloc_sleep(void)
>>>        remove_wait_queue(&khugepaged_wait, &wait);
>>>    }
>>>
>>> -struct collapse_control khugepaged_collapse_control = {
>>> -     .is_khugepaged = true,
>>> -};
>>> -
>>>    static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
>>>    {
>>>        int i;
>>> @@ -1102,7 +1134,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>>>
>>>    static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>                              int referenced, int unmapped,
>>> -                           struct collapse_control *cc)
>>> +                           struct collapse_control *cc, bool *mmap_locked,
>>> +                               int order, int offset)
>>>    {
>>>        LIST_HEAD(compound_pagelist);
>>>        pmd_t *pmd, _pmd;
>>> @@ -1115,6 +1148,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>        struct mmu_notifier_range range;
>>>        VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>>>
>>> +     /* if collapsing mTHPs we may have already released the read_lock, and
>>> +      * need to reaquire it to keep the proper locking order.
>>> +      */
>>> +     if (!*mmap_locked)
>>> +             mmap_read_lock(mm);
>>
>> There is no need to take the read lock again, because we drop it just
>> after this.
> 
> collapse_huge_page expects the mmap_lock to already be taken, and it
> returns with it unlocked. If we are collapsing multiple mTHPs under
> the same PMD, then I think we need to reacquire the lock before
> calling unlock on it.

I cannot figure out a potential place where we drop the lock before 
entering collapse_huge_page(). In any case, wouldn't this be better:
if (*mmap_locked)
	mmap_read_unlock(mm);

Basically, instead of putting the if condition around the lock, you do 
it around the unlock?

> 
>>
>>>        /*
>>>         * Before allocating the hugepage, release the mmap_lock read lock.
>>>         * The allocation can take potentially a long time if it involves
>>> @@ -1122,6 +1160,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>         * that. We will recheck the vma after taking it again in write mode.
>>>         */
>>>        mmap_read_unlock(mm);
>>> +     *mmap_locked = false;
>>>
>>>        result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
>>>        if (result != SCAN_SUCCEED)
>>> @@ -1256,12 +1295,71 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>    out_up_write:
>>>        mmap_write_unlock(mm);
>>>    out_nolock:
>>> +     *mmap_locked = false;
>>>        if (folio)
>>>                folio_put(folio);
>>>        trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
>>>        return result;
>>>    }
>>>
>>> +// Recursive function to consume the bitmap
>>> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
>>> +                     int referenced, int unmapped, struct collapse_control *cc,
>>> +                     bool *mmap_locked, unsigned long enabled_orders)
>>> +{
>>> +     u8 order, offset;
>>> +     int num_chunks;
>>> +     int bits_set, max_percent, threshold_bits;
>>> +     int next_order, mid_offset;
>>> +     int top = -1;
>>> +     int collapsed = 0;
>>> +     int ret;
>>> +     struct scan_bit_state state;
>>> +
>>> +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
>>> +             { HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
>>> +
>>> +     while (top >= 0) {
>>> +             state = cc->mthp_bitmap_stack[top--];
>>> +             order = state.order;
>>> +             offset = state.offset;
>>> +             num_chunks = 1 << order;
>>> +             // Skip mTHP orders that are not enabled
>>> +             if (!(enabled_orders >> (order +  MIN_MTHP_ORDER)) & 1)
>>> +                     goto next;
>>> +
>>> +             // copy the relavant section to a new bitmap
>>> +             bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
>>> +                               MTHP_BITMAP_SIZE);
>>> +
>>> +             bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
>>> +
>>> +             // Check if the region is "almost full" based on the threshold
>>> +             max_percent = ((HPAGE_PMD_NR - khugepaged_max_ptes_none - 1) * 100)
>>> +                                             / (HPAGE_PMD_NR - 1);
>>> +             threshold_bits = (max_percent * num_chunks) / 100;
>>> +
>>> +             if (bits_set >= threshold_bits) {
>>> +                     ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
>>> +                                     mmap_locked, order + MIN_MTHP_ORDER, offset * MIN_MTHP_NR);
>>> +                     if (ret == SCAN_SUCCEED)
>>> +                             collapsed += (1 << (order + MIN_MTHP_ORDER));
>>> +                     continue;
>>> +             }
>>> +
>>> +next:
>>> +             if (order > 0) {
>>> +                     next_order = order - 1;
>>> +                     mid_offset = offset + (num_chunks / 2);
>>> +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
>>> +                             { next_order, mid_offset };
>>> +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
>>> +                             { next_order, offset };
>>> +                     }
>>> +     }
>>> +     return collapsed;
>>> +}
>>> +
>>>    static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>                                   struct vm_area_struct *vma,
>>>                                   unsigned long address, bool *mmap_locked,
>>> @@ -1430,7 +1528,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>        pte_unmap_unlock(pte, ptl);
>>>        if (result == SCAN_SUCCEED) {
>>>                result = collapse_huge_page(mm, address, referenced,
>>> -                                         unmapped, cc);
>>> +                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
>>>                /* collapse_huge_page will return with the mmap_lock released */
>>>                *mmap_locked = false;
>>>        }
>>> @@ -2767,6 +2865,21 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>>>                return -ENOMEM;
>>>        cc->is_khugepaged = false;
>>>
>>> +     cc->mthp_bitmap = kmalloc_array(
>>> +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
>>> +     if (!cc->mthp_bitmap)
>>> +             return -ENOMEM;
>>> +
>>> +     cc->mthp_bitmap_temp = kmalloc_array(
>>> +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
>>> +     if (!cc->mthp_bitmap_temp)
>>> +             return -ENOMEM;
>>> +
>>> +     cc->mthp_bitmap_stack = kmalloc_array(
>>> +             MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL);
>>> +     if (!cc->mthp_bitmap_stack)
>>> +             return -ENOMEM;
>>> +
>>>        mmgrab(mm);
>>>        lru_add_drain_all();
>>>
>>> @@ -2831,8 +2944,12 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>>>    out_nolock:
>>>        mmap_assert_locked(mm);
>>>        mmdrop(mm);
>>> +     kfree(cc->mthp_bitmap);
>>> +     kfree(cc->mthp_bitmap_temp);
>>> +     kfree(cc->mthp_bitmap_stack);
>>>        kfree(cc);
>>>
>>> +
>>>        return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
>>>                        : madvise_collapse_errno(last_fail);
>>>    }
>>
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-10 22:01       ` Nico Pache
@ 2025-01-12 14:11         ` Dev Jain
  2025-01-13 23:00           ` Nico Pache
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-12 14:11 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm



On 11/01/25 3:31 am, Nico Pache wrote:
> On Thu, Jan 9, 2025 at 9:56 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 10/01/25 7:57 am, Nico Pache wrote:
>>> On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>>
>>>> On 09/01/25 5:01 am, Nico Pache wrote:
>>>>> The following series provides khugepaged and madvise collapse with the
>>>>> capability to collapse regions to mTHPs.
>>>>>
>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>>>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
>>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>>>> on max_ptes_none is removed during the scan, to make sure we account for
>>>>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
>>>>> determine how full a mTHP order needs to be before collapsing it.
>>>>>
>>>>> Some design choices to note:
>>>>>     - bitmap structures are allocated dynamically because on some arch's
>>>>>        (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>>>>>        compile time leading to warnings.
>>>>>     - The recursion is masked through a stack structure.
>>>>>     - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>>>>>        64bit on x86. This provides some optimization on the bitmap operations.
>>>>>        if other arches/configs that have larger than 512 PTEs per PMD want to
>>>>>        compress their bitmap further we can change this value per arch.
>>>>>
>>>>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
>>>>> Patch 3:    A minor "fix"/optimization
>>>>> Patch 4:    Refactor/rename hpage_collapse
>>>>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
>>>>> Patch 8-11: The mTHP patches
>>>>>
>>>>> This series acts as an alternative to Dev Jain's approach [1]. The two
>>>>> series differ in a few ways:
>>>>>      - My approach uses a bitmap to store the state of the linear scan_pmd to
>>>>>        then determine potential mTHP batches. Devs incorporates his directly
>>>>>        into the scan, and will try each available order.
>>>>>      - Dev is attempting to optimize the locking, while my approach keeps the
>>>>>        locking changes to a minimum. I believe his changes are not safe for
>>>>>        uffd.
>>>>>      - Dev's changes only work for khugepaged not madvise_collapse (although
>>>>>        i think that was by choice and it could easily support madvise)
>>>>>      - Dev scales all khugepaged sysfs tunables by order, while im removing
>>>>>        the restriction of max_ptes_none and converting it to a scale to
>>>>>        determine a (m)THP threshold.
>>>>>      - Dev turns on khugepaged if any order is available while mine still
>>>>>        only runs if PMDs are enabled. I like Dev's approach and will most
>>>>>        likely do the same in my PATCH posting.
>>>>>      - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>>>>>
>>>>> Patch 11 was inspired by one of Dev's changes.
>>>>>
>>>>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>>>>>
>>>>> Nico Pache (11):
>>>>>      introduce khugepaged_collapse_single_pmd to collapse a single pmd
>>>>>      khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>>>>>      khugepaged: Don't allocate khugepaged mm_slot early
>>>>>      khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>>      khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>>>      khugepaged: generalize alloc_charge_folio for mTHP support
>>>>>      khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>>>      khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>>      khugepaged: add mTHP support
>>>>>      khugepaged: remove max_ptes_none restriction on the pmd scan
>>>>>      khugepaged: skip collapsing mTHP to smaller orders
>>>>>
>>>>>     include/linux/khugepaged.h |   4 +-
>>>>>     mm/huge_memory.c           |   3 +-
>>>>>     mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>>>>>     3 files changed, 306 insertions(+), 137 deletions(-)
>>>>
>>>> Before I take a proper look at your series, can you please include any testing
>>>> you may have done?
>>>
>>> I Built these changes for the following arches: x86_64, arm64,
>>> arm64-64k, ppc64le, s390x
>>>
>>> x86 testing:
>>> - Selftests mm
>>> - some stress-ng tests
>>> - compile kernel
>>> - I did some tests with my defer [1] set on top. This pushes all the
>>> work to khugepaged, which removes the noise of all the PF allocations.
>>>
>>> I recently got an ARM64 machine and did some simple sanity tests (on
>>> both 4k and 64k) like selftests, stress-ng, and playing around with
>>> the tunables, etc.
>>>
>>> I will also be running all the builds through our CI, and perf testing
>>> environments before posting.
>>>
>>> [1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/
>>>
>>>>
>>>
>> I tested your series with the program I was using and it is not working;
>> can you please confirm it.
> 
> Yes, this is expected because you are not fully filling any 32K chunk
> (MIN_MTHP_ORDER) so no bit is ever set.

That is weird, because if this is the case, then PMD-collapse should 
have also failed, but that succeeded. Do you have some userspace program 
I can test with?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-01-08 23:31 ` [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
  2025-01-10  9:05   ` Dev Jain
  2025-01-10 14:54   ` Dev Jain
@ 2025-01-12 15:13   ` Dev Jain
  2025-01-12 16:41     ` Dev Jain
  2 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-12 15:13 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm

On 09/01/25 5:01 am, Nico Pache wrote:
> khugepaged scans PMD ranges for potential collapse to a hugepage. To add
> mTHP support we use this scan to instead record chunks of fully utilized
> sections of the PMD.
> 
> create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
> by default we will set this to order 3. The reasoning is that for 4K 512
> PMD size this results in a 64 bit bitmap which has some optimizations.
> For other arches like ARM64 64K, we can set a larger order if needed.
> 
> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> that represents chunks of fully utilized regions. We can then determine
> what mTHP size fits best and in the following patch, we set this bitmap
> while scanning the PMD.
> 
> max_ptes_none is used as a scale to determine how "full" an order must
> be before being considered for collapse.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>

Here is my objective and subjective analysis.

--------------------- Mathematical Analysis ----------------------------

First off, in my series, I am missing one thing: When I fail to collapse 
a range as a result of exhausting all orders, I should jump to the next 
range starting with the alignment of order at which I just failed (i.e, 
the minimum allowable order). Currently I am exiting which is wasteful. 
This should be easy to extend.

Let's call Nico Pache's method NP, and Dev Jain's method DJ.

The only difference between NP and DJ is the remembrance of the state of 
the PTEs (I have already reverted to using write lock for my v2, see 
this reply: 
https://lore.kernel.org/all/71a2f471-3082-4ca2-ac48-2f664977282f@arm.com/). 
NP saves empty and filled PTEs in a bitmap, and then uses the optimized 
(let us assume them to be constant time operations, hopefully?) bitmap 
APIs, like bitmap_shift_right(), and bitmap_weight(). The latter is what 
determines whether for a particular order, the range has enough filled 
PTEs to justify calling collapse_huge_page(). DJ does this naively with 
a brute force iteration. Now, the edge NP has over DJ is just before 
calling collapse_huge_page(). Post calling that, everything remains the 
same; assuming that both DJ and NP derive the same collapsed ranges, 
then, collapse_huge_page() succeeds in NP if and only if it succeeds in 
DJ. NP knows quickly, when and when not to call collapse_huge_page().

So the question is, how many iterations of PTE scans does NP save over 
DJ? We prove a stronger result:

Let the PTE table consist of 2^x pte entries, where x >= 2 belongs to 
the set of natural numbers (x >= 2 because anon collapse is not 
supported for x < 2). Let f(x) = #iterations performed by DJ in the 
worst case. The worst case is, all orders are enabled, and we have some 
distribution of the PTEs.

Lemma: f(x) <= 2^x * (x-1).

Proof: We perform weak mathematical induction on x. Assume 
zero-indexing, and assume the worst case that all orders are enabled.

Base case: Let x = 4. We have 16 entries. NP does 16 iterations. In the 
worst case, this what DJ may do: it will iterate all 16, and not 
collapse. Then it will iterate from 0-7 pte entries, and not collapse. 
Then, it will iterate from 0-3, and may or may not collapse. Here is the 
worst case (When I write l-r below, I mean the range l-r, both inclusive):

0-15 fail -> 0-7 fail -> 0-3 fail/success -> 4-7 fail/success -> 8-15 
fail -> 8-11 fail/success -> 12-15 fail/success

#iterations = 16+8+4+4+8+4+4 = 48 = 2^4 * (4-1).
Convince yourself that f(2) == 4 and f(3) <= 16.

Inductive hypothesis: Assume the lemma is true for some x > 4.

We need to prove for x+1. Let X = 2^(x+1) - 1, and Y = 2^x - 1.
Let DJ start scanning from 0. If 0-X is success, we are done. So, assume 
0-X fails. Now, DJ looks at 0-Y. Note that, for any x s.t 0 <= x <= X, 
if DJ starts scanning from x, there is no way it will cross the scan 
into the next half, i.e Y+1-X, since the scan length from x will be 
atmost the highest power-of-two alignment of x. Given this, we scan 0-Y 
completely, and then start from Y+1. Having established the above 
argument, we can use the inductive hypothesis on 0-Y and Y+1-X to derive 
that f(x) <= 2^(x+1) + 2f(x) <= 2^(x+1) + 2(2^x * (x-1)) = 2^(x+1) + 
2^(x+1) * (x-1) = 2^(x+1) * (x). Q.E.D
(You can simulate the proof for x=9; what I mean to say is, we can 
divide 0-511 into 0-255 and 256-511).

So, for our case, NP performs 512 iterations, and DJ performs in the 
worst case, 512 * 8 = 4096 iterations. Hmm...

----------------------- Subjective Analysis --------------------------

[1] The worst case is, well, the worst case; the practical case on arm64 
machines is, only 2048k and 64k is enabled. So, NP performs 512 
iterations, and DJ performs 512 + 16 * (number of 64K chunks) = 512 + 
512 = 1024 iterations. That is not much difference.

[2] Both implementations have the creep problem described here:
https://lore.kernel.org/all/20241216165105.56185-13-dev.jain@arm.com/

[3] The bitmaps are being created only for pte_none case, whereas we 
also have the shared and the swap case. In fact, for the none case, if 
we have PMD-order enabled, we will almost surely collapse to PMD size, 
given that the common case is khugepaged_max_ptes_none = 511: if we have 
one PTE filled, we will call collapse_huge_page(), and both DJ and NP 
will perform 512 iterations. Therefore, the bitmaps also need to be 
extended to the shared and the swap case so as to get any potential 
benefit from the idea in a practical scenario.

[4] NP does not handle scanning VMAs of size less than PMD. Since NP 
introduces a single entry point of khugepaged_collapse_single_pmd(), I 
am not sure how difficult it will be to extend the implementation, and 
given that, MTHP_BITMAP_SIZE is a compile time constant. I have extended 
this in my v2 and it works.

[5] In NP, for a bit to be set, the chunk completely needs to be 
filled/shared/swapped out. This completely changes the meaning of the 
sysfs parameters max_ptes_*. It also makes it very hard to debug since 
it may happen that, distribution D1 has more PTEs filled but less bits 
in the bitmap set than distribution D2. DJ also changes the meaning of 
the parameters due to scaling errors, but that is only an off-by-one 
error, therefore, the behaviour is easier to predict.

[6] In NP, we have: remember the state of the PTEs -> 
alloc_charge_folio() -> read_lock(), unlock() -> mmap_write_lock() -> 
anon_vma_lock_write() -> TLB flush for PMD. There is a significant time 
difference here, and the remembered PTEs may be vastly different from 
what we have now. Obviously I cannot pinpoint an exact number as to how 
bad this is or not for the accuracy of khugepaged. For DJ, since a 
particular PTE may come into the scan range multiple times, DJ gives the 
range a chance if the distribution changed recently.

[7] The last time I tried to save on #iterations of PTE entries, this 
happened:

https://lore.kernel.org/all/ZugxqJ-CjEi5lEW_@casper.infradead.org/

Matthew Wilcox pointed out a potential regression in a patch which was 
an "obvious optimization" to me on paper; I tested and it turned out he 
was correct:

https://lore.kernel.org/all/8700274f-b521-444e-8d17-c06039a1376c@arm.com/

We could argue whether it is worth to have the bitmap memory 
initialization, copying, weight checking, and recursion overhead.

This is the most I can come up with by analyzing from a third person 
perspective :)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-01-12 15:13   ` Dev Jain
@ 2025-01-12 16:41     ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-01-12 16:41 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: ryan.roberts, anshuman.khandual, catalin.marinas, cl, vbabka,
	mhocko, apopple, dave.hansen, will, baohua, jack, srivatsa,
	haowenchao22, hughd, aneesh.kumar, yang, peterx, ioworker0,
	wangkefeng.wang, ziy, jglisse, surenb, vishal.moola, zokeefe,
	zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov, david,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm



On 12/01/25 8:43 pm, Dev Jain wrote:
> 
> 
> On 09/01/25 5:01 am, Nico Pache wrote:
>> khugepaged scans PMD ranges for potential collapse to a hugepage. To add
>> mTHP support we use this scan to instead record chunks of fully utilized
>> sections of the PMD.
>>
>> create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
>> by default we will set this to order 3. The reasoning is that for 4K 512
>> PMD size this results in a 64 bit bitmap which has some optimizations.
>> For other arches like ARM64 64K, we can set a larger order if needed.
>>
>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
>> that represents chunks of fully utilized regions. We can then determine
>> what mTHP size fits best and in the following patch, we set this bitmap
>> while scanning the PMD.
>>
>> max_ptes_none is used as a scale to determine how "full" an order must
>> be before being considered for collapse.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
> 
> Here is my objective and subjective analysis.
> 
> --------------------- Mathematical Analysis ----------------------------
> 
> First off, in my series, I am missing one thing: When I fail to collapse 
> a range as a result of exhausting all orders, I should jump to the next 
> range starting with the alignment of order at which I just failed (i.e, 
> the minimum allowable order). Currently I am exiting which is wasteful. 
> This should be easy to extend.
> 
> Let's call Nico Pache's method NP, and Dev Jain's method DJ.
> 
> The only difference between NP and DJ is the remembrance of the state of 
> the PTEs (I have already reverted to using write lock for my v2, see 
> this reply: https://lore.kernel.org/all/71a2f471-3082-4ca2- 
> ac48-2f664977282f@arm.com/). NP saves empty and filled PTEs in a bitmap, 
> and then uses the optimized (let us assume them to be constant time 
> operations, hopefully?) bitmap APIs, like bitmap_shift_right(), and 
> bitmap_weight(). The latter is what determines whether for a particular 
> order, the range has enough filled PTEs to justify calling 
> collapse_huge_page(). DJ does this naively with a brute force iteration. 
> Now, the edge NP has over DJ is just before calling 
> collapse_huge_page(). Post calling that, everything remains the same; 
> assuming that both DJ and NP derive the same collapsed ranges, then, 
> collapse_huge_page() succeeds in NP if and only if it succeeds in DJ. NP 
> knows quickly, when and when not to call collapse_huge_page().
> 
> So the question is, how many iterations of PTE scans does NP save over 
> DJ? We prove a stronger result:
> 
> Let the PTE table consist of 2^x pte entries, where x >= 2 belongs to 
> the set of natural numbers (x >= 2 because anon collapse is not 
> supported for x < 2). Let f(x) = #iterations performed by DJ in the 
> worst case. The worst case is, all orders are enabled, and we have some 
> distribution of the PTEs.
> 
> Lemma: f(x) <= 2^x * (x-1).
> 
> Proof: We perform weak mathematical induction on x. Assume zero- 
> indexing, and assume the worst case that all orders are enabled.
> 
> Base case: Let x = 4. We have 16 entries. NP does 16 iterations. In the 
> worst case, this what DJ may do: it will iterate all 16, and not 
> collapse. Then it will iterate from 0-7 pte entries, and not collapse. 
> Then, it will iterate from 0-3, and may or may not collapse. Here is the 
> worst case (When I write l-r below, I mean the range l-r, both inclusive):
> 
> 0-15 fail -> 0-7 fail -> 0-3 fail/success -> 4-7 fail/success -> 8-15 
> fail -> 8-11 fail/success -> 12-15 fail/success
> 
> #iterations = 16+8+4+4+8+4+4 = 48 = 2^4 * (4-1).
> Convince yourself that f(2) == 4 and f(3) <= 16.
> 
> Inductive hypothesis: Assume the lemma is true for some x > 4.
> 
> We need to prove for x+1. Let X = 2^(x+1) - 1, and Y = 2^x - 1.
> Let DJ start scanning from 0. If 0-X is success, we are done. So, assume 
> 0-X fails. Now, DJ looks at 0-Y. Note that, for any x s.t 0 <= x <= X, 
> if DJ starts scanning from x, there is no way it will cross the scan 
> into the next half, i.e Y+1-X, since the scan length from x will be 
> atmost the highest power-of-two alignment of x. Given this, we scan 0-Y 
> completely, and then start from Y+1. Having established the above 
> argument, we can use the inductive hypothesis on 0-Y and Y+1-X to derive 
> that f(x) <= 2^(x+1) + 2f(x) <= 2^(x+1) + 2(2^x * (x-1)) = 2^(x+1) +

Typo: f(x+1) <= 2^(x+1) + 2f(x).

> 2^(x+1) * (x-1) = 2^(x+1) * (x). Q.E.D
> (You can simulate the proof for x=9; what I mean to say is, we can 
> divide 0-511 into 0-255 and 256-511).
> 
> So, for our case, NP performs 512 iterations, and DJ performs in the 
> worst case, 512 * 8 = 4096 iterations. Hmm...
> 
> ----------------------- Subjective Analysis --------------------------
> 
> [1] The worst case is, well, the worst case; the practical case on arm64 
> machines is, only 2048k and 64k is enabled. So, NP performs 512 
> iterations, and DJ performs 512 + 16 * (number of 64K chunks) = 512 + 
> 512 = 1024 iterations. That is not much difference.
> 
> [2] Both implementations have the creep problem described here:
> https://lore.kernel.org/all/20241216165105.56185-13-dev.jain@arm.com/
> 
> [3] The bitmaps are being created only for pte_none case, whereas we 
> also have the shared and the swap case. In fact, for the none case, if 
> we have PMD-order enabled, we will almost surely collapse to PMD size, 
> given that the common case is khugepaged_max_ptes_none = 511: if we have 
> one PTE filled, we will call collapse_huge_page(), and both DJ and NP 
> will perform 512 iterations. Therefore, the bitmaps also need to be 
> extended to the shared and the swap case so as to get any potential 
> benefit from the idea in a practical scenario.
> 
> [4] NP does not handle scanning VMAs of size less than PMD. Since NP 
> introduces a single entry point of khugepaged_collapse_single_pmd(), I 
> am not sure how difficult it will be to extend the implementation, and 
> given that, MTHP_BITMAP_SIZE is a compile time constant. I have extended 
> this in my v2 and it works.
> 
> [5] In NP, for a bit to be set, the chunk completely needs to be filled/ 
> shared/swapped out. This completely changes the meaning of the sysfs 
> parameters max_ptes_*. It also makes it very hard to debug since it may 
> happen that, distribution D1 has more PTEs filled but less bits in the 
> bitmap set than distribution D2. DJ also changes the meaning of the 
> parameters due to scaling errors, but that is only an off-by-one error, 
> therefore, the behaviour is easier to predict.
> 
> [6] In NP, we have: remember the state of the PTEs -> 
> alloc_charge_folio() -> read_lock(), unlock() -> mmap_write_lock() -> 
> anon_vma_lock_write() -> TLB flush for PMD. There is a significant time 
> difference here, and the remembered PTEs may be vastly different from 
> what we have now. Obviously I cannot pinpoint an exact number as to how 
> bad this is or not for the accuracy of khugepaged. For DJ, since a 
> particular PTE may come into the scan range multiple times, DJ gives the 
> range a chance if the distribution changed recently.
> 
> [7] The last time I tried to save on #iterations of PTE entries, this 
> happened:
> 
> https://lore.kernel.org/all/ZugxqJ-CjEi5lEW_@casper.infradead.org/
> 
> Matthew Wilcox pointed out a potential regression in a patch which was 
> an "obvious optimization" to me on paper; I tested and it turned out he 
> was correct:
> 
> https://lore.kernel.org/all/8700274f-b521-444e-8d17-c06039a1376c@arm.com/
> 
> We could argue whether it is worth to have the bitmap memory 
> initialization, copying, weight checking, and recursion overhead.
> 
> This is the most I can come up with by analyzing from a third person 
> perspective :)
> 
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  2025-01-12 11:23       ` Dev Jain
@ 2025-01-13 22:25         ` Nico Pache
  0 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-13 22:25 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm

On Sun, Jan 12, 2025 at 4:23 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 11/01/25 3:18 am, Nico Pache wrote:
> > On Fri, Jan 10, 2025 at 2:06 AM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >>
> >> On 09/01/25 5:01 am, Nico Pache wrote:
> >>> khugepaged scans PMD ranges for potential collapse to a hugepage. To add
> >>> mTHP support we use this scan to instead record chunks of fully utilized
> >>> sections of the PMD.
> >>>
> >>> create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
> >>> by default we will set this to order 3. The reasoning is that for 4K 512
> >>> PMD size this results in a 64 bit bitmap which has some optimizations.
> >>> For other arches like ARM64 64K, we can set a larger order if needed.
> >>>
> >>> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> >>> that represents chunks of fully utilized regions. We can then determine
> >>> what mTHP size fits best and in the following patch, we set this bitmap
> >>> while scanning the PMD.
> >>>
> >>> max_ptes_none is used as a scale to determine how "full" an order must
> >>> be before being considered for collapse.
> >>>
> >>> Signed-off-by: Nico Pache <npache@redhat.com>
> >>> ---
> >>>    include/linux/khugepaged.h |   4 +-
> >>>    mm/khugepaged.c            | 129 +++++++++++++++++++++++++++++++++++--
> >>>    2 files changed, 126 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> >>> index 1f46046080f5..31cff8aeec4a 100644
> >>> --- a/include/linux/khugepaged.h
> >>> +++ b/include/linux/khugepaged.h
> >>> @@ -1,7 +1,9 @@
> >>>    /* SPDX-License-Identifier: GPL-2.0 */
> >>>    #ifndef _LINUX_KHUGEPAGED_H
> >>>    #define _LINUX_KHUGEPAGED_H
> >>> -
> >>
> >> Nit: I don't think this line needs to be deleted.
> >>
> >>> +#define MIN_MTHP_ORDER       3
> >>> +#define MIN_MTHP_NR  (1<<MIN_MTHP_ORDER)
> >>
> >> Nit: Insert a space: (1 << MIN_MTHP_ORDER)
> >>
> >>> +#define MTHP_BITMAP_SIZE  (1<<(HPAGE_PMD_ORDER - MIN_MTHP_ORDER))
> >>>    extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >>>    #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>    extern struct attribute_group khugepaged_attr_group;
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index 9eb161b04ee4..de1dc6ea3c71 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >>>
> >>>    static struct kmem_cache *mm_slot_cache __ro_after_init;
> >>>
> >>> +struct scan_bit_state {
> >>> +     u8 order;
> >>> +     u8 offset;
> >>> +};
> >>> +
> >>>    struct collapse_control {
> >>>        bool is_khugepaged;
> >>>
> >>> @@ -102,6 +107,15 @@ struct collapse_control {
> >>>
> >>>        /* nodemask for allocation fallback */
> >>>        nodemask_t alloc_nmask;
> >>> +
> >>> +     /* bitmap used to collapse mTHP sizes. 1bit = order MIN_MTHP_ORDER mTHP */
> >>> +     unsigned long *mthp_bitmap;
> >>> +     unsigned long *mthp_bitmap_temp;
> >>> +     struct scan_bit_state *mthp_bitmap_stack;
> >>> +};
> >>> +
> >>> +struct collapse_control khugepaged_collapse_control = {
> >>> +     .is_khugepaged = true,
> >>>    };
> >>>
> >>>    /**
> >>> @@ -389,6 +403,25 @@ int __init khugepaged_init(void)
> >>>        if (!mm_slot_cache)
> >>>                return -ENOMEM;
> >>>
> >>> +     /*
> >>> +      * allocate the bitmaps dynamically since MTHP_BITMAP_SIZE is not known at
> >>> +      * compile time for some architectures.
> >>> +      */
> >>> +     khugepaged_collapse_control.mthp_bitmap = kmalloc_array(
> >>> +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> >>> +     if (!khugepaged_collapse_control.mthp_bitmap)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     khugepaged_collapse_control.mthp_bitmap_temp = kmalloc_array(
> >>> +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> >>> +     if (!khugepaged_collapse_control.mthp_bitmap_temp)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     khugepaged_collapse_control.mthp_bitmap_stack = kmalloc_array(
> >>> +             MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL);
> >>> +     if (!khugepaged_collapse_control.mthp_bitmap_stack)
> >>> +             return -ENOMEM;
> >>> +
> >>>        khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
> >>>        khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
> >>>        khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
> >>> @@ -400,6 +433,9 @@ int __init khugepaged_init(void)
> >>>    void __init khugepaged_destroy(void)
> >>>    {
> >>>        kmem_cache_destroy(mm_slot_cache);
> >>> +     kfree(khugepaged_collapse_control.mthp_bitmap);
> >>> +     kfree(khugepaged_collapse_control.mthp_bitmap_temp);
> >>> +     kfree(khugepaged_collapse_control.mthp_bitmap_stack);
> >>>    }
> >>>
> >>>    static inline int khugepaged_test_exit(struct mm_struct *mm)
> >>> @@ -850,10 +886,6 @@ static void khugepaged_alloc_sleep(void)
> >>>        remove_wait_queue(&khugepaged_wait, &wait);
> >>>    }
> >>>
> >>> -struct collapse_control khugepaged_collapse_control = {
> >>> -     .is_khugepaged = true,
> >>> -};
> >>> -
> >>>    static bool khugepaged_scan_abort(int nid, struct collapse_control *cc)
> >>>    {
> >>>        int i;
> >>> @@ -1102,7 +1134,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >>>
> >>>    static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>                              int referenced, int unmapped,
> >>> -                           struct collapse_control *cc)
> >>> +                           struct collapse_control *cc, bool *mmap_locked,
> >>> +                               int order, int offset)
> >>>    {
> >>>        LIST_HEAD(compound_pagelist);
> >>>        pmd_t *pmd, _pmd;
> >>> @@ -1115,6 +1148,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>        struct mmu_notifier_range range;
> >>>        VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >>>
> >>> +     /* if collapsing mTHPs we may have already released the read_lock, and
> >>> +      * need to reaquire it to keep the proper locking order.
> >>> +      */
> >>> +     if (!*mmap_locked)
> >>> +             mmap_read_lock(mm);
> >>
> >> There is no need to take the read lock again, because we drop it just
> >> after this.
> >
> > collapse_huge_page expects the mmap_lock to already be taken, and it
> > returns with it unlocked. If we are collapsing multiple mTHPs under
> > the same PMD, then I think we need to reacquire the lock before
> > calling unlock on it.
>
> I cannot figure out a potential place where we drop the lock before
> entering collapse_huge_page(). In any case, wouldn't this be better:

Let's say we are collapsing two 1024kB mTHPs in a single PMD region.
We call collapse_huge_page on the first mTHP and during the collapse
the lock is dropped.
When the second mTHP collapse is attempted the lock has already been dropped.

> if (*mmap_locked)
>         mmap_read_unlock(mm);
>
> Basically, instead of putting the if condition around the lock, you do
> it around the unlock?

Yeah that seems much cleaner, Ill give it a try, thanks!

>
> >
> >>
> >>>        /*
> >>>         * Before allocating the hugepage, release the mmap_lock read lock.
> >>>         * The allocation can take potentially a long time if it involves
> >>> @@ -1122,6 +1160,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>         * that. We will recheck the vma after taking it again in write mode.
> >>>         */
> >>>        mmap_read_unlock(mm);
> >>> +     *mmap_locked = false;
> >>>
> >>>        result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> >>>        if (result != SCAN_SUCCEED)
> >>> @@ -1256,12 +1295,71 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>    out_up_write:
> >>>        mmap_write_unlock(mm);
> >>>    out_nolock:
> >>> +     *mmap_locked = false;
> >>>        if (folio)
> >>>                folio_put(folio);
> >>>        trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> >>>        return result;
> >>>    }
> >>>
> >>> +// Recursive function to consume the bitmap
> >>> +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address,
> >>> +                     int referenced, int unmapped, struct collapse_control *cc,
> >>> +                     bool *mmap_locked, unsigned long enabled_orders)
> >>> +{
> >>> +     u8 order, offset;
> >>> +     int num_chunks;
> >>> +     int bits_set, max_percent, threshold_bits;
> >>> +     int next_order, mid_offset;
> >>> +     int top = -1;
> >>> +     int collapsed = 0;
> >>> +     int ret;
> >>> +     struct scan_bit_state state;
> >>> +
> >>> +     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> >>> +             { HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 };
> >>> +
> >>> +     while (top >= 0) {
> >>> +             state = cc->mthp_bitmap_stack[top--];
> >>> +             order = state.order;
> >>> +             offset = state.offset;
> >>> +             num_chunks = 1 << order;
> >>> +             // Skip mTHP orders that are not enabled
> >>> +             if (!(enabled_orders >> (order +  MIN_MTHP_ORDER)) & 1)
> >>> +                     goto next;
> >>> +
> >>> +             // copy the relavant section to a new bitmap
> >>> +             bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset,
> >>> +                               MTHP_BITMAP_SIZE);
> >>> +
> >>> +             bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks);
> >>> +
> >>> +             // Check if the region is "almost full" based on the threshold
> >>> +             max_percent = ((HPAGE_PMD_NR - khugepaged_max_ptes_none - 1) * 100)
> >>> +                                             / (HPAGE_PMD_NR - 1);
> >>> +             threshold_bits = (max_percent * num_chunks) / 100;
> >>> +
> >>> +             if (bits_set >= threshold_bits) {
> >>> +                     ret = collapse_huge_page(mm, address, referenced, unmapped, cc,
> >>> +                                     mmap_locked, order + MIN_MTHP_ORDER, offset * MIN_MTHP_NR);
> >>> +                     if (ret == SCAN_SUCCEED)
> >>> +                             collapsed += (1 << (order + MIN_MTHP_ORDER));
> >>> +                     continue;
> >>> +             }
> >>> +
> >>> +next:
> >>> +             if (order > 0) {
> >>> +                     next_order = order - 1;
> >>> +                     mid_offset = offset + (num_chunks / 2);
> >>> +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> >>> +                             { next_order, mid_offset };
> >>> +                     cc->mthp_bitmap_stack[++top] = (struct scan_bit_state)
> >>> +                             { next_order, offset };
> >>> +                     }
> >>> +     }
> >>> +     return collapsed;
> >>> +}
> >>> +
> >>>    static int khugepaged_scan_pmd(struct mm_struct *mm,
> >>>                                   struct vm_area_struct *vma,
> >>>                                   unsigned long address, bool *mmap_locked,
> >>> @@ -1430,7 +1528,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >>>        pte_unmap_unlock(pte, ptl);
> >>>        if (result == SCAN_SUCCEED) {
> >>>                result = collapse_huge_page(mm, address, referenced,
> >>> -                                         unmapped, cc);
> >>> +                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> >>>                /* collapse_huge_page will return with the mmap_lock released */
> >>>                *mmap_locked = false;
> >>>        }
> >>> @@ -2767,6 +2865,21 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >>>                return -ENOMEM;
> >>>        cc->is_khugepaged = false;
> >>>
> >>> +     cc->mthp_bitmap = kmalloc_array(
> >>> +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> >>> +     if (!cc->mthp_bitmap)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     cc->mthp_bitmap_temp = kmalloc_array(
> >>> +             BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL);
> >>> +     if (!cc->mthp_bitmap_temp)
> >>> +             return -ENOMEM;
> >>> +
> >>> +     cc->mthp_bitmap_stack = kmalloc_array(
> >>> +             MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL);
> >>> +     if (!cc->mthp_bitmap_stack)
> >>> +             return -ENOMEM;
> >>> +
> >>>        mmgrab(mm);
> >>>        lru_add_drain_all();
> >>>
> >>> @@ -2831,8 +2944,12 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >>>    out_nolock:
> >>>        mmap_assert_locked(mm);
> >>>        mmdrop(mm);
> >>> +     kfree(cc->mthp_bitmap);
> >>> +     kfree(cc->mthp_bitmap_temp);
> >>> +     kfree(cc->mthp_bitmap_stack);
> >>>        kfree(cc);
> >>>
> >>> +
> >>>        return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0
> >>>                        : madvise_collapse_errno(last_fail);
> >>>    }
> >>
> >
>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-12 14:11         ` Dev Jain
@ 2025-01-13 23:00           ` Nico Pache
  0 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-13 23:00 UTC (permalink / raw)
  To: Dev Jain
  Cc: linux-kernel, linux-mm, ryan.roberts, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm

On Sun, Jan 12, 2025 at 7:12 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 11/01/25 3:31 am, Nico Pache wrote:
> > On Thu, Jan 9, 2025 at 9:56 PM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >>
> >> On 10/01/25 7:57 am, Nico Pache wrote:
> >>> On Wed, Jan 8, 2025 at 11:22 PM Dev Jain <dev.jain@arm.com> wrote:
> >>>>
> >>>>
> >>>> On 09/01/25 5:01 am, Nico Pache wrote:
> >>>>> The following series provides khugepaged and madvise collapse with the
> >>>>> capability to collapse regions to mTHPs.
> >>>>>
> >>>>> To achieve this we generalize the khugepaged functions to no longer depend
> >>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>>>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> >>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>>>> on max_ptes_none is removed during the scan, to make sure we account for
> >>>>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> >>>>> determine how full a mTHP order needs to be before collapsing it.
> >>>>>
> >>>>> Some design choices to note:
> >>>>>     - bitmap structures are allocated dynamically because on some arch's
> >>>>>        (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >>>>>        compile time leading to warnings.
> >>>>>     - The recursion is masked through a stack structure.
> >>>>>     - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >>>>>        64bit on x86. This provides some optimization on the bitmap operations.
> >>>>>        if other arches/configs that have larger than 512 PTEs per PMD want to
> >>>>>        compress their bitmap further we can change this value per arch.
> >>>>>
> >>>>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> >>>>> Patch 3:    A minor "fix"/optimization
> >>>>> Patch 4:    Refactor/rename hpage_collapse
> >>>>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> >>>>> Patch 8-11: The mTHP patches
> >>>>>
> >>>>> This series acts as an alternative to Dev Jain's approach [1]. The two
> >>>>> series differ in a few ways:
> >>>>>      - My approach uses a bitmap to store the state of the linear scan_pmd to
> >>>>>        then determine potential mTHP batches. Devs incorporates his directly
> >>>>>        into the scan, and will try each available order.
> >>>>>      - Dev is attempting to optimize the locking, while my approach keeps the
> >>>>>        locking changes to a minimum. I believe his changes are not safe for
> >>>>>        uffd.
> >>>>>      - Dev's changes only work for khugepaged not madvise_collapse (although
> >>>>>        i think that was by choice and it could easily support madvise)
> >>>>>      - Dev scales all khugepaged sysfs tunables by order, while im removing
> >>>>>        the restriction of max_ptes_none and converting it to a scale to
> >>>>>        determine a (m)THP threshold.
> >>>>>      - Dev turns on khugepaged if any order is available while mine still
> >>>>>        only runs if PMDs are enabled. I like Dev's approach and will most
> >>>>>        likely do the same in my PATCH posting.
> >>>>>      - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >>>>>
> >>>>> Patch 11 was inspired by one of Dev's changes.
> >>>>>
> >>>>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >>>>>
> >>>>> Nico Pache (11):
> >>>>>      introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >>>>>      khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >>>>>      khugepaged: Don't allocate khugepaged mm_slot early
> >>>>>      khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>>>      khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>>>      khugepaged: generalize alloc_charge_folio for mTHP support
> >>>>>      khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>>>      khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>>>      khugepaged: add mTHP support
> >>>>>      khugepaged: remove max_ptes_none restriction on the pmd scan
> >>>>>      khugepaged: skip collapsing mTHP to smaller orders
> >>>>>
> >>>>>     include/linux/khugepaged.h |   4 +-
> >>>>>     mm/huge_memory.c           |   3 +-
> >>>>>     mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >>>>>     3 files changed, 306 insertions(+), 137 deletions(-)
> >>>>
> >>>> Before I take a proper look at your series, can you please include any testing
> >>>> you may have done?
> >>>
> >>> I Built these changes for the following arches: x86_64, arm64,
> >>> arm64-64k, ppc64le, s390x
> >>>
> >>> x86 testing:
> >>> - Selftests mm
> >>> - some stress-ng tests
> >>> - compile kernel
> >>> - I did some tests with my defer [1] set on top. This pushes all the
> >>> work to khugepaged, which removes the noise of all the PF allocations.
> >>>
> >>> I recently got an ARM64 machine and did some simple sanity tests (on
> >>> both 4k and 64k) like selftests, stress-ng, and playing around with
> >>> the tunables, etc.
> >>>
> >>> I will also be running all the builds through our CI, and perf testing
> >>> environments before posting.
> >>>
> >>> [1] https://lore.kernel.org/lkml/20240729222727.64319-1-npache@redhat.com/
> >>>
> >>>>
> >>>
> >> I tested your series with the program I was using and it is not working;
> >> can you please confirm it.
> >
> > Yes, this is expected because you are not fully filling any 32K chunk
> > (MIN_MTHP_ORDER) so no bit is ever set.
>
> That is weird, because if this is the case, then PMD-collapse should
> have also failed, but that succeeded. Do you have some userspace program
> I can test with?
Not exactly, if max_ptes_none is still 511, the old behavior is kept.

I modified your program to set the first 8 pages (32k chunk) in every
64k region.

#include <unistd.h>
#include <sys/ioctl.h>
#include <string.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/random.h>
#include <assert.h>

int main(int argc, char *argv[])
{
    char *ptr;
    unsigned long mthp_size = (1UL << 16); // 64 KB chunk size
    size_t chunk_size = (1UL << 25);       // 32 MB total size

    // mmap() to allocate memory at a specific address (1 GB address)
    ptr = mmap((void *)(1UL << 30), chunk_size, PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (((unsigned long)ptr) != (1UL << 30)) {
        printf("mmap did not work on required address\n");
        return 1;
    }

    // Touch the first 8 pages in every 64 KB chunk
    for (int i = 0; i < chunk_size; i += mthp_size) {
        // Touch the first 8 pages within the 64 KB chunk (8 * 4 KB = 32 KB)
        for (int j = 0; j < 8; ++j) {
            ptr[i + j * 4096] = i + j * 4096;  // Touch the first byte
of each page
        }
    }

    // Use madvise() to advise the kernel to use huge pages for this memory
    if (madvise(ptr, chunk_size, MADV_HUGEPAGE)) {
        perror("madvise");
        return 1;
    }

    sleep(100); // Sleep to allow time for the kernel to process the advice
    return 0;
}

There's some rounding errors in how I compute the threshold_bits... I
think I will adopt how you do the max_ptes_none shifting for better
accuracy. Currently if you run this with max_ptes_none=255 (or even
lower values like 200...) it will still collapse to a 64k chunk when
in reality it should only do 32k because only half the bitmap is set
for this order, and 255 < 50% of 512.

I'm adding a threshold to the bitmap_set, and doing better scaling
like you do. My next version should handle the example code better.

>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
                   ` (12 preceding siblings ...)
  2025-01-09  6:27 ` Dev Jain
@ 2025-01-16  9:47 ` Ryan Roberts
  2025-01-16 20:53   ` Nico Pache
  13 siblings, 1 reply; 53+ messages in thread
From: Ryan Roberts @ 2025-01-16  9:47 UTC (permalink / raw)
  To: Nico Pache, linux-kernel, linux-mm
  Cc: anshuman.khandual, catalin.marinas, cl, vbabka, mhocko, apopple,
	dave.hansen, will, baohua, jack, srivatsa, haowenchao22, hughd,
	aneesh.kumar, yang, peterx, ioworker0, wangkefeng.wang, ziy,
	jglisse, surenb, vishal.moola, zokeefe, zhengqi.arch, jhubbard,
	21cnbao, willy, kirill.shutemov, david, aarcange, raquini,
	dev.jain, sunnanyong, usamaarif642, audra, akpm

Hi Nico,

On 08/01/2025 23:31, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the 
> capability to collapse regions to mTHPs.

It's great to see multiple solutions for this feature being posted; I guess that
leaves us with the luxurious problem of figuring out an uber-patchset that
incorporates the best of both? :)

I haven't had a chance to review your series in detail yet, but have a few
questions below that will help me understand the key differences between your
series and Dev's.

> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none is mapped to a 0-100 range to 
> determine how full a mTHP order needs to be before collapsing it.
> 
> Some design choices to note: 
>  - bitmap structures are allocated dynamically because on some arch's 
>     (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>     compile time leading to warnings.

We have MAX_PTRS_PER_PTE and friends though, which are worst case and compile
time. Could these help avoid the dynamic allocation?

MAX_PMD_ORDER = ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE)

Althogh to be honest, it's not super clear to me what the benefit of the bitmap
is vs just iterating through the PTEs like Dev does; is there a significant cost
saving in practice? On the face of it, it seems like it might be uneeded complexity.

>  - The recursion is masked through a stack structure.
>  - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>     64bit on x86. This provides some optimization on the bitmap operations.
>     if other arches/configs that have larger than 512 PTEs per PMD want to 
>     compress their bitmap further we can change this value per arch.

So 1 bit in the bitmap represents 8 pages? And it will only be set if all 8
pages are !pte_none()? I'm wondering what will happen if you have a pattern of 4
set PTEs followed by 4 none PTEs, followed by 4 set PTEs... If 16K mTHP is
enabled, you would want to collapse every other 16K block in this case, but I'm
guessing with your scheme, all the bits will be clear and no collapse will
occur? But for arm64 at least, collapsing to order-2 (16K) may be desired for HPA.

> 
> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> Patch 3:    A minor "fix"/optimization
> Patch 4:    Refactor/rename hpage_collapse
> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> Patch 8-11: The mTHP patches
> 
> This series acts as an alternative to Dev Jain's approach [1]. The two 
> series differ in a few ways:
>   - My approach uses a bitmap to store the state of the linear scan_pmd to
>     then determine potential mTHP batches. Devs incorporates his directly
>     into the scan, and will try each available order.

So if I'm understanding, the benefit of the bitmap is to remove the need to
re-scan the "low" PTEs when moving to a lower order, which is what Dev's
approach does? Are there not some locking/consistency issues to manage if not
re-scanning?

>   - Dev is attempting to optimize the locking, while my approach keeps the
>     locking changes to a minimum. I believe his changes are not safe for
>     uffd.

I agree; let's keep the locking simple for the initial effort.

>   - Dev's changes only work for khugepaged not madvise_collapse (although
>     i think that was by choice and it could easily support madvise)

I agree supporting MADV_COLLAPSE is good; what exactly are the semantics for it
though? I think it ignores the sysfs settings (max_ptes_none and friends) so
presumably it will continue to be much more greedy about collapsing to the
highest possible order and only fall back to lower orders if the VMA boundaries
force it to or if the higher order allocation fails?

>   - Dev scales all khugepaged sysfs tunables by order, while im removing 
>     the restriction of max_ptes_none and converting it to a scale to 
>     determine a (m)THP threshold.

I don't really understand this statement. You say you are removing the
restriction of max_ptes_none. But then you say you scale it to determine a
threshold. So are you honoring it or not? And if you're honouring it, how is
your scaling method different to Dev's? What about the other tunables (shared
and swap)?

>   - Dev turns on khugepaged if any order is available while mine still 
>     only runs if PMDs are enabled. I like Dev's approach and will most
>     likely do the same in my PATCH posting.

Agreed. Also, we will want khugepaged to be able to scan VMAs (or parts of VMAs)
that cover only a partial PMD entry. I think neither of your implementations
currently do that. As I understand it, Dev's v2 will add that support. Is your
approach ammeanable to this?

>   - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> 
> Patch 11 was inspired by one of Dev's changes.

I think the 1 problem that emerged during review of Dev's series, which we don't
have a proper solution to yet, is the issue of "creep", where regions can be
collapsed to progressively higher orders through iterative scans. At each
collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
effectively adds more non-none ptes so the next scan will then collapse to even
higher order. Does your solution suffer from this (theoretical/edge case) issue?
If not, how did you solve?

Thanks,
Ryan

> 
> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> 
> Nico Pache (11):
>   introduce khugepaged_collapse_single_pmd to collapse a single pmd
>   khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>   khugepaged: Don't allocate khugepaged mm_slot early
>   khugepaged: rename hpage_collapse_* to khugepaged_*
>   khugepaged: generalize hugepage_vma_revalidate for mTHP support
>   khugepaged: generalize alloc_charge_folio for mTHP support
>   khugepaged: generalize __collapse_huge_page_* for mTHP support
>   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>   khugepaged: add mTHP support
>   khugepaged: remove max_ptes_none restriction on the pmd scan
>   khugepaged: skip collapsing mTHP to smaller orders
> 
>  include/linux/khugepaged.h |   4 +-
>  mm/huge_memory.c           |   3 +-
>  mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>  3 files changed, 306 insertions(+), 137 deletions(-)
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-16  9:47 ` Ryan Roberts
@ 2025-01-16 20:53   ` Nico Pache
  2025-01-20  5:17     ` Dev Jain
                       ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-16 20:53 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	david, aarcange, raquini, dev.jain, sunnanyong, usamaarif642,
	audra, akpm

On Thu, Jan 16, 2025 at 2:47 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Nico,
Hi Ryan!
>
> On 08/01/2025 23:31, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
>
> It's great to see multiple solutions for this feature being posted; I guess that
> leaves us with the luxurious problem of figuring out an uber-patchset that
> incorporates the best of both? :)
I guess so! My motivation for developing this was inspired by my
'defer' RFC. Which can't really live without khugepaged having mTHP
support (ie having 32k mTHP= always and global=defer doesnt make
sense).
>
> I haven't had a chance to review your series in detail yet, but have a few
> questions below that will help me understand the key differences between your
> series and Dev's.
>
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> > determine how full a mTHP order needs to be before collapsing it.
> >
> > Some design choices to note:
> >  - bitmap structures are allocated dynamically because on some arch's
> >     (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >     compile time leading to warnings.
>
> We have MAX_PTRS_PER_PTE and friends though, which are worst case and compile
> time. Could these help avoid the dynamic allocation?
>
> MAX_PMD_ORDER = ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE)
is the MAX_PMD_ORDER = PMD_ORDER? if not this might introduce weird
edge cases where PMD_ORDER < MAX_PMD_ORDER.

>
> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
> is vs just iterating through the PTEs like Dev does; is there a significant cost
> saving in practice? On the face of it, it seems like it might be uneeded complexity.
The bitmap was to encode the state of PMD without needing rescanning
(or refactor a lot of code). We keep the scan runtime constant at 512
(for x86). Dev did some good analysis for this here
https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
This prevents needing to hold the read lock for longer, and prevents
needing to reacquire it too.
>
> >  - The recursion is masked through a stack structure.
> >  - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >     64bit on x86. This provides some optimization on the bitmap operations.
> >     if other arches/configs that have larger than 512 PTEs per PMD want to
> >     compress their bitmap further we can change this value per arch.
>
> So 1 bit in the bitmap represents 8 pages? And it will only be set if all 8
> pages are !pte_none()? I'm wondering what will happen if you have a pattern of 4
> set PTEs followed by 4 none PTEs, followed by 4 set PTEs... If 16K mTHP is
> enabled, you would want to collapse every other 16K block in this case, but I'm
> guessing with your scheme, all the bits will be clear and no collapse will
> occur? But for arm64 at least, collapsing to order-2 (16K) may be desired for HPA.

Yeah on my V2 ive incorporated a threshold (like max_ptes_none) for
setting the bit. This will covert this case better (given a better
default max_ptes_none).
The way i see it 511 max_ptes_none is just wrong... we should flip it
towards the lower end of the scale (ie 64), and the "always" THP
setting should ignore it (like madvise does).

>
> >
> > Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> > Patch 3:    A minor "fix"/optimization
> > Patch 4:    Refactor/rename hpage_collapse
> > Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> > Patch 8-11: The mTHP patches
> >
> > This series acts as an alternative to Dev Jain's approach [1]. The two
> > series differ in a few ways:
> >   - My approach uses a bitmap to store the state of the linear scan_pmd to
> >     then determine potential mTHP batches. Devs incorporates his directly
> >     into the scan, and will try each available order.
>
> So if I'm understanding, the benefit of the bitmap is to remove the need to
> re-scan the "low" PTEs when moving to a lower order, which is what Dev's
> approach does? Are there not some locking/consistency issues to manage if not
> re-scanning?
Correct, so far i haven't found any issues (other than the bugs Dev
reported in his review)-- my fixed version of this RFC has been
running fine with no notable locking issues.
>
> >   - Dev is attempting to optimize the locking, while my approach keeps the
> >     locking changes to a minimum. I believe his changes are not safe for
> >     uffd.
>
> I agree; let's keep the locking simple for the initial effort.
>
> >   - Dev's changes only work for khugepaged not madvise_collapse (although
> >     i think that was by choice and it could easily support madvise)
>
> I agree supporting MADV_COLLAPSE is good; what exactly are the semantics for it
> though? I think it ignores the sysfs settings (max_ptes_none and friends) so
> presumably it will continue to be much more greedy about collapsing to the
> highest possible order and only fall back to lower orders if the VMA boundaries
> force it to or if the higher order allocation fails?
Kind of, because I removed the max_ptes_none check during the scan,
and reintroduced it in the bitmap scan (without a madvise
restriction), MADV_COLLAPSE and khugepaged will work more similarly.
>
> >   - Dev scales all khugepaged sysfs tunables by order, while im removing
> >     the restriction of max_ptes_none and converting it to a scale to
> >     determine a (m)THP threshold.
>
> I don't really understand this statement. You say you are removing the
> restriction of max_ptes_none. But then you say you scale it to determine a
> threshold. So are you honoring it or not? And if you're honouring it, how is
> your scaling method different to Dev's? What about the other tunables (shared
> and swap)?
I removed the max_ptes_none restriction during the initial scan, so we
can account for the full PMD (which is what happens with
max_ptes_none=511 anyways). Then max_ptes_none can be used with the
bitmap to calculate a threshold (max_ptes_none=64 == ~90% full) for
finding the optimal mTHP size.

This RFC scales max_ptes_none to 0-100, but it has some really bad
rounding issues, so instead ive incorporated scaling (via bitshifting)
like Dev did in his series. Ive tested this and it's more accurate
now.
>
> >   - Dev turns on khugepaged if any order is available while mine still
> >     only runs if PMDs are enabled. I like Dev's approach and will most
> >     likely do the same in my PATCH posting.
>
> Agreed. Also, we will want khugepaged to be able to scan VMAs (or parts of VMAs)
> that cover only a partial PMD entry. I think neither of your implementations
> currently do that. As I understand it, Dev's v2 will add that support. Is your
> approach ammeanable to this?

Yes, I believe so. I'm working on adding this too.

>
> >   - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >
> > Patch 11 was inspired by one of Dev's changes.
>
> I think the 1 problem that emerged during review of Dev's series, which we don't
> have a proper solution to yet, is the issue of "creep", where regions can be
> collapsed to progressively higher orders through iterative scans. At each
> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
> effectively adds more non-none ptes so the next scan will then collapse to even
> higher order. Does your solution suffer from this (theoretical/edge case) issue?
> If not, how did you solve?

Yes sadly it suffers from the same issue. bringing max_ptes_none much
lower as a default would "help".
I liked Zi Yan's solution of a per-VMA bit that gets set when
khugepaged collapses, and unset when the VMA changes (pf, realloc,
etc).
Then khugepaged can only operate on VMAs that dont have the bit set.
This way we only collapse once, unless the mapping was changed.

Could we map the new "non-none" pages to the zero page (rather than
actually zeroing the page), so they dont actually act as new "utilized
pages" and are still counted as none pages during the scan (until they
are written to)?

>
> Thanks,
> Ryan

Cheers!
-- Nico

>
>
> >
> > [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >
> > Nico Pache (11):
> >   introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >   khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >   khugepaged: Don't allocate khugepaged mm_slot early
> >   khugepaged: rename hpage_collapse_* to khugepaged_*
> >   khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >   khugepaged: generalize alloc_charge_folio for mTHP support
> >   khugepaged: generalize __collapse_huge_page_* for mTHP support
> >   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >   khugepaged: add mTHP support
> >   khugepaged: remove max_ptes_none restriction on the pmd scan
> >   khugepaged: skip collapsing mTHP to smaller orders
> >
> >  include/linux/khugepaged.h |   4 +-
> >  mm/huge_memory.c           |   3 +-
> >  mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >  3 files changed, 306 insertions(+), 137 deletions(-)
> >
>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-16 20:53   ` Nico Pache
@ 2025-01-20  5:17     ` Dev Jain
  2025-01-23 20:24       ` Nico Pache
  2025-01-20 12:49     ` Ryan Roberts
  2025-01-20 12:54     ` David Hildenbrand
  2 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-20  5:17 UTC (permalink / raw)
  To: Nico Pache, Ryan Roberts
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	david, aarcange, raquini, sunnanyong, usamaarif642, audra, akpm

--- snip ---
>>
>> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
>> is vs just iterating through the PTEs like Dev does; is there a significant cost
>> saving in practice? On the face of it, it seems like it might be uneeded complexity.
> The bitmap was to encode the state of PMD without needing rescanning
> (or refactor a lot of code). We keep the scan runtime constant at 512
> (for x86). Dev did some good analysis for this here
> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/

I think I swayed away and over-analyzed, and probably did not make my 
main objection clear enough, so let us cut to the chase.
*Why* is it correct to remember the state of the PMD?

In__collapse_huge_page_isolate(), we check the PTEs against the sysfs 
tunables again, since we dropped the lock. The bitmap thingy which you 
are doing, and in general, any algorithm which tries to remember the 
state of the PMD, violates the entire point of max_ptes_*. Take for 
example: Suppose the PTE table had a lot of shared ptes. After you drop 
the PTL, you do this: scan_bitmap() -> read_unlock() -> 
alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot 
of stuff. Now, you do write_lock(), which means that you need to wait 
for all faulting/forking/mremap/mmap etc to stop. Suppose this process 
forks and then a lot of PTEs become shared. The point of max_ptes_shared 
is to stop the collapse here, since we do not want memory bloat 
(collapse will grab more memory from the buddy and the old memory won't 
be freed because it has a reference from the parent/child).
Another example would be, a sysadmin does not want too much memory 
wastage from khugepaged, so we decide to set max_ptes_none low. When you 
scan the PTE table you justify the collapse. After you drop the PTL and 
the mmap_lock, a munmap() happens in the region, no longer justifying 
the collapse. If you have a lot of VMAs of size <= 2MB, then any 
munmap() on a VMA will happen on the single PTE table present.

So, IMHO before even jumping on analyzing the bitmap algorithm, we need 
to ask whether any algorithm remembering the state of the PMD is even 
conceptually right.

Then, you have the harder task of proving that your optimization is 
actually an optimization, that it is not turned into being futile 
because of overhead. From a high-level mathematical PoV, you are saving 
iterations. Any mathematical analysis has the underlying assumption that 
every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 << 
order)] is virtually and physically contiguous in memory so prefetching 
helps us. You are trying to save on pte memory references, but then look 
at the number of bitmap memory references you have created, not to 
mention that you are doing a (costly?) division operation in there, you 
have a while loop, a stack, new structs, and if conditions. I do not see 
how this is any faster than a naive linear scan.

> This prevents needing to hold the read lock for longer, and prevents
> needing to reacquire it too.

My implementation does not hold the read lock for longer. What you mean 
to say is, I need to reacquire the lock, and this is by design, to 
ensure correctness, which boils down to what I wrote above.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-16 20:53   ` Nico Pache
  2025-01-20  5:17     ` Dev Jain
@ 2025-01-20 12:49     ` Ryan Roberts
  2025-01-23 20:42       ` Nico Pache
  2025-01-20 12:54     ` David Hildenbrand
  2 siblings, 1 reply; 53+ messages in thread
From: Ryan Roberts @ 2025-01-20 12:49 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	david, aarcange, raquini, dev.jain, sunnanyong, usamaarif642,
	audra, akpm

On 16/01/2025 20:53, Nico Pache wrote:
> On Thu, Jan 16, 2025 at 2:47 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi Nico,
> Hi Ryan!
>>
>> On 08/01/2025 23:31, Nico Pache wrote:
>>> The following series provides khugepaged and madvise collapse with the
>>> capability to collapse regions to mTHPs.
>>
>> It's great to see multiple solutions for this feature being posted; I guess that
>> leaves us with the luxurious problem of figuring out an uber-patchset that
>> incorporates the best of both? :)
> I guess so! My motivation for developing this was inspired by my
> 'defer' RFC. Which can't really live without khugepaged having mTHP
> support (ie having 32k mTHP= always and global=defer doesnt make
> sense).

I'm not sure why that wouldn't make sense? setting global=defer would only be
picked up for a given size that sets "inherit". So "32k=always, 2m=inherit,
global=defer" is the same as "32k=always, 2m=defer"; which means you would try
to allocate 32K directly in the fault handler and defer collapse to 2m to
khugepaged. I guess where it would get difficult is if you set a size less than
PMD-size to defer; at the moment khugepaged can't actually do that; it would
just end up collapsing to 2M? Anyway, I'm rambling... I get your point.

>>
>> I haven't had a chance to review your series in detail yet, but have a few
>> questions below that will help me understand the key differences between your
>> series and Dev's.
>>
>>>
>>> To achieve this we generalize the khugepaged functions to no longer depend
>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>> on max_ptes_none is removed during the scan, to make sure we account for
>>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
>>> determine how full a mTHP order needs to be before collapsing it.
>>>
>>> Some design choices to note:
>>>  - bitmap structures are allocated dynamically because on some arch's
>>>     (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
>>>     compile time leading to warnings.
>>
>> We have MAX_PTRS_PER_PTE and friends though, which are worst case and compile
>> time. Could these help avoid the dynamic allocation?
>>
>> MAX_PMD_ORDER = ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE)
> is the MAX_PMD_ORDER = PMD_ORDER? if not this might introduce weird
> edge cases where PMD_ORDER < MAX_PMD_ORDER.

No, MAX_PMD_ORDER becomes the largest order that could be configured at boot.
PMD_ORDER is what is actually configured at boot. My understanding was that you
were dynamically allocating your bitmap based on the runtime value of PMD_ORDER?
I was just suggesting that you could allocate it statically (on stack or
whatever) based on MAX_PMD_ORDER, for the worst-case requirement and only
actually use the portion required by the runtime PMD_ORDER value. It avoids the
kmalloc call.

> 
>>
>> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
>> is vs just iterating through the PTEs like Dev does; is there a significant cost
>> saving in practice? On the face of it, it seems like it might be uneeded complexity.
> The bitmap was to encode the state of PMD without needing rescanning
> (or refactor a lot of code). We keep the scan runtime constant at 512
> (for x86). Dev did some good analysis for this here
> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
> This prevents needing to hold the read lock for longer, and prevents
> needing to reacquire it too.
>>>>>  - The recursion is masked through a stack structure.
>>>  - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
>>>     64bit on x86. This provides some optimization on the bitmap operations.
>>>     if other arches/configs that have larger than 512 PTEs per PMD want to
>>>     compress their bitmap further we can change this value per arch.
>>
>> So 1 bit in the bitmap represents 8 pages? And it will only be set if all 8
>> pages are !pte_none()? I'm wondering what will happen if you have a pattern of 4
>> set PTEs followed by 4 none PTEs, followed by 4 set PTEs... If 16K mTHP is
>> enabled, you would want to collapse every other 16K block in this case, but I'm
>> guessing with your scheme, all the bits will be clear and no collapse will
>> occur? But for arm64 at least, collapsing to order-2 (16K) may be desired for HPA.
> 
> Yeah on my V2 ive incorporated a threshold (like max_ptes_none) for
> setting the bit. This will covert this case better (given a better
> default max_ptes_none).
> The way i see it 511 max_ptes_none is just wrong... 

You mean it's a bad default?

> we should flip it
> towards the lower end of the scale (ie 64), and the "always" THP
> setting should ignore it (like madvise does).

But user space can already get that behaviour by modifying the tunable, right?
Isn't that just a user space policy choice?

One other thing that occurs to me regarding the bitmap; In the context of Dev's
series, we have discussed policy for what to do when the source PTEs are backed
by a large folio already. I'm guessing if you are making your
smaller-than-PMD-size collapse decisions based solely on the bitmap, you won't
be able to see when the PTEs are already collpsed for the target order? i.e.
let's say you already have a 64K folio fully mapped in an aligned way. You
wouldn't want to "re-collapse" it to 64K. Are you robust to this?

> 
>>
>>>
>>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
>>> Patch 3:    A minor "fix"/optimization
>>> Patch 4:    Refactor/rename hpage_collapse
>>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
>>> Patch 8-11: The mTHP patches
>>>
>>> This series acts as an alternative to Dev Jain's approach [1]. The two
>>> series differ in a few ways:
>>>   - My approach uses a bitmap to store the state of the linear scan_pmd to
>>>     then determine potential mTHP batches. Devs incorporates his directly
>>>     into the scan, and will try each available order.
>>
>> So if I'm understanding, the benefit of the bitmap is to remove the need to
>> re-scan the "low" PTEs when moving to a lower order, which is what Dev's
>> approach does? Are there not some locking/consistency issues to manage if not
>> re-scanning?
> Correct, so far i haven't found any issues (other than the bugs Dev
> reported in his review)-- my fixed version of this RFC has been
> running fine with no notable locking issues.
>>
>>>   - Dev is attempting to optimize the locking, while my approach keeps the
>>>     locking changes to a minimum. I believe his changes are not safe for
>>>     uffd.
>>
>> I agree; let's keep the locking simple for the initial effort.
>>
>>>   - Dev's changes only work for khugepaged not madvise_collapse (although
>>>     i think that was by choice and it could easily support madvise)
>>
>> I agree supporting MADV_COLLAPSE is good; what exactly are the semantics for it
>> though? I think it ignores the sysfs settings (max_ptes_none and friends) so
>> presumably it will continue to be much more greedy about collapsing to the
>> highest possible order and only fall back to lower orders if the VMA boundaries
>> force it to or if the higher order allocation fails?
> Kind of, because I removed the max_ptes_none check during the scan,
> and reintroduced it in the bitmap scan (without a madvise
> restriction), MADV_COLLAPSE and khugepaged will work more similarly.
>>
>>>   - Dev scales all khugepaged sysfs tunables by order, while im removing
>>>     the restriction of max_ptes_none and converting it to a scale to
>>>     determine a (m)THP threshold.
>>
>> I don't really understand this statement. You say you are removing the
>> restriction of max_ptes_none. But then you say you scale it to determine a
>> threshold. So are you honoring it or not? And if you're honouring it, how is
>> your scaling method different to Dev's? What about the other tunables (shared
>> and swap)?
> I removed the max_ptes_none restriction during the initial scan, so we
> can account for the full PMD (which is what happens with
> max_ptes_none=511 anyways). Then max_ptes_none can be used with the
> bitmap to calculate a threshold (max_ptes_none=64 == ~90% full) for
> finding the optimal mTHP size.
> 
> This RFC scales max_ptes_none to 0-100, but it has some really bad
> rounding issues, so instead ive incorporated scaling (via bitshifting)
> like Dev did in his series. Ive tested this and it's more accurate
> now.
>>
>>>   - Dev turns on khugepaged if any order is available while mine still
>>>     only runs if PMDs are enabled. I like Dev's approach and will most
>>>     likely do the same in my PATCH posting.
>>
>> Agreed. Also, we will want khugepaged to be able to scan VMAs (or parts of VMAs)
>> that cover only a partial PMD entry. I think neither of your implementations
>> currently do that. As I understand it, Dev's v2 will add that support. Is your
>> approach ammeanable to this?
> 
> Yes, I believe so. I'm working on adding this too.
> 
>>
>>>   - mTHPs need their ref count updated to 1<<order, which Dev is missing.
>>>
>>> Patch 11 was inspired by one of Dev's changes.
>>
>> I think the 1 problem that emerged during review of Dev's series, which we don't
>> have a proper solution to yet, is the issue of "creep", where regions can be
>> collapsed to progressively higher orders through iterative scans. At each
>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
>> effectively adds more non-none ptes so the next scan will then collapse to even
>> higher order. Does your solution suffer from this (theoretical/edge case) issue?
>> If not, how did you solve?
> 
> Yes sadly it suffers from the same issue. bringing max_ptes_none much
> lower as a default would "help".
> I liked Zi Yan's solution of a per-VMA bit that gets set when
> khugepaged collapses, and unset when the VMA changes (pf, realloc,
> etc).
> Then khugepaged can only operate on VMAs that dont have the bit set.
> This way we only collapse once, unless the mapping was changed.

Dev raised the issue in discussion against his series, that currently khugepaged
doesn't scan the entire VMA, it scans to the first PMD that it collapses then
moves to another VMA. I guess that's a fairness thing. So a VMA flag won't quite
do the trick assuming we want to continue with that behavior. Perhaps we could
keep a "cursor" in the VMA though, which describes the starting address of the
next scan. We can move it forwards as we scan. And move it backwards when taking
a fault. Still not perfect, but perhaps good enough?

> 
> Could we map the new "non-none" pages to the zero page (rather than
> actually zeroing the page), so they dont actually act as new "utilized
> pages" and are still counted as none pages during the scan (until they
> are written to)?

I think you are propsing to use the zero page as a PTE marker to say "this
region is scheduled for collapse"? In which case, why not just use a PTE
marker... But you still have to do the collapse at some point (which I guess you
are now deferring to the next page fault that hits one of those markers)? Once
you have collapsed, you're still back to the original issue. So I don't think
it's bought you anything except complexity and more latency :)

Thanks,
Ryan

> 
>>
>> Thanks,
>> Ryan
> 
> Cheers!
> -- Nico
> 
>>
>>
>>>
>>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
>>>
>>> Nico Pache (11):
>>>   introduce khugepaged_collapse_single_pmd to collapse a single pmd
>>>   khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
>>>   khugepaged: Don't allocate khugepaged mm_slot early
>>>   khugepaged: rename hpage_collapse_* to khugepaged_*
>>>   khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>   khugepaged: generalize alloc_charge_folio for mTHP support
>>>   khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>   khugepaged: add mTHP support
>>>   khugepaged: remove max_ptes_none restriction on the pmd scan
>>>   khugepaged: skip collapsing mTHP to smaller orders
>>>
>>>  include/linux/khugepaged.h |   4 +-
>>>  mm/huge_memory.c           |   3 +-
>>>  mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
>>>  3 files changed, 306 insertions(+), 137 deletions(-)
>>>
>>
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-16 20:53   ` Nico Pache
  2025-01-20  5:17     ` Dev Jain
  2025-01-20 12:49     ` Ryan Roberts
@ 2025-01-20 12:54     ` David Hildenbrand
  2025-01-20 13:37       ` Ryan Roberts
  2 siblings, 1 reply; 53+ messages in thread
From: David Hildenbrand @ 2025-01-20 12:54 UTC (permalink / raw)
  To: Nico Pache, Ryan Roberts
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

>> I think the 1 problem that emerged during review of Dev's series, which we don't
>> have a proper solution to yet, is the issue of "creep", where regions can be
>> collapsed to progressively higher orders through iterative scans. At each
>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
>> effectively adds more non-none ptes so the next scan will then collapse to even
>> higher order. Does your solution suffer from this (theoretical/edge case) issue?
>> If not, how did you solve?
> 
> Yes sadly it suffers from the same issue. bringing max_ptes_none much
> lower as a default would "help".

Can we just keep it simple and only support max_ptes_none = 511 
("pagefault behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 
("deferred behavior") and document that the other weird configurations 
will make mTHP skip, because "weird and unexpetced" ? :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-20 12:54     ` David Hildenbrand
@ 2025-01-20 13:37       ` Ryan Roberts
  2025-01-20 13:56         ` David Hildenbrand
  0 siblings, 1 reply; 53+ messages in thread
From: Ryan Roberts @ 2025-01-20 13:37 UTC (permalink / raw)
  To: David Hildenbrand, Nico Pache
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

On 20/01/2025 12:54, David Hildenbrand wrote:
>>> I think the 1 problem that emerged during review of Dev's series, which we don't
>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>> collapsed to progressively higher orders through iterative scans. At each
>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
>>> effectively adds more non-none ptes so the next scan will then collapse to even
>>> higher order. Does your solution suffer from this (theoretical/edge case) issue?
>>> If not, how did you solve?
>>
>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>> lower as a default would "help".
> 
> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
> document that the other weird configurations will make mTHP skip, because "weird
> and unexpetced" ? :)
> 

That sounds like a great simplification in principle! We would need to consider
the swap and shared tunables too though. Perhaps we can pull a similar trick
with those?


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-20 13:37       ` Ryan Roberts
@ 2025-01-20 13:56         ` David Hildenbrand
  2025-01-20 16:27           ` Ryan Roberts
  0 siblings, 1 reply; 53+ messages in thread
From: David Hildenbrand @ 2025-01-20 13:56 UTC (permalink / raw)
  To: Ryan Roberts, Nico Pache
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

On 20.01.25 14:37, Ryan Roberts wrote:
> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>> I think the 1 problem that emerged during review of Dev's series, which we don't
>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>> collapsed to progressively higher orders through iterative scans. At each
>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
>>>> effectively adds more non-none ptes so the next scan will then collapse to even
>>>> higher order. Does your solution suffer from this (theoretical/edge case) issue?
>>>> If not, how did you solve?
>>>
>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>> lower as a default would "help".
>>
>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>> document that the other weird configurations will make mTHP skip, because "weird
>> and unexpetced" ? :)
>>
> 
> That sounds like a great simplification in principle!

And certainly a much easier to start with :)

If we ever get the request to support something else, maybe that's also 
where we can learn *why*, and what we would actually want to do with mTHP.

> We would need to consider
> the swap and shared tunables too though. Perhaps we can pull a similar trick
> with those?

Swapped and shared are a bit more challenging, because they are set to 
"/ 2" or "/ 8" heuristics.


One simple starting point here is of course to say "when collapsing 
mTHP, all have to be unshared and all have to be swapped in", so to 
essentially ignore both tunables (in a memory friendly way, as if they 
are set to 0) for mTHP collapse and worry about that later, when really 
required.

Two alternatives I discussed with Nico for these (not sure which is 
implemented here) is to calculate it proportionally to the folio order 
we are collapsing:

Assuming max_ptes_swap = 64 (PMD: 512 PTEs) and we are collapsing a 1 
MiB mTHP (256 PTEs), 32 PTEs would be allowed to be swapped out.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-20 13:56         ` David Hildenbrand
@ 2025-01-20 16:27           ` Ryan Roberts
  2025-01-20 18:39             ` David Hildenbrand
  0 siblings, 1 reply; 53+ messages in thread
From: Ryan Roberts @ 2025-01-20 16:27 UTC (permalink / raw)
  To: David Hildenbrand, Nico Pache
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

On 20/01/2025 13:56, David Hildenbrand wrote:
> On 20.01.25 14:37, Ryan Roberts wrote:
>> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>>> I think the 1 problem that emerged during review of Dev's series, which we
>>>>> don't
>>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>>> collapsed to progressively higher orders through iterative scans. At each
>>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the
>>>>> collapse
>>>>> effectively adds more non-none ptes so the next scan will then collapse to
>>>>> even
>>>>> higher order. Does your solution suffer from this (theoretical/edge case)
>>>>> issue?
>>>>> If not, how did you solve?
>>>>
>>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>>> lower as a default would "help".
>>>
>>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>>> document that the other weird configurations will make mTHP skip, because "weird
>>> and unexpetced" ? :)

nit: Rather than values of max_ptes_none other than 0 and max making mTHP skip,
perhaps it's better to say we round to closest of 0 and max?

>>>
>>
>> That sounds like a great simplification in principle!
> 
> And certainly a much easier to start with :)
> 
> If we ever get the request to support something else, maybe that's also where we
> can learn *why*, and what we would actually want to do with mTHP.
> 
>> We would need to consider
>> the swap and shared tunables too though. Perhaps we can pull a similar trick
>> with those?
> 
> Swapped and shared are a bit more challenging, because they are set to "/ 2" or
> "/ 8" heuristics.
> 
> 
> One simple starting point here is of course to say "when collapsing mTHP, all
> have to be unshared and all have to be swapped in", so to essentially ignore
> both tunables (in a memory friendly way, as if they are set to 0) for mTHP
> collapse and worry about that later, when really required.

For swap, if we assume we start with the whole VMA swapped out, I think setting
max_ptes_swap to 0 could still cause the "creep" problem if faulting pages back
in sequentially? I guess that's creep due to faulting pattern though, so at
least it's not due to collapse. Doesn't feel ideal though.

I'm not sure what the semantic of "shared" is? I'm guessing it's specifically
for private COWed pages, and khugepaged will trigger the COW on collapse? So
again depending on the pattern of writes we could still end up with creep in a
similar way to swap?

> 
> Two alternatives I discussed with Nico for these (not sure which is implemented
> here) is to calculate it proportionally to the folio order we are collapsing:

You're only listing one option here... what's the other one you discussed?

> 
> Assuming max_ptes_swap = 64 (PMD: 512 PTEs) and we are collapsing a 1 MiB mTHP
> (256 PTEs), 32 PTEs would be allowed to be swapped out.

Yeah this is exactly what Dev's version is doing at the moment. But that's the
behaviour that leads to the "creep" problem.

Thanks,
Ryan

> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-20 16:27           ` Ryan Roberts
@ 2025-01-20 18:39             ` David Hildenbrand
  2025-01-21  9:48               ` Ryan Roberts
  0 siblings, 1 reply; 53+ messages in thread
From: David Hildenbrand @ 2025-01-20 18:39 UTC (permalink / raw)
  To: Ryan Roberts, Nico Pache
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

On 20.01.25 17:27, Ryan Roberts wrote:
> On 20/01/2025 13:56, David Hildenbrand wrote:
>> On 20.01.25 14:37, Ryan Roberts wrote:
>>> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>>>> I think the 1 problem that emerged during review of Dev's series, which we
>>>>>> don't
>>>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>>>> collapsed to progressively higher orders through iterative scans. At each
>>>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the
>>>>>> collapse
>>>>>> effectively adds more non-none ptes so the next scan will then collapse to
>>>>>> even
>>>>>> higher order. Does your solution suffer from this (theoretical/edge case)
>>>>>> issue?
>>>>>> If not, how did you solve?
>>>>>
>>>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>>>> lower as a default would "help".
>>>>
>>>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>>>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>>>> document that the other weird configurations will make mTHP skip, because "weird
>>>> and unexpetced" ? :)
> 
> nit: Rather than values of max_ptes_none other than 0 and max making mTHP skip,
> perhaps it's better to say we round to closest of 0 and max?

Maybe. Rounding down always implies doing something not necessarily desired.

In any case, I assume most setups just have the default values here ... :)

> 
>>>>
>>>
>>> That sounds like a great simplification in principle!
>>
>> And certainly a much easier to start with :)
>>
>> If we ever get the request to support something else, maybe that's also where we
>> can learn *why*, and what we would actually want to do with mTHP.
>>
>>> We would need to consider
>>> the swap and shared tunables too though. Perhaps we can pull a similar trick
>>> with those?
>>
>> Swapped and shared are a bit more challenging, because they are set to "/ 2" or
>> "/ 8" heuristics.
>>
>>
>> One simple starting point here is of course to say "when collapsing mTHP, all
>> have to be unshared and all have to be swapped in", so to essentially ignore
>> both tunables (in a memory friendly way, as if they are set to 0) for mTHP
>> collapse and worry about that later, when really required.
> 
> For swap, if we assume we start with the whole VMA swapped out, I think setting
> max_ptes_swap to 0 could still cause the "creep" problem if faulting pages back
> in sequentially? I guess that's creep due to faulting pattern though, so at
> least it's not due to collapse. Doesn't feel ideal though.
 > > I'm not sure what the semantic of "shared" is? I'm guessing it's 
specifically
> for private COWed pages, and khugepaged will trigger the COW on collapse?

Yes.

> So
> again depending on the pattern of writes we could still end up with creep in a
> similar way to swap?

I think in regards of both "yes", so a simple starting point but not 
necessarily what we want long term. The creep is at least "not wasting 
more memory", because we don't collapse where PMD wouldn't have collapsed.

After all, right now we don't collapse mTHP, now we would collapse mTHP 
in many scenarios, so we don't have to be perfect initially.

Deriving stuff for small THP sizes when configured for PMD THP sizes is 
not easy to do right.

> 
>>
>> Two alternatives I discussed with Nico for these (not sure which is implemented
>> here) is to calculate it proportionally to the folio order we are collapsing:
> 
> You're only listing one option here... what's the other one you discussed?
>

Ah sorry, reshuffled it and then had to rush.

The other thing I had in mind is to scan the whole PMD range, and 
discard skip the whole PMD range if it doesn't obey the max_ptes_* 
stuff. Not perfect, but will mean that we behave just like PMD collapse 
would, unless I am missing something.


>>
>> Assuming max_ptes_swap = 64 (PMD: 512 PTEs) and we are collapsing a 1 MiB mTHP
>> (256 PTEs), 32 PTEs would be allowed to be swapped out.
> 
> Yeah this is exactly what Dev's version is doing at the moment. But that's the
> behaviour that leads to the "creep" problem.

Right.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-20 18:39             ` David Hildenbrand
@ 2025-01-21  9:48               ` Ryan Roberts
  2025-01-21 10:19                 ` David Hildenbrand
  2025-01-22  5:18                 ` Dev Jain
  0 siblings, 2 replies; 53+ messages in thread
From: Ryan Roberts @ 2025-01-21  9:48 UTC (permalink / raw)
  To: David Hildenbrand, Nico Pache
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

On 20/01/2025 18:39, David Hildenbrand wrote:
> On 20.01.25 17:27, Ryan Roberts wrote:
>> On 20/01/2025 13:56, David Hildenbrand wrote:
>>> On 20.01.25 14:37, Ryan Roberts wrote:
>>>> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>>>>> I think the 1 problem that emerged during review of Dev's series, which we
>>>>>>> don't
>>>>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>>>>> collapsed to progressively higher orders through iterative scans. At each
>>>>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the
>>>>>>> collapse
>>>>>>> effectively adds more non-none ptes so the next scan will then collapse to
>>>>>>> even
>>>>>>> higher order. Does your solution suffer from this (theoretical/edge case)
>>>>>>> issue?
>>>>>>> If not, how did you solve?
>>>>>>
>>>>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>>>>> lower as a default would "help".
>>>>>
>>>>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>>>>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>>>>> document that the other weird configurations will make mTHP skip, because
>>>>> "weird
>>>>> and unexpetced" ? :)
>>
>> nit: Rather than values of max_ptes_none other than 0 and max making mTHP skip,
>> perhaps it's better to say we round to closest of 0 and max?
> 
> Maybe. Rounding down always implies doing something not necessarily desired.
> 
> In any case, I assume most setups just have the default values here ... :)
> 
>>
>>>>>
>>>>
>>>> That sounds like a great simplification in principle!
>>>
>>> And certainly a much easier to start with :)
>>>
>>> If we ever get the request to support something else, maybe that's also where we
>>> can learn *why*, and what we would actually want to do with mTHP.
>>>
>>>> We would need to consider
>>>> the swap and shared tunables too though. Perhaps we can pull a similar trick
>>>> with those?
>>>
>>> Swapped and shared are a bit more challenging, because they are set to "/ 2" or
>>> "/ 8" heuristics.
>>>
>>>
>>> One simple starting point here is of course to say "when collapsing mTHP, all
>>> have to be unshared and all have to be swapped in", so to essentially ignore
>>> both tunables (in a memory friendly way, as if they are set to 0) for mTHP
>>> collapse and worry about that later, when really required.
>>
>> For swap, if we assume we start with the whole VMA swapped out, I think setting
>> max_ptes_swap to 0 could still cause the "creep" problem if faulting pages back
>> in sequentially? I guess that's creep due to faulting pattern though, so at
>> least it's not due to collapse. Doesn't feel ideal though.
>> > I'm not sure what the semantic of "shared" is? I'm guessing it's specifically
>> for private COWed pages, and khugepaged will trigger the COW on collapse?
> 
> Yes.
> 
>> So
>> again depending on the pattern of writes we could still end up with creep in a
>> similar way to swap?
> 
> I think in regards of both "yes", so a simple starting point but not necessarily
> what we want long term. The creep is at least "not wasting more memory", because
> we don't collapse where PMD wouldn't have collapsed.
> 
> After all, right now we don't collapse mTHP, now we would collapse mTHP in many
> scenarios, so we don't have to be perfect initially.
> 
> Deriving stuff for small THP sizes when configured for PMD THP sizes is not easy
> to do right.
> 
>>
>>>
>>> Two alternatives I discussed with Nico for these (not sure which is implemented
>>> here) is to calculate it proportionally to the folio order we are collapsing:
>>
>> You're only listing one option here... what's the other one you discussed?
>>
> 
> Ah sorry, reshuffled it and then had to rush.
> 
> The other thing I had in mind is to scan the whole PMD range, and discard skip
> the whole PMD range if it doesn't obey the max_ptes_* stuff. Not perfect, but
> will mean that we behave just like PMD collapse would, unless I am missing
> something.

Hmm that's an interesting idea; If I've understood, we would effectively test
the PMD for collapse as if we were collapsing to PMD-size, but then do the
actual collapse to the "highest allowed order" (dictated by what's enabled +
MADV_HUGEPAGE config).

I'm not so sure this is a good way to go; there would be no way to support VMAs
(or parts of VMAs) that don't span a full PMD. And I can imagine we might see
memory bloat; imagine you have 2M=madvise, 64K=always, max_ptes_none=511, and
let's say we have a 2M (aligned portion of a) VMA that does NOT have
MADV_HUGEPAGE set and has a single page populated. It passes the PMD-size test,
but we opt to collapse to 64K (since 2M=madvise). So now we end up with 32x 64K
folios, 31 of which are all zeros. We have spent the same amount of memory as if
2M=always. Perhaps that's a detail that could be solved by ignoring fully none
64K blocks when collapsing to 64K...

Personally, I think your "enforce simplicifation of the tunables for mTHP
collapse" idea is the best we have so far.

But I'll just push against your pushback of the per-VMA cursor idea briefly. It
strikes me that this could be useful for khugepaged regardless of mTHP support.
Today, it starts scanning a VMA, collapses the first PMD it finds that meets the
requirements, then switches to scanning another VMA. When it eventually gets
back to scanning the first VMA, it starts from the beginning again. Wouldn't a
cursor help reduce the amount of scanning it has to do?

> 
> 
>>>
>>> Assuming max_ptes_swap = 64 (PMD: 512 PTEs) and we are collapsing a 1 MiB mTHP
>>> (256 PTEs), 32 PTEs would be allowed to be swapped out.
>>
>> Yeah this is exactly what Dev's version is doing at the moment. But that's the
>> behaviour that leads to the "creep" problem.
> 
> Right.
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-21  9:48               ` Ryan Roberts
@ 2025-01-21 10:19                 ` David Hildenbrand
  2025-01-27  9:31                   ` Dev Jain
  2025-01-22  5:18                 ` Dev Jain
  1 sibling, 1 reply; 53+ messages in thread
From: David Hildenbrand @ 2025-01-21 10:19 UTC (permalink / raw)
  To: Ryan Roberts, Nico Pache
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	aarcange, raquini, dev.jain, sunnanyong, usamaarif642, audra,
	akpm

> Hmm that's an interesting idea; If I've understood, we would effectively test
> the PMD for collapse as if we were collapsing to PMD-size, but then do the
> actual collapse to the "highest allowed order" (dictated by what's enabled +
> MADV_HUGEPAGE config).
> 
> I'm not so sure this is a good way to go; there would be no way to support VMAs
> (or parts of VMAs) that don't span a full PMD. 

In Nicos approach to locking, we temporarily have to remove the PTE 
table either way. While holding the mmap lock in write mode, the VMAs 
cannot go away, so we could scan the whole PTE table to figure it out.

To just figure out "none" vs. "non-none" vs. "swap PTE", we'd probably 
don't need the other VMA information. Figuring out "shared" is trickier, 
because we have to obtain the folio and would have to walk the other VMAs.

It's a good question if we would have to VMA-write-lock the other 
affected VMAs as well in order to temporarily remove the PTE table that 
crosses multiple VMAs, or if we'd need something different (collapse PMD 
marker) so the page table walkers could handle that case properly -- 
keep retrying or fallback to the mmap lock.

> And I can imagine we might see
> memory bloat; imagine you have 2M=madvise, 64K=always, max_ptes_none=511, and
> let's say we have a 2M (aligned portion of a) VMA that does NOT have
> MADV_HUGEPAGE set and has a single page populated. It passes the PMD-size test,
> but we opt to collapse to 64K (since 2M=madvise). So now we end up with 32x 64K
> folios, 31 of which are all zeros. We have spent the same amount of memory as if
> 2M=always. Perhaps that's a detail that could be solved by ignoring fully none
> 64K blocks when collapsing to 64K...

Yes, that's what I had in mind. No need to collapse where there is 
nothing at all ...

> 
> Personally, I think your "enforce simplicifation of the tunables for mTHP
> collapse" idea is the best we have so far.

Right.

> 
> But I'll just push against your pushback of the per-VMA cursor idea briefly. It
> strikes me that this could be useful for khugepaged regardless of mTHP support.

Not a clear pushback, as you say to me this is a different optimization 
and I am missing how it could really solve the problem at hand here.

Note that we're already fighting with not growing VMAs (see the VMA 
locking changes under review), but maybe we could still squeeze it in 
there without requiring a bigger slab.

> Today, it starts scanning a VMA, collapses the first PMD it finds that meets the
> requirements, then switches to scanning another VMA. When it eventually gets
> back to scanning the first VMA, it starts from the beginning again. Wouldn't a
> cursor help reduce the amount of scanning it has to do?

Yes, that whole scanning approach sound weird. I would have assumed that 
it might nowdays be smarter to just scan the MM sequentially, and not 
jump between VMAs.

Assume you only have a handfull of large VMAs (like in a VMM), you'd end 
up scanning the same handful of VMAs over and over again.

I think a lot of the khugepaged codebase is just full with historical 
baggage that must be cleaned up and re-validated if it still required ...

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-21  9:48               ` Ryan Roberts
  2025-01-21 10:19                 ` David Hildenbrand
@ 2025-01-22  5:18                 ` Dev Jain
  1 sibling, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-01-22  5:18 UTC (permalink / raw)
  To: Ryan Roberts, David Hildenbrand, Nico Pache
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm

On 21/01/25 3:18 pm, Ryan Roberts wrote:
> On 20/01/2025 18:39, David Hildenbrand wrote:
>> On 20.01.25 17:27, Ryan Roberts wrote:
>>> On 20/01/2025 13:56, David Hildenbrand wrote:
>>>> On 20.01.25 14:37, Ryan Roberts wrote:
>>>>> On 20/01/2025 12:54, David Hildenbrand wrote:
>>>>>>>> I think the 1 problem that emerged during review of Dev's series, which we
>>>>>>>> don't
>>>>>>>> have a proper solution to yet, is the issue of "creep", where regions can be
>>>>>>>> collapsed to progressively higher orders through iterative scans. At each
>>>>>>>> collapse, the required thresholds (e.g. max_ptes_none) are met, and the
>>>>>>>> collapse
>>>>>>>> effectively adds more non-none ptes so the next scan will then collapse to
>>>>>>>> even
>>>>>>>> higher order. Does your solution suffer from this (theoretical/edge case)
>>>>>>>> issue?
>>>>>>>> If not, how did you solve?
>>>>>>>
>>>>>>> Yes sadly it suffers from the same issue. bringing max_ptes_none much
>>>>>>> lower as a default would "help".
>>>>>>
>>>>>> Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
>>>>>> behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
>>>>>> document that the other weird configurations will make mTHP skip, because
>>>>>> "weird
>>>>>> and unexpetced" ? :)
>>>
>>> nit: Rather than values of max_ptes_none other than 0 and max making mTHP skip,
>>> perhaps it's better to say we round to closest of 0 and max?
>>
>> Maybe. Rounding down always implies doing something not necessarily desired.
>>
>> In any case, I assume most setups just have the default values here ... :)
>>
>>>
>>>>>>
>>>>>
>>>>> That sounds like a great simplification in principle!
>>>>
>>>> And certainly a much easier to start with :)
>>>>
>>>> If we ever get the request to support something else, maybe that's also where we
>>>> can learn *why*, and what we would actually want to do with mTHP.
>>>>
>>>>> We would need to consider
>>>>> the swap and shared tunables too though. Perhaps we can pull a similar trick
>>>>> with those?
>>>>
>>>> Swapped and shared are a bit more challenging, because they are set to "/ 2" or
>>>> "/ 8" heuristics.
>>>>
>>>>
>>>> One simple starting point here is of course to say "when collapsing mTHP, all
>>>> have to be unshared and all have to be swapped in", so to essentially ignore
>>>> both tunables (in a memory friendly way, as if they are set to 0) for mTHP
>>>> collapse and worry about that later, when really required.
>>>
>>> For swap, if we assume we start with the whole VMA swapped out, I think setting
>>> max_ptes_swap to 0 could still cause the "creep" problem if faulting pages back
>>> in sequentially? I guess that's creep due to faulting pattern though, so at
>>> least it's not due to collapse. Doesn't feel ideal though.
>>>> I'm not sure what the semantic of "shared" is? I'm guessing it's specifically
>>> for private COWed pages, and khugepaged will trigger the COW on collapse?
>>
>> Yes.
>>
>>> So
>>> again depending on the pattern of writes we could still end up with creep in a
>>> similar way to swap?
>>
>> I think in regards of both "yes", so a simple starting point but not necessarily
>> what we want long term. The creep is at least "not wasting more memory", because
>> we don't collapse where PMD wouldn't have collapsed.
>>
>> After all, right now we don't collapse mTHP, now we would collapse mTHP in many
>> scenarios, so we don't have to be perfect initially.
>>
>> Deriving stuff for small THP sizes when configured for PMD THP sizes is not easy
>> to do right.
>>
>>>
>>>>
>>>> Two alternatives I discussed with Nico for these (not sure which is implemented
>>>> here) is to calculate it proportionally to the folio order we are collapsing:
>>>
>>> You're only listing one option here... what's the other one you discussed?
>>>
>>
>> Ah sorry, reshuffled it and then had to rush.
>>
>> The other thing I had in mind is to scan the whole PMD range, and discard skip
>> the whole PMD range if it doesn't obey the max_ptes_* stuff. Not perfect, but
>> will mean that we behave just like PMD collapse would, unless I am missing
>> something.
> 
> Hmm that's an interesting idea; If I've understood, we would effectively test
> the PMD for collapse as if we were collapsing to PMD-size, but then do the
> actual collapse to the "highest allowed order" (dictated by what's enabled +
> MADV_HUGEPAGE config).
> 
> I'm not so sure this is a good way to go; there would be no way to support VMAs
> (or parts of VMAs) that don't span a full PMD. And I can imagine we might see
> memory bloat; imagine you have 2M=madvise, 64K=always, max_ptes_none=511, and
> let's say we have a 2M (aligned portion of a) VMA that does NOT have
> MADV_HUGEPAGE set and has a single page populated. It passes the PMD-size test,
> but we opt to collapse to 64K (since 2M=madvise). So now we end up with 32x 64K
> folios, 31 of which are all zeros. We have spent the same amount of memory as if
> 2M=always. Perhaps that's a detail that could be solved by ignoring fully none
> 64K blocks when collapsing to 64K...

There are two ways a collapse can fail.
(1) We exceed one of max_ptes_* in the range.
(2) We fail in the function collapse_huge_page(), whose real bottleneck 
really is alloc_charge_folio(); i.e we fail to find physically 
contiguous memory for the corresponding scan order.

Now, we do not care whether we get a "creep" in case of (2), because 
khugepaged's end goal is to collapse to the highest order. We care if we 
get a creep from (1), because a smaller collapse leads us to come under 
the max_ptes_* constraint for a bigger order.

So I think, what David is suggesting is to ignore (1) for smaller 
orders. I'll try to formalize the proposed policy as follows: 
khugepaged's goal should be to collapse to the highest order possible, 
and therefore, we should bother with max_ptes_* only for the highest 
order, since that really is the end goal. So, the algorithm is: check 
max_ptes_* for PMD -> if success, collapse_huge_page(PMD_ORDER) -> if 
this fails, we drop down the order, and we ignore the local distribution 
of none, swap, shared PTEs -> collapse_huge_page(say, 64K) -> success. 
Now, this won't give us any creep since PMD collapse was eligible 
anyways. The only rule we should add is "collapse to a smaller order 
only when at least one PTE is filled in that range" since we do not want 
to collapse an empty range, as Ryan notes above. This should be easy to 
do; just maintain a local bitmap in hpage_collapse_scan_pmd(), the 
question we need to answer is "is the bitmap non-zero for a range?" 
which we can get in O(1).
The issue of bothering with max_ptes_* will come here when we drop and 
reacquire locks, because the global distribution may change, so we will 
be trading with accuracy...

With regards to "there would be no way to support VMAs that don't span a 
full PMD", the policy should be "scale max_ptes_* to the highest order 
possible for the VMA". Basically, just get rid of the PMD-case and 
generalize this algorithm.

> 
> Personally, I think your "enforce simplicifation of the tunables for mTHP
> collapse" idea is the best we have so far.

What I understood: keep max_ptes_none = 511 or 0, document that keeping 
it other than that may cause the creep, and consider max_ptes_swap = 
max_ptes_shared = 0 for mTHP. I will vote for this too...we won't have 
locking issues since we will *have* to scan ranges for smaller orders to 
check nothing is swap and shared. I have two justifications to support 
this policy:

(1) Systems in general may, and in particular, Android has a good 
percent of memory in swap, so we really don't want khugepaged to say 
"please swap-in memory for this range so I could collapse to 64K" when 
the real goal is to collapse to 2MB.

(2) The reason we dropped down to a lower order, is because we failed 
PMD-collapse. The reason for that is that we couldn't find 2MB 
physically contiguous memory, so, assume for the sake of argument that 
we are in memory pressure. We don't want khugepaged to say "please 
swapin memory and let me create even more memory pressure". The analogy 
runs for the shared case.

All in all, the essence of the policy is that mTHP collapse should be 
made stricter, since the fact that we failed PMD-collapse means we can't 
ask a lot of memory from the system just to do an mTHP collapse.

With regards to max_ptes_none, David says:
"If we ever get the request to support something else, maybe that's also 
where we can learn *why*, and what we would actually want to do with mTHP. "

Which sounds very reasonable, since we are solving a theoretical 
problem, we don't have a real-use-case justification as to why we should 
bother to solve this problem :)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-20  5:17     ` Dev Jain
@ 2025-01-23 20:24       ` Nico Pache
  2025-01-24  7:13         ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: Nico Pache @ 2025-01-23 20:24 UTC (permalink / raw)
  To: Dev Jain
  Cc: Ryan Roberts, linux-kernel, linux-mm, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm

On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> --- snip ---
> >>
> >> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
> >> is vs just iterating through the PTEs like Dev does; is there a significant cost
> >> saving in practice? On the face of it, it seems like it might be uneeded complexity.
> > The bitmap was to encode the state of PMD without needing rescanning
> > (or refactor a lot of code). We keep the scan runtime constant at 512
> > (for x86). Dev did some good analysis for this here
> > https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>
> I think I swayed away and over-analyzed, and probably did not make my
> main objection clear enough, so let us cut to the chase.
> *Why* is it correct to remember the state of the PMD?
>
> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
> tunables again, since we dropped the lock. The bitmap thingy which you
> are doing, and in general, any algorithm which tries to remember the
> state of the PMD, violates the entire point of max_ptes_*. Take for
> example: Suppose the PTE table had a lot of shared ptes. After you drop
> the PTL, you do this: scan_bitmap() -> read_unlock() ->
> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
per your recommendation I dropped the read_lock() -> read_unlock() and
made it a conditional unlock
> of stuff. Now, you do write_lock(), which means that you need to wait
> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
> forks and then a lot of PTEs become shared. The point of max_ptes_shared
> is to stop the collapse here, since we do not want memory bloat
> (collapse will grab more memory from the buddy and the old memory won't
> be freed because it has a reference from the parent/child).

That's a fair point, but given the other feedback, my current
implementation now requires mTHPs to have no shared/swap, and ive
improved the sysctl interactions for the set_bitmap and the
max_ptes_none check in the _isolate function.

As for *why* remembering the state is correct. It just prevents
needing to rescan.

> Another example would be, a sysadmin does not want too much memory
> wastage from khugepaged, so we decide to set max_ptes_none low. When you
> scan the PTE table you justify the collapse. After you drop the PTL and
> the mmap_lock, a munmap() happens in the region, no longer justifying
> the collapse. If you have a lot of VMAs of size <= 2MB, then any
> munmap() on a VMA will happen on the single PTE table present.
>
> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
> to ask whether any algorithm remembering the state of the PMD is even
> conceptually right.

Both the issues you raised dont really have to do with the bitmap...
they are fair points, but they are more of a criticism of my sysctl
handling. Ive cleaned up the max_ptes_none interactions, and now that
we dont plan to initially support swap/shared both these problems are
'gone'.
>
> Then, you have the harder task of proving that your optimization is
> actually an optimization, that it is not turned into being futile
> because of overhead. From a high-level mathematical PoV, you are saving
> iterations. Any mathematical analysis has the underlying assumption that
> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
> order)] is virtually and physically contiguous in memory so prefetching
> helps us. You are trying to save on pte memory references, but then look
> at the number of bitmap memory references you have created, not to
> mention that you are doing a (costly?) division operation in there, you
> have a while loop, a stack, new structs, and if conditions. I do not see
> how this is any faster than a naive linear scan.

Yeah it's hard to say without real performance testing. I hope to
include some performance results with my next post.

>
> > This prevents needing to hold the read lock for longer, and prevents
> > needing to reacquire it too.
>
> My implementation does not hold the read lock for longer. What you mean
> to say is, I need to reacquire the lock, and this is by design, to
yes sorry.
> ensure correctness, which boils down to what I wrote above.
The write lock is what ensures correctness, not the read lock. The
read lock is to gain insight of potential collapse candidates while
avoiding the cost of the write lock.

Cheers!
-- Nico
>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-20 12:49     ` Ryan Roberts
@ 2025-01-23 20:42       ` Nico Pache
  0 siblings, 0 replies; 53+ messages in thread
From: Nico Pache @ 2025-01-23 20:42 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	david, aarcange, raquini, dev.jain, sunnanyong, usamaarif642,
	audra, akpm

On Mon, Jan 20, 2025 at 5:49 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 16/01/2025 20:53, Nico Pache wrote:
> > On Thu, Jan 16, 2025 at 2:47 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi Nico,
> > Hi Ryan!
> >>
> >> On 08/01/2025 23:31, Nico Pache wrote:
> >>> The following series provides khugepaged and madvise collapse with the
> >>> capability to collapse regions to mTHPs.
> >>
> >> It's great to see multiple solutions for this feature being posted; I guess that
> >> leaves us with the luxurious problem of figuring out an uber-patchset that
> >> incorporates the best of both? :)
> > I guess so! My motivation for developing this was inspired by my
> > 'defer' RFC. Which can't really live without khugepaged having mTHP
> > support (ie having 32k mTHP= always and global=defer doesnt make
> > sense).
>
> I'm not sure why that wouldn't make sense? setting global=defer would only be
> picked up for a given size that sets "inherit". So "32k=always, 2m=inherit,
> global=defer" is the same as "32k=always, 2m=defer"; which means you would try
> to allocate 32K directly in the fault handler and defer collapse to 2m to
> khugepaged. I guess where it would get difficult is if you set a size less than
> PMD-size to defer; at the moment khugepaged can't actually do that; it would
> just end up collapsing to 2M? Anyway, I'm rambling... I get your point.

Yeah looks like you found one of the issues. so defer means no pf time
(m)THPs. mthp sysctls need a "defer" entry, what does it mean to defer
globally and have a mthp size as always or inherit? I assume for
global=defer and mthps=always/inherit/defer, we defer at pf time and
can collapse the mthp. and gobal=always and mthp=defer, we always
allocate a thp, then khugepaged can scan for (m)thp collapse.

>
> >>
> >> I haven't had a chance to review your series in detail yet, but have a few
> >> questions below that will help me understand the key differences between your
> >> series and Dev's.
> >>
> >>>
> >>> To achieve this we generalize the khugepaged functions to no longer depend
> >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>> (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked
> >>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>> on max_ptes_none is removed during the scan, to make sure we account for
> >>> the whole PMD range. max_ptes_none is mapped to a 0-100 range to
> >>> determine how full a mTHP order needs to be before collapsing it.
> >>>
> >>> Some design choices to note:
> >>>  - bitmap structures are allocated dynamically because on some arch's
> >>>     (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at
> >>>     compile time leading to warnings.
> >>
> >> We have MAX_PTRS_PER_PTE and friends though, which are worst case and compile
> >> time. Could these help avoid the dynamic allocation?
> >>
> >> MAX_PMD_ORDER = ilog2(MAX_PTRS_PER_PTE * PAGE_SIZE)
> > is the MAX_PMD_ORDER = PMD_ORDER? if not this might introduce weird
> > edge cases where PMD_ORDER < MAX_PMD_ORDER.
>
> No, MAX_PMD_ORDER becomes the largest order that could be configured at boot.
> PMD_ORDER is what is actually configured at boot. My understanding was that you
> were dynamically allocating your bitmap based on the runtime value of PMD_ORDER?
> I was just suggesting that you could allocate it statically (on stack or
> whatever) based on MAX_PMD_ORDER, for the worst-case requirement and only
> actually use the portion required by the runtime PMD_ORDER value. It avoids the
> kmalloc call.

I originally had this on the stack, but the PMD_ORDER gave me trouble
for ppc. Ill try this approach to get it back on the stack! Thanks!

>
> >
> >>
> >> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
> >> is vs just iterating through the PTEs like Dev does; is there a significant cost
> >> saving in practice? On the face of it, it seems like it might be uneeded complexity.
> > The bitmap was to encode the state of PMD without needing rescanning
> > (or refactor a lot of code). We keep the scan runtime constant at 512
> > (for x86). Dev did some good analysis for this here
> > https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
> > This prevents needing to hold the read lock for longer, and prevents
> > needing to reacquire it too.
> >>>>>  - The recursion is masked through a stack structure.
> >>>  - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was
> >>>     64bit on x86. This provides some optimization on the bitmap operations.
> >>>     if other arches/configs that have larger than 512 PTEs per PMD want to
> >>>     compress their bitmap further we can change this value per arch.
> >>
> >> So 1 bit in the bitmap represents 8 pages? And it will only be set if all 8
> >> pages are !pte_none()? I'm wondering what will happen if you have a pattern of 4
> >> set PTEs followed by 4 none PTEs, followed by 4 set PTEs... If 16K mTHP is
> >> enabled, you would want to collapse every other 16K block in this case, but I'm
> >> guessing with your scheme, all the bits will be clear and no collapse will
> >> occur? But for arm64 at least, collapsing to order-2 (16K) may be desired for HPA.
> >
> > Yeah on my V2 ive incorporated a threshold (like max_ptes_none) for
> > setting the bit. This will covert this case better (given a better
> > default max_ptes_none).
> > The way i see it 511 max_ptes_none is just wrong...
>
> You mean it's a bad default?
Yeah that's the better phrasing.
>
> > we should flip it
> > towards the lower end of the scale (ie 64), and the "always" THP
> > setting should ignore it (like madvise does).
>
> But user space can already get that behaviour by modifying the tunable, right?
> Isn't that just a user space policy choice?
technically yes, but shouldn't defaults reflect sane behavior? ofc
this is my opinion, some might think 511 is not a bad default at all.
My original perspective comes from the memory waste issue, 511 could
be really good for performance if you benefit from PMDs; hence why I
was also suggesting "always" ignores the max_ptes_none.
>
> One other thing that occurs to me regarding the bitmap; In the context of Dev's
> series, we have discussed policy for what to do when the source PTEs are backed
> by a large folio already. I'm guessing if you are making your
> smaller-than-PMD-size collapse decisions based solely on the bitmap, you won't
> be able to see when the PTEs are already collpsed for the target order? i.e.
> let's say you already have a 64K folio fully mapped in an aligned way. You
> wouldn't want to "re-collapse" it to 64K. Are you robust to this?

Yes, I am also skipping the order <= folio_order case.

>
> >
> >>
> >>>
> >>> Patch 1-2:  Some refactoring to combine madvise_collapse and khugepaged
> >>> Patch 3:    A minor "fix"/optimization
> >>> Patch 4:    Refactor/rename hpage_collapse
> >>> Patch 5-7:  Generalize khugepaged functions for arbitrary orders
> >>> Patch 8-11: The mTHP patches
> >>>
> >>> This series acts as an alternative to Dev Jain's approach [1]. The two
> >>> series differ in a few ways:
> >>>   - My approach uses a bitmap to store the state of the linear scan_pmd to
> >>>     then determine potential mTHP batches. Devs incorporates his directly
> >>>     into the scan, and will try each available order.
> >>
> >> So if I'm understanding, the benefit of the bitmap is to remove the need to
> >> re-scan the "low" PTEs when moving to a lower order, which is what Dev's
> >> approach does? Are there not some locking/consistency issues to manage if not
> >> re-scanning?
> > Correct, so far i haven't found any issues (other than the bugs Dev
> > reported in his review)-- my fixed version of this RFC has been
> > running fine with no notable locking issues.
> >>
> >>>   - Dev is attempting to optimize the locking, while my approach keeps the
> >>>     locking changes to a minimum. I believe his changes are not safe for
> >>>     uffd.
> >>
> >> I agree; let's keep the locking simple for the initial effort.
> >>
> >>>   - Dev's changes only work for khugepaged not madvise_collapse (although
> >>>     i think that was by choice and it could easily support madvise)
> >>
> >> I agree supporting MADV_COLLAPSE is good; what exactly are the semantics for it
> >> though? I think it ignores the sysfs settings (max_ptes_none and friends) so
> >> presumably it will continue to be much more greedy about collapsing to the
> >> highest possible order and only fall back to lower orders if the VMA boundaries
> >> force it to or if the higher order allocation fails?
> > Kind of, because I removed the max_ptes_none check during the scan,
> > and reintroduced it in the bitmap scan (without a madvise
> > restriction), MADV_COLLAPSE and khugepaged will work more similarly.
> >>
> >>>   - Dev scales all khugepaged sysfs tunables by order, while im removing
> >>>     the restriction of max_ptes_none and converting it to a scale to
> >>>     determine a (m)THP threshold.
> >>
> >> I don't really understand this statement. You say you are removing the
> >> restriction of max_ptes_none. But then you say you scale it to determine a
> >> threshold. So are you honoring it or not? And if you're honouring it, how is
> >> your scaling method different to Dev's? What about the other tunables (shared
> >> and swap)?
> > I removed the max_ptes_none restriction during the initial scan, so we
> > can account for the full PMD (which is what happens with
> > max_ptes_none=511 anyways). Then max_ptes_none can be used with the
> > bitmap to calculate a threshold (max_ptes_none=64 == ~90% full) for
> > finding the optimal mTHP size.
> >
> > This RFC scales max_ptes_none to 0-100, but it has some really bad
> > rounding issues, so instead ive incorporated scaling (via bitshifting)
> > like Dev did in his series. Ive tested this and it's more accurate
> > now.
> >>
> >>>   - Dev turns on khugepaged if any order is available while mine still
> >>>     only runs if PMDs are enabled. I like Dev's approach and will most
> >>>     likely do the same in my PATCH posting.
> >>
> >> Agreed. Also, we will want khugepaged to be able to scan VMAs (or parts of VMAs)
> >> that cover only a partial PMD entry. I think neither of your implementations
> >> currently do that. As I understand it, Dev's v2 will add that support. Is your
> >> approach ammeanable to this?
> >
> > Yes, I believe so. I'm working on adding this too.
> >
> >>
> >>>   - mTHPs need their ref count updated to 1<<order, which Dev is missing.
> >>>
> >>> Patch 11 was inspired by one of Dev's changes.
> >>
> >> I think the 1 problem that emerged during review of Dev's series, which we don't
> >> have a proper solution to yet, is the issue of "creep", where regions can be
> >> collapsed to progressively higher orders through iterative scans. At each
> >> collapse, the required thresholds (e.g. max_ptes_none) are met, and the collapse
> >> effectively adds more non-none ptes so the next scan will then collapse to even
> >> higher order. Does your solution suffer from this (theoretical/edge case) issue?
> >> If not, how did you solve?
> >
> > Yes sadly it suffers from the same issue. bringing max_ptes_none much
> > lower as a default would "help".
> > I liked Zi Yan's solution of a per-VMA bit that gets set when
> > khugepaged collapses, and unset when the VMA changes (pf, realloc,
> > etc).
> > Then khugepaged can only operate on VMAs that dont have the bit set.
> > This way we only collapse once, unless the mapping was changed.
>
> Dev raised the issue in discussion against his series, that currently khugepaged
> doesn't scan the entire VMA, it scans to the first PMD that it collapses then
> moves to another VMA. I guess that's a fairness thing. So a VMA flag won't quite
> do the trick assuming we want to continue with that behavior. Perhaps we could
> keep a "cursor" in the VMA though, which describes the starting address of the
> next scan. We can move it forwards as we scan. And move it backwards when taking
> a fault. Still not perfect, but perhaps good enough?

I started playing around with some of these changes, it seems to work,
but David raised the issue that we can't expand vm_struct, so I need
to find a different solution.

>
> >
> > Could we map the new "non-none" pages to the zero page (rather than
> > actually zeroing the page), so they dont actually act as new "utilized
> > pages" and are still counted as none pages during the scan (until they
> > are written to)?
>
> I think you are propsing to use the zero page as a PTE marker to say "this
> region is scheduled for collapse"? In which case, why not just use a PTE
> marker... But you still have to do the collapse at some point (which I guess you
> are now deferring to the next page fault that hits one of those markers)? Once
> you have collapsed, you're still back to the original issue. So I don't think
> it's bought you anything except complexity and more latency :)

Ah ok i see! Thanks for clarifying
>
> Thanks,
> Ryan
>
> >
> >>
> >> Thanks,
> >> Ryan
> >
> > Cheers!
> > -- Nico
> >
> >>
> >>
> >>>
> >>> [1] https://lore.kernel.org/lkml/20241216165105.56185-1-dev.jain@arm.com/
> >>>
> >>> Nico Pache (11):
> >>>   introduce khugepaged_collapse_single_pmd to collapse a single pmd
> >>>   khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot
> >>>   khugepaged: Don't allocate khugepaged mm_slot early
> >>>   khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>   khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>   khugepaged: generalize alloc_charge_folio for mTHP support
> >>>   khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>   khugepaged: add mTHP support
> >>>   khugepaged: remove max_ptes_none restriction on the pmd scan
> >>>   khugepaged: skip collapsing mTHP to smaller orders
> >>>
> >>>  include/linux/khugepaged.h |   4 +-
> >>>  mm/huge_memory.c           |   3 +-
> >>>  mm/khugepaged.c            | 436 +++++++++++++++++++++++++------------
> >>>  3 files changed, 306 insertions(+), 137 deletions(-)
> >>>
> >>
> >
>



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-23 20:24       ` Nico Pache
@ 2025-01-24  7:13         ` Dev Jain
  2025-01-24  7:38           ` Dev Jain
  0 siblings, 1 reply; 53+ messages in thread
From: Dev Jain @ 2025-01-24  7:13 UTC (permalink / raw)
  To: Nico Pache
  Cc: Ryan Roberts, linux-kernel, linux-mm, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm



On 24/01/25 1:54 am, Nico Pache wrote:
> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> --- snip ---
>>>>
>>>> Althogh to be honest, it's not super clear to me what the benefit of the bitmap
>>>> is vs just iterating through the PTEs like Dev does; is there a significant cost
>>>> saving in practice? On the face of it, it seems like it might be uneeded complexity.
>>> The bitmap was to encode the state of PMD without needing rescanning
>>> (or refactor a lot of code). We keep the scan runtime constant at 512
>>> (for x86). Dev did some good analysis for this here
>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>>
>> I think I swayed away and over-analyzed, and probably did not make my
>> main objection clear enough, so let us cut to the chase.
>> *Why* is it correct to remember the state of the PMD?
>>
>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
>> tunables again, since we dropped the lock. The bitmap thingy which you
>> are doing, and in general, any algorithm which tries to remember the
>> state of the PMD, violates the entire point of max_ptes_*. Take for
>> example: Suppose the PTE table had a lot of shared ptes. After you drop
>> the PTL, you do this: scan_bitmap() -> read_unlock() ->
>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
> per your recommendation I dropped the read_lock() -> read_unlock() and
> made it a conditional unlock

That's not the one I was talking about here...

>> of stuff. Now, you do write_lock(), which means that you need to wait
>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
>> forks and then a lot of PTEs become shared. The point of max_ptes_shared
>> is to stop the collapse here, since we do not want memory bloat
>> (collapse will grab more memory from the buddy and the old memory won't
>> be freed because it has a reference from the parent/child).
> 
> That's a fair point, but given the other feedback, my current
> implementation now requires mTHPs to have no shared/swap, and ive
> improved the sysctl interactions for the set_bitmap and the
> max_ptes_none check in the _isolate function.

I am guessing you are following the policy of letting the creep happen 
for none ptes, and assuming shared and swap to be zero.

> 
> As for *why* remembering the state is correct. It just prevents
> needing to rescan.

That is what I am saying...if collapse_huge_page() fails, then you have 
dropped the mmap write lock, so now the state of the PTEs may have 
changed, so you must rescan...

> 
>> Another example would be, a sysadmin does not want too much memory
>> wastage from khugepaged, so we decide to set max_ptes_none low. When you
>> scan the PTE table you justify the collapse. After you drop the PTL and
>> the mmap_lock, a munmap() happens in the region, no longer justifying
>> the collapse. If you have a lot of VMAs of size <= 2MB, then any
>> munmap() on a VMA will happen on the single PTE table present.
>>
>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
>> to ask whether any algorithm remembering the state of the PMD is even
>> conceptually right.
> 
> Both the issues you raised dont really have to do with the bitmap...

Correct, my issue is with any general algorithm remembering PTE state.

> they are fair points, but they are more of a criticism of my sysctl
> handling. Ive cleaned up the max_ptes_none interactions, and now that
> we dont plan to initially support swap/shared both these problems are
> 'gone'.
>>
>> Then, you have the harder task of proving that your optimization is
>> actually an optimization, that it is not turned into being futile
>> because of overhead. From a high-level mathematical PoV, you are saving
>> iterations. Any mathematical analysis has the underlying assumption that
>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
>> order)] is virtually and physically contiguous in memory so prefetching
>> helps us. You are trying to save on pte memory references, but then look
>> at the number of bitmap memory references you have created, not to
>> mention that you are doing a (costly?) division operation in there, you
>> have a while loop, a stack, new structs, and if conditions. I do not see
>> how this is any faster than a naive linear scan.
> 
> Yeah it's hard to say without real performance testing. I hope to
> include some performance results with my next post.
> 
>>
>>> This prevents needing to hold the read lock for longer, and prevents
>>> needing to reacquire it too.
>>
>> My implementation does not hold the read lock for longer. What you mean
>> to say is, I need to reacquire the lock, and this is by design, to
> yes sorry.
>> ensure correctness, which boils down to what I wrote above.
> The write lock is what ensures correctness, not the read lock. The
> read lock is to gain insight of potential collapse candidates while
> avoiding the cost of the write lock.
> 
> Cheers!
> -- Nico
>>
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-24  7:13         ` Dev Jain
@ 2025-01-24  7:38           ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-01-24  7:38 UTC (permalink / raw)
  To: Nico Pache
  Cc: Ryan Roberts, linux-kernel, linux-mm, anshuman.khandual,
	catalin.marinas, cl, vbabka, mhocko, apopple, dave.hansen, will,
	baohua, jack, srivatsa, haowenchao22, hughd, aneesh.kumar, yang,
	peterx, ioworker0, wangkefeng.wang, ziy, jglisse, surenb,
	vishal.moola, zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy,
	kirill.shutemov, david, aarcange, raquini, sunnanyong,
	usamaarif642, audra, akpm



On 24/01/25 12:43 pm, Dev Jain wrote:
> 
> 
> On 24/01/25 1:54 am, Nico Pache wrote:
>> On Sun, Jan 19, 2025 at 10:18 PM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>
>>>
>>> --- snip ---
>>>>>
>>>>> Althogh to be honest, it's not super clear to me what the benefit 
>>>>> of the bitmap
>>>>> is vs just iterating through the PTEs like Dev does; is there a 
>>>>> significant cost
>>>>> saving in practice? On the face of it, it seems like it might be 
>>>>> uneeded complexity.
>>>> The bitmap was to encode the state of PMD without needing rescanning
>>>> (or refactor a lot of code). We keep the scan runtime constant at 512
>>>> (for x86). Dev did some good analysis for this here
>>>> https://lore.kernel.org/lkml/23023f48-95c6-4a24-ac8b- 
>>>> aba4b1a441b4@arm.com/
>>>
>>> I think I swayed away and over-analyzed, and probably did not make my
>>> main objection clear enough, so let us cut to the chase.
>>> *Why* is it correct to remember the state of the PMD?
>>>
>>> In__collapse_huge_page_isolate(), we check the PTEs against the sysfs
>>> tunables again, since we dropped the lock. The bitmap thingy which you
>>> are doing, and in general, any algorithm which tries to remember the
>>> state of the PMD, violates the entire point of max_ptes_*. Take for
>>> example: Suppose the PTE table had a lot of shared ptes. After you drop
>>> the PTL, you do this: scan_bitmap() -> read_unlock() ->
>>> alloc_charge_folio() -> read_lock() -> read_unlock()....which is a lot
>> per your recommendation I dropped the read_lock() -> read_unlock() and
>> made it a conditional unlock
> 
> That's not the one I was talking about here...
> 
>>> of stuff. Now, you do write_lock(), which means that you need to wait
>>> for all faulting/forking/mremap/mmap etc to stop. Suppose this process
>>> forks and then a lot of PTEs become shared. The point of max_ptes_shared
>>> is to stop the collapse here, since we do not want memory bloat
>>> (collapse will grab more memory from the buddy and the old memory won't
>>> be freed because it has a reference from the parent/child).
>>
>> That's a fair point, but given the other feedback, my current
>> implementation now requires mTHPs to have no shared/swap, and ive
>> improved the sysctl interactions for the set_bitmap and the
>> max_ptes_none check in the _isolate function.
> 
> I am guessing you are following the policy of letting the creep happen 
> for none ptes, and assuming shared and swap to be zero.

Ah sorry, I read the thread again and it seems we decided on skipping 
mTHP if max_ptes_none != 0 and 511. In any case, we need to scan the 
range to check whether we have at least one filled /all filled ptes, and 
none of them are shared and swap.

> 
>>
>> As for *why* remembering the state is correct. It just prevents
>> needing to rescan.
> 
> That is what I am saying...if collapse_huge_page() fails, then you have 
> dropped the mmap write lock, so now the state of the PTEs may have 
> changed, so you must rescan...
> 
>>
>>> Another example would be, a sysadmin does not want too much memory
>>> wastage from khugepaged, so we decide to set max_ptes_none low. When you
>>> scan the PTE table you justify the collapse. After you drop the PTL and
>>> the mmap_lock, a munmap() happens in the region, no longer justifying
>>> the collapse. If you have a lot of VMAs of size <= 2MB, then any
>>> munmap() on a VMA will happen on the single PTE table present.
>>>
>>> So, IMHO before even jumping on analyzing the bitmap algorithm, we need
>>> to ask whether any algorithm remembering the state of the PMD is even
>>> conceptually right.
>>
>> Both the issues you raised dont really have to do with the bitmap...
> 
> Correct, my issue is with any general algorithm remembering PTE state.
> 
>> they are fair points, but they are more of a criticism of my sysctl
>> handling. Ive cleaned up the max_ptes_none interactions, and now that
>> we dont plan to initially support swap/shared both these problems are
>> 'gone'.
>>>
>>> Then, you have the harder task of proving that your optimization is
>>> actually an optimization, that it is not turned into being futile
>>> because of overhead. From a high-level mathematical PoV, you are saving
>>> iterations. Any mathematical analysis has the underlying assumption that
>>> every iteration is equal. But the list [pte, pte + 1, ....., pte + (1 <<
>>> order)] is virtually and physically contiguous in memory so prefetching
>>> helps us. You are trying to save on pte memory references, but then look
>>> at the number of bitmap memory references you have created, not to
>>> mention that you are doing a (costly?) division operation in there, you
>>> have a while loop, a stack, new structs, and if conditions. I do not see
>>> how this is any faster than a naive linear scan.
>>
>> Yeah it's hard to say without real performance testing. I hope to
>> include some performance results with my next post.
>>
>>>
>>>> This prevents needing to hold the read lock for longer, and prevents
>>>> needing to reacquire it too.
>>>
>>> My implementation does not hold the read lock for longer. What you mean
>>> to say is, I need to reacquire the lock, and this is by design, to
>> yes sorry.
>>> ensure correctness, which boils down to what I wrote above.
>> The write lock is what ensures correctness, not the read lock. The
>> read lock is to gain insight of potential collapse candidates while
>> avoiding the cost of the write lock.
>>
>> Cheers!
>> -- Nico
>>>
>>
> 
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 00/11] khugepaged: mTHP support
  2025-01-21 10:19                 ` David Hildenbrand
@ 2025-01-27  9:31                   ` Dev Jain
  0 siblings, 0 replies; 53+ messages in thread
From: Dev Jain @ 2025-01-27  9:31 UTC (permalink / raw)
  To: David Hildenbrand, Ryan Roberts, Nico Pache
  Cc: linux-kernel, linux-mm, anshuman.khandual, catalin.marinas, cl,
	vbabka, mhocko, apopple, dave.hansen, will, baohua, jack,
	srivatsa, haowenchao22, hughd, aneesh.kumar, yang, peterx,
	ioworker0, wangkefeng.wang, ziy, jglisse, surenb, vishal.moola,
	zokeefe, zhengqi.arch, jhubbard, 21cnbao, willy, kirill.shutemov,
	aarcange, raquini, sunnanyong, usamaarif642, audra, akpm



On 21/01/25 3:49 pm, David Hildenbrand wrote:
>> Hmm that's an interesting idea; If I've understood, we would 
>> effectively test
>> the PMD for collapse as if we were collapsing to PMD-size, but then do 
>> the
>> actual collapse to the "highest allowed order" (dictated by what's 
>> enabled +
>> MADV_HUGEPAGE config).
>>
>> I'm not so sure this is a good way to go; there would be no way to 
>> support VMAs
>> (or parts of VMAs) that don't span a full PMD. 
> 
> 
> In Nicos approach to locking, we temporarily have to remove the PTE 
> table either way. While holding the mmap lock in write mode, the VMAs 
> cannot go away, so we could scan the whole PTE table to figure it out.
> 
> To just figure out "none" vs. "non-none" vs. "swap PTE", we'd probably 
> don't need the other VMA information. Figuring out "shared" is trickier, 
> because we have to obtain the folio and would have to walk the other VMAs.
> 
> It's a good question if we would have to VMA-write-lock the other 
> affected VMAs as well in order to temporarily remove the PTE table that 
> crosses multiple VMAs, or if we'd need something different (collapse PMD 
> marker) so the page table walkers could handle that case properly -- 
> keep retrying or fallback to the mmap lock.

I missed this reply, could have saved me some trouble :) When collapsing 
for VMAs < PMD, we *will* have to write lock the VMAs, write lock the 
anon_vma's, and write lock vma->vm_file->f_mapping for file VMAs, 
otherwise someone may fault on another VMA mapping the same PTE table. I 
was trying to implement this, but cannot find a clean way: we will have 
to implement it like mm_take_all_locks(), with a similar bit like 
AS_MM_ALL_LOCKS, because, suppose we need to lock all anon_vma's, then 
two VMAs may have the same anon_vma, and we cannot get away with the 
following check:

lock only if !rwsem_is_locked(&vma->anon_vma->root->rwsem)

since I need to skip the lock only when it is khugepaged which has taken 
the lock.

I guess the way to go about this then is the PMD-marker thingy, which I 
am not very familiar with.

> 
>> And I can imagine we might see
>> memory bloat; imagine you have 2M=madvise, 64K=always, 
>> max_ptes_none=511, and
>> let's say we have a 2M (aligned portion of a) VMA that does NOT have
>> MADV_HUGEPAGE set and has a single page populated. It passes the PMD- 
>> size test,
>> but we opt to collapse to 64K (since 2M=madvise). So now we end up 
>> with 32x 64K
>> folios, 31 of which are all zeros. We have spent the same amount of 
>> memory as if
>> 2M=always. Perhaps that's a detail that could be solved by ignoring 
>> fully none
>> 64K blocks when collapsing to 64K...
> 
> Yes, that's what I had in mind. No need to collapse where there is 
> nothing at all ...
> 
>>
>> Personally, I think your "enforce simplicifation of the tunables for mTHP
>> collapse" idea is the best we have so far.
> 
> Right.
> 
>>
>> But I'll just push against your pushback of the per-VMA cursor idea 
>> briefly. It
>> strikes me that this could be useful for khugepaged regardless of mTHP 
>> support.
> 
> Not a clear pushback, as you say to me this is a different optimization 
> and I am missing how it could really solve the problem at hand here.
> 
> Note that we're already fighting with not growing VMAs (see the VMA 
> locking changes under review), but maybe we could still squeeze it in 
> there without requiring a bigger slab.
> 
>> Today, it starts scanning a VMA, collapses the first PMD it finds that 
>> meets the
>> requirements, then switches to scanning another VMA. When it 
>> eventually gets
>> back to scanning the first VMA, it starts from the beginning again. 
>> Wouldn't a
>> cursor help reduce the amount of scanning it has to do?
> 
> Yes, that whole scanning approach sound weird. I would have assumed that 
> it might nowdays be smarter to just scan the MM sequentially, and not 
> jump between VMAs.
> 
> Assume you only have a handfull of large VMAs (like in a VMM), you'd end 
> up scanning the same handful of VMAs over and over again.
> 
> I think a lot of the khugepaged codebase is just full with historical 
> baggage that must be cleaned up and re-validated if it still required ...
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2025-01-27  9:32 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-08 23:31 [RFC 00/11] khugepaged: mTHP support Nico Pache
2025-01-08 23:31 ` [RFC 01/11] introduce khugepaged_collapse_single_pmd to collapse a single pmd Nico Pache
2025-01-10  6:25   ` Dev Jain
2025-01-08 23:31 ` [RFC 02/11] khugepaged: refactor madvise_collapse and khugepaged_scan_mm_slot Nico Pache
2025-01-08 23:31 ` [RFC 03/11] khugepaged: Don't allocate khugepaged mm_slot early Nico Pache
2025-01-10  6:11   ` Dev Jain
2025-01-10 19:37     ` Nico Pache
2025-01-08 23:31 ` [RFC 04/11] khugepaged: rename hpage_collapse_* to khugepaged_* Nico Pache
2025-01-08 23:31 ` [RFC 05/11] khugepaged: generalize hugepage_vma_revalidate for mTHP support Nico Pache
2025-01-08 23:31 ` [RFC 06/11] khugepaged: generalize alloc_charge_folio " Nico Pache
2025-01-10  6:23   ` Dev Jain
2025-01-10 19:41     ` Nico Pache
2025-01-08 23:31 ` [RFC 07/11] khugepaged: generalize __collapse_huge_page_* " Nico Pache
2025-01-10  6:38   ` Dev Jain
2025-01-08 23:31 ` [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap " Nico Pache
2025-01-10  9:05   ` Dev Jain
2025-01-10 21:48     ` Nico Pache
2025-01-12 11:23       ` Dev Jain
2025-01-13 22:25         ` Nico Pache
2025-01-10 14:54   ` Dev Jain
2025-01-10 21:48     ` Nico Pache
2025-01-12 15:13   ` Dev Jain
2025-01-12 16:41     ` Dev Jain
2025-01-08 23:31 ` [RFC 09/11] khugepaged: add " Nico Pache
2025-01-10  9:20   ` Dev Jain
2025-01-10 13:36   ` Dev Jain
2025-01-08 23:31 ` [RFC 10/11] khugepaged: remove max_ptes_none restriction on the pmd scan Nico Pache
2025-01-08 23:31 ` [RFC 11/11] khugepaged: skip collapsing mTHP to smaller orders Nico Pache
2025-01-09  6:22 ` [RFC 00/11] khugepaged: mTHP support Dev Jain
2025-01-10  2:27   ` Nico Pache
2025-01-10  4:56     ` Dev Jain
2025-01-10 22:01       ` Nico Pache
2025-01-12 14:11         ` Dev Jain
2025-01-13 23:00           ` Nico Pache
2025-01-09  6:27 ` Dev Jain
2025-01-10  1:28   ` Nico Pache
2025-01-16  9:47 ` Ryan Roberts
2025-01-16 20:53   ` Nico Pache
2025-01-20  5:17     ` Dev Jain
2025-01-23 20:24       ` Nico Pache
2025-01-24  7:13         ` Dev Jain
2025-01-24  7:38           ` Dev Jain
2025-01-20 12:49     ` Ryan Roberts
2025-01-23 20:42       ` Nico Pache
2025-01-20 12:54     ` David Hildenbrand
2025-01-20 13:37       ` Ryan Roberts
2025-01-20 13:56         ` David Hildenbrand
2025-01-20 16:27           ` Ryan Roberts
2025-01-20 18:39             ` David Hildenbrand
2025-01-21  9:48               ` Ryan Roberts
2025-01-21 10:19                 ` David Hildenbrand
2025-01-27  9:31                   ` Dev Jain
2025-01-22  5:18                 ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox