[RFC PATCH 0/5] Accelerate page migration with batching and multi threads

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
@ 2025-01-03 17:24 Zi Yan
  2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Zi Yan @ 2025-01-03 17:24 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Shivank Garg, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying, Zi Yan

Hi all,

This patchset accelerates page migration by batching folio copy operations and
using multiple CPU threads and is based on Shivank's Enhancements to Page
Migration with Batch Offloading via DMA patchset[1] and my original accelerate
page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
The last patch is for testing purpose and should not be considered.

The motivations are:

1. Batching folio copy increases copy throughput. Especially for base page
migrations, folio copy throughput is low since there are kernel activities like
moving folio metadata and updating page table entries sit between two folio
copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
and 64KB on ARM64.

2. Single CPU thread has limited copy throughput. Using multi threads is
a natural extension to speed up folio copy, when DMA engine is NOT
available in a system.

Design
===

It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
(renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
migrate_folio_move() and perform them in one shot afterwards. A
copy_page_lists_mt() function is added to use multi threads to copy
folios from src list to dst list.

Changes compared to Shivank's patchset (mainly rewrote batching folio
copy code)
===

1. mig_info is removed, so no memory allocation is needed during
batching folio copies. src->private is used to store old page state and
anon_vma after folio metadata is copied from src to dst.

2. move_to_new_folio() and migrate_folio_move() are refactored to remove
redundant code in migrate_folios_batch_move().

3. folio_mc_copy() is used for the single threaded copy code to keep the
original kernel behavior.

Performance
===

I benchmarked move_pages() throughput on a two socket NUMA system with two
NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
mTHP page migration are measured.

The tables below show move_pages() throughput with different
configurations and different numbers of copied pages. The x-axis is the
configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
threads with this patchset applied. And the unit is GB/s.

The 32-thread copy throughput can be up to 10x of single thread serial folio
copy. Batching folio copy not only benefits huge page but also base
page.

64KB (GB/s):

		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80

2MB mTHP (GB/s):

		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84

TODOs
===
1. Multi-threaded folio copy routine needs to look at CPU scheduler and
only use idle CPUs to avoid interfering userspace workloads. Of course
more complicated policies can be used based on migration issuing thread
priority.

2. Eliminate memory allocation during multi-threaded folio copy routine
if possible.

3. A runtime check to decide when use multi-threaded folio copy.
Something like cache hotness issue mentioned by Matthew[3].

4. Use non-temporal CPU instructions to avoid cache pollution issues.

5. Explicitly make multi-threaded folio copy only available to
!HIGHMEM, since kmap_local_page() would be needed for each kernel
folio copy work threads and expensive.

6. A better interface than copy_page_lists_mt() to allow DMA data copy
to be used as well.

Let me know your thoughts. Thanks.

[1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/
[2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
[3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/

Byungchul Park (1):
  mm: separate move/undo doing on folio list from migrate_pages_batch()

Zi Yan (4):
  mm/migrate: factor out code in move_to_new_folio() and
    migrate_folio_move()
  mm/migrate: add migrate_folios_batch_move to batch the folio move
    operations
  mm/migrate: introduce multi-threaded page copy routine
  test: add sysctl for folio copy tests and adjust
    NR_MAX_BATCHED_MIGRATION

 include/linux/migrate.h      |   3 +
 include/linux/migrate_mode.h |   2 +
 include/linux/mm.h           |   4 +
 include/linux/sysctl.h       |   1 +
 kernel/sysctl.c              |  29 ++-
 mm/Makefile                  |   2 +-
 mm/copy_pages.c              | 190 +++++++++++++++
 mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
 8 files changed, 577 insertions(+), 97 deletions(-)
 create mode 100644 mm/copy_pages.c

-- 
2.45.2

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch()
  2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
@ 2025-01-03 17:24 ` Zi Yan
  2025-01-03 17:24 ` [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Zi Yan
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Zi Yan @ 2025-01-03 17:24 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Shivank Garg, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying, Byungchul Park

From: Byungchul Park <byungchul@sk.com>

Functionally, no change. This is a preparatory patch picked from luf
(lazy unmap flush) patch series. This patch improve code organization
and readability for steps involving migrate_folio_move().

Refactored migrate_pages_batch() and separated move and undo parts
operating on folio list, from migrate_pages_batch().

Signed-off-by: Byungchul Park <byungchul@sk.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 mm/migrate.c | 134 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 83 insertions(+), 51 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index caadbe393aa2..df1b615c8114 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1690,6 +1690,81 @@ static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
 	return nr_failed;
 }
 
+static void migrate_folios_move(struct list_head *src_folios,
+		struct list_head *dst_folios,
+		free_folio_t put_new_folio, unsigned long private,
+		enum migrate_mode mode, int reason,
+		struct list_head *ret_folios,
+		struct migrate_pages_stats *stats,
+		int *retry, int *thp_retry, int *nr_failed,
+		int *nr_retry_pages)
+{
+	struct folio *folio, *folio2, *dst, *dst2;
+	bool is_thp;
+	int nr_pages;
+	int rc;
+
+	dst = list_first_entry(dst_folios, struct folio, lru);
+	dst2 = list_next_entry(dst, lru);
+	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+		is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
+		nr_pages = folio_nr_pages(folio);
+
+		cond_resched();
+
+		rc = migrate_folio_move(put_new_folio, private,
+				folio, dst, mode,
+				reason, ret_folios);
+		/*
+		 * The rules are:
+		 *	Success: folio will be freed
+		 *	-EAGAIN: stay on the unmap_folios list
+		 *	Other errno: put on ret_folios list
+		 */
+		switch (rc) {
+		case -EAGAIN:
+			*retry += 1;
+			*thp_retry += is_thp;
+			*nr_retry_pages += nr_pages;
+			break;
+		case MIGRATEPAGE_SUCCESS:
+			stats->nr_succeeded += nr_pages;
+			stats->nr_thp_succeeded += is_thp;
+			break;
+		default:
+			*nr_failed += 1;
+			stats->nr_thp_failed += is_thp;
+			stats->nr_failed_pages += nr_pages;
+			break;
+		}
+		dst = dst2;
+		dst2 = list_next_entry(dst, lru);
+	}
+}
+
+static void migrate_folios_undo(struct list_head *src_folios,
+		struct list_head *dst_folios,
+		free_folio_t put_new_folio, unsigned long private,
+		struct list_head *ret_folios)
+{
+	struct folio *folio, *folio2, *dst, *dst2;
+
+	dst = list_first_entry(dst_folios, struct folio, lru);
+	dst2 = list_next_entry(dst, lru);
+	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+		int old_page_state = 0;
+		struct anon_vma *anon_vma = NULL;
+
+		__migrate_folio_extract(dst, &old_page_state, &anon_vma);
+		migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
+				anon_vma, true, ret_folios);
+		list_del(&dst->lru);
+		migrate_folio_undo_dst(dst, true, put_new_folio, private);
+		dst = dst2;
+		dst2 = list_next_entry(dst, lru);
+	}
+}
+
 /*
  * migrate_pages_batch() first unmaps folios in the from list as many as
  * possible, then move the unmapped folios.
@@ -1712,7 +1787,7 @@ static int migrate_pages_batch(struct list_head *from,
 	int pass = 0;
 	bool is_thp = false;
 	bool is_large = false;
-	struct folio *folio, *folio2, *dst = NULL, *dst2;
+	struct folio *folio, *folio2, *dst = NULL;
 	int rc, rc_saved = 0, nr_pages;
 	LIST_HEAD(unmap_folios);
 	LIST_HEAD(dst_folios);
@@ -1883,42 +1958,11 @@ static int migrate_pages_batch(struct list_head *from,
 		thp_retry = 0;
 		nr_retry_pages = 0;
 
-		dst = list_first_entry(&dst_folios, struct folio, lru);
-		dst2 = list_next_entry(dst, lru);
-		list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
-			is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
-			nr_pages = folio_nr_pages(folio);
-
-			cond_resched();
-
-			rc = migrate_folio_move(put_new_folio, private,
-						folio, dst, mode,
-						reason, ret_folios);
-			/*
-			 * The rules are:
-			 *	Success: folio will be freed
-			 *	-EAGAIN: stay on the unmap_folios list
-			 *	Other errno: put on ret_folios list
-			 */
-			switch(rc) {
-			case -EAGAIN:
-				retry++;
-				thp_retry += is_thp;
-				nr_retry_pages += nr_pages;
-				break;
-			case MIGRATEPAGE_SUCCESS:
-				stats->nr_succeeded += nr_pages;
-				stats->nr_thp_succeeded += is_thp;
-				break;
-			default:
-				nr_failed++;
-				stats->nr_thp_failed += is_thp;
-				stats->nr_failed_pages += nr_pages;
-				break;
-			}
-			dst = dst2;
-			dst2 = list_next_entry(dst, lru);
-		}
+		/* Move the unmapped folios */
+		migrate_folios_move(&unmap_folios, &dst_folios,
+				put_new_folio, private, mode, reason,
+				ret_folios, stats, &retry, &thp_retry,
+				&nr_failed, &nr_retry_pages);
 	}
 	nr_failed += retry;
 	stats->nr_thp_failed += thp_retry;
@@ -1927,20 +1971,8 @@ static int migrate_pages_batch(struct list_head *from,
 	rc = rc_saved ? : nr_failed;
 out:
 	/* Cleanup remaining folios */
-	dst = list_first_entry(&dst_folios, struct folio, lru);
-	dst2 = list_next_entry(dst, lru);
-	list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
-		int old_page_state = 0;
-		struct anon_vma *anon_vma = NULL;
-
-		__migrate_folio_extract(dst, &old_page_state, &anon_vma);
-		migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
-				       anon_vma, true, ret_folios);
-		list_del(&dst->lru);
-		migrate_folio_undo_dst(dst, true, put_new_folio, private);
-		dst = dst2;
-		dst2 = list_next_entry(dst, lru);
-	}
+	migrate_folios_undo(&unmap_folios, &dst_folios,
+			put_new_folio, private, ret_folios);
 
 	return rc;
 }
-- 
2.45.2



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move()
  2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
  2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
@ 2025-01-03 17:24 ` Zi Yan
  2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Zi Yan @ 2025-01-03 17:24 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Shivank Garg, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying, Zi Yan

No function change is intended. The factored out code will be reused in
an upcoming batched folio move function.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/migrate.c | 101 +++++++++++++++++++++++++++++++++------------------
 1 file changed, 65 insertions(+), 36 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index df1b615c8114..a83508f94c57 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1014,18 +1014,7 @@ static int fallback_migrate_folio(struct address_space *mapping,
 	return migrate_folio(mapping, dst, src, mode);
 }
 
-/*
- * Move a page to a newly allocated page
- * The page is locked and all ptes have been successfully removed.
- *
- * The new page will have replaced the old page if this function
- * is successful.
- *
- * Return value:
- *   < 0 - error code
- *  MIGRATEPAGE_SUCCESS - success
- */
-static int move_to_new_folio(struct folio *dst, struct folio *src,
+static int _move_to_new_folio_prep(struct folio *dst, struct folio *src,
 				enum migrate_mode mode)
 {
 	int rc = -EAGAIN;
@@ -1072,7 +1061,13 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
 		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
 				!folio_test_isolated(src));
 	}
+out:
+	return rc;
+}
 
+static void _move_to_new_folio_finalize(struct folio *dst, struct folio *src,
+				int rc)
+{
 	/*
 	 * When successful, old pagecache src->mapping must be cleared before
 	 * src is freed; but stats require that PageAnon be left as PageAnon.
@@ -1099,7 +1094,29 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
 		if (likely(!folio_is_zone_device(dst)))
 			flush_dcache_folio(dst);
 	}
-out:
+}
+
+
+/*
+ * Move a page to a newly allocated page
+ * The page is locked and all ptes have been successfully removed.
+ *
+ * The new page will have replaced the old page if this function
+ * is successful.
+ *
+ * Return value:
+ *   < 0 - error code
+ *  MIGRATEPAGE_SUCCESS - success
+ */
+static int move_to_new_folio(struct folio *dst, struct folio *src,
+				enum migrate_mode mode)
+{
+	int rc;
+
+	rc = _move_to_new_folio_prep(dst, src, mode);
+
+	_move_to_new_folio_finalize(dst, src, rc);
+
 	return rc;
 }
 
@@ -1344,29 +1361,9 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 	return rc;
 }
 
-/* Migrate the folio to the newly allocated folio in dst. */
-static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
-			      struct folio *src, struct folio *dst,
-			      enum migrate_mode mode, enum migrate_reason reason,
-			      struct list_head *ret)
+static void _migrate_folio_move_finalize1(struct folio *src, struct folio *dst,
+					  int old_page_state)
 {
-	int rc;
-	int old_page_state = 0;
-	struct anon_vma *anon_vma = NULL;
-	bool is_lru = !__folio_test_movable(src);
-	struct list_head *prev;
-
-	__migrate_folio_extract(dst, &old_page_state, &anon_vma);
-	prev = dst->lru.prev;
-	list_del(&dst->lru);
-
-	rc = move_to_new_folio(dst, src, mode);
-	if (rc)
-		goto out;
-
-	if (unlikely(!is_lru))
-		goto out_unlock_both;
-
 	/*
 	 * When successful, push dst to LRU immediately: so that if it
 	 * turns out to be an mlocked page, remove_migration_ptes() will
@@ -1382,8 +1379,12 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 
 	if (old_page_state & PAGE_WAS_MAPPED)
 		remove_migration_ptes(src, dst, 0);
+}
 
-out_unlock_both:
+static void _migrate_folio_move_finalize2(struct folio *src, struct folio *dst,
+					  enum migrate_reason reason,
+					  struct anon_vma *anon_vma)
+{
 	folio_unlock(dst);
 	set_page_owner_migrate_reason(&dst->page, reason);
 	/*
@@ -1403,6 +1404,34 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 		put_anon_vma(anon_vma);
 	folio_unlock(src);
 	migrate_folio_done(src, reason);
+}
+
+/* Migrate the folio to the newly allocated folio in dst. */
+static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
+			      struct folio *src, struct folio *dst,
+			      enum migrate_mode mode, enum migrate_reason reason,
+			      struct list_head *ret)
+{
+	int rc;
+	int old_page_state = 0;
+	struct anon_vma *anon_vma = NULL;
+	bool is_lru = !__folio_test_movable(src);
+	struct list_head *prev;
+
+	__migrate_folio_extract(dst, &old_page_state, &anon_vma);
+	prev = dst->lru.prev;
+	list_del(&dst->lru);
+
+	rc = move_to_new_folio(dst, src, mode);
+	if (rc)
+		goto out;
+
+	if (unlikely(!is_lru))
+		goto out_unlock_both;
+
+	_migrate_folio_move_finalize1(src, dst, old_page_state);
+out_unlock_both:
+	_migrate_folio_move_finalize2(src, dst, reason, anon_vma);
 
 	return rc;
 out:
-- 
2.45.2



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations
  2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
  2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
  2025-01-03 17:24 ` [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Zi Yan
@ 2025-01-03 17:24 ` Zi Yan
  2025-01-09 11:47   ` Shivank Garg
  2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Zi Yan @ 2025-01-03 17:24 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Shivank Garg, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying, Zi Yan

This is a preparatory patch that enables batch copying for folios
undergoing migration. By enabling batch copying the folio content, we can
efficiently utilize the capabilities of DMA hardware or multi-threaded
folio copy. It also adds MIGRATE_NO_COPY back to migrate_mode, so that
folio copy will be skipped during metadata copy process and performed
in a batch later.

Currently, the folio move operation is performed individually for each
folio in sequential manner:
for_each_folio() {
        Copy folio metadata like flags and mappings
        Copy the folio content from src to dst
        Update page tables with dst folio
}

With this patch, we transition to a batch processing approach as shown
below:
for_each_folio() {
        Copy folio metadata like flags and mappings
}
Batch copy all src folios to dst
for_each_folio() {
        Update page tables with dst folios
}

dst->private is used to store page states and possible anon_vma value,
thus needs to be cleared during metadata copy process. To avoid additional
memory allocation to store the data during batch copy process, src->private
is used to store the data after metadata copy process, since src is no
longer used.

Originally-by: Shivank Garg <shivankg@amd.com>
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/migrate_mode.h |   2 +
 mm/migrate.c                 | 207 +++++++++++++++++++++++++++++++++--
 2 files changed, 201 insertions(+), 8 deletions(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 265c4328b36a..9af6c949a057 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -7,11 +7,13 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_NO_COPY will not copy page content
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
+	MIGRATE_NO_COPY,
 };
 
 enum migrate_reason {
diff --git a/mm/migrate.c b/mm/migrate.c
index a83508f94c57..95c4cc4a7823 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -51,6 +51,7 @@
 
 #include "internal.h"
 
+
 bool isolate_movable_page(struct page *page, isolate_mode_t mode)
 {
 	struct folio *folio = folio_get_nontail_page(page);
@@ -752,14 +753,19 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
 			   enum migrate_mode mode)
 {
 	int rc, expected_count = folio_expected_refs(mapping, src);
+	unsigned long dst_private = (unsigned long)dst->private;
 
 	/* Check whether src does not have extra refs before we do more work */
 	if (folio_ref_count(src) != expected_count)
 		return -EAGAIN;
 
-	rc = folio_mc_copy(dst, src);
-	if (unlikely(rc))
-		return rc;
+	if (mode == MIGRATE_NO_COPY)
+		dst->private = NULL;
+	else {
+		rc = folio_mc_copy(dst, src);
+		if (unlikely(rc))
+			return rc;
+	}
 
 	rc = __folio_migrate_mapping(mapping, dst, src, expected_count);
 	if (rc != MIGRATEPAGE_SUCCESS)
@@ -769,6 +775,10 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
 		folio_attach_private(dst, folio_detach_private(src));
 
 	folio_migrate_flags(dst, src);
+
+	if (mode == MIGRATE_NO_COPY)
+		src->private = (void *)dst_private;
+
 	return MIGRATEPAGE_SUCCESS;
 }
 
@@ -1042,7 +1052,7 @@ static int _move_to_new_folio_prep(struct folio *dst, struct folio *src,
 								mode);
 		else
 			rc = fallback_migrate_folio(mapping, dst, src, mode);
-	} else {
+	} else if (mode != MIGRATE_NO_COPY) {
 		const struct movable_operations *mops;
 
 		/*
@@ -1060,7 +1070,8 @@ static int _move_to_new_folio_prep(struct folio *dst, struct folio *src,
 		rc = mops->migrate_page(&dst->page, &src->page, mode);
 		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
 				!folio_test_isolated(src));
-	}
+	} else
+		rc = -EAGAIN;
 out:
 	return rc;
 }
@@ -1138,7 +1149,7 @@ static void __migrate_folio_record(struct folio *dst,
 	dst->private = (void *)anon_vma + old_page_state;
 }
 
-static void __migrate_folio_extract(struct folio *dst,
+static void __migrate_folio_read(struct folio *dst,
 				   int *old_page_state,
 				   struct anon_vma **anon_vmap)
 {
@@ -1146,6 +1157,13 @@ static void __migrate_folio_extract(struct folio *dst,
 
 	*anon_vmap = (struct anon_vma *)(private & ~PAGE_OLD_STATES);
 	*old_page_state = private & PAGE_OLD_STATES;
+}
+
+static void __migrate_folio_extract(struct folio *dst,
+				   int *old_page_state,
+				   struct anon_vma **anon_vmap)
+{
+	__migrate_folio_read(dst, old_page_state, anon_vmap);
 	dst->private = NULL;
 }
 
@@ -1771,6 +1789,174 @@ static void migrate_folios_move(struct list_head *src_folios,
 	}
 }
 
+static void migrate_folios_batch_move(struct list_head *src_folios,
+		struct list_head *dst_folios,
+		free_folio_t put_new_folio, unsigned long private,
+		enum migrate_mode mode, int reason,
+		struct list_head *ret_folios,
+		struct migrate_pages_stats *stats,
+		int *retry, int *thp_retry, int *nr_failed,
+		int *nr_retry_pages)
+{
+	struct folio *folio, *folio2, *dst, *dst2;
+	int rc, nr_pages = 0, nr_mig_folios = 0;
+	int old_page_state = 0;
+	struct anon_vma *anon_vma = NULL;
+	bool is_lru;
+	int is_thp = 0;
+	LIST_HEAD(err_src);
+	LIST_HEAD(err_dst);
+
+	if (mode != MIGRATE_ASYNC) {
+		*retry += 1;
+		return;
+	}
+
+	/*
+	 * Iterate over the list of locked src/dst folios to copy the metadata
+	 */
+	dst = list_first_entry(dst_folios, struct folio, lru);
+	dst2 = list_next_entry(dst, lru);
+	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+		is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
+		nr_pages = folio_nr_pages(folio);
+		is_lru = !__folio_test_movable(folio);
+
+		/*
+		 * dst->private is not cleared here. It is cleared and moved to
+		 * src->private in __migrate_folio().
+		 */
+		__migrate_folio_read(dst, &old_page_state, &anon_vma);
+
+		/*
+		 * Use MIGRATE_NO_COPY mode in migrate_folio family functions
+		 * to copy the flags, mapping and some other ancillary information.
+		 * This does everything except the page copy. The actual page copy
+		 * is handled later in a batch manner.
+		 */
+		rc = _move_to_new_folio_prep(dst, folio, MIGRATE_NO_COPY);
+
+		/*
+		 * -EAGAIN: Move src/dst folios to tmp lists for retry
+		 * Other Errno: Put src folio on ret_folios list, remove the dst folio
+		 * Success: Copy the folio bytes, restoring working pte, unlock and
+		 *	    decrement refcounter
+		 */
+		if (rc == -EAGAIN) {
+			*retry += 1;
+			*thp_retry += is_thp;
+			*nr_retry_pages += nr_pages;
+
+			list_move_tail(&folio->lru, &err_src);
+			list_move_tail(&dst->lru, &err_dst);
+			__migrate_folio_record(dst, old_page_state, anon_vma);
+		} else if (rc != MIGRATEPAGE_SUCCESS) {
+			*nr_failed += 1;
+			stats->nr_thp_failed += is_thp;
+			stats->nr_failed_pages += nr_pages;
+
+			list_del(&dst->lru);
+			migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
+					anon_vma, true, ret_folios);
+			migrate_folio_undo_dst(dst, true, put_new_folio, private);
+		} else /* MIGRATEPAGE_SUCCESS */
+			nr_mig_folios++;
+
+		dst = dst2;
+		dst2 = list_next_entry(dst, lru);
+	}
+
+	/* Exit if folio list for batch migration is empty */
+	if (!nr_mig_folios)
+		goto out;
+
+	/* Batch copy the folios */
+	{
+		dst = list_first_entry(dst_folios, struct folio, lru);
+		dst2 = list_next_entry(dst, lru);
+		list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+			is_thp = folio_test_large(folio) &&
+				 folio_test_pmd_mappable(folio);
+			nr_pages = folio_nr_pages(folio);
+			rc = folio_mc_copy(dst, folio);
+
+			if (rc) {
+				int old_page_state = 0;
+				struct anon_vma *anon_vma = NULL;
+
+				/*
+				 * dst->private is moved to src->private in
+				 * __migrate_folio(), so page state and anon_vma
+				 * values can be extracted from (src) folio.
+				 */
+				__migrate_folio_extract(folio, &old_page_state,
+						&anon_vma);
+				migrate_folio_undo_src(folio,
+						old_page_state & PAGE_WAS_MAPPED,
+						anon_vma, true, ret_folios);
+				list_del(&dst->lru);
+				migrate_folio_undo_dst(dst, true, put_new_folio,
+						private);
+			}
+
+			switch (rc) {
+			case MIGRATEPAGE_SUCCESS:
+				stats->nr_succeeded += nr_pages;
+				stats->nr_thp_succeeded += is_thp;
+				break;
+			default:
+				*nr_failed += 1;
+				stats->nr_thp_failed += is_thp;
+				stats->nr_failed_pages += nr_pages;
+				break;
+			}
+
+			dst = dst2;
+			dst2 = list_next_entry(dst, lru);
+		}
+	}
+
+	/*
+	 * Iterate the folio lists to remove migration pte and restore them
+	 * as working pte. Unlock the folios, add/remove them to LRU lists (if
+	 * applicable) and release the src folios.
+	 */
+	dst = list_first_entry(dst_folios, struct folio, lru);
+	dst2 = list_next_entry(dst, lru);
+	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+		is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
+		nr_pages = folio_nr_pages(folio);
+		/*
+		 * dst->private is moved to src->private in __migrate_folio(),
+		 * so page state and anon_vma values can be extracted from
+		 * (src) folio.
+		 */
+		__migrate_folio_extract(folio, &old_page_state, &anon_vma);
+		list_del(&dst->lru);
+
+		_move_to_new_folio_finalize(dst, folio, MIGRATEPAGE_SUCCESS);
+
+		/*
+		 * Below few steps are only applicable for lru pages which is
+		 * ensured as we have removed the non-lru pages from our list.
+		 */
+		_migrate_folio_move_finalize1(folio, dst, old_page_state);
+
+		_migrate_folio_move_finalize2(folio, dst, reason, anon_vma);
+
+		/* Page migration successful, increase stat counter */
+		stats->nr_succeeded += nr_pages;
+		stats->nr_thp_succeeded += is_thp;
+
+		dst = dst2;
+		dst2 = list_next_entry(dst, lru);
+	}
+out:
+	/* Add tmp folios back to the list to let CPU re-attempt migration. */
+	list_splice(&err_src, src_folios);
+	list_splice(&err_dst, dst_folios);
+}
+
 static void migrate_folios_undo(struct list_head *src_folios,
 		struct list_head *dst_folios,
 		free_folio_t put_new_folio, unsigned long private,
@@ -1981,13 +2167,18 @@ static int migrate_pages_batch(struct list_head *from,
 	/* Flush TLBs for all unmapped folios */
 	try_to_unmap_flush();
 
-	retry = 1;
+	retry = 0;
+	/* Batch move the unmapped folios */
+	migrate_folios_batch_move(&unmap_folios, &dst_folios, put_new_folio,
+			private, mode, reason, ret_folios, stats, &retry,
+			&thp_retry, &nr_failed, &nr_retry_pages);
+
 	for (pass = 0; pass < nr_pass && retry; pass++) {
 		retry = 0;
 		thp_retry = 0;
 		nr_retry_pages = 0;
 
-		/* Move the unmapped folios */
+		/* Move the remaining unmapped folios */
 		migrate_folios_move(&unmap_folios, &dst_folios,
 				put_new_folio, private, mode, reason,
 				ret_folios, stats, &retry, &thp_retry,
-- 
2.45.2



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine
  2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
                   ` (2 preceding siblings ...)
  2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
@ 2025-01-03 17:24 ` Zi Yan
  2025-01-06  1:18   ` Hyeonggon Yoo
  2025-02-13 12:44   ` Byungchul Park
  2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 30+ messages in thread
From: Zi Yan @ 2025-01-03 17:24 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Shivank Garg, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying, Zi Yan

Now page copies are batched, multi-threaded page copy can be used to
increase page copy throughput. Add copy_page_lists_mt() to copy pages in
multi-threaded manners. Empirical data show more than 32 base pages are
needed to show the benefit of using multi-threaded page copy, so use 32 as
the threshold.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/migrate.h |   3 +
 mm/Makefile             |   2 +-
 mm/copy_pages.c         | 186 ++++++++++++++++++++++++++++++++++++++++
 mm/migrate.c            |  19 ++--
 4 files changed, 199 insertions(+), 11 deletions(-)
 create mode 100644 mm/copy_pages.c

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 29919faea2f1..a0124f4893b0 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -80,6 +80,9 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
 int folio_migrate_mapping(struct address_space *mapping,
 		struct folio *newfolio, struct folio *folio, int extra_count);
 
+int copy_page_lists_mt(struct list_head *dst_folios,
+		struct list_head *src_folios, int nr_items);
+
 #else
 
 static inline void putback_movable_pages(struct list_head *l) {}
diff --git a/mm/Makefile b/mm/Makefile
index 850386a67b3e..f8c7f6b4cebb 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,7 +92,7 @@ obj-$(CONFIG_KMSAN)	+= kmsan/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
-obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_MIGRATION) += migrate.o copy_pages.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
diff --git a/mm/copy_pages.c b/mm/copy_pages.c
new file mode 100644
index 000000000000..0e2231199f66
--- /dev/null
+++ b/mm/copy_pages.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Parallel page copy routine.
+ */
+
+#include <linux/sysctl.h>
+#include <linux/highmem.h>
+#include <linux/workqueue.h>
+#include <linux/slab.h>
+#include <linux/migrate.h>
+
+
+unsigned int limit_mt_num = 4;
+
+struct copy_item {
+	char *to;
+	char *from;
+	unsigned long chunk_size;
+};
+
+struct copy_page_info {
+	struct work_struct copy_page_work;
+	unsigned long num_items;
+	struct copy_item item_list[];
+};
+
+static void copy_page_routine(char *vto, char *vfrom,
+	unsigned long chunk_size)
+{
+	memcpy(vto, vfrom, chunk_size);
+}
+
+static void copy_page_work_queue_thread(struct work_struct *work)
+{
+	struct copy_page_info *my_work = (struct copy_page_info *)work;
+	int i;
+
+	for (i = 0; i < my_work->num_items; ++i)
+		copy_page_routine(my_work->item_list[i].to,
+						  my_work->item_list[i].from,
+						  my_work->item_list[i].chunk_size);
+}
+
+int copy_page_lists_mt(struct list_head *dst_folios,
+		struct list_head *src_folios, int nr_items)
+{
+	int err = 0;
+	unsigned int total_mt_num = limit_mt_num;
+	int to_node = folio_nid(list_first_entry(dst_folios, struct folio, lru));
+	int i;
+	struct copy_page_info *work_items[32] = {0};
+	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
+	int cpu_id_list[32] = {0};
+	int cpu;
+	int max_items_per_thread;
+	int item_idx;
+	struct folio *src, *src2, *dst, *dst2;
+
+	total_mt_num = min_t(unsigned int, total_mt_num,
+			cpumask_weight(per_node_cpumask));
+
+	if (total_mt_num > 32)
+		total_mt_num = 32;
+
+	/* Each threads get part of each page, if nr_items < totla_mt_num */
+	if (nr_items < total_mt_num)
+		max_items_per_thread = nr_items;
+	else
+		max_items_per_thread = (nr_items / total_mt_num) +
+				((nr_items % total_mt_num) ? 1 : 0);
+
+
+	for (cpu = 0; cpu < total_mt_num; ++cpu) {
+		work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
+					sizeof(struct copy_item) * max_items_per_thread,
+					GFP_NOWAIT);
+		if (!work_items[cpu]) {
+			err = -ENOMEM;
+			goto free_work_items;
+		}
+	}
+
+	i = 0;
+	/* TODO: need a better cpu selection method */
+	for_each_cpu(cpu, per_node_cpumask) {
+		if (i >= total_mt_num)
+			break;
+		cpu_id_list[i] = cpu;
+		++i;
+	}
+
+	if (nr_items < total_mt_num) {
+		for (cpu = 0; cpu < total_mt_num; ++cpu) {
+			INIT_WORK((struct work_struct *)work_items[cpu],
+					  copy_page_work_queue_thread);
+			work_items[cpu]->num_items = max_items_per_thread;
+		}
+
+		item_idx = 0;
+		dst = list_first_entry(dst_folios, struct folio, lru);
+		dst2 = list_next_entry(dst, lru);
+		list_for_each_entry_safe(src, src2, src_folios, lru) {
+			unsigned long chunk_size = PAGE_SIZE * folio_nr_pages(src) / total_mt_num;
+			/* XXX: not working in HIGHMEM */
+			char *vfrom = page_address(&src->page);
+			char *vto = page_address(&dst->page);
+
+			VM_WARN_ON(PAGE_SIZE * folio_nr_pages(src) % total_mt_num);
+			VM_WARN_ON(folio_nr_pages(dst) != folio_nr_pages(src));
+
+			for (cpu = 0; cpu < total_mt_num; ++cpu) {
+				work_items[cpu]->item_list[item_idx].to =
+					vto + chunk_size * cpu;
+				work_items[cpu]->item_list[item_idx].from =
+					vfrom + chunk_size * cpu;
+				work_items[cpu]->item_list[item_idx].chunk_size =
+					chunk_size;
+			}
+
+			item_idx++;
+			dst = dst2;
+			dst2 = list_next_entry(dst, lru);
+		}
+
+		for (cpu = 0; cpu < total_mt_num; ++cpu)
+			queue_work_on(cpu_id_list[cpu],
+						  system_unbound_wq,
+						  (struct work_struct *)work_items[cpu]);
+	} else {
+		int num_xfer_per_thread = nr_items / total_mt_num;
+		int per_cpu_item_idx;
+
+
+		for (cpu = 0; cpu < total_mt_num; ++cpu) {
+			INIT_WORK((struct work_struct *)work_items[cpu],
+					  copy_page_work_queue_thread);
+
+			work_items[cpu]->num_items = num_xfer_per_thread +
+					(cpu < (nr_items % total_mt_num));
+		}
+
+		cpu = 0;
+		per_cpu_item_idx = 0;
+		item_idx = 0;
+		dst = list_first_entry(dst_folios, struct folio, lru);
+		dst2 = list_next_entry(dst, lru);
+		list_for_each_entry_safe(src, src2, src_folios, lru) {
+			/* XXX: not working in HIGHMEM */
+			work_items[cpu]->item_list[per_cpu_item_idx].to =
+				page_address(&dst->page);
+			work_items[cpu]->item_list[per_cpu_item_idx].from =
+				page_address(&src->page);
+			work_items[cpu]->item_list[per_cpu_item_idx].chunk_size =
+				PAGE_SIZE * folio_nr_pages(src);
+
+			VM_WARN_ON(folio_nr_pages(dst) !=
+				   folio_nr_pages(src));
+
+			per_cpu_item_idx++;
+			item_idx++;
+			dst = dst2;
+			dst2 = list_next_entry(dst, lru);
+
+			if (per_cpu_item_idx == work_items[cpu]->num_items) {
+				queue_work_on(cpu_id_list[cpu],
+					system_unbound_wq,
+					(struct work_struct *)work_items[cpu]);
+				per_cpu_item_idx = 0;
+				cpu++;
+			}
+		}
+		if (item_idx != nr_items)
+			pr_warn("%s: only %d out of %d pages are transferred\n",
+				__func__, item_idx - 1, nr_items);
+	}
+
+	/* Wait until it finishes  */
+	for (i = 0; i < total_mt_num; ++i)
+		flush_work((struct work_struct *)work_items[i]);
+
+free_work_items:
+	for (cpu = 0; cpu < total_mt_num; ++cpu)
+		kfree(work_items[cpu]);
+
+	return err;
+}
diff --git a/mm/migrate.c b/mm/migrate.c
index 95c4cc4a7823..18440180d747 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1799,7 +1799,7 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
 		int *nr_retry_pages)
 {
 	struct folio *folio, *folio2, *dst, *dst2;
-	int rc, nr_pages = 0, nr_mig_folios = 0;
+	int rc, nr_pages = 0, total_nr_pages = 0, total_nr_folios = 0;
 	int old_page_state = 0;
 	struct anon_vma *anon_vma = NULL;
 	bool is_lru;
@@ -1807,11 +1807,6 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
 	LIST_HEAD(err_src);
 	LIST_HEAD(err_dst);
 
-	if (mode != MIGRATE_ASYNC) {
-		*retry += 1;
-		return;
-	}
-
 	/*
 	 * Iterate over the list of locked src/dst folios to copy the metadata
 	 */
@@ -1859,19 +1854,23 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
 			migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
 					anon_vma, true, ret_folios);
 			migrate_folio_undo_dst(dst, true, put_new_folio, private);
-		} else /* MIGRATEPAGE_SUCCESS */
-			nr_mig_folios++;
+		} else { /* MIGRATEPAGE_SUCCESS */
+			total_nr_pages += nr_pages;
+			total_nr_folios++;
+		}
 
 		dst = dst2;
 		dst2 = list_next_entry(dst, lru);
 	}
 
 	/* Exit if folio list for batch migration is empty */
-	if (!nr_mig_folios)
+	if (!total_nr_pages)
 		goto out;
 
 	/* Batch copy the folios */
-	{
+	if (total_nr_pages > 32) {
+		copy_page_lists_mt(dst_folios, src_folios, total_nr_folios);
+	} else {
 		dst = list_first_entry(dst_folios, struct folio, lru);
 		dst2 = list_next_entry(dst, lru);
 		list_for_each_entry_safe(folio, folio2, src_folios, lru) {
-- 
2.45.2



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION
  2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
                   ` (3 preceding siblings ...)
  2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
@ 2025-01-03 17:24 ` Zi Yan
  2025-01-03 22:21   ` Gregory Price
  2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Zi Yan @ 2025-01-03 17:24 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Shivank Garg, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying, Zi Yan

1. enable multi-threaded copy
2. specify how many CPU threads to use
3. push from local CPUs or pull from remote CPUs
4. change NR_MAX_BATCHED_MIGRATION to HPAGE_PUD_NR to allow batching THP
copies.

These are for testing purpose only.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/mm.h     |  4 ++++
 include/linux/sysctl.h |  1 +
 kernel/sysctl.c        | 29 ++++++++++++++++++++++++++++-
 mm/copy_pages.c        | 10 +++++++---
 mm/migrate.c           |  6 ++++--
 5 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1a11f9df5c2d..277b12b9ef0d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -42,6 +42,10 @@ struct pt_regs;
 struct folio_batch;
 
 extern int sysctl_page_lock_unfairness;
+extern int sysctl_use_mt_copy;
+extern unsigned int sysctl_limit_mt_num;
+extern unsigned int sysctl_push_0_pull_1;
+
 
 void mm_core_init(void);
 void init_mm_internals(void);
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 40a6ac6c9713..f33dafea2533 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -52,6 +52,7 @@ struct ctl_dir;
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 #define SYSCTL_MAXOLDUID		((void *)&sysctl_vals[10])
 #define SYSCTL_NEG_ONE			((void *)&sysctl_vals[11])
+#define SYSCTL_32			((void *)&sysctl_vals[12])
 
 extern const int sysctl_vals[];
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5c9202cb8f59..f9ba48cd6e09 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -82,7 +82,7 @@
 #endif
 
 /* shared constants to be used in various sysctls */
-const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1 };
+const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1, 32 };
 EXPORT_SYMBOL(sysctl_vals);
 
 const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX };
@@ -2091,6 +2091,33 @@ static struct ctl_table vm_table[] = {
 		.extra2		= SYSCTL_ONE,
 	},
 #endif
+	{
+		.procname	= "use_mt_copy",
+		.data		= &use_mt_copy,
+		.maxlen		= sizeof(use_mt_copy),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "limit_mt_num",
+		.data		= &limit_mt_num,
+		.maxlen		= sizeof(limit_mt_num),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ONE,
+		.extra2		= SYSCTL_32,
+	},
+	{
+		.procname	= "push_0_pull_1",
+		.data		= &push_0_pull_1,
+		.maxlen		= sizeof(push_0_pull_1),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{
 		.procname	= "drop_caches",
 		.data		= &sysctl_drop_caches,
diff --git a/mm/copy_pages.c b/mm/copy_pages.c
index 0e2231199f66..257034550c86 100644
--- a/mm/copy_pages.c
+++ b/mm/copy_pages.c
@@ -10,7 +10,9 @@
 #include <linux/migrate.h>
 
 
-unsigned int limit_mt_num = 4;
+unsigned int sysctl_limit_mt_num = 4;
+/* push by default */
+unsigned int sysctl_push_0_pull_1;
 
 struct copy_item {
 	char *to;
@@ -45,11 +47,13 @@ int copy_page_lists_mt(struct list_head *dst_folios,
 		struct list_head *src_folios, int nr_items)
 {
 	int err = 0;
-	unsigned int total_mt_num = limit_mt_num;
+	unsigned int total_mt_num = sysctl_limit_mt_num;
 	int to_node = folio_nid(list_first_entry(dst_folios, struct folio, lru));
+	int from_node = folio_nid(list_first_entry(src_folios, struct folio, lru));
 	int i;
 	struct copy_page_info *work_items[32] = {0};
-	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
+	const struct cpumask *per_node_cpumask =
+		cpumask_of_node(sysctl_push_0_pull_1 ? to_node : from_node);
 	int cpu_id_list[32] = {0};
 	int cpu;
 	int max_items_per_thread;
diff --git a/mm/migrate.c b/mm/migrate.c
index 18440180d747..0f7a4b09acda 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -51,6 +51,7 @@
 
 #include "internal.h"
 
+int sysctl_use_mt_copy;
 
 bool isolate_movable_page(struct page *page, isolate_mode_t mode)
 {
@@ -1621,7 +1622,7 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_MAX_BATCHED_MIGRATION	HPAGE_PMD_NR
+#define NR_MAX_BATCHED_MIGRATION	HPAGE_PUD_NR
 #else
 #define NR_MAX_BATCHED_MIGRATION	512
 #endif
@@ -1868,7 +1869,8 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
 		goto out;
 
 	/* Batch copy the folios */
-	if (total_nr_pages > 32) {
+	/* if (total_nr_pages > 32) { */
+	if (sysctl_use_mt_copy) {
 		copy_page_lists_mt(dst_folios, src_folios, total_nr_folios);
 	} else {
 		dst = list_first_entry(dst_folios, struct folio, lru);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
                   ` (4 preceding siblings ...)
  2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
@ 2025-01-03 19:17 ` Gregory Price
  2025-01-03 19:32   ` Zi Yan
  2025-01-03 22:09 ` Yang Shi
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Gregory Price @ 2025-01-03 19:17 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, David Rientjes, Shivank Garg, Aneesh Kumar,
	David Hildenbrand, John Hubbard, Kirill Shutemov, Matthew Wilcox,
	Mel Gorman, Rao, Bharata Bhasker, Rik van Riel, RaghavendraKT,
	Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
	shy828301, Liam Howlett, Gregory Price, Huang, Ying

On Fri, Jan 03, 2025 at 12:24:14PM -0500, Zi Yan wrote:
> Hi all,
> 
> This patchset accelerates page migration by batching folio copy operations and
> using multiple CPU threads and is based on Shivank's Enhancements to Page
> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
> The last patch is for testing purpose and should not be considered.
> 

This is well timed as I've been testing a batch-migration variant of
migrate_misplaced_folio for my pagecache promotion work (attached).

I will add this to my pagecache branch and give it a test at some point.

Quick question: is the multi-threaded movement supported in the context
of task_work?  i.e. in which context is the multi-threaded path
safe/unsafe? (inline in a syscall, async only, etc).

~Gregory

---

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9438cc7c2aeb..17baf63964c0 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -146,6 +146,9 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
                struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
                           int node);
+int migrate_misplaced_folio_batch(struct list_head *foliolist,
+                                 struct vm_area_struct *vma,
+                                 int node);
 #else
 static inline int migrate_misplaced_folio_prepare(struct folio *folio,
                struct vm_area_struct *vma, int node)
@@ -157,6 +160,12 @@ static inline int migrate_misplaced_folio(struct folio *folio,
 {
        return -EAGAIN; /* can't migrate now */
 }
+int migrate_misplaced_folio_batch(struct list_head *foliolist,
+                                 struct vm_area_struct *vma,
+                                 int node)
+{
+       return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_NUMA_BALANCING */

 #ifdef CONFIG_MIGRATION
diff --git a/mm/migrate.c b/mm/migrate.c
index 459f396f7bc1..454fd93c4cc7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2608,5 +2608,27 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
        BUG_ON(!list_empty(&migratepages));
        return nr_remaining ? -EAGAIN : 0;
 }
+
+int migrate_misplaced_folio_batch(struct list_head *folio_list,
+                                 struct vm_area_struct *vma,
+                                 int node)
+{
+       pg_data_t *pgdat = NODE_DATA(node);
+       unsigned int nr_succeeded;
+       int nr_remaining;
+
+       nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
+                                    NULL, node, MIGRATE_ASYNC,
+                                    MR_NUMA_MISPLACED, &nr_succeeded);
+       if (nr_remaining)
+               putback_movable_pages(folio_list);
+
+       if (nr_succeeded) {
+               count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
+               mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
+       }
+       BUG_ON(!list_empty(folio_list));
+       return nr_remaining ? -EAGAIN : 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
@ 2025-01-03 19:32   ` Zi Yan
  0 siblings, 0 replies; 30+ messages in thread
From: Zi Yan @ 2025-01-03 19:32 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, David Rientjes, Shivank Garg, Aneesh Kumar,
	David Hildenbrand, John Hubbard, Kirill Shutemov, Matthew Wilcox,
	Mel Gorman, Rao, Bharata Bhasker, Rik van Riel, RaghavendraKT,
	Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
	shy828301, Liam Howlett, Gregory Price, Huang, Ying

On 3 Jan 2025, at 14:17, Gregory Price wrote:

> On Fri, Jan 03, 2025 at 12:24:14PM -0500, Zi Yan wrote:
>> Hi all,
>>
>> This patchset accelerates page migration by batching folio copy operations and
>> using multiple CPU threads and is based on Shivank's Enhancements to Page
>> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
>> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
>> The last patch is for testing purpose and should not be considered.
>>
>
> This is well timed as I've been testing a batch-migration variant of
> migrate_misplaced_folio for my pagecache promotion work (attached).
>
> I will add this to my pagecache branch and give it a test at some point.

Great. Thanks.

>
> Quick question: is the multi-threaded movement supported in the context
> of task_work?  i.e. in which context is the multi-threaded path
> safe/unsafe? (inline in a syscall, async only, etc).

It should work in any context, like syscall, memory compaction, and so on,
since it just distributes memcpy to different CPUs using workqueue.

>
> ~Gregory
>
> ---
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 9438cc7c2aeb..17baf63964c0 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -146,6 +146,9 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
>                 struct vm_area_struct *vma, int node);
>  int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>                            int node);
> +int migrate_misplaced_folio_batch(struct list_head *foliolist,
> +                                 struct vm_area_struct *vma,
> +                                 int node);
>  #else
>  static inline int migrate_misplaced_folio_prepare(struct folio *folio,
>                 struct vm_area_struct *vma, int node)
> @@ -157,6 +160,12 @@ static inline int migrate_misplaced_folio(struct folio *folio,
>  {
>         return -EAGAIN; /* can't migrate now */
>  }
> +int migrate_misplaced_folio_batch(struct list_head *foliolist,
> +                                 struct vm_area_struct *vma,
> +                                 int node)
> +{
> +       return -EAGAIN; /* can't migrate now */
> +}
>  #endif /* CONFIG_NUMA_BALANCING */
>
>  #ifdef CONFIG_MIGRATION
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 459f396f7bc1..454fd93c4cc7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2608,5 +2608,27 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>         BUG_ON(!list_empty(&migratepages));
>         return nr_remaining ? -EAGAIN : 0;
>  }
> +
> +int migrate_misplaced_folio_batch(struct list_head *folio_list,
> +                                 struct vm_area_struct *vma,
> +                                 int node)
> +{
> +       pg_data_t *pgdat = NODE_DATA(node);
> +       unsigned int nr_succeeded;
> +       int nr_remaining;
> +
> +       nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
> +                                    NULL, node, MIGRATE_ASYNC,
> +                                    MR_NUMA_MISPLACED, &nr_succeeded);
> +       if (nr_remaining)
> +               putback_movable_pages(folio_list);
> +
> +       if (nr_succeeded) {
> +               count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> +               mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
> +       }
> +       BUG_ON(!list_empty(folio_list));
> +       return nr_remaining ? -EAGAIN : 0;
> +}
>  #endif /* CONFIG_NUMA_BALANCING */
>  #endif /* CONFIG_NUMA */


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
                   ` (5 preceding siblings ...)
  2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
@ 2025-01-03 22:09 ` Yang Shi
  2025-01-06  2:33   ` Zi Yan
  2025-01-09 11:47 ` Shivank Garg
  2025-02-13  8:17 ` Byungchul Park
  8 siblings, 1 reply; 30+ messages in thread
From: Yang Shi @ 2025-01-03 22:09 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, David Rientjes, Shivank Garg, Aneesh Kumar,
	David Hildenbrand, John Hubbard, Kirill Shutemov, Matthew Wilcox,
	Mel Gorman, Rao, Bharata Bhasker, Rik van Riel, RaghavendraKT,
	Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
	Liam Howlett, Gregory Price, Huang, Ying

On Fri, Jan 3, 2025 at 9:24 AM Zi Yan <ziy@nvidia.com> wrote:
>
> Hi all,
>
> This patchset accelerates page migration by batching folio copy operations and
> using multiple CPU threads and is based on Shivank's Enhancements to Page
> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
> The last patch is for testing purpose and should not be considered.
>
> The motivations are:
>
> 1. Batching folio copy increases copy throughput. Especially for base page
> migrations, folio copy throughput is low since there are kernel activities like
> moving folio metadata and updating page table entries sit between two folio
> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
> and 64KB on ARM64.
>
> 2. Single CPU thread has limited copy throughput. Using multi threads is
> a natural extension to speed up folio copy, when DMA engine is NOT
> available in a system.
>
>
> Design
> ===
>
> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
> migrate_folio_move() and perform them in one shot afterwards. A
> copy_page_lists_mt() function is added to use multi threads to copy
> folios from src list to dst list.
>
> Changes compared to Shivank's patchset (mainly rewrote batching folio
> copy code)
> ===
>
> 1. mig_info is removed, so no memory allocation is needed during
> batching folio copies. src->private is used to store old page state and
> anon_vma after folio metadata is copied from src to dst.
>
> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
> redundant code in migrate_folios_batch_move().
>
> 3. folio_mc_copy() is used for the single threaded copy code to keep the
> original kernel behavior.
>
>
> Performance
> ===
>
> I benchmarked move_pages() throughput on a two socket NUMA system with two
> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
> mTHP page migration are measured.
>
> The tables below show move_pages() throughput with different
> configurations and different numbers of copied pages. The x-axis is the
> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
> threads with this patchset applied. And the unit is GB/s.
>
> The 32-thread copy throughput can be up to 10x of single thread serial folio
> copy. Batching folio copy not only benefits huge page but also base
> page.
>
> 64KB (GB/s):
>
>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
> 32              5.43    4.90    5.65    7.31    7.60    8.61    6.43
> 256             6.95    6.89    9.28    14.67   22.41   23.39   23.93
> 512             7.88    7.26    10.15   17.53   27.82   27.88   33.93
> 768             7.65    7.42    10.46   18.59   28.65   29.67   30.76
> 1024    7.46    8.01    10.90   17.77   27.04   32.18   38.80
>
> 2MB mTHP (GB/s):
>
>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
> 1               5.94    2.90    6.90    8.56    11.16   8.76    6.41
> 2               7.67    5.57    7.11    12.48   17.37   15.68   14.10
> 4               8.01    6.04    10.25   20.14   22.52   27.79   25.28
> 8               8.42    7.00    11.41   24.73   33.96   32.62   39.55
> 16              9.41    6.91    12.23   27.51   43.95   49.15   51.38
> 32              10.23   7.15    13.03   29.52   49.49   69.98   71.51
> 64              9.40    7.37    13.88   30.38   52.00   76.89   79.41
> 128             8.59    7.23    14.20   28.39   49.98   78.27   90.18
> 256             8.43    7.16    14.59   28.14   48.78   76.88   92.28
> 512             8.31    7.78    14.40   26.20   43.31   63.91   75.21
> 768             8.30    7.86    14.83   27.41   46.25   69.85   81.31
> 1024    8.31    7.90    14.96   27.62   46.75   71.76   83.84

Is this done on an idle system or a busy system? For real production
workloads, all the CPUs are likely busy. It would be great to have the
performance data collected from a busys system too.

>
>
> TODOs
> ===
> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
> only use idle CPUs to avoid interfering userspace workloads. Of course
> more complicated policies can be used based on migration issuing thread
> priority.

The other potential problem is it is hard to attribute cpu time
consumed by the migration work threads to cpu cgroups. In a
multi-tenant environment this may result in unfair cpu time counting.
However, it is a chronic problem to properly count cpu time for kernel
threads. I'm not sure whether it has been solved or not.

>
> 2. Eliminate memory allocation during multi-threaded folio copy routine
> if possible.
>
> 3. A runtime check to decide when use multi-threaded folio copy.
> Something like cache hotness issue mentioned by Matthew[3].
>
> 4. Use non-temporal CPU instructions to avoid cache pollution issues.

AFAICT, arm64 already uses non-temporal instructions for copy page.

>
> 5. Explicitly make multi-threaded folio copy only available to
> !HIGHMEM, since kmap_local_page() would be needed for each kernel
> folio copy work threads and expensive.
>
> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
> to be used as well.
>
> Let me know your thoughts. Thanks.
>
>
> [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/
> [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
> [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/
>
> Byungchul Park (1):
>   mm: separate move/undo doing on folio list from migrate_pages_batch()
>
> Zi Yan (4):
>   mm/migrate: factor out code in move_to_new_folio() and
>     migrate_folio_move()
>   mm/migrate: add migrate_folios_batch_move to batch the folio move
>     operations
>   mm/migrate: introduce multi-threaded page copy routine
>   test: add sysctl for folio copy tests and adjust
>     NR_MAX_BATCHED_MIGRATION
>
>  include/linux/migrate.h      |   3 +
>  include/linux/migrate_mode.h |   2 +
>  include/linux/mm.h           |   4 +
>  include/linux/sysctl.h       |   1 +
>  kernel/sysctl.c              |  29 ++-
>  mm/Makefile                  |   2 +-
>  mm/copy_pages.c              | 190 +++++++++++++++
>  mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
>  8 files changed, 577 insertions(+), 97 deletions(-)
>  create mode 100644 mm/copy_pages.c
>
> --
> 2.45.2
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION
  2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
@ 2025-01-03 22:21   ` Gregory Price
  2025-01-03 22:56     ` Zi Yan
  0 siblings, 1 reply; 30+ messages in thread
From: Gregory Price @ 2025-01-03 22:21 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, David Rientjes, Shivank Garg, Aneesh Kumar,
	David Hildenbrand, John Hubbard, Kirill Shutemov, Matthew Wilcox,
	Mel Gorman, Rao, Bharata Bhasker, Rik van Riel, RaghavendraKT,
	Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
	shy828301, Liam Howlett, Gregory Price, Huang, Ying

On Fri, Jan 03, 2025 at 12:24:19PM -0500, Zi Yan wrote:
... snip ...
> +	{
> +		.procname	= "use_mt_copy",
> +		.data		= &use_mt_copy,
> +		.maxlen		= sizeof(use_mt_copy),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_ONE,
> +	},
> +	{
> +		.procname	= "limit_mt_num",
> +		.data		= &limit_mt_num,
> +		.maxlen		= sizeof(limit_mt_num),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ONE,
> +		.extra2		= SYSCTL_32,
> +	},
> +	{
> +		.procname	= "push_0_pull_1",
> +		.data		= &push_0_pull_1,
> +		.maxlen		= sizeof(push_0_pull_1),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_ONE,
> +	},
>  	{
>  		.procname	= "drop_caches",
>  		.data		= &sysctl_drop_caches,

Build errors here

~Gregory

---

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f9ba48cd6e09..bca82e6132b3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2093,8 +2093,8 @@ static struct ctl_table vm_table[] = {
 #endif
        {
                .procname       = "use_mt_copy",
-               .data           = &use_mt_copy,
-               .maxlen         = sizeof(use_mt_copy),
+               .data           = &sysctl_use_mt_copy,
+               .maxlen         = sizeof(sysctl_use_mt_copy),
                .mode           = 0644,
                .proc_handler   = proc_dointvec_minmax,
                .extra1         = SYSCTL_ZERO,
@@ -2102,8 +2102,8 @@ static struct ctl_table vm_table[] = {
        },
        {
                .procname       = "limit_mt_num",
-               .data           = &limit_mt_num,
-               .maxlen         = sizeof(limit_mt_num),
+               .data           = &sysctl_limit_mt_num,
+               .maxlen         = sizeof(sysctl_limit_mt_num),
                .mode           = 0644,
                .proc_handler   = proc_dointvec_minmax,
                .extra1         = SYSCTL_ONE,
@@ -2111,8 +2111,8 @@ static struct ctl_table vm_table[] = {
        },
        {
                .procname       = "push_0_pull_1",
-               .data           = &push_0_pull_1,
-               .maxlen         = sizeof(push_0_pull_1),
+               .data           = &sysctl_push_0_pull_1,
+               .maxlen         = sizeof(sysctl_push_0_pull_1),
                .mode           = 0644,
                .proc_handler   = proc_dointvec_minmax,
                .extra1         = SYSCTL_ZERO,



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION
  2025-01-03 22:21   ` Gregory Price
@ 2025-01-03 22:56     ` Zi Yan
  0 siblings, 0 replies; 30+ messages in thread
From: Zi Yan @ 2025-01-03 22:56 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, David Rientjes, Shivank Garg, Aneesh Kumar,
	David Hildenbrand, John Hubbard, Kirill Shutemov, Matthew Wilcox,
	Mel Gorman, Rao, Bharata Bhasker, Rik van Riel, RaghavendraKT,
	Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
	shy828301, Liam Howlett, Gregory Price, Huang, Ying

On 3 Jan 2025, at 17:21, Gregory Price wrote:

> On Fri, Jan 03, 2025 at 12:24:19PM -0500, Zi Yan wrote:
> ... snip ...
>> +	{
>> +		.procname	= "use_mt_copy",
>> +		.data		= &use_mt_copy,
>> +		.maxlen		= sizeof(use_mt_copy),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec_minmax,
>> +		.extra1		= SYSCTL_ZERO,
>> +		.extra2		= SYSCTL_ONE,
>> +	},
>> +	{
>> +		.procname	= "limit_mt_num",
>> +		.data		= &limit_mt_num,
>> +		.maxlen		= sizeof(limit_mt_num),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec_minmax,
>> +		.extra1		= SYSCTL_ONE,
>> +		.extra2		= SYSCTL_32,
>> +	},
>> +	{
>> +		.procname	= "push_0_pull_1",
>> +		.data		= &push_0_pull_1,
>> +		.maxlen		= sizeof(push_0_pull_1),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec_minmax,
>> +		.extra1		= SYSCTL_ZERO,
>> +		.extra2		= SYSCTL_ONE,
>> +	},
>>  	{
>>  		.procname	= "drop_caches",
>>  		.data		= &sysctl_drop_caches,
>
> Build errors here

Thanks, these changes must be lost during my patch clean time.

>
> ~Gregory
>
> ---
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index f9ba48cd6e09..bca82e6132b3 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2093,8 +2093,8 @@ static struct ctl_table vm_table[] = {
>  #endif
>         {
>                 .procname       = "use_mt_copy",
> -               .data           = &use_mt_copy,
> -               .maxlen         = sizeof(use_mt_copy),
> +               .data           = &sysctl_use_mt_copy,
> +               .maxlen         = sizeof(sysctl_use_mt_copy),
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec_minmax,
>                 .extra1         = SYSCTL_ZERO,
> @@ -2102,8 +2102,8 @@ static struct ctl_table vm_table[] = {
>         },
>         {
>                 .procname       = "limit_mt_num",
> -               .data           = &limit_mt_num,
> -               .maxlen         = sizeof(limit_mt_num),
> +               .data           = &sysctl_limit_mt_num,
> +               .maxlen         = sizeof(sysctl_limit_mt_num),
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec_minmax,
>                 .extra1         = SYSCTL_ONE,
> @@ -2111,8 +2111,8 @@ static struct ctl_table vm_table[] = {
>         },
>         {
>                 .procname       = "push_0_pull_1",
> -               .data           = &push_0_pull_1,
> -               .maxlen         = sizeof(push_0_pull_1),
> +               .data           = &sysctl_push_0_pull_1,
> +               .maxlen         = sizeof(sysctl_push_0_pull_1),
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec_minmax,
>                 .extra1         = SYSCTL_ZERO,


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine
  2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
@ 2025-01-06  1:18   ` Hyeonggon Yoo
  2025-01-06  2:01     ` Zi Yan
  2025-02-13 12:44   ` Byungchul Park
  1 sibling, 1 reply; 30+ messages in thread
From: Hyeonggon Yoo @ 2025-01-06  1:18 UTC (permalink / raw)
  To: Zi Yan, linux-mm
  Cc: kernel_team, 42.hyeyoo, David Rientjes, Shivank Garg,
	Aneesh Kumar, David Hildenbrand, John Hubbard, Kirill Shutemov,
	Matthew Wilcox, Mel Gorman, Rao, Bharata Bhasker, Rik van Riel,
	RaghavendraKT, Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh,
	Grimm, Jon, sj, shy828301, Liam Howlett, Gregory Price, Huang,
	Ying



On 2025-01-04 2:24 AM, Zi Yan wrote:
> Now page copies are batched, multi-threaded page copy can be used to
> increase page copy throughput. Add copy_page_lists_mt() to copy pages in
> multi-threaded manners. Empirical data show more than 32 base pages are
> needed to show the benefit of using multi-threaded page copy, so use 32 as
> the threshold.
 >
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   include/linux/migrate.h |   3 +
>   mm/Makefile             |   2 +-
>   mm/copy_pages.c         | 186 ++++++++++++++++++++++++++++++++++++++++
>   mm/migrate.c            |  19 ++--
>   4 files changed, 199 insertions(+), 11 deletions(-)
>   create mode 100644 mm/copy_pages.c
> 

[...snip...]

> +++ b/mm/copy_pages.c
> @@ -0,0 +1,186 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Parallel page copy routine.
> + */
> +
> +#include <linux/sysctl.h>
> +#include <linux/highmem.h>
> +#include <linux/workqueue.h>
> +#include <linux/slab.h>
> +#include <linux/migrate.h>
> +
> +
> +unsigned int limit_mt_num = 4;
> +
> +struct copy_item {
> +	char *to;
> +	char *from;
> +	unsigned long chunk_size;
> +};
> +
> +struct copy_page_info {
> +	struct work_struct copy_page_work;
> +	unsigned long num_items;
> +	struct copy_item item_list[];
> +};
> +
> +static void copy_page_routine(char *vto, char *vfrom,
> +	unsigned long chunk_size)
> +{
> +	memcpy(vto, vfrom, chunk_size);
> +}
> +
> +static void copy_page_work_queue_thread(struct work_struct *work)
> +{
> +	struct copy_page_info *my_work = (struct copy_page_info *)work;
> +	int i;
> +
> +	for (i = 0; i < my_work->num_items; ++i)
> +		copy_page_routine(my_work->item_list[i].to,
> +						  my_work->item_list[i].from,
> +						  my_work->item_list[i].chunk_size);
> +}
> +
> +int copy_page_lists_mt(struct list_head *dst_folios,
> +		struct list_head *src_folios, int nr_items)
> +{
> +	int err = 0;
> +	unsigned int total_mt_num = limit_mt_num;
> +	int to_node = folio_nid(list_first_entry(dst_folios, struct folio, lru));
> +	int i;
> +	struct copy_page_info *work_items[32] = {0};
> +	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);

What happens here if to_node is a NUMA node without CPUs? (e.g. CXL
node).

And even with a NUMA node with CPUs I think offloading copies to CPUs
of either "from node" or "to node" will end up a CPU touching two pages
in two different NUMA nodes anyway, one page in the local node
and the other page in the remote node.

In that sense, I don't understand when push_0_pull_1 (introduced in
patch 5) should be 0 or 1. Am I missing something?

> +	int cpu_id_list[32] = {0};
> +	int cpu;
> +	int max_items_per_thread;
> +	int item_idx;
> +	struct folio *src, *src2, *dst, *dst2;
> +
> +	total_mt_num = min_t(unsigned int, total_mt_num,
> +			cpumask_weight(per_node_cpumask));
> +
> +	if (total_mt_num > 32)
> +		total_mt_num = 32;
> +
> +	/* Each threads get part of each page, if nr_items < totla_mt_num */
> +	if (nr_items < total_mt_num)
> +		max_items_per_thread = nr_items;
> +	else
> +		max_items_per_thread = (nr_items / total_mt_num) +
> +				((nr_items % total_mt_num) ? 1 : 0);
> +
> +
> +	for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +		work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
> +					sizeof(struct copy_item) * max_items_per_thread,
> +					GFP_NOWAIT);
 > +
> +		if (!work_items[cpu]) {
> +			err = -ENOMEM;
> +			goto free_work_items;
> +		}
> +	}

[...snip...]

> +
> +	/* Wait until it finishes  */
> +	for (i = 0; i < total_mt_num; ++i)
> +		flush_work((struct work_struct *)work_items[i]);
> +
> +free_work_items:
> +	for (cpu = 0; cpu < total_mt_num; ++cpu)
> +		kfree(work_items[cpu]);
> +
> +	return err;

Should the kernel re-try migration without multi-threading if it failed
to allocate memory?

---
Hyeonggon


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine
  2025-01-06  1:18   ` Hyeonggon Yoo
@ 2025-01-06  2:01     ` Zi Yan
  0 siblings, 0 replies; 30+ messages in thread
From: Zi Yan @ 2025-01-06  2:01 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-mm, kernel_team, 42.hyeyoo, David Rientjes, Shivank Garg,
	Aneesh Kumar, David Hildenbrand, John Hubbard, Kirill Shutemov,
	Matthew Wilcox, Mel Gorman, Rao, Bharata Bhasker, Rik van Riel,
	RaghavendraKT, Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh,
	Grimm, Jon, sj, shy828301, Liam Howlett, Gregory Price, Huang,
	Ying

On 5 Jan 2025, at 20:18, Hyeonggon Yoo wrote:

> On 2025-01-04 2:24 AM, Zi Yan wrote:
>> Now page copies are batched, multi-threaded page copy can be used to
>> increase page copy throughput. Add copy_page_lists_mt() to copy pages in
>> multi-threaded manners. Empirical data show more than 32 base pages are
>> needed to show the benefit of using multi-threaded page copy, so use 32 as
>> the threshold.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   include/linux/migrate.h |   3 +
>>   mm/Makefile             |   2 +-
>>   mm/copy_pages.c         | 186 ++++++++++++++++++++++++++++++++++++++++
>>   mm/migrate.c            |  19 ++--
>>   4 files changed, 199 insertions(+), 11 deletions(-)
>>   create mode 100644 mm/copy_pages.c
>>
>
> [...snip...]
>
>> +++ b/mm/copy_pages.c
>> @@ -0,0 +1,186 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Parallel page copy routine.
>> + */
>> +
>> +#include <linux/sysctl.h>
>> +#include <linux/highmem.h>
>> +#include <linux/workqueue.h>
>> +#include <linux/slab.h>
>> +#include <linux/migrate.h>
>> +
>> +
>> +unsigned int limit_mt_num = 4;
>> +
>> +struct copy_item {
>> +	char *to;
>> +	char *from;
>> +	unsigned long chunk_size;
>> +};
>> +
>> +struct copy_page_info {
>> +	struct work_struct copy_page_work;
>> +	unsigned long num_items;
>> +	struct copy_item item_list[];
>> +};
>> +
>> +static void copy_page_routine(char *vto, char *vfrom,
>> +	unsigned long chunk_size)
>> +{
>> +	memcpy(vto, vfrom, chunk_size);
>> +}
>> +
>> +static void copy_page_work_queue_thread(struct work_struct *work)
>> +{
>> +	struct copy_page_info *my_work = (struct copy_page_info *)work;
>> +	int i;
>> +
>> +	for (i = 0; i < my_work->num_items; ++i)
>> +		copy_page_routine(my_work->item_list[i].to,
>> +						  my_work->item_list[i].from,
>> +						  my_work->item_list[i].chunk_size);
>> +}
>> +
>> +int copy_page_lists_mt(struct list_head *dst_folios,
>> +		struct list_head *src_folios, int nr_items)
>> +{
>> +	int err = 0;
>> +	unsigned int total_mt_num = limit_mt_num;
>> +	int to_node = folio_nid(list_first_entry(dst_folios, struct folio, lru));
>> +	int i;
>> +	struct copy_page_info *work_items[32] = {0};
>> +	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
>
> What happens here if to_node is a NUMA node without CPUs? (e.g. CXL
> node).

I did not think about that case. In that case, from_node will be used.
If both from and to are CPUless nodes, maybe the node of the executing
CPU should be used to select cpumask here.

>
> And even with a NUMA node with CPUs I think offloading copies to CPUs
> of either "from node" or "to node" will end up a CPU touching two pages
> in two different NUMA nodes anyway, one page in the local node
> and the other page in the remote node.
>
> In that sense, I don't understand when push_0_pull_1 (introduced in
> patch 5) should be 0 or 1. Am I missing something?

From my experiments, copy throughput differs between pushing data from local
CPUs and pulling data from remote CPUs. On NVIDIA Grace CPU, pushing data
has higher throughput. Back in 2019, when I tested it on Intel Xeon Broadwell,
pulling data has higher throughput. In the final version, a boot time
benchmark might be needed to decide whether to push data or pull data.

>> +	int cpu_id_list[32] = {0};
>> +	int cpu;
>> +	int max_items_per_thread;
>> +	int item_idx;
>> +	struct folio *src, *src2, *dst, *dst2;
>> +
>> +	total_mt_num = min_t(unsigned int, total_mt_num,
>> +			cpumask_weight(per_node_cpumask));
>> +
>> +	if (total_mt_num > 32)
>> +		total_mt_num = 32;
>> +
>> +	/* Each threads get part of each page, if nr_items < totla_mt_num */
>> +	if (nr_items < total_mt_num)
>> +		max_items_per_thread = nr_items;
>> +	else
>> +		max_items_per_thread = (nr_items / total_mt_num) +
>> +				((nr_items % total_mt_num) ? 1 : 0);
>> +
>> +
>> +	for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +		work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
>> +					sizeof(struct copy_item) * max_items_per_thread,
>> +					GFP_NOWAIT);
>> +
>> +		if (!work_items[cpu]) {
>> +			err = -ENOMEM;
>> +			goto free_work_items;
>> +		}
>> +	}
>
> [...snip...]
>
>> +
>> +	/* Wait until it finishes  */
>> +	for (i = 0; i < total_mt_num; ++i)
>> +		flush_work((struct work_struct *)work_items[i]);
>> +
>> +free_work_items:
>> +	for (cpu = 0; cpu < total_mt_num; ++cpu)
>> +		kfree(work_items[cpu]);
>> +
>> +	return err;
>
> Should the kernel re-try migration without multi-threading if it failed
> to allocate memory?

Sure. Will add it in the next version.

Thank you for the reviews.

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-03 22:09 ` Yang Shi
@ 2025-01-06  2:33   ` Zi Yan
  0 siblings, 0 replies; 30+ messages in thread
From: Zi Yan @ 2025-01-06  2:33 UTC (permalink / raw)
  To: Yang Shi
  Cc: linux-mm, David Rientjes, Shivank Garg, Aneesh Kumar,
	David Hildenbrand, John Hubbard, Kirill Shutemov, Matthew Wilcox,
	Mel Gorman, Rao, Bharata Bhasker, Rik van Riel, RaghavendraKT,
	Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
	Liam Howlett, Gregory Price, Huang, Ying

On 3 Jan 2025, at 17:09, Yang Shi wrote:

> On Fri, Jan 3, 2025 at 9:24 AM Zi Yan <ziy@nvidia.com> wrote:
>>
>> Hi all,
>>
>> This patchset accelerates page migration by batching folio copy operations and
>> using multiple CPU threads and is based on Shivank's Enhancements to Page
>> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
>> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
>> The last patch is for testing purpose and should not be considered.
>>
>> The motivations are:
>>
>> 1. Batching folio copy increases copy throughput. Especially for base page
>> migrations, folio copy throughput is low since there are kernel activities like
>> moving folio metadata and updating page table entries sit between two folio
>> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
>> and 64KB on ARM64.
>>
>> 2. Single CPU thread has limited copy throughput. Using multi threads is
>> a natural extension to speed up folio copy, when DMA engine is NOT
>> available in a system.
>>
>>
>> Design
>> ===
>>
>> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
>> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
>> migrate_folio_move() and perform them in one shot afterwards. A
>> copy_page_lists_mt() function is added to use multi threads to copy
>> folios from src list to dst list.
>>
>> Changes compared to Shivank's patchset (mainly rewrote batching folio
>> copy code)
>> ===
>>
>> 1. mig_info is removed, so no memory allocation is needed during
>> batching folio copies. src->private is used to store old page state and
>> anon_vma after folio metadata is copied from src to dst.
>>
>> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
>> redundant code in migrate_folios_batch_move().
>>
>> 3. folio_mc_copy() is used for the single threaded copy code to keep the
>> original kernel behavior.
>>
>>
>> Performance
>> ===
>>
>> I benchmarked move_pages() throughput on a two socket NUMA system with two
>> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
>> mTHP page migration are measured.
>>
>> The tables below show move_pages() throughput with different
>> configurations and different numbers of copied pages. The x-axis is the
>> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
>> threads with this patchset applied. And the unit is GB/s.
>>
>> The 32-thread copy throughput can be up to 10x of single thread serial folio
>> copy. Batching folio copy not only benefits huge page but also base
>> page.
>>
>> 64KB (GB/s):
>>
>>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
>> 32              5.43    4.90    5.65    7.31    7.60    8.61    6.43
>> 256             6.95    6.89    9.28    14.67   22.41   23.39   23.93
>> 512             7.88    7.26    10.15   17.53   27.82   27.88   33.93
>> 768             7.65    7.42    10.46   18.59   28.65   29.67   30.76
>> 1024    7.46    8.01    10.90   17.77   27.04   32.18   38.80
>>
>> 2MB mTHP (GB/s):
>>
>>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
>> 1               5.94    2.90    6.90    8.56    11.16   8.76    6.41
>> 2               7.67    5.57    7.11    12.48   17.37   15.68   14.10
>> 4               8.01    6.04    10.25   20.14   22.52   27.79   25.28
>> 8               8.42    7.00    11.41   24.73   33.96   32.62   39.55
>> 16              9.41    6.91    12.23   27.51   43.95   49.15   51.38
>> 32              10.23   7.15    13.03   29.52   49.49   69.98   71.51
>> 64              9.40    7.37    13.88   30.38   52.00   76.89   79.41
>> 128             8.59    7.23    14.20   28.39   49.98   78.27   90.18
>> 256             8.43    7.16    14.59   28.14   48.78   76.88   92.28
>> 512             8.31    7.78    14.40   26.20   43.31   63.91   75.21
>> 768             8.30    7.86    14.83   27.41   46.25   69.85   81.31
>> 1024    8.31    7.90    14.96   27.62   46.75   71.76   83.84
>
> Is this done on an idle system or a busy system? For real production
> workloads, all the CPUs are likely busy. It would be great to have the
> performance data collected from a busys system too.

Yes, it was done on an idle system.

I redid the experiments on a busy system by running stress on all CPU
cores and the results are as not good, since all CPUs are occupied.
Then I switched to system_highpri_wq, the throughput got better,
almost on par with the results on an idle machine. The numbers are
below.

It becomes a trade-off between page migration throughput vs user
application performance on _a busy system_. If a page migration is badly
needed, system_highpri_wq can be used to retain high copy throughput.
Otherwise, multithreads should not be used.

64KB with system_unbound_wq on a busy system (GB/s):

| ---- | -------- | ---- | ---- | ---- | ---- | ----- | ----- |
|      | vanilla  | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | mt_32 |
| ---- | -------- | ---- | ---- | ---- | ---- | ----- | ----- |
| 32   | 4.05     | 1.51 | 1.32 | 1.20 | 4.31 | 1.05  | 0.02  |
| 256  | 6.91     | 3.93 | 4.61 | 0.08 | 4.46 | 4.30  | 3.89  |
| 512  | 7.28     | 4.87 | 1.81 | 6.18 | 4.38 | 5.58  | 6.10  |
| 768  | 4.57     | 5.72 | 5.35 | 5.24 | 5.94 | 5.66  | 0.20  |
| 1024 | 7.88     | 5.73 | 5.81 | 6.52 | 7.29 | 6.06  | 5.62  |

2MB with system_unbound_wq on a busy system  (GB/s):

| ---- | ------- | ---- | ---- | ---- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4 | mt_8  | mt_16 | mt_32 |
| ---- | ------- | ---- | ---- | ---- | ----- | ----- | ----- |
| 1    | 1.38    | 0.59 | 1.45 | 1.99 | 1.59  | 2.18  | 1.48  |
| 2    | 1.13    | 3.08 | 3.11 | 1.85 | 0.32  | 1.46  | 2.53  |
| 4    | 8.31    | 4.02 | 5.68 | 3.22 | 2.96  | 5.77  | 2.91  |
| 8    | 8.16    | 5.09 | 1.19 | 4.96 | 4.50  | 3.36  | 4.99  |
| 16   | 3.47    | 5.13 | 5.72 | 7.06 | 5.90  | 6.49  | 5.34  |
| 32   | 8.42    | 6.97 | 0.13 | 6.77 | 7.69  | 7.56  | 2.87  |
| 64   | 7.45    | 8.06 | 7.22 | 8.60 | 8.07  | 7.16  | 0.57  |
| 128  | 7.77    | 7.93 | 7.29 | 8.31 | 7.77  | 9.05  | 0.92  |
| 256  | 6.91    | 7.20 | 6.80 | 8.56 | 7.81  | 10.13 | 11.21 |
| 512  | 6.72    | 7.22 | 7.77 | 9.71 | 10.68 | 10.35 | 10.40 |
| 768  | 6.87    | 7.18 | 7.98 | 9.28 | 10.85 | 10.83 | 14.17 |
| 1024 | 6.95    | 7.23 | 8.03 | 9.59 | 10.88 | 10.22 | 20.27 |



64KB with system_highpri_wq on a busy system  (GB/s):

| ---- | ------- | ---- | ---- | ----- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 | mt_32 |
| ---- | ------- | ---- | ---- | ----- | ----- | ----- | ----- |
| 32   | 4.05    | 2.63 | 1.62 | 1.90  | 3.34  | 3.71  | 3.40  |
| 256  | 6.91    | 5.16 | 4.33 | 8.07  | 6.81  | 10.31 | 13.51 |
| 512  | 7.28    | 4.89 | 6.43 | 15.72 | 11.31 | 18.03 | 32.69 |
| 768  | 4.57    | 6.27 | 6.42 | 11.06 | 8.56  | 14.91 | 9.24  |
| 1024 | 7.88    | 6.73 | 0.49 | 17.09 | 19.34 | 23.60 | 18.12 |


2MB with system_highpri_wq on a busy system  (GB/s):

| ---- | ------- | ---- | ----- | ----- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2  | mt_4  | mt_8  | mt_16 | mt_32 |
| ---- | ------- | ---- | ----- | ----- | ----- | ----- | ----- |
| 1    | 1.38    | 1.18 | 1.17  | 5.00  | 1.68  | 3.86  | 2.46  |
| 2    | 1.13    | 1.78 | 1.05  | 0.01  | 3.52  | 1.84  | 1.80  |
| 4    | 8.31    | 3.91 | 5.24  | 4.30  | 4.12  | 2.93  | 3.44  |
| 8    | 8.16    | 6.09 | 3.67  | 7.81  | 11.10 | 8.47  | 15.21 |
| 16   | 3.47    | 6.02 | 8.44  | 11.80 | 9.56  | 12.84 | 9.81  |
| 32   | 8.42    | 7.34 | 10.10 | 13.79 | 23.03 | 26.68 | 45.24 |
| 64   | 7.45    | 7.90 | 12.27 | 19.99 | 36.08 | 35.11 | 60.26 |
| 128  | 7.77    | 7.57 | 13.35 | 24.67 | 35.03 | 41.40 | 51.68 |
| 256  | 6.91    | 7.40 | 14.13 | 25.37 | 38.83 | 62.18 | 51.37 |
| 512  | 6.72    | 7.26 | 14.72 | 27.37 | 43.99 | 66.84 | 69.63 |
| 768  | 6.87    | 7.29 | 14.84 | 26.34 | 47.21 | 67.51 | 80.32 |
| 1024 | 6.95    | 7.26 | 14.88 | 26.98 | 47.75 | 74.99 | 85.00 |



>
>>
>>
>> TODOs
>> ===
>> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
>> only use idle CPUs to avoid interfering userspace workloads. Of course
>> more complicated policies can be used based on migration issuing thread
>> priority.
>
> The other potential problem is it is hard to attribute cpu time
> consumed by the migration work threads to cpu cgroups. In a
> multi-tenant environment this may result in unfair cpu time counting.
> However, it is a chronic problem to properly count cpu time for kernel
> threads. I'm not sure whether it has been solved or not.
>
>>
>> 2. Eliminate memory allocation during multi-threaded folio copy routine
>> if possible.
>>
>> 3. A runtime check to decide when use multi-threaded folio copy.
>> Something like cache hotness issue mentioned by Matthew[3].
>>
>> 4. Use non-temporal CPU instructions to avoid cache pollution issues.
>
> AFAICT, arm64 already uses non-temporal instructions for copy page.

Right. My current implementation uses memcpy, which does not use non-temporal
on ARM64, since a huge page can be copied by multiple threads. A non-temporal
memcpy can be added for this use.

Thank you for the inputs.

>
>>
>> 5. Explicitly make multi-threaded folio copy only available to
>> !HIGHMEM, since kmap_local_page() would be needed for each kernel
>> folio copy work threads and expensive.
>>
>> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
>> to be used as well.
>>
>> Let me know your thoughts. Thanks.
>>
>>
>> [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/
>> [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
>> [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/
>>
>> Byungchul Park (1):
>>   mm: separate move/undo doing on folio list from migrate_pages_batch()
>>
>> Zi Yan (4):
>>   mm/migrate: factor out code in move_to_new_folio() and
>>     migrate_folio_move()
>>   mm/migrate: add migrate_folios_batch_move to batch the folio move
>>     operations
>>   mm/migrate: introduce multi-threaded page copy routine
>>   test: add sysctl for folio copy tests and adjust
>>     NR_MAX_BATCHED_MIGRATION
>>
>>  include/linux/migrate.h      |   3 +
>>  include/linux/migrate_mode.h |   2 +
>>  include/linux/mm.h           |   4 +
>>  include/linux/sysctl.h       |   1 +
>>  kernel/sysctl.c              |  29 ++-
>>  mm/Makefile                  |   2 +-
>>  mm/copy_pages.c              | 190 +++++++++++++++
>>  mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
>>  8 files changed, 577 insertions(+), 97 deletions(-)
>>  create mode 100644 mm/copy_pages.c
>>
>> --
>> 2.45.2
>>


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
                   ` (6 preceding siblings ...)
  2025-01-03 22:09 ` Yang Shi
@ 2025-01-09 11:47 ` Shivank Garg
  2025-01-09 15:04   ` Zi Yan
  2025-02-13  8:17 ` Byungchul Park
  8 siblings, 1 reply; 30+ messages in thread
From: Shivank Garg @ 2025-01-09 11:47 UTC (permalink / raw)
  To: Zi Yan, linux-mm
  Cc: David Rientjes, Aneesh Kumar, David Hildenbrand, John Hubbard,
	Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying

On 1/3/2025 10:54 PM, Zi Yan wrote:

Hi Zi,

It's interesting to see my batch page migration patchset evolution with
multi-threading support. Thanks for sharing this.

> Hi all,
> 
> This patchset accelerates page migration by batching folio copy operations and
> using multiple CPU threads and is based on Shivank's Enhancements to Page
> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
> The last patch is for testing purpose and should not be considered.
> 
> The motivations are:
> 
> 1. Batching folio copy increases copy throughput. Especially for base page
> migrations, folio copy throughput is low since there are kernel activities like
> moving folio metadata and updating page table entries sit between two folio
> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
> and 64KB on ARM64.
> 
> 2. Single CPU thread has limited copy throughput. Using multi threads is
> a natural extension to speed up folio copy, when DMA engine is NOT
> available in a system.
> 
> 
> Design
> ===
> 
> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
> migrate_folio_move() and perform them in one shot afterwards. A
> copy_page_lists_mt() function is added to use multi threads to copy
> folios from src list to dst list.
> 
> Changes compared to Shivank's patchset (mainly rewrote batching folio
> copy code)
> ===
> 
> 1. mig_info is removed, so no memory allocation is needed during
> batching folio copies. src->private is used to store old page state and
> anon_vma after folio metadata is copied from src to dst.
> 
> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
> redundant code in migrate_folios_batch_move().
> 
> 3. folio_mc_copy() is used for the single threaded copy code to keep the
> original kernel behavior.
> 
> 


> 
> TODOs
> ===
> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
> only use idle CPUs to avoid interfering userspace workloads. Of course
> more complicated policies can be used based on migration issuing thread
> priority.
> 
> 2. Eliminate memory allocation during multi-threaded folio copy routine
> if possible.
> 
> 3. A runtime check to decide when use multi-threaded folio copy.
> Something like cache hotness issue mentioned by Matthew[3].
> 
> 4. Use non-temporal CPU instructions to avoid cache pollution issues.

> 
> 5. Explicitly make multi-threaded folio copy only available to
> !HIGHMEM, since kmap_local_page() would be needed for each kernel
> folio copy work threads and expensive.
> 
> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
> to be used as well.

I think Static Calls can be better option for this.

This will give a flexible copy interface to support both CPU and various DMA-based
folio copy. DMA-capable driver can override the default CPU copy path without any
additional runtime overheads.


> Performance
> ===
> 
> I benchmarked move_pages() throughput on a two socket NUMA system with two
> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
> mTHP page migration are measured.
> 
> The tables below show move_pages() throughput with different
> configurations and different numbers of copied pages. The x-axis is the
> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
> threads with this patchset applied. And the unit is GB/s.
> 
> The 32-thread copy throughput can be up to 10x of single thread serial folio
> copy. Batching folio copy not only benefits huge page but also base
> page.
> 
> 64KB (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
> 256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
> 512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
> 768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
> 1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80
> 
> 2MB mTHP (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
> 2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
> 4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
> 8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
> 16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
> 32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
> 64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
> 128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
> 256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
> 512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
> 768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
> 1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84

I'm measuring the throughput(in GB/s) on our AMD EPYC Zen 5 system
(2-socket, 64-core per socket with SMT Enabled, 2 NUMA nodes) with base
page-size as 4KB and using using mm-everything-2025-01-04-04-41 as base
kernel.

Method:
======
main() {
...

    // code snippet to measure throughput
    clock_gettime(CLOCK_MONOTONIC, &t1);
    retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
    clock_gettime(CLOCK_MONOTONIC, &t2);

    // tput = num_pages*PAGE_SIZE/(t2-t1)

...
}


Measurements:
============
vanilla: base kernel without patchset
mt:0 = MT kernel with use_mt_copy=0
mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32

Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
for 4KB migration and THP migration.

--------------------
#1 push_0_pull_1 = 0 (src node CPUs are used)

#1.1 THP=Never, 4KB (GB/s):
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
512                 1.28      1.28      1.92      1.80      2.24      2.35      2.22      2.17
4096                2.40      2.40      2.51      2.58      2.83      2.72      2.99      3.25
8192                3.18      2.88      2.83      2.69      3.49      3.46      3.57      3.80
16348               3.17      2.94      2.96      3.17      3.63      3.68      4.06      4.15

#1.2 THP=Always, 2MB (GB/s):
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
512                 4.31      5.02      3.39      3.40      3.33      3.51      3.91      4.03
1024                7.13      4.49      3.58      3.56      3.91      3.87      4.39      4.57
2048                5.26      6.47      3.91      4.00      3.71      3.85      4.97      6.83
4096                9.93      7.77      4.58      3.79      3.93      3.53      6.41      4.77
8192                6.47      6.33      4.37      4.67      4.52      4.39      5.30      5.37
16348               7.66      8.00      5.20      5.22      5.24      5.28      6.41      7.02
32768               8.56      8.62      6.34      6.20      6.20      6.19      7.18      8.10
65536               9.41      9.40      7.14      7.15      7.15      7.19      7.96      8.89
262144              10.17     10.19     7.26      7.90      7.98      8.05      9.46      10.30
524288              10.40     9.95      7.25      7.93      8.02      8.76      9.55      10.30

--------------------
#2 push_0_pull_1 = 1 (dst node CPUs are used):

#2.1 THP=Never 4KB (GB/s):
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
512                 1.28      1.36      2.01      2.74      2.33      2.31      2.53      2.96
4096                2.40      2.84      2.94      3.04      3.40      3.23      3.31      4.16
8192                3.18      3.27      3.34      3.94      3.77      3.68      4.23      4.76
16348               3.17      3.42      3.66      3.21      3.82      4.40      4.76      4.89

#2.2 THP=Always 2MB (GB/s):
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
512                 4.31      5.91      4.03      3.73      4.26      4.13      4.78      3.44
1024                7.13      6.83      4.60      5.13      5.03      5.19      5.94      7.25
2048                5.26      7.09      5.20      5.69      5.83      5.73      6.85      8.13
4096                9.93      9.31      4.90      4.82      4.82      5.26      8.46      8.52
8192                6.47      7.63      5.66      5.85      5.75      6.14      7.45      8.63
16348               7.66      10.00     6.35      6.54      6.66      6.99      8.18      10.21
32768               8.56      9.78      7.06      7.41      7.76      9.02      9.55      11.92
65536               9.41      10.00     8.19      9.20      9.32      8.68      11.00     13.31
262144              10.17     11.17     9.01      9.96      9.99      10.00     11.70     14.27
524288              10.40     11.38     9.07      9.98      10.01     10.09     11.95     14.48

Note:
1. For THP = Never: I'm doing for 16X pages to keep total size same for your
   experiment with 64KB pagesize)
2. For THP = Always: nr_pages = Number of 4KB pages moved.
   nr_pages=512 => 512 4KB pages => 1 2MB page)


I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
relatively flat across thread counts.

Is it possible I'm missing something in my testing?
 
Could the base page size difference (4KB vs 64KB) be playing a role in
the scaling behavior? How the performance varies with 4KB pages on your system?

I'd be happy to work with you on investigating this differences.
Let me know if you'd like any additional test data or if there are specific
configurations I should try.


> 
> Let me know your thoughts. Thanks.
> 
> 
> [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/
> [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
> [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/
> 
> Byungchul Park (1):
>   mm: separate move/undo doing on folio list from migrate_pages_batch()
> 
> Zi Yan (4):
>   mm/migrate: factor out code in move_to_new_folio() and
>     migrate_folio_move()
>   mm/migrate: add migrate_folios_batch_move to batch the folio move
>     operations
>   mm/migrate: introduce multi-threaded page copy routine
>   test: add sysctl for folio copy tests and adjust
>     NR_MAX_BATCHED_MIGRATION
> 
>  include/linux/migrate.h      |   3 +
>  include/linux/migrate_mode.h |   2 +
>  include/linux/mm.h           |   4 +
>  include/linux/sysctl.h       |   1 +
>  kernel/sysctl.c              |  29 ++-
>  mm/Makefile                  |   2 +-
>  mm/copy_pages.c              | 190 +++++++++++++++
>  mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
>  8 files changed, 577 insertions(+), 97 deletions(-)
>  create mode 100644 mm/copy_pages.c
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations
  2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
@ 2025-01-09 11:47   ` Shivank Garg
  2025-01-09 14:08     ` Zi Yan
  0 siblings, 1 reply; 30+ messages in thread
From: Shivank Garg @ 2025-01-09 11:47 UTC (permalink / raw)
  To: Zi Yan, linux-mm
  Cc: David Rientjes, Aneesh Kumar, David Hildenbrand, John Hubbard,
	Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying

On 1/3/2025 10:54 PM, Zi Yan wrote:
> This is a preparatory patch that enables batch copying for folios
> undergoing migration. By enabling batch copying the folio content, we can
> efficiently utilize the capabilities of DMA hardware or multi-threaded
> folio copy. It also adds MIGRATE_NO_COPY back to migrate_mode, so that
> folio copy will be skipped during metadata copy process and performed
> in a batch later.
> 
> Currently, the folio move operation is performed individually for each
> folio in sequential manner:
> for_each_folio() {
>         Copy folio metadata like flags and mappings
>         Copy the folio content from src to dst
>         Update page tables with dst folio
> }
> 
> With this patch, we transition to a batch processing approach as shown
> below:
> for_each_folio() {
>         Copy folio metadata like flags and mappings
> }
> Batch copy all src folios to dst
> for_each_folio() {
>         Update page tables with dst folios
> }
> 
> dst->private is used to store page states and possible anon_vma value,
> thus needs to be cleared during metadata copy process. To avoid additional
> memory allocation to store the data during batch copy process, src->private
> is used to store the data after metadata copy process, since src is no
> longer used.
> 
> Originally-by: Shivank Garg <shivankg@amd.com>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---

Hi Zi,

Please retain my Signed-off-by for future posting of batch page migration
patchset.

I think we can separate out the MIGRATE_NO_COPY support into separate patch.

Thanks,
Shivank

>  include/linux/migrate_mode.h |   2 +
>  mm/migrate.c                 | 207 +++++++++++++++++++++++++++++++++--
>  2 files changed, 201 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
> index 265c4328b36a..9af6c949a057 100644
> --- a/include/linux/migrate_mode.h
> +++ b/include/linux/migrate_mode.h
> @@ -7,11 +7,13 @@
>   *	on most operations but not ->writepage as the potential stall time
>   *	is too significant
>   * MIGRATE_SYNC will block when migrating pages
> + * MIGRATE_NO_COPY will not copy page content
>   */
>  enum migrate_mode {
>  	MIGRATE_ASYNC,
>  	MIGRATE_SYNC_LIGHT,
>  	MIGRATE_SYNC,
> +	MIGRATE_NO_COPY,
>  };
>  
>  enum migrate_reason {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a83508f94c57..95c4cc4a7823 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -51,6 +51,7 @@
>  
>  #include "internal.h"
>  
> +
>  bool isolate_movable_page(struct page *page, isolate_mode_t mode)
>  {
>  	struct folio *folio = folio_get_nontail_page(page);
> @@ -752,14 +753,19 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
>  			   enum migrate_mode mode)
>  {
>  	int rc, expected_count = folio_expected_refs(mapping, src);
> +	unsigned long dst_private = (unsigned long)dst->private;
>  
>  	/* Check whether src does not have extra refs before we do more work */
>  	if (folio_ref_count(src) != expected_count)
>  		return -EAGAIN;
>  
> -	rc = folio_mc_copy(dst, src);
> -	if (unlikely(rc))
> -		return rc;
> +	if (mode == MIGRATE_NO_COPY)
> +		dst->private = NULL;
> +	else {
> +		rc = folio_mc_copy(dst, src);
> +		if (unlikely(rc))
> +			return rc;
> +	}
>  
>  	rc = __folio_migrate_mapping(mapping, dst, src, expected_count);
>  	if (rc != MIGRATEPAGE_SUCCESS)
> @@ -769,6 +775,10 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
>  		folio_attach_private(dst, folio_detach_private(src));
>  
>  	folio_migrate_flags(dst, src);
> +
> +	if (mode == MIGRATE_NO_COPY)
> +		src->private = (void *)dst_private;
> +
>  	return MIGRATEPAGE_SUCCESS;
>  }
>  
> @@ -1042,7 +1052,7 @@ static int _move_to_new_folio_prep(struct folio *dst, struct folio *src,
>  								mode);
>  		else
>  			rc = fallback_migrate_folio(mapping, dst, src, mode);
> -	} else {
> +	} else if (mode != MIGRATE_NO_COPY) {
>  		const struct movable_operations *mops;
>  
>  		/*
> @@ -1060,7 +1070,8 @@ static int _move_to_new_folio_prep(struct folio *dst, struct folio *src,
>  		rc = mops->migrate_page(&dst->page, &src->page, mode);
>  		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
>  				!folio_test_isolated(src));
> -	}
> +	} else
> +		rc = -EAGAIN;
>  out:
>  	return rc;
>  }
> @@ -1138,7 +1149,7 @@ static void __migrate_folio_record(struct folio *dst,
>  	dst->private = (void *)anon_vma + old_page_state;
>  }
>  
> -static void __migrate_folio_extract(struct folio *dst,
> +static void __migrate_folio_read(struct folio *dst,
>  				   int *old_page_state,
>  				   struct anon_vma **anon_vmap)
>  {
> @@ -1146,6 +1157,13 @@ static void __migrate_folio_extract(struct folio *dst,
>  
>  	*anon_vmap = (struct anon_vma *)(private & ~PAGE_OLD_STATES);
>  	*old_page_state = private & PAGE_OLD_STATES;
> +}
> +
> +static void __migrate_folio_extract(struct folio *dst,
> +				   int *old_page_state,
> +				   struct anon_vma **anon_vmap)
> +{
> +	__migrate_folio_read(dst, old_page_state, anon_vmap);
>  	dst->private = NULL;
>  }
>  
> @@ -1771,6 +1789,174 @@ static void migrate_folios_move(struct list_head *src_folios,
>  	}
>  }
>  
> +static void migrate_folios_batch_move(struct list_head *src_folios,
> +		struct list_head *dst_folios,
> +		free_folio_t put_new_folio, unsigned long private,
> +		enum migrate_mode mode, int reason,
> +		struct list_head *ret_folios,
> +		struct migrate_pages_stats *stats,
> +		int *retry, int *thp_retry, int *nr_failed,
> +		int *nr_retry_pages)
> +{
> +	struct folio *folio, *folio2, *dst, *dst2;
> +	int rc, nr_pages = 0, nr_mig_folios = 0;
> +	int old_page_state = 0;
> +	struct anon_vma *anon_vma = NULL;
> +	bool is_lru;
> +	int is_thp = 0;
> +	LIST_HEAD(err_src);
> +	LIST_HEAD(err_dst);
> +
> +	if (mode != MIGRATE_ASYNC) {
> +		*retry += 1;
> +		return;
> +	}
> +
> +	/*
> +	 * Iterate over the list of locked src/dst folios to copy the metadata
> +	 */
> +	dst = list_first_entry(dst_folios, struct folio, lru);
> +	dst2 = list_next_entry(dst, lru);
> +	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
> +		is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
> +		nr_pages = folio_nr_pages(folio);
> +		is_lru = !__folio_test_movable(folio);
> +
> +		/*
> +		 * dst->private is not cleared here. It is cleared and moved to
> +		 * src->private in __migrate_folio().
> +		 */
> +		__migrate_folio_read(dst, &old_page_state, &anon_vma);
> +
> +		/*
> +		 * Use MIGRATE_NO_COPY mode in migrate_folio family functions
> +		 * to copy the flags, mapping and some other ancillary information.
> +		 * This does everything except the page copy. The actual page copy
> +		 * is handled later in a batch manner.
> +		 */
> +		rc = _move_to_new_folio_prep(dst, folio, MIGRATE_NO_COPY);
> +
> +		/*
> +		 * -EAGAIN: Move src/dst folios to tmp lists for retry
> +		 * Other Errno: Put src folio on ret_folios list, remove the dst folio
> +		 * Success: Copy the folio bytes, restoring working pte, unlock and
> +		 *	    decrement refcounter
> +		 */
> +		if (rc == -EAGAIN) {
> +			*retry += 1;
> +			*thp_retry += is_thp;
> +			*nr_retry_pages += nr_pages;
> +
> +			list_move_tail(&folio->lru, &err_src);
> +			list_move_tail(&dst->lru, &err_dst);
> +			__migrate_folio_record(dst, old_page_state, anon_vma);
> +		} else if (rc != MIGRATEPAGE_SUCCESS) {
> +			*nr_failed += 1;
> +			stats->nr_thp_failed += is_thp;
> +			stats->nr_failed_pages += nr_pages;
> +
> +			list_del(&dst->lru);
> +			migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
> +					anon_vma, true, ret_folios);
> +			migrate_folio_undo_dst(dst, true, put_new_folio, private);
> +		} else /* MIGRATEPAGE_SUCCESS */
> +			nr_mig_folios++;
> +
> +		dst = dst2;
> +		dst2 = list_next_entry(dst, lru);
> +	}
> +
> +	/* Exit if folio list for batch migration is empty */
> +	if (!nr_mig_folios)
> +		goto out;
> +
> +	/* Batch copy the folios */
> +	{
> +		dst = list_first_entry(dst_folios, struct folio, lru);
> +		dst2 = list_next_entry(dst, lru);
> +		list_for_each_entry_safe(folio, folio2, src_folios, lru) {
> +			is_thp = folio_test_large(folio) &&
> +				 folio_test_pmd_mappable(folio);
> +			nr_pages = folio_nr_pages(folio);
> +			rc = folio_mc_copy(dst, folio);
> +
> +			if (rc) {
> +				int old_page_state = 0;
> +				struct anon_vma *anon_vma = NULL;
> +
> +				/*
> +				 * dst->private is moved to src->private in
> +				 * __migrate_folio(), so page state and anon_vma
> +				 * values can be extracted from (src) folio.
> +				 */
> +				__migrate_folio_extract(folio, &old_page_state,
> +						&anon_vma);
> +				migrate_folio_undo_src(folio,
> +						old_page_state & PAGE_WAS_MAPPED,
> +						anon_vma, true, ret_folios);
> +				list_del(&dst->lru);
> +				migrate_folio_undo_dst(dst, true, put_new_folio,
> +						private);
> +			}
> +
> +			switch (rc) {
> +			case MIGRATEPAGE_SUCCESS:
> +				stats->nr_succeeded += nr_pages;
> +				stats->nr_thp_succeeded += is_thp;
> +				break;
> +			default:
> +				*nr_failed += 1;
> +				stats->nr_thp_failed += is_thp;
> +				stats->nr_failed_pages += nr_pages;
> +				break;
> +			}
> +
> +			dst = dst2;
> +			dst2 = list_next_entry(dst, lru);
> +		}
> +	}
> +
> +	/*
> +	 * Iterate the folio lists to remove migration pte and restore them
> +	 * as working pte. Unlock the folios, add/remove them to LRU lists (if
> +	 * applicable) and release the src folios.
> +	 */
> +	dst = list_first_entry(dst_folios, struct folio, lru);
> +	dst2 = list_next_entry(dst, lru);
> +	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
> +		is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
> +		nr_pages = folio_nr_pages(folio);
> +		/*
> +		 * dst->private is moved to src->private in __migrate_folio(),
> +		 * so page state and anon_vma values can be extracted from
> +		 * (src) folio.
> +		 */
> +		__migrate_folio_extract(folio, &old_page_state, &anon_vma);
> +		list_del(&dst->lru);
> +
> +		_move_to_new_folio_finalize(dst, folio, MIGRATEPAGE_SUCCESS);
> +
> +		/*
> +		 * Below few steps are only applicable for lru pages which is
> +		 * ensured as we have removed the non-lru pages from our list.
> +		 */
> +		_migrate_folio_move_finalize1(folio, dst, old_page_state);
> +
> +		_migrate_folio_move_finalize2(folio, dst, reason, anon_vma);
> +
> +		/* Page migration successful, increase stat counter */
> +		stats->nr_succeeded += nr_pages;
> +		stats->nr_thp_succeeded += is_thp;
> +
> +		dst = dst2;
> +		dst2 = list_next_entry(dst, lru);
> +	}
> +out:
> +	/* Add tmp folios back to the list to let CPU re-attempt migration. */
> +	list_splice(&err_src, src_folios);
> +	list_splice(&err_dst, dst_folios);
> +}
> +
>  static void migrate_folios_undo(struct list_head *src_folios,
>  		struct list_head *dst_folios,
>  		free_folio_t put_new_folio, unsigned long private,
> @@ -1981,13 +2167,18 @@ static int migrate_pages_batch(struct list_head *from,
>  	/* Flush TLBs for all unmapped folios */
>  	try_to_unmap_flush();
>  
> -	retry = 1;
> +	retry = 0;
> +	/* Batch move the unmapped folios */
> +	migrate_folios_batch_move(&unmap_folios, &dst_folios, put_new_folio,
> +			private, mode, reason, ret_folios, stats, &retry,
> +			&thp_retry, &nr_failed, &nr_retry_pages);
> +
>  	for (pass = 0; pass < nr_pass && retry; pass++) {
>  		retry = 0;
>  		thp_retry = 0;
>  		nr_retry_pages = 0;
>  
> -		/* Move the unmapped folios */
> +		/* Move the remaining unmapped folios */
>  		migrate_folios_move(&unmap_folios, &dst_folios,
>  				put_new_folio, private, mode, reason,
>  				ret_folios, stats, &retry, &thp_retry,



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations
  2025-01-09 11:47   ` Shivank Garg
@ 2025-01-09 14:08     ` Zi Yan
  0 siblings, 0 replies; 30+ messages in thread
From: Zi Yan @ 2025-01-09 14:08 UTC (permalink / raw)
  To: Shivank Garg
  Cc: linux-mm, David Rientjes, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying

On 9 Jan 2025, at 6:47, Shivank Garg wrote:

> On 1/3/2025 10:54 PM, Zi Yan wrote:
>> This is a preparatory patch that enables batch copying for folios
>> undergoing migration. By enabling batch copying the folio content, we can
>> efficiently utilize the capabilities of DMA hardware or multi-threaded
>> folio copy. It also adds MIGRATE_NO_COPY back to migrate_mode, so that
>> folio copy will be skipped during metadata copy process and performed
>> in a batch later.
>>
>> Currently, the folio move operation is performed individually for each
>> folio in sequential manner:
>> for_each_folio() {
>>         Copy folio metadata like flags and mappings
>>         Copy the folio content from src to dst
>>         Update page tables with dst folio
>> }
>>
>> With this patch, we transition to a batch processing approach as shown
>> below:
>> for_each_folio() {
>>         Copy folio metadata like flags and mappings
>> }
>> Batch copy all src folios to dst
>> for_each_folio() {
>>         Update page tables with dst folios
>> }
>>
>> dst->private is used to store page states and possible anon_vma value,
>> thus needs to be cleared during metadata copy process. To avoid additional
>> memory allocation to store the data during batch copy process, src->private
>> is used to store the data after metadata copy process, since src is no
>> longer used.
>>
>> Originally-by: Shivank Garg <shivankg@amd.com>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>
> Hi Zi,
>
> Please retain my Signed-off-by for future posting of batch page migration
> patchset.
>
> I think we can separate out the MIGRATE_NO_COPY support into separate patch.

Sure. Will change them in the next version.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-09 11:47 ` Shivank Garg
@ 2025-01-09 15:04   ` Zi Yan
  2025-01-09 18:03     ` Shivank Garg
  0 siblings, 1 reply; 30+ messages in thread
From: Zi Yan @ 2025-01-09 15:04 UTC (permalink / raw)
  To: Shivank Garg
  Cc: linux-mm, David Rientjes, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying

On 9 Jan 2025, at 6:47, Shivank Garg wrote:

> On 1/3/2025 10:54 PM, Zi Yan wrote:
>
> Hi Zi,
>
> It's interesting to see my batch page migration patchset evolution with
> multi-threading support. Thanks for sharing this.
>
>> Hi all,
>>
>> This patchset accelerates page migration by batching folio copy operations and
>> using multiple CPU threads and is based on Shivank's Enhancements to Page
>> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
>> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
>> The last patch is for testing purpose and should not be considered.
>>
>> The motivations are:
>>
>> 1. Batching folio copy increases copy throughput. Especially for base page
>> migrations, folio copy throughput is low since there are kernel activities like
>> moving folio metadata and updating page table entries sit between two folio
>> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
>> and 64KB on ARM64.
>>
>> 2. Single CPU thread has limited copy throughput. Using multi threads is
>> a natural extension to speed up folio copy, when DMA engine is NOT
>> available in a system.
>>
>>
>> Design
>> ===
>>
>> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
>> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
>> migrate_folio_move() and perform them in one shot afterwards. A
>> copy_page_lists_mt() function is added to use multi threads to copy
>> folios from src list to dst list.
>>
>> Changes compared to Shivank's patchset (mainly rewrote batching folio
>> copy code)
>> ===
>>
>> 1. mig_info is removed, so no memory allocation is needed during
>> batching folio copies. src->private is used to store old page state and
>> anon_vma after folio metadata is copied from src to dst.
>>
>> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
>> redundant code in migrate_folios_batch_move().
>>
>> 3. folio_mc_copy() is used for the single threaded copy code to keep the
>> original kernel behavior.
>>
>>
>
>
>>
>> TODOs
>> ===
>> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
>> only use idle CPUs to avoid interfering userspace workloads. Of course
>> more complicated policies can be used based on migration issuing thread
>> priority.
>>
>> 2. Eliminate memory allocation during multi-threaded folio copy routine
>> if possible.
>>
>> 3. A runtime check to decide when use multi-threaded folio copy.
>> Something like cache hotness issue mentioned by Matthew[3].
>>
>> 4. Use non-temporal CPU instructions to avoid cache pollution issues.
>
>>
>> 5. Explicitly make multi-threaded folio copy only available to
>> !HIGHMEM, since kmap_local_page() would be needed for each kernel
>> folio copy work threads and expensive.
>>
>> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
>> to be used as well.
>
> I think Static Calls can be better option for this.

This is the first time I hear about it. Based on the info I find, I agree
it is a great mechanism to switch between two methods globally.
>
> This will give a flexible copy interface to support both CPU and various DMA-based
> folio copy. DMA-capable driver can override the default CPU copy path without any
> additional runtime overheads.

Yes, supporting DMA-based folio copy is also my intention too. I am happy to
with you on that. Things to note are:
1. DMA engine should have more copy throughput as a single CPU thread, otherwise
the scatter-gather setup overheads will eliminate the benefit of using DMA engine.

2. Unless the DMA engine is really beef and can handle all possible page migration
requests, CPU-based migration (single or multi threads) should be a fallback.

In terms of 2, I wonder how much overheads does Static Calls have when switching
between functions. Also, a lock might be needed since falling back to CPU might
be per migrate_pages(). Considering these two, Static Calls might not work
as you intended if switching between CPU and DMA is needed.

>
>
>> Performance
>> ===
>>
>> I benchmarked move_pages() throughput on a two socket NUMA system with two
>> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
>> mTHP page migration are measured.
>>
>> The tables below show move_pages() throughput with different
>> configurations and different numbers of copied pages. The x-axis is the
>> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
>> threads with this patchset applied. And the unit is GB/s.
>>
>> The 32-thread copy throughput can be up to 10x of single thread serial folio
>> copy. Batching folio copy not only benefits huge page but also base
>> page.
>>
>> 64KB (GB/s):
>>
>> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
>> 32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
>> 256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
>> 512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
>> 768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
>> 1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80
>>
>> 2MB mTHP (GB/s):
>>
>> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
>> 1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
>> 2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
>> 4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
>> 8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
>> 16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
>> 32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
>> 64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
>> 128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
>> 256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
>> 512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
>> 768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
>> 1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84
>
> I'm measuring the throughput(in GB/s) on our AMD EPYC Zen 5 system
> (2-socket, 64-core per socket with SMT Enabled, 2 NUMA nodes) with base
> page-size as 4KB and using using mm-everything-2025-01-04-04-41 as base
> kernel.
>
> Method:
> ======
> main() {
> ...
>
>     // code snippet to measure throughput
>     clock_gettime(CLOCK_MONOTONIC, &t1);
>     retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
>     clock_gettime(CLOCK_MONOTONIC, &t2);
>
>     // tput = num_pages*PAGE_SIZE/(t2-t1)
>
> ...
> }
>
>
> Measurements:
> ============
> vanilla: base kernel without patchset
> mt:0 = MT kernel with use_mt_copy=0
> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>
> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
> for 4KB migration and THP migration.
>
> --------------------
> #1 push_0_pull_1 = 0 (src node CPUs are used)
>
> #1.1 THP=Never, 4KB (GB/s):
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 512                 1.28      1.28      1.92      1.80      2.24      2.35      2.22      2.17
> 4096                2.40      2.40      2.51      2.58      2.83      2.72      2.99      3.25
> 8192                3.18      2.88      2.83      2.69      3.49      3.46      3.57      3.80
> 16348               3.17      2.94      2.96      3.17      3.63      3.68      4.06      4.15
>
> #1.2 THP=Always, 2MB (GB/s):
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 512                 4.31      5.02      3.39      3.40      3.33      3.51      3.91      4.03
> 1024                7.13      4.49      3.58      3.56      3.91      3.87      4.39      4.57
> 2048                5.26      6.47      3.91      4.00      3.71      3.85      4.97      6.83
> 4096                9.93      7.77      4.58      3.79      3.93      3.53      6.41      4.77
> 8192                6.47      6.33      4.37      4.67      4.52      4.39      5.30      5.37
> 16348               7.66      8.00      5.20      5.22      5.24      5.28      6.41      7.02
> 32768               8.56      8.62      6.34      6.20      6.20      6.19      7.18      8.10
> 65536               9.41      9.40      7.14      7.15      7.15      7.19      7.96      8.89
> 262144              10.17     10.19     7.26      7.90      7.98      8.05      9.46      10.30
> 524288              10.40     9.95      7.25      7.93      8.02      8.76      9.55      10.30
>
> --------------------
> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>
> #2.1 THP=Never 4KB (GB/s):
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 512                 1.28      1.36      2.01      2.74      2.33      2.31      2.53      2.96
> 4096                2.40      2.84      2.94      3.04      3.40      3.23      3.31      4.16
> 8192                3.18      3.27      3.34      3.94      3.77      3.68      4.23      4.76
> 16348               3.17      3.42      3.66      3.21      3.82      4.40      4.76      4.89
>
> #2.2 THP=Always 2MB (GB/s):
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 512                 4.31      5.91      4.03      3.73      4.26      4.13      4.78      3.44
> 1024                7.13      6.83      4.60      5.13      5.03      5.19      5.94      7.25
> 2048                5.26      7.09      5.20      5.69      5.83      5.73      6.85      8.13
> 4096                9.93      9.31      4.90      4.82      4.82      5.26      8.46      8.52
> 8192                6.47      7.63      5.66      5.85      5.75      6.14      7.45      8.63
> 16348               7.66      10.00     6.35      6.54      6.66      6.99      8.18      10.21
> 32768               8.56      9.78      7.06      7.41      7.76      9.02      9.55      11.92
> 65536               9.41      10.00     8.19      9.20      9.32      8.68      11.00     13.31
> 262144              10.17     11.17     9.01      9.96      9.99      10.00     11.70     14.27
> 524288              10.40     11.38     9.07      9.98      10.01     10.09     11.95     14.48
>
> Note:
> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
>    experiment with 64KB pagesize)
> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
>    nr_pages=512 => 512 4KB pages => 1 2MB page)
>
>
> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
> relatively flat across thread counts.
>
> Is it possible I'm missing something in my testing?
>
> Could the base page size difference (4KB vs 64KB) be playing a role in
> the scaling behavior? How the performance varies with 4KB pages on your system?
>
> I'd be happy to work with you on investigating this differences.
> Let me know if you'd like any additional test data or if there are specific
> configurations I should try.

The results surprises me, since I was able to achieve ~9GB/s when migrating
16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
(a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
not make sense.

One thing you might want to try is to set init_on_alloc=0 in your boot
parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
might reduce the time spent on page zeros.

I am also going to rerun the experiments locally on x86_64 boxes to see if your
results can be replicated.

Thank you for the review and running these experiments. I really appreciate
it.


[1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-09 15:04   ` Zi Yan
@ 2025-01-09 18:03     ` Shivank Garg
  2025-01-09 19:32       ` Zi Yan
  0 siblings, 1 reply; 30+ messages in thread
From: Shivank Garg @ 2025-01-09 18:03 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, David Rientjes, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying

On 1/9/2025 8:34 PM, Zi Yan wrote:
> On 9 Jan 2025, at 6:47, Shivank Garg wrote:
> 
>> On 1/3/2025 10:54 PM, Zi Yan wrote:
>>


>>>
>>> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
>>> to be used as well.
>>
>> I think Static Calls can be better option for this.
> 
> This is the first time I hear about it. Based on the info I find, I agree
> it is a great mechanism to switch between two methods globally.
>>
>> This will give a flexible copy interface to support both CPU and various DMA-based
>> folio copy. DMA-capable driver can override the default CPU copy path without any
>> additional runtime overheads.
> 
> Yes, supporting DMA-based folio copy is also my intention too. I am happy to
> with you on that. Things to note are:
> 1. DMA engine should have more copy throughput as a single CPU thread, otherwise
> the scatter-gather setup overheads will eliminate the benefit of using DMA engine.

I agree on this.

> 2. Unless the DMA engine is really beef and can handle all possible page migration
> requests, CPU-based migration (single or multi threads) should be a fallback.
> 
> In terms of 2, I wonder how much overheads does Static Calls have when switching
> between functions. Also, a lock might be needed since falling back to CPU might
> be per migrate_pages(). Considering these two, Static Calls might not work
> as you intended if switching between CPU and DMA is needed.

You can check Patch 4/5 and 5/5 for static call implementation for using DMA Driver
https://lore.kernel.org/linux-mm/20240614221525.19170-5-shivankg@amd.com

There are no run-time overheads of this Static call approach as update happens only
during DMA driver registration/un-registration - dma_update_migrator()
The SRCU synchronization will ensure the safety during updates.

It'll use static_call(_folios_copy)() for the copy path. A wrapper inside the DMA can
ensure it fallback to folios_copy().

Does this address your concern regarding the 2?


>> main() {
>> ...
>>
>>     // code snippet to measure throughput
>>     clock_gettime(CLOCK_MONOTONIC, &t1);
>>     retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
>>     clock_gettime(CLOCK_MONOTONIC, &t2);
>>
>>     // tput = num_pages*PAGE_SIZE/(t2-t1)
>>
>> ...
>> }
>>
>>
>> Measurements:
>> ============
>> vanilla: base kernel without patchset
>> mt:0 = MT kernel with use_mt_copy=0
>> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>>
>> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
>> for 4KB migration and THP migration.
>>
>> --------------------
>> #1 push_0_pull_1 = 0 (src node CPUs are used)
>>
>> #1.1 THP=Never, 4KB (GB/s):
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 1.28      1.28      1.92      1.80      2.24      2.35      2.22      2.17
>> 4096                2.40      2.40      2.51      2.58      2.83      2.72      2.99      3.25
>> 8192                3.18      2.88      2.83      2.69      3.49      3.46      3.57      3.80
>> 16348               3.17      2.94      2.96      3.17      3.63      3.68      4.06      4.15
>>
>> #1.2 THP=Always, 2MB (GB/s):
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 4.31      5.02      3.39      3.40      3.33      3.51      3.91      4.03
>> 1024                7.13      4.49      3.58      3.56      3.91      3.87      4.39      4.57
>> 2048                5.26      6.47      3.91      4.00      3.71      3.85      4.97      6.83
>> 4096                9.93      7.77      4.58      3.79      3.93      3.53      6.41      4.77
>> 8192                6.47      6.33      4.37      4.67      4.52      4.39      5.30      5.37
>> 16348               7.66      8.00      5.20      5.22      5.24      5.28      6.41      7.02
>> 32768               8.56      8.62      6.34      6.20      6.20      6.19      7.18      8.10
>> 65536               9.41      9.40      7.14      7.15      7.15      7.19      7.96      8.89
>> 262144              10.17     10.19     7.26      7.90      7.98      8.05      9.46      10.30
>> 524288              10.40     9.95      7.25      7.93      8.02      8.76      9.55      10.30
>>
>> --------------------
>> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>>
>> #2.1 THP=Never 4KB (GB/s):
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 1.28      1.36      2.01      2.74      2.33      2.31      2.53      2.96
>> 4096                2.40      2.84      2.94      3.04      3.40      3.23      3.31      4.16
>> 8192                3.18      3.27      3.34      3.94      3.77      3.68      4.23      4.76
>> 16348               3.17      3.42      3.66      3.21      3.82      4.40      4.76      4.89
>>
>> #2.2 THP=Always 2MB (GB/s):
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 4.31      5.91      4.03      3.73      4.26      4.13      4.78      3.44
>> 1024                7.13      6.83      4.60      5.13      5.03      5.19      5.94      7.25
>> 2048                5.26      7.09      5.20      5.69      5.83      5.73      6.85      8.13
>> 4096                9.93      9.31      4.90      4.82      4.82      5.26      8.46      8.52
>> 8192                6.47      7.63      5.66      5.85      5.75      6.14      7.45      8.63
>> 16348               7.66      10.00     6.35      6.54      6.66      6.99      8.18      10.21
>> 32768               8.56      9.78      7.06      7.41      7.76      9.02      9.55      11.92
>> 65536               9.41      10.00     8.19      9.20      9.32      8.68      11.00     13.31
>> 262144              10.17     11.17     9.01      9.96      9.99      10.00     11.70     14.27
>> 524288              10.40     11.38     9.07      9.98      10.01     10.09     11.95     14.48
>>
>> Note:
>> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
>>    experiment with 64KB pagesize)
>> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
>>    nr_pages=512 => 512 4KB pages => 1 2MB page)
>>
>>
>> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
>> relatively flat across thread counts.
>>
>> Is it possible I'm missing something in my testing?
>>
>> Could the base page size difference (4KB vs 64KB) be playing a role in
>> the scaling behavior? How the performance varies with 4KB pages on your system?
>>
>> I'd be happy to work with you on investigating this differences.
>> Let me know if you'd like any additional test data or if there are specific
>> configurations I should try.
> 
> The results surprises me, since I was able to achieve ~9GB/s when migrating
> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
> These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
> not make sense.
> 
> One thing you might want to try is to set init_on_alloc=0 in your boot
> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
> might reduce the time spent on page zeros.
> 
> I am also going to rerun the experiments locally on x86_64 boxes to see if your
> results can be replicated.
> 
> Thank you for the review and running these experiments. I really appreciate
> it.> 
> 
> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
> 

Using init_on_alloc=0 gave significant performance gain over the last experiment
but I'm still missing the performance scaling you observed.

THP Never
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41

THP Always
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29

Thanks,
Shivank

> Best Regards,
> Yan, Zi
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-09 18:03     ` Shivank Garg
@ 2025-01-09 19:32       ` Zi Yan
  2025-01-10 17:05         ` Zi Yan
  0 siblings, 1 reply; 30+ messages in thread
From: Zi Yan @ 2025-01-09 19:32 UTC (permalink / raw)
  To: Shivank Garg
  Cc: linux-mm, David Rientjes, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying

On 9 Jan 2025, at 13:03, Shivank Garg wrote:

> On 1/9/2025 8:34 PM, Zi Yan wrote:
>> On 9 Jan 2025, at 6:47, Shivank Garg wrote:
>>
>>> On 1/3/2025 10:54 PM, Zi Yan wrote:
>>>
>
>
>>>>
>>>> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
>>>> to be used as well.
>>>
>>> I think Static Calls can be better option for this.
>>
>> This is the first time I hear about it. Based on the info I find, I agree
>> it is a great mechanism to switch between two methods globally.
>>>
>>> This will give a flexible copy interface to support both CPU and various DMA-based
>>> folio copy. DMA-capable driver can override the default CPU copy path without any
>>> additional runtime overheads.
>>
>> Yes, supporting DMA-based folio copy is also my intention too. I am happy to
>> with you on that. Things to note are:
>> 1. DMA engine should have more copy throughput as a single CPU thread, otherwise
>> the scatter-gather setup overheads will eliminate the benefit of using DMA engine.
>
> I agree on this.
>
>> 2. Unless the DMA engine is really beef and can handle all possible page migration
>> requests, CPU-based migration (single or multi threads) should be a fallback.
>>
>> In terms of 2, I wonder how much overheads does Static Calls have when switching
>> between functions. Also, a lock might be needed since falling back to CPU might
>> be per migrate_pages(). Considering these two, Static Calls might not work
>> as you intended if switching between CPU and DMA is needed.
>
> You can check Patch 4/5 and 5/5 for static call implementation for using DMA Driver
> https://lore.kernel.org/linux-mm/20240614221525.19170-5-shivankg@amd.com
>
> There are no run-time overheads of this Static call approach as update happens only
> during DMA driver registration/un-registration - dma_update_migrator()
> The SRCU synchronization will ensure the safety during updates.

I understand this part.

>
> It'll use static_call(_folios_copy)() for the copy path. A wrapper inside the DMA can
> ensure it fallback to folios_copy().
>
> Does this address your concern regarding the 2?

DMA driver will need to fall back to folios_copy() (using CPU to copy folios),
when it thinks DMA engine is overloaded. In my mind, kernel should make the
decision whether to use single CPU, multiple CPUs, or DMA engine based on
CPU usage and DMA usage. As I am writing it, I realize that might be an overhead
we want to avoid, since it takes time to get CPU usage and DMA usage information
and should not be on the critical path of page migration. A better approach might
be that CPU scheduler and DMA engine can call dma_update_migrator() to change
_folios_copy in the static_call, based on CPU usage and DMA usage.

Yes, I think Static Calls should be able to help us choose the right folio copy
method, single CPU, multiple CPUs, or DMA engine, to perform folio copies.

BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_dma(),
which would incur a huge overhead, based on my past experience using DMA engine
for page copy. I know it is needed to make sure DMA is still present, but
its cost needs to be minimized to make DMA folio copy usable. Otherwise,
the 768MB/s DMA copy throughput from your cover letter cannot convince people
to use it for page migration, since single CPU can achieve more than that,
as you showed in the table below.

>
>
>>> main() {
>>> ...
>>>
>>>     // code snippet to measure throughput
>>>     clock_gettime(CLOCK_MONOTONIC, &t1);
>>>     retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
>>>     clock_gettime(CLOCK_MONOTONIC, &t2);
>>>
>>>     // tput = num_pages*PAGE_SIZE/(t2-t1)
>>>
>>> ...
>>> }
>>>
>>>
>>> Measurements:
>>> ============
>>> vanilla: base kernel without patchset
>>> mt:0 = MT kernel with use_mt_copy=0
>>> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>>>
>>> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
>>> for 4KB migration and THP migration.
>>>
>>> --------------------
>>> #1 push_0_pull_1 = 0 (src node CPUs are used)
>>>
>>> #1.1 THP=Never, 4KB (GB/s):
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>> 512                 1.28      1.28      1.92      1.80      2.24      2.35      2.22      2.17
>>> 4096                2.40      2.40      2.51      2.58      2.83      2.72      2.99      3.25
>>> 8192                3.18      2.88      2.83      2.69      3.49      3.46      3.57      3.80
>>> 16348               3.17      2.94      2.96      3.17      3.63      3.68      4.06      4.15
>>>
>>> #1.2 THP=Always, 2MB (GB/s):
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>> 512                 4.31      5.02      3.39      3.40      3.33      3.51      3.91      4.03
>>> 1024                7.13      4.49      3.58      3.56      3.91      3.87      4.39      4.57
>>> 2048                5.26      6.47      3.91      4.00      3.71      3.85      4.97      6.83
>>> 4096                9.93      7.77      4.58      3.79      3.93      3.53      6.41      4.77
>>> 8192                6.47      6.33      4.37      4.67      4.52      4.39      5.30      5.37
>>> 16348               7.66      8.00      5.20      5.22      5.24      5.28      6.41      7.02
>>> 32768               8.56      8.62      6.34      6.20      6.20      6.19      7.18      8.10
>>> 65536               9.41      9.40      7.14      7.15      7.15      7.19      7.96      8.89
>>> 262144              10.17     10.19     7.26      7.90      7.98      8.05      9.46      10.30
>>> 524288              10.40     9.95      7.25      7.93      8.02      8.76      9.55      10.30
>>>
>>> --------------------
>>> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>>>
>>> #2.1 THP=Never 4KB (GB/s):
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>> 512                 1.28      1.36      2.01      2.74      2.33      2.31      2.53      2.96
>>> 4096                2.40      2.84      2.94      3.04      3.40      3.23      3.31      4.16
>>> 8192                3.18      3.27      3.34      3.94      3.77      3.68      4.23      4.76
>>> 16348               3.17      3.42      3.66      3.21      3.82      4.40      4.76      4.89
>>>
>>> #2.2 THP=Always 2MB (GB/s):
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>> 512                 4.31      5.91      4.03      3.73      4.26      4.13      4.78      3.44
>>> 1024                7.13      6.83      4.60      5.13      5.03      5.19      5.94      7.25
>>> 2048                5.26      7.09      5.20      5.69      5.83      5.73      6.85      8.13
>>> 4096                9.93      9.31      4.90      4.82      4.82      5.26      8.46      8.52
>>> 8192                6.47      7.63      5.66      5.85      5.75      6.14      7.45      8.63
>>> 16348               7.66      10.00     6.35      6.54      6.66      6.99      8.18      10.21
>>> 32768               8.56      9.78      7.06      7.41      7.76      9.02      9.55      11.92
>>> 65536               9.41      10.00     8.19      9.20      9.32      8.68      11.00     13.31
>>> 262144              10.17     11.17     9.01      9.96      9.99      10.00     11.70     14.27
>>> 524288              10.40     11.38     9.07      9.98      10.01     10.09     11.95     14.48
>>>
>>> Note:
>>> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
>>>    experiment with 64KB pagesize)
>>> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
>>>    nr_pages=512 => 512 4KB pages => 1 2MB page)
>>>
>>>
>>> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
>>> relatively flat across thread counts.
>>>
>>> Is it possible I'm missing something in my testing?
>>>
>>> Could the base page size difference (4KB vs 64KB) be playing a role in
>>> the scaling behavior? How the performance varies with 4KB pages on your system?
>>>
>>> I'd be happy to work with you on investigating this differences.
>>> Let me know if you'd like any additional test data or if there are specific
>>> configurations I should try.
>>
>> The results surprises me, since I was able to achieve ~9GB/s when migrating
>> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
>> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
>> These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
>> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
>> not make sense.
>>
>> One thing you might want to try is to set init_on_alloc=0 in your boot
>> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
>> might reduce the time spent on page zeros.
>>
>> I am also going to rerun the experiments locally on x86_64 boxes to see if your
>> results can be replicated.
>>
>> Thank you for the review and running these experiments. I really appreciate
>> it.>
>>
>> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
>>
>
> Using init_on_alloc=0 gave significant performance gain over the last experiment
> but I'm still missing the performance scaling you observed.

It might be the difference between x86 and ARM64, but I am not 100% sure.
Based on your data below, 2 or 4 threads seem to the sweep spot for
the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
~25GB/s bidirectional. I wonder if your results below are cross-socket
link bandwidth limited.

From my results, NVIDIA Grace CPU can achieve high copy throughput
with more threads between two sockets, maybe part of the reason is that
its cross-socket link theoretical bandwidth is 900GB/s bidirectional.

>
> THP Never
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
> 4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
> 8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
> 16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41
>
> THP Always
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
> 1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
> 2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
> 4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
> 8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
> 16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
> 32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
> 65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
> 262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
> 524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29

[1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-09 19:32       ` Zi Yan
@ 2025-01-10 17:05         ` Zi Yan
  2025-01-10 19:51           ` Zi Yan
  0 siblings, 1 reply; 30+ messages in thread
From: Zi Yan @ 2025-01-10 17:05 UTC (permalink / raw)
  To: Shivank Garg
  Cc: linux-mm, David Rientjes, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying

<snip>
>>
>>>> main() {
>>>> ...
>>>>
>>>>     // code snippet to measure throughput
>>>>     clock_gettime(CLOCK_MONOTONIC, &t1);
>>>>     retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
>>>>     clock_gettime(CLOCK_MONOTONIC, &t2);
>>>>
>>>>     // tput = num_pages*PAGE_SIZE/(t2-t1)
>>>>
>>>> ...
>>>> }
>>>>
>>>>
>>>> Measurements:
>>>> ============
>>>> vanilla: base kernel without patchset
>>>> mt:0 = MT kernel with use_mt_copy=0
>>>> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>>>>
>>>> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
>>>> for 4KB migration and THP migration.
>>>>
>>>> --------------------
>>>> #1 push_0_pull_1 = 0 (src node CPUs are used)
>>>>
>>>> #1.1 THP=Never, 4KB (GB/s):
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 1.28      1.28      1.92      1.80      2.24      2.35      2.22      2.17
>>>> 4096                2.40      2.40      2.51      2.58      2.83      2.72      2.99      3.25
>>>> 8192                3.18      2.88      2.83      2.69      3.49      3.46      3.57      3.80
>>>> 16348               3.17      2.94      2.96      3.17      3.63      3.68      4.06      4.15
>>>>
>>>> #1.2 THP=Always, 2MB (GB/s):
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 4.31      5.02      3.39      3.40      3.33      3.51      3.91      4.03
>>>> 1024                7.13      4.49      3.58      3.56      3.91      3.87      4.39      4.57
>>>> 2048                5.26      6.47      3.91      4.00      3.71      3.85      4.97      6.83
>>>> 4096                9.93      7.77      4.58      3.79      3.93      3.53      6.41      4.77
>>>> 8192                6.47      6.33      4.37      4.67      4.52      4.39      5.30      5.37
>>>> 16348               7.66      8.00      5.20      5.22      5.24      5.28      6.41      7.02
>>>> 32768               8.56      8.62      6.34      6.20      6.20      6.19      7.18      8.10
>>>> 65536               9.41      9.40      7.14      7.15      7.15      7.19      7.96      8.89
>>>> 262144              10.17     10.19     7.26      7.90      7.98      8.05      9.46      10.30
>>>> 524288              10.40     9.95      7.25      7.93      8.02      8.76      9.55      10.30
>>>>
>>>> --------------------
>>>> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>>>>
>>>> #2.1 THP=Never 4KB (GB/s):
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 1.28      1.36      2.01      2.74      2.33      2.31      2.53      2.96
>>>> 4096                2.40      2.84      2.94      3.04      3.40      3.23      3.31      4.16
>>>> 8192                3.18      3.27      3.34      3.94      3.77      3.68      4.23      4.76
>>>> 16348               3.17      3.42      3.66      3.21      3.82      4.40      4.76      4.89
>>>>
>>>> #2.2 THP=Always 2MB (GB/s):
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 4.31      5.91      4.03      3.73      4.26      4.13      4.78      3.44
>>>> 1024                7.13      6.83      4.60      5.13      5.03      5.19      5.94      7.25
>>>> 2048                5.26      7.09      5.20      5.69      5.83      5.73      6.85      8.13
>>>> 4096                9.93      9.31      4.90      4.82      4.82      5.26      8.46      8.52
>>>> 8192                6.47      7.63      5.66      5.85      5.75      6.14      7.45      8.63
>>>> 16348               7.66      10.00     6.35      6.54      6.66      6.99      8.18      10.21
>>>> 32768               8.56      9.78      7.06      7.41      7.76      9.02      9.55      11.92
>>>> 65536               9.41      10.00     8.19      9.20      9.32      8.68      11.00     13.31
>>>> 262144              10.17     11.17     9.01      9.96      9.99      10.00     11.70     14.27
>>>> 524288              10.40     11.38     9.07      9.98      10.01     10.09     11.95     14.48
>>>>
>>>> Note:
>>>> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
>>>>    experiment with 64KB pagesize)
>>>> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
>>>>    nr_pages=512 => 512 4KB pages => 1 2MB page)
>>>>
>>>>
>>>> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
>>>> relatively flat across thread counts.
>>>>
>>>> Is it possible I'm missing something in my testing?
>>>>
>>>> Could the base page size difference (4KB vs 64KB) be playing a role in
>>>> the scaling behavior? How the performance varies with 4KB pages on your system?
>>>>
>>>> I'd be happy to work with you on investigating this differences.
>>>> Let me know if you'd like any additional test data or if there are specific
>>>> configurations I should try.
>>>
>>> The results surprises me, since I was able to achieve ~9GB/s when migrating
>>> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
>>> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
>>> These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
>>> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
>>> not make sense.
>>>
>>> One thing you might want to try is to set init_on_alloc=0 in your boot
>>> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
>>> might reduce the time spent on page zeros.
>>>
>>> I am also going to rerun the experiments locally on x86_64 boxes to see if your
>>> results can be replicated.
>>>
>>> Thank you for the review and running these experiments. I really appreciate
>>> it.>
>>>
>>> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
>>>
>>
>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>> but I'm still missing the performance scaling you observed.
>
> It might be the difference between x86 and ARM64, but I am not 100% sure.
> Based on your data below, 2 or 4 threads seem to the sweep spot for
> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
> ~25GB/s bidirectional. I wonder if your results below are cross-socket
> link bandwidth limited.
>
> From my results, NVIDIA Grace CPU can achieve high copy throughput
> with more threads between two sockets, maybe part of the reason is that
> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.

I talked to my colleague about this and he mentioned about CCD architecture
on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
another. This means my naive scheduling algorithm, which use CPUs from
0 to N threads, uses all cores from one CDD first, then move to another
CCD. It is not able to saturate the cross-socket bandwidth. Does it make
sense to you?

If yes, can you please change the my cpu selection code in mm/copy_pages.c:

+	/* TODO: need a better cpu selection method */
+	for_each_cpu(cpu, per_node_cpumask) {
+		if (i >= total_mt_num)
+			break;
+		cpu_id_list[i] = cpu;
+		++i;
+	}

to select CPUs from as many CCDs as possible and rerun the tests.
That might boost the page migration throughput on AMD CPUs more.

Thanks.

>>
>> THP Never
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
>> 4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
>> 8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
>> 16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41
>>
>> THP Always
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
>> 1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
>> 2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
>> 4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
>> 8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
>> 16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
>> 32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
>> 65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
>> 262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
>> 524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29
>
> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-10 17:05         ` Zi Yan
@ 2025-01-10 19:51           ` Zi Yan
  2025-01-16  4:57             ` Shivank Garg
  0 siblings, 1 reply; 30+ messages in thread
From: Zi Yan @ 2025-01-10 19:51 UTC (permalink / raw)
  To: Shivank Garg
  Cc: linux-mm, David Rientjes, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying

On 10 Jan 2025, at 12:05, Zi Yan wrote:

> <snip>
>>>
>>>>> main() {
>>>>> ...
>>>>>
>>>>>     // code snippet to measure throughput
>>>>>     clock_gettime(CLOCK_MONOTONIC, &t1);
>>>>>     retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
>>>>>     clock_gettime(CLOCK_MONOTONIC, &t2);
>>>>>
>>>>>     // tput = num_pages*PAGE_SIZE/(t2-t1)
>>>>>
>>>>> ...
>>>>> }
>>>>>
>>>>>
>>>>> Measurements:
>>>>> ============
>>>>> vanilla: base kernel without patchset
>>>>> mt:0 = MT kernel with use_mt_copy=0
>>>>> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>>>>>
>>>>> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
>>>>> for 4KB migration and THP migration.
>>>>>
>>>>> --------------------
>>>>> #1 push_0_pull_1 = 0 (src node CPUs are used)
>>>>>
>>>>> #1.1 THP=Never, 4KB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 1.28      1.28      1.92      1.80      2.24      2.35      2.22      2.17
>>>>> 4096                2.40      2.40      2.51      2.58      2.83      2.72      2.99      3.25
>>>>> 8192                3.18      2.88      2.83      2.69      3.49      3.46      3.57      3.80
>>>>> 16348               3.17      2.94      2.96      3.17      3.63      3.68      4.06      4.15
>>>>>
>>>>> #1.2 THP=Always, 2MB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 4.31      5.02      3.39      3.40      3.33      3.51      3.91      4.03
>>>>> 1024                7.13      4.49      3.58      3.56      3.91      3.87      4.39      4.57
>>>>> 2048                5.26      6.47      3.91      4.00      3.71      3.85      4.97      6.83
>>>>> 4096                9.93      7.77      4.58      3.79      3.93      3.53      6.41      4.77
>>>>> 8192                6.47      6.33      4.37      4.67      4.52      4.39      5.30      5.37
>>>>> 16348               7.66      8.00      5.20      5.22      5.24      5.28      6.41      7.02
>>>>> 32768               8.56      8.62      6.34      6.20      6.20      6.19      7.18      8.10
>>>>> 65536               9.41      9.40      7.14      7.15      7.15      7.19      7.96      8.89
>>>>> 262144              10.17     10.19     7.26      7.90      7.98      8.05      9.46      10.30
>>>>> 524288              10.40     9.95      7.25      7.93      8.02      8.76      9.55      10.30
>>>>>
>>>>> --------------------
>>>>> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>>>>>
>>>>> #2.1 THP=Never 4KB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 1.28      1.36      2.01      2.74      2.33      2.31      2.53      2.96
>>>>> 4096                2.40      2.84      2.94      3.04      3.40      3.23      3.31      4.16
>>>>> 8192                3.18      3.27      3.34      3.94      3.77      3.68      4.23      4.76
>>>>> 16348               3.17      3.42      3.66      3.21      3.82      4.40      4.76      4.89
>>>>>
>>>>> #2.2 THP=Always 2MB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 4.31      5.91      4.03      3.73      4.26      4.13      4.78      3.44
>>>>> 1024                7.13      6.83      4.60      5.13      5.03      5.19      5.94      7.25
>>>>> 2048                5.26      7.09      5.20      5.69      5.83      5.73      6.85      8.13
>>>>> 4096                9.93      9.31      4.90      4.82      4.82      5.26      8.46      8.52
>>>>> 8192                6.47      7.63      5.66      5.85      5.75      6.14      7.45      8.63
>>>>> 16348               7.66      10.00     6.35      6.54      6.66      6.99      8.18      10.21
>>>>> 32768               8.56      9.78      7.06      7.41      7.76      9.02      9.55      11.92
>>>>> 65536               9.41      10.00     8.19      9.20      9.32      8.68      11.00     13.31
>>>>> 262144              10.17     11.17     9.01      9.96      9.99      10.00     11.70     14.27
>>>>> 524288              10.40     11.38     9.07      9.98      10.01     10.09     11.95     14.48
>>>>>
>>>>> Note:
>>>>> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
>>>>>    experiment with 64KB pagesize)
>>>>> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
>>>>>    nr_pages=512 => 512 4KB pages => 1 2MB page)
>>>>>
>>>>>
>>>>> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
>>>>> relatively flat across thread counts.
>>>>>
>>>>> Is it possible I'm missing something in my testing?
>>>>>
>>>>> Could the base page size difference (4KB vs 64KB) be playing a role in
>>>>> the scaling behavior? How the performance varies with 4KB pages on your system?
>>>>>
>>>>> I'd be happy to work with you on investigating this differences.
>>>>> Let me know if you'd like any additional test data or if there are specific
>>>>> configurations I should try.
>>>>
>>>> The results surprises me, since I was able to achieve ~9GB/s when migrating
>>>> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
>>>> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
>>>> These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
>>>> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
>>>> not make sense.
>>>>
>>>> One thing you might want to try is to set init_on_alloc=0 in your boot
>>>> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
>>>> might reduce the time spent on page zeros.
>>>>
>>>> I am also going to rerun the experiments locally on x86_64 boxes to see if your
>>>> results can be replicated.
>>>>
>>>> Thank you for the review and running these experiments. I really appreciate
>>>> it.>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
>>>>
>>>
>>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>>> but I'm still missing the performance scaling you observed.
>>
>> It might be the difference between x86 and ARM64, but I am not 100% sure.
>> Based on your data below, 2 or 4 threads seem to the sweep spot for
>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
>> ~25GB/s bidirectional. I wonder if your results below are cross-socket
>> link bandwidth limited.
>>
>> From my results, NVIDIA Grace CPU can achieve high copy throughput
>> with more threads between two sockets, maybe part of the reason is that
>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.
>
> I talked to my colleague about this and he mentioned about CCD architecture
> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
> another. This means my naive scheduling algorithm, which use CPUs from
> 0 to N threads, uses all cores from one CDD first, then move to another
> CCD. It is not able to saturate the cross-socket bandwidth. Does it make
> sense to you?
>
> If yes, can you please change the my cpu selection code in mm/copy_pages.c:
>
> +	/* TODO: need a better cpu selection method */
> +	for_each_cpu(cpu, per_node_cpumask) {
> +		if (i >= total_mt_num)
> +			break;
> +		cpu_id_list[i] = cpu;
> +		++i;
> +	}
>
> to select CPUs from as many CCDs as possible and rerun the tests.
> That might boost the page migration throughput on AMD CPUs more.
>
> Thanks.
>
>>>
>>> THP Never
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>> 512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
>>> 4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
>>> 8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
>>> 16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41
>>>
>>> THP Always
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>> 512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
>>> 1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
>>> 2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
>>> 4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
>>> 8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
>>> 16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
>>> 32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
>>> 65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
>>> 262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
>>> 524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29
>>
>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study


BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method.
The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of
vanilla kernel throughput using 8 or 16 threads.


4KB (GB/s)

| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
| 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
| 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
| 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
| 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
| 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |



2MB (GB/s)
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
| 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
| 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
| 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
| 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
| 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
| 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
| 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
| 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
| 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
| 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
| 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
| 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |



Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-10 19:51           ` Zi Yan
@ 2025-01-16  4:57             ` Shivank Garg
  2025-01-21  6:15               ` Shivank Garg
  0 siblings, 1 reply; 30+ messages in thread
From: Shivank Garg @ 2025-01-16  4:57 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, David Rientjes, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying

On 1/11/2025 1:21 AM, Zi Yan wrote:
<snip>


>>> BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_dma(),
>>> which would incur a huge overhead, based on my past experience using DMA engine
>>> for page copy. I know it is needed to make sure DMA is still present, but
>>> its cost needs to be minimized to make DMA folio copy usable. Otherwise,
>>> the 768MB/s DMA copy throughput from your cover letter cannot convince people
>>> to use it for page migration, since single CPU can achieve more than that,
>>> as you showed in the table below.

Thank you for pointing this.
I'm learning about DMAEngine and will look more into DMA driver part.

>>>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>>>> but I'm still missing the performance scaling you observed.
>>>
>>> It might be the difference between x86 and ARM64, but I am not 100% sure.
>>> Based on your data below, 2 or 4 threads seem to the sweep spot for
>>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
>>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
>>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
>>> ~25GB/s bidirectional. I wonder if your results below are cross-socket
>>> link bandwidth limited.

I tested the cross-socket bandwidth on my EPYC Zen 5 system and easily getting >10X
bandwidth as this. I don't think BW is a issue here.


>>>
>>> From my results, NVIDIA Grace CPU can achieve high copy throughput
>>> with more threads between two sockets, maybe part of the reason is that
>>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.
>>
>> I talked to my colleague about this and he mentioned about CCD architecture
>> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
>> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
>> another. This means my naive scheduling algorithm, which use CPUs from
>> 0 to N threads, uses all cores from one CDD first, then move to another
>> CCD. It is not able to saturate the cross-socket bandwidth. Does it make
>> sense to you?
>>
>> If yes, can you please change the my cpu selection code in mm/copy_pages.c:

This is making sense.

I first tried distributing work threads across different CCDs, which yielded
better results.

Also, I switched my system to NPS-2 config (2 Nodes per socket). This was done
to eliminate cross-socket connections and variables by focusing on intra-socket
page migrations.

Cross-Socket (Node 0 -> Node 2)
THP Always (2 MB pages)

nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
262144              12.37     12.52     15.72     24.94     30.40     33.23     34.68     29.67
524288              12.44     12.19     15.70     24.96     32.72     33.40     35.40     29.18

Intra-Socket (Node 0 -> Node 1)
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
262144              12.37     17.10     18.65     26.05     35.56     37.80     33.73     29.29
524288              12.44     16.73     18.87     24.34     35.63     37.49     33.79     29.76

I have temporarily hardcoded the CPU assignments and will work on improving the
CPU selection code.
>>
>> +	/* TODO: need a better cpu selection method */
>> +	for_each_cpu(cpu, per_node_cpumask) {
>> +		if (i >= total_mt_num)
>> +			break;
>> +		cpu_id_list[i] = cpu;
>> +		++i;
>> +	}
>>
>> to select CPUs from as many CCDs as possible and rerun the tests.
>> That might boost the page migration throughput on AMD CPUs more.
>>
>> Thanks.
>>
>>>>
>>>> THP Never
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
>>>> 4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
>>>> 8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
>>>> 16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41
>>>>
>>>> THP Always
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
>>>> 1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
>>>> 2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
>>>> 4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
>>>> 8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
>>>> 16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
>>>> 32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
>>>> 65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
>>>> 262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
>>>> 524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29
>>>
>>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study
> 
> 
> BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method.
> The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of
> vanilla kernel throughput using 8 or 16 threads.
> 
> 
> 4KB (GB/s)
> 
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> | 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
> | 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
> | 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
> | 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
> | 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |
> 
> 
> 
> 2MB (GB/s)
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> | 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
> | 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
> | 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
> | 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
> | 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
> | 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
> | 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
> | 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
> | 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
> | 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
> | 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
> | 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
> 

I see, thank you for checking.

Meanwhile, I'll continue to explore for performance optimization
avenues.

Best Regards,
Shivank
> 
> 
> Best Regards,
> Yan, Zi
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-16  4:57             ` Shivank Garg
@ 2025-01-21  6:15               ` Shivank Garg
  0 siblings, 0 replies; 30+ messages in thread
From: Shivank Garg @ 2025-01-21  6:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, David Rientjes, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying



On 1/16/2025 10:27 AM, Shivank Garg wrote:
> On 1/11/2025 1:21 AM, Zi Yan wrote:
> <snip>
> 
> 
>>>> BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_dma(),
>>>> which would incur a huge overhead, based on my past experience using DMA engine
>>>> for page copy. I know it is needed to make sure DMA is still present, but
>>>> its cost needs to be minimized to make DMA folio copy usable. Otherwise,
>>>> the 768MB/s DMA copy throughput from your cover letter cannot convince people
>>>> to use it for page migration, since single CPU can achieve more than that,
>>>> as you showed in the table below.
> 
> Thank you for pointing this.
> I'm learning about DMAEngine and will look more into DMA driver part.
> 
>>>>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>>>>> but I'm still missing the performance scaling you observed.
>>>>
>>>> It might be the difference between x86 and ARM64, but I am not 100% sure.
>>>> Based on your data below, 2 or 4 threads seem to the sweep spot for
>>>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
>>>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
>>>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
>>>> ~25GB/s bidirectional. I wonder if your results below are cross-socket
>>>> link bandwidth limited.
> 
> I tested the cross-socket bandwidth on my EPYC Zen 5 system and easily getting >10X
> bandwidth as this. I don't think BW is a issue here.
> 
> 
>>>>
>>>> From my results, NVIDIA Grace CPU can achieve high copy throughput
>>>> with more threads between two sockets, maybe part of the reason is that
>>>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.
>>>
>>> I talked to my colleague about this and he mentioned about CCD architecture
>>> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
>>> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
>>> another. This means my naive scheduling algorithm, which use CPUs from
>>> 0 to N threads, uses all cores from one CDD first, then move to another
>>> CCD. It is not able to saturate the cross-socket bandwidth. Does it make
>>> sense to you?
>>>
>>> If yes, can you please change the my cpu selection code in mm/copy_pages.c:
> 
> This is making sense.
> 
> I first tried distributing work threads across different CCDs, which yielded
> better results.
> 
> Also, I switched my system to NPS-2 config (2 Nodes per socket). This was done
> to eliminate cross-socket connections and variables by focusing on intra-socket
> page migrations.
> 
> Cross-Socket (Node 0 -> Node 2)
> THP Always (2 MB pages)
> 
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 262144              12.37     12.52     15.72     24.94     30.40     33.23     34.68     29.67
> 524288              12.44     12.19     15.70     24.96     32.72     33.40     35.40     29.18
> 
> Intra-Socket (Node 0 -> Node 1)
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 262144              12.37     17.10     18.65     26.05     35.56     37.80     33.73     29.29
> 524288              12.44     16.73     18.87     24.34     35.63     37.49     33.79     29.76
> 
> I have temporarily hardcoded the CPU assignments and will work on improving the
> CPU selection code.
>>>
>>> +	/* TODO: need a better cpu selection method */
>>> +	for_each_cpu(cpu, per_node_cpumask) {
>>> +		if (i >= total_mt_num)
>>> +			break;
>>> +		cpu_id_list[i] = cpu;
>>> +		++i;
>>> +	}
>>>
>>> to select CPUs from as many CCDs as possible and rerun the tests.
>>> That might boost the page migration throughput on AMD CPUs more.
>>>
>>> Thanks.
>>>
>>>>>
>>>>> THP Never
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
>>>>> 4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
>>>>> 8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
>>>>> 16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41
>>>>>
>>>>> THP Always
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
>>>>> 1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
>>>>> 2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
>>>>> 4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
>>>>> 8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
>>>>> 16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
>>>>> 32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
>>>>> 65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
>>>>> 262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
>>>>> 524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29
>>>>
>>>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study
>>
>>
>> BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method.
>> The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of
>> vanilla kernel throughput using 8 or 16 threads.
>>
>>
>> 4KB (GB/s)
>>
>> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
>> |      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
>> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
>> | 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
>> | 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
>> | 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
>> | 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
>> | 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |
>>
>>
>>
>> 2MB (GB/s)
>> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
>> |      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
>> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
>> | 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
>> | 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
>> | 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
>> | 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
>> | 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
>> | 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
>> | 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
>> | 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
>> | 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
>> | 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
>> | 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
>> | 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
>>
> 
> I see, thank you for checking.
> 
> Meanwhile, I'll continue to explore for performance optimization
> avenues.

Hi Zi,

I experimented with your testcase[1] to get 2-2.5X throughput gain than my previous
experiment. Multi-threading scaling for 32 threads is ~4X (slightly higher than my
previous experiment).


The main difference between our move_pages throughput benchmarks was:
You're explicitly using THP using aligned_alloc and MADV_HUGEPAGE whereas
I'm relying on system THP operating on 4KB boundaries.


While both methods use THP and expected similar performance, we saw lower
throughput in my case because:

In my test code,
The kernel processes all 512 4KB folios within a 2MB region in the first migration
attempt. For subsequent folios, __add_folio_for_migration() returns early with
folio_nid(folio) == node, as the pages are already on the target node.
This adds extra overheads of vma_lookup and folio_walk_start() in my experiment.


2MB pages (GB/s):
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
1                   10.74     11.04     4.68      8.17      6.47      6.09      3.97      6.20
2                   12.44     4.90      11.19     14.10     15.33     8.45      10.09     9.97
4                   14.82     9.80      11.93     18.35     21.82     17.09     10.53     7.51
8                   16.13     9.91      15.26     11.85     26.53     13.09     12.71     13.75
16                  15.99     8.81      13.84     22.43     33.89     11.91     12.30     13.26
32                  14.03     11.37     17.54     23.96     57.07     18.78     19.51     21.29
64                  15.79     9.55      22.19     33.17     57.18     65.51     55.39     62.53
128                 18.22     16.65     21.49     30.73     52.99     61.05     58.44     60.38
256                 19.78     20.56     24.72     34.94     56.73     71.11     61.83     62.77
512                 20.27     21.40     27.47     39.23     65.72     67.97     70.48     71.39
1024                20.48     21.48     27.48     38.30     68.62     77.94     78.00     78.95

[1]: https://github.com/x-y-z/thp-migration-bench/blob/arm64/move_thp.c

> 
> Best Regards,
> Shivank
>>
>>
>> Best Regards,
>> Yan, Zi
>>
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
                   ` (7 preceding siblings ...)
  2025-01-09 11:47 ` Shivank Garg
@ 2025-02-13  8:17 ` Byungchul Park
  2025-02-13 15:36   ` Zi Yan
  8 siblings, 1 reply; 30+ messages in thread
From: Byungchul Park @ 2025-02-13  8:17 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, David Rientjes, Shivank Garg, Aneesh Kumar,
	David Hildenbrand, John Hubbard, Kirill Shutemov, Matthew Wilcox,
	Mel Gorman, Rao, Bharata Bhasker, Rik van Riel, RaghavendraKT,
	Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
	shy828301, Liam Howlett, Gregory Price, Huang, Ying, kernel_team

On Fri, Jan 03, 2025 at 12:24:14PM -0500, Zi Yan wrote:
> Hi all,

Hi,

It'd be appreciated to cc me from the next.

	Byungchul
> 
> This patchset accelerates page migration by batching folio copy operations and
> using multiple CPU threads and is based on Shivank's Enhancements to Page
> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
> The last patch is for testing purpose and should not be considered.
> 
> The motivations are:
> 
> 1. Batching folio copy increases copy throughput. Especially for base page
> migrations, folio copy throughput is low since there are kernel activities like
> moving folio metadata and updating page table entries sit between two folio
> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
> and 64KB on ARM64.
> 
> 2. Single CPU thread has limited copy throughput. Using multi threads is
> a natural extension to speed up folio copy, when DMA engine is NOT
> available in a system.
> 
> 
> Design
> ===
> 
> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
> migrate_folio_move() and perform them in one shot afterwards. A
> copy_page_lists_mt() function is added to use multi threads to copy
> folios from src list to dst list.
> 
> Changes compared to Shivank's patchset (mainly rewrote batching folio
> copy code)
> ===
> 
> 1. mig_info is removed, so no memory allocation is needed during
> batching folio copies. src->private is used to store old page state and
> anon_vma after folio metadata is copied from src to dst.
> 
> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
> redundant code in migrate_folios_batch_move().
> 
> 3. folio_mc_copy() is used for the single threaded copy code to keep the
> original kernel behavior.
> 
> 
> Performance
> ===
> 
> I benchmarked move_pages() throughput on a two socket NUMA system with two
> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
> mTHP page migration are measured.
> 
> The tables below show move_pages() throughput with different
> configurations and different numbers of copied pages. The x-axis is the
> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
> threads with this patchset applied. And the unit is GB/s.
> 
> The 32-thread copy throughput can be up to 10x of single thread serial folio
> copy. Batching folio copy not only benefits huge page but also base
> page.
> 
> 64KB (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
> 256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
> 512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
> 768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
> 1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80
> 
> 2MB mTHP (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
> 2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
> 4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
> 8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
> 16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
> 32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
> 64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
> 128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
> 256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
> 512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
> 768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
> 1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84
> 
> 
> TODOs
> ===
> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
> only use idle CPUs to avoid interfering userspace workloads. Of course
> more complicated policies can be used based on migration issuing thread
> priority.
> 
> 2. Eliminate memory allocation during multi-threaded folio copy routine
> if possible.
> 
> 3. A runtime check to decide when use multi-threaded folio copy.
> Something like cache hotness issue mentioned by Matthew[3].
> 
> 4. Use non-temporal CPU instructions to avoid cache pollution issues.
> 
> 5. Explicitly make multi-threaded folio copy only available to
> !HIGHMEM, since kmap_local_page() would be needed for each kernel
> folio copy work threads and expensive.
> 
> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
> to be used as well.
> 
> Let me know your thoughts. Thanks.
> 
> 
> [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/
> [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
> [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/
> 
> Byungchul Park (1):
>   mm: separate move/undo doing on folio list from migrate_pages_batch()
> 
> Zi Yan (4):
>   mm/migrate: factor out code in move_to_new_folio() and
>     migrate_folio_move()
>   mm/migrate: add migrate_folios_batch_move to batch the folio move
>     operations
>   mm/migrate: introduce multi-threaded page copy routine
>   test: add sysctl for folio copy tests and adjust
>     NR_MAX_BATCHED_MIGRATION
> 
>  include/linux/migrate.h      |   3 +
>  include/linux/migrate_mode.h |   2 +
>  include/linux/mm.h           |   4 +
>  include/linux/sysctl.h       |   1 +
>  kernel/sysctl.c              |  29 ++-
>  mm/Makefile                  |   2 +-
>  mm/copy_pages.c              | 190 +++++++++++++++
>  mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
>  8 files changed, 577 insertions(+), 97 deletions(-)
>  create mode 100644 mm/copy_pages.c
> 
> -- 
> 2.45.2
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine
  2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
  2025-01-06  1:18   ` Hyeonggon Yoo
@ 2025-02-13 12:44   ` Byungchul Park
  2025-02-13 15:34     ` Zi Yan
  1 sibling, 1 reply; 30+ messages in thread
From: Byungchul Park @ 2025-02-13 12:44 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, David Rientjes, Shivank Garg, Aneesh Kumar,
	David Hildenbrand, John Hubbard, Kirill Shutemov, Matthew Wilcox,
	Mel Gorman, Rao, Bharata Bhasker, Rik van Riel, RaghavendraKT,
	Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
	shy828301, Liam Howlett, Gregory Price, Huang, Ying, kernel_team

On Fri, Jan 03, 2025 at 12:24:18PM -0500, Zi Yan wrote:
> Now page copies are batched, multi-threaded page copy can be used to
> increase page copy throughput. Add copy_page_lists_mt() to copy pages in
> multi-threaded manners. Empirical data show more than 32 base pages are
> needed to show the benefit of using multi-threaded page copy, so use 32 as
> the threshold.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/migrate.h |   3 +
>  mm/Makefile             |   2 +-
>  mm/copy_pages.c         | 186 ++++++++++++++++++++++++++++++++++++++++
>  mm/migrate.c            |  19 ++--
>  4 files changed, 199 insertions(+), 11 deletions(-)
>  create mode 100644 mm/copy_pages.c
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 29919faea2f1..a0124f4893b0 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -80,6 +80,9 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
>  int folio_migrate_mapping(struct address_space *mapping,
>  		struct folio *newfolio, struct folio *folio, int extra_count);
>  
> +int copy_page_lists_mt(struct list_head *dst_folios,
> +		struct list_head *src_folios, int nr_items);
> +
>  #else
>  
>  static inline void putback_movable_pages(struct list_head *l) {}
> diff --git a/mm/Makefile b/mm/Makefile
> index 850386a67b3e..f8c7f6b4cebb 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -92,7 +92,7 @@ obj-$(CONFIG_KMSAN)	+= kmsan/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
> -obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_MIGRATION) += migrate.o copy_pages.o
>  obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> diff --git a/mm/copy_pages.c b/mm/copy_pages.c
> new file mode 100644
> index 000000000000..0e2231199f66
> --- /dev/null
> +++ b/mm/copy_pages.c
> @@ -0,0 +1,186 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Parallel page copy routine.
> + */
> +
> +#include <linux/sysctl.h>
> +#include <linux/highmem.h>
> +#include <linux/workqueue.h>
> +#include <linux/slab.h>
> +#include <linux/migrate.h>
> +
> +
> +unsigned int limit_mt_num = 4;
> +
> +struct copy_item {
> +	char *to;
> +	char *from;
> +	unsigned long chunk_size;
> +};
> +
> +struct copy_page_info {
> +	struct work_struct copy_page_work;
> +	unsigned long num_items;
> +	struct copy_item item_list[];
> +};
> +
> +static void copy_page_routine(char *vto, char *vfrom,
> +	unsigned long chunk_size)
> +{
> +	memcpy(vto, vfrom, chunk_size);
> +}
> +
> +static void copy_page_work_queue_thread(struct work_struct *work)
> +{
> +	struct copy_page_info *my_work = (struct copy_page_info *)work;
> +	int i;
> +
> +	for (i = 0; i < my_work->num_items; ++i)
> +		copy_page_routine(my_work->item_list[i].to,
> +						  my_work->item_list[i].from,
> +						  my_work->item_list[i].chunk_size);
> +}
> +
> +int copy_page_lists_mt(struct list_head *dst_folios,
> +		struct list_head *src_folios, int nr_items)
> +{
> +	int err = 0;
> +	unsigned int total_mt_num = limit_mt_num;
> +	int to_node = folio_nid(list_first_entry(dst_folios, struct folio, lru));
> +	int i;
> +	struct copy_page_info *work_items[32] = {0};
> +	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);

Hi,

Why do you use the cpumask of dst's node than src for where queueing
the works on?  Is it for utilizing CPU cache?  Isn't it better to use
src's node than dst where nothing has been loaded to CPU cache?  Or why
don't you avoid specifying cpus to queue on but let system_unbound_wq
select the appropriate CPUs e.g. idlest CPUs, when the system is not
that idle?

Am I missing something?

	Byungchul

> +	int cpu_id_list[32] = {0};
> +	int cpu;
> +	int max_items_per_thread;
> +	int item_idx;
> +	struct folio *src, *src2, *dst, *dst2;
> +
> +	total_mt_num = min_t(unsigned int, total_mt_num,
> +			cpumask_weight(per_node_cpumask));
> +
> +	if (total_mt_num > 32)
> +		total_mt_num = 32;
> +
> +	/* Each threads get part of each page, if nr_items < totla_mt_num */
> +	if (nr_items < total_mt_num)
> +		max_items_per_thread = nr_items;
> +	else
> +		max_items_per_thread = (nr_items / total_mt_num) +
> +				((nr_items % total_mt_num) ? 1 : 0);
> +
> +
> +	for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +		work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
> +					sizeof(struct copy_item) * max_items_per_thread,
> +					GFP_NOWAIT);
> +		if (!work_items[cpu]) {
> +			err = -ENOMEM;
> +			goto free_work_items;
> +		}
> +	}
> +
> +	i = 0;
> +	/* TODO: need a better cpu selection method */
> +	for_each_cpu(cpu, per_node_cpumask) {
> +		if (i >= total_mt_num)
> +			break;
> +		cpu_id_list[i] = cpu;
> +		++i;
> +	}
> +
> +	if (nr_items < total_mt_num) {
> +		for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +			INIT_WORK((struct work_struct *)work_items[cpu],
> +					  copy_page_work_queue_thread);
> +			work_items[cpu]->num_items = max_items_per_thread;
> +		}
> +
> +		item_idx = 0;
> +		dst = list_first_entry(dst_folios, struct folio, lru);
> +		dst2 = list_next_entry(dst, lru);
> +		list_for_each_entry_safe(src, src2, src_folios, lru) {
> +			unsigned long chunk_size = PAGE_SIZE * folio_nr_pages(src) / total_mt_num;
> +			/* XXX: not working in HIGHMEM */
> +			char *vfrom = page_address(&src->page);
> +			char *vto = page_address(&dst->page);
> +
> +			VM_WARN_ON(PAGE_SIZE * folio_nr_pages(src) % total_mt_num);
> +			VM_WARN_ON(folio_nr_pages(dst) != folio_nr_pages(src));
> +
> +			for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +				work_items[cpu]->item_list[item_idx].to =
> +					vto + chunk_size * cpu;
> +				work_items[cpu]->item_list[item_idx].from =
> +					vfrom + chunk_size * cpu;
> +				work_items[cpu]->item_list[item_idx].chunk_size =
> +					chunk_size;
> +			}
> +
> +			item_idx++;
> +			dst = dst2;
> +			dst2 = list_next_entry(dst, lru);
> +		}
> +
> +		for (cpu = 0; cpu < total_mt_num; ++cpu)
> +			queue_work_on(cpu_id_list[cpu],
> +						  system_unbound_wq,
> +						  (struct work_struct *)work_items[cpu]);
> +	} else {
> +		int num_xfer_per_thread = nr_items / total_mt_num;
> +		int per_cpu_item_idx;
> +
> +
> +		for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +			INIT_WORK((struct work_struct *)work_items[cpu],
> +					  copy_page_work_queue_thread);
> +
> +			work_items[cpu]->num_items = num_xfer_per_thread +
> +					(cpu < (nr_items % total_mt_num));
> +		}
> +
> +		cpu = 0;
> +		per_cpu_item_idx = 0;
> +		item_idx = 0;
> +		dst = list_first_entry(dst_folios, struct folio, lru);
> +		dst2 = list_next_entry(dst, lru);
> +		list_for_each_entry_safe(src, src2, src_folios, lru) {
> +			/* XXX: not working in HIGHMEM */
> +			work_items[cpu]->item_list[per_cpu_item_idx].to =
> +				page_address(&dst->page);
> +			work_items[cpu]->item_list[per_cpu_item_idx].from =
> +				page_address(&src->page);
> +			work_items[cpu]->item_list[per_cpu_item_idx].chunk_size =
> +				PAGE_SIZE * folio_nr_pages(src);
> +
> +			VM_WARN_ON(folio_nr_pages(dst) !=
> +				   folio_nr_pages(src));
> +
> +			per_cpu_item_idx++;
> +			item_idx++;
> +			dst = dst2;
> +			dst2 = list_next_entry(dst, lru);
> +
> +			if (per_cpu_item_idx == work_items[cpu]->num_items) {
> +				queue_work_on(cpu_id_list[cpu],
> +					system_unbound_wq,
> +					(struct work_struct *)work_items[cpu]);
> +				per_cpu_item_idx = 0;
> +				cpu++;
> +			}
> +		}
> +		if (item_idx != nr_items)
> +			pr_warn("%s: only %d out of %d pages are transferred\n",
> +				__func__, item_idx - 1, nr_items);
> +	}
> +
> +	/* Wait until it finishes  */
> +	for (i = 0; i < total_mt_num; ++i)
> +		flush_work((struct work_struct *)work_items[i]);
> +
> +free_work_items:
> +	for (cpu = 0; cpu < total_mt_num; ++cpu)
> +		kfree(work_items[cpu]);
> +
> +	return err;
> +}
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 95c4cc4a7823..18440180d747 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1799,7 +1799,7 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
>  		int *nr_retry_pages)
>  {
>  	struct folio *folio, *folio2, *dst, *dst2;
> -	int rc, nr_pages = 0, nr_mig_folios = 0;
> +	int rc, nr_pages = 0, total_nr_pages = 0, total_nr_folios = 0;
>  	int old_page_state = 0;
>  	struct anon_vma *anon_vma = NULL;
>  	bool is_lru;
> @@ -1807,11 +1807,6 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
>  	LIST_HEAD(err_src);
>  	LIST_HEAD(err_dst);
>  
> -	if (mode != MIGRATE_ASYNC) {
> -		*retry += 1;
> -		return;
> -	}
> -
>  	/*
>  	 * Iterate over the list of locked src/dst folios to copy the metadata
>  	 */
> @@ -1859,19 +1854,23 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
>  			migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
>  					anon_vma, true, ret_folios);
>  			migrate_folio_undo_dst(dst, true, put_new_folio, private);
> -		} else /* MIGRATEPAGE_SUCCESS */
> -			nr_mig_folios++;
> +		} else { /* MIGRATEPAGE_SUCCESS */
> +			total_nr_pages += nr_pages;
> +			total_nr_folios++;
> +		}
>  
>  		dst = dst2;
>  		dst2 = list_next_entry(dst, lru);
>  	}
>  
>  	/* Exit if folio list for batch migration is empty */
> -	if (!nr_mig_folios)
> +	if (!total_nr_pages)
>  		goto out;
>  
>  	/* Batch copy the folios */
> -	{
> +	if (total_nr_pages > 32) {
> +		copy_page_lists_mt(dst_folios, src_folios, total_nr_folios);
> +	} else {
>  		dst = list_first_entry(dst_folios, struct folio, lru);
>  		dst2 = list_next_entry(dst, lru);
>  		list_for_each_entry_safe(folio, folio2, src_folios, lru) {
> -- 
> 2.45.2
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine
  2025-02-13 12:44   ` Byungchul Park
@ 2025-02-13 15:34     ` Zi Yan
  2025-02-13 21:34       ` Byungchul Park
  0 siblings, 1 reply; 30+ messages in thread
From: Zi Yan @ 2025-02-13 15:34 UTC (permalink / raw)
  To: Byungchul Park
  Cc: linux-mm, David Rientjes, Shivank Garg, Aneesh Kumar,
	David Hildenbrand, John Hubbard, Kirill Shutemov, Matthew Wilcox,
	Mel Gorman, Rao, Bharata Bhasker, Rik van Riel, RaghavendraKT,
	Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
	shy828301, Liam Howlett, Gregory Price, Huang, Ying, kernel_team

On 13 Feb 2025, at 7:44, Byungchul Park wrote:

> On Fri, Jan 03, 2025 at 12:24:18PM -0500, Zi Yan wrote:
>> Now page copies are batched, multi-threaded page copy can be used to
>> increase page copy throughput. Add copy_page_lists_mt() to copy pages in
>> multi-threaded manners. Empirical data show more than 32 base pages are
>> needed to show the benefit of using multi-threaded page copy, so use 32 as
>> the threshold.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  include/linux/migrate.h |   3 +
>>  mm/Makefile             |   2 +-
>>  mm/copy_pages.c         | 186 ++++++++++++++++++++++++++++++++++++++++
>>  mm/migrate.c            |  19 ++--
>>  4 files changed, 199 insertions(+), 11 deletions(-)
>>  create mode 100644 mm/copy_pages.c
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index 29919faea2f1..a0124f4893b0 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -80,6 +80,9 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
>>  int folio_migrate_mapping(struct address_space *mapping,
>>  		struct folio *newfolio, struct folio *folio, int extra_count);
>>
>> +int copy_page_lists_mt(struct list_head *dst_folios,
>> +		struct list_head *src_folios, int nr_items);
>> +
>>  #else
>>
>>  static inline void putback_movable_pages(struct list_head *l) {}
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 850386a67b3e..f8c7f6b4cebb 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -92,7 +92,7 @@ obj-$(CONFIG_KMSAN)	+= kmsan/
>>  obj-$(CONFIG_FAILSLAB) += failslab.o
>>  obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
>>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>> -obj-$(CONFIG_MIGRATION) += migrate.o
>> +obj-$(CONFIG_MIGRATION) += migrate.o copy_pages.o
>>  obj-$(CONFIG_NUMA) += memory-tiers.o
>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>> diff --git a/mm/copy_pages.c b/mm/copy_pages.c
>> new file mode 100644
>> index 000000000000..0e2231199f66
>> --- /dev/null
>> +++ b/mm/copy_pages.c
>> @@ -0,0 +1,186 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Parallel page copy routine.
>> + */
>> +
>> +#include <linux/sysctl.h>
>> +#include <linux/highmem.h>
>> +#include <linux/workqueue.h>
>> +#include <linux/slab.h>
>> +#include <linux/migrate.h>
>> +
>> +
>> +unsigned int limit_mt_num = 4;
>> +
>> +struct copy_item {
>> +	char *to;
>> +	char *from;
>> +	unsigned long chunk_size;
>> +};
>> +
>> +struct copy_page_info {
>> +	struct work_struct copy_page_work;
>> +	unsigned long num_items;
>> +	struct copy_item item_list[];
>> +};
>> +
>> +static void copy_page_routine(char *vto, char *vfrom,
>> +	unsigned long chunk_size)
>> +{
>> +	memcpy(vto, vfrom, chunk_size);
>> +}
>> +
>> +static void copy_page_work_queue_thread(struct work_struct *work)
>> +{
>> +	struct copy_page_info *my_work = (struct copy_page_info *)work;
>> +	int i;
>> +
>> +	for (i = 0; i < my_work->num_items; ++i)
>> +		copy_page_routine(my_work->item_list[i].to,
>> +						  my_work->item_list[i].from,
>> +						  my_work->item_list[i].chunk_size);
>> +}
>> +
>> +int copy_page_lists_mt(struct list_head *dst_folios,
>> +		struct list_head *src_folios, int nr_items)
>> +{
>> +	int err = 0;
>> +	unsigned int total_mt_num = limit_mt_num;
>> +	int to_node = folio_nid(list_first_entry(dst_folios, struct folio, lru));
>> +	int i;
>> +	struct copy_page_info *work_items[32] = {0};
>> +	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
>
> Hi,
>
> Why do you use the cpumask of dst's node than src for where queueing
> the works on?  Is it for utilizing CPU cache?  Isn't it better to use
> src's node than dst where nothing has been loaded to CPU cache?  Or why

Because some vendor’s CPU achieves higher copy throughput by pushing
data from src to dst, whereas other vendor’s CPU get higher by pulling
data from dst to src. More in [1].


> don't you avoid specifying cpus to queue on but let system_unbound_wq
> select the appropriate CPUs e.g. idlest CPUs, when the system is not
> that idle?

Based on wq_select_unbound_cpu()[2], a round robin method is used to
select target CPUs, not the idlest CPUs. queue_work_node(), which queue jobs
on a NUMA node, uses select_numa_node_cpu() and it just chooses the
random (or known as first) CPU from the NUMA node[3]. There is no idleness
detection in workqueue implementation yet.

In addition, based on CPU topology, not all idle CPUs are equal. For example,
AMD CPUs have CCDs and two cores from one CCD would saturate the CCD bandwidth.
This means if you want to achieve high copy throughput, even if all cores
in a CCD are idle, other idle CPUs from another CCD should be chosen first[4].

I am planning to reach out to scheduling folks to learn more about CPU scheduling
and come up with a better workqueue or an alternative for multithreading page
migration.


[1] https://lore.kernel.org/linux-mm/8B66C7BA-96D6-4E04-89F7-13829BF480D7@nvidia.com/
[2] https://elixir.bootlin.com/linux/v6.13.2/source/kernel/workqueue.c#L2212
[3] https://elixir.bootlin.com/linux/v6.13.2/source/kernel/workqueue.c#L2408
[4] https://lore.kernel.org/linux-mm/D969919C-A241-432E-A0E3-353CCD8AC7E8@nvidia.com/



>
> 	Byungchul
>
>> +	int cpu_id_list[32] = {0};
>> +	int cpu;
>> +	int max_items_per_thread;
>> +	int item_idx;
>> +	struct folio *src, *src2, *dst, *dst2;
>> +
>> +	total_mt_num = min_t(unsigned int, total_mt_num,
>> +			cpumask_weight(per_node_cpumask));
>> +
>> +	if (total_mt_num > 32)
>> +		total_mt_num = 32;
>> +
>> +	/* Each threads get part of each page, if nr_items < totla_mt_num */
>> +	if (nr_items < total_mt_num)
>> +		max_items_per_thread = nr_items;
>> +	else
>> +		max_items_per_thread = (nr_items / total_mt_num) +
>> +				((nr_items % total_mt_num) ? 1 : 0);
>> +
>> +
>> +	for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +		work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
>> +					sizeof(struct copy_item) * max_items_per_thread,
>> +					GFP_NOWAIT);
>> +		if (!work_items[cpu]) {
>> +			err = -ENOMEM;
>> +			goto free_work_items;
>> +		}
>> +	}
>> +
>> +	i = 0;
>> +	/* TODO: need a better cpu selection method */
>> +	for_each_cpu(cpu, per_node_cpumask) {
>> +		if (i >= total_mt_num)
>> +			break;
>> +		cpu_id_list[i] = cpu;
>> +		++i;
>> +	}
>> +
>> +	if (nr_items < total_mt_num) {
>> +		for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +			INIT_WORK((struct work_struct *)work_items[cpu],
>> +					  copy_page_work_queue_thread);
>> +			work_items[cpu]->num_items = max_items_per_thread;
>> +		}
>> +
>> +		item_idx = 0;
>> +		dst = list_first_entry(dst_folios, struct folio, lru);
>> +		dst2 = list_next_entry(dst, lru);
>> +		list_for_each_entry_safe(src, src2, src_folios, lru) {
>> +			unsigned long chunk_size = PAGE_SIZE * folio_nr_pages(src) / total_mt_num;
>> +			/* XXX: not working in HIGHMEM */
>> +			char *vfrom = page_address(&src->page);
>> +			char *vto = page_address(&dst->page);
>> +
>> +			VM_WARN_ON(PAGE_SIZE * folio_nr_pages(src) % total_mt_num);
>> +			VM_WARN_ON(folio_nr_pages(dst) != folio_nr_pages(src));
>> +
>> +			for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +				work_items[cpu]->item_list[item_idx].to =
>> +					vto + chunk_size * cpu;
>> +				work_items[cpu]->item_list[item_idx].from =
>> +					vfrom + chunk_size * cpu;
>> +				work_items[cpu]->item_list[item_idx].chunk_size =
>> +					chunk_size;
>> +			}
>> +
>> +			item_idx++;
>> +			dst = dst2;
>> +			dst2 = list_next_entry(dst, lru);
>> +		}
>> +
>> +		for (cpu = 0; cpu < total_mt_num; ++cpu)
>> +			queue_work_on(cpu_id_list[cpu],
>> +						  system_unbound_wq,
>> +						  (struct work_struct *)work_items[cpu]);
>> +	} else {
>> +		int num_xfer_per_thread = nr_items / total_mt_num;
>> +		int per_cpu_item_idx;
>> +
>> +
>> +		for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +			INIT_WORK((struct work_struct *)work_items[cpu],
>> +					  copy_page_work_queue_thread);
>> +
>> +			work_items[cpu]->num_items = num_xfer_per_thread +
>> +					(cpu < (nr_items % total_mt_num));
>> +		}
>> +
>> +		cpu = 0;
>> +		per_cpu_item_idx = 0;
>> +		item_idx = 0;
>> +		dst = list_first_entry(dst_folios, struct folio, lru);
>> +		dst2 = list_next_entry(dst, lru);
>> +		list_for_each_entry_safe(src, src2, src_folios, lru) {
>> +			/* XXX: not working in HIGHMEM */
>> +			work_items[cpu]->item_list[per_cpu_item_idx].to =
>> +				page_address(&dst->page);
>> +			work_items[cpu]->item_list[per_cpu_item_idx].from =
>> +				page_address(&src->page);
>> +			work_items[cpu]->item_list[per_cpu_item_idx].chunk_size =
>> +				PAGE_SIZE * folio_nr_pages(src);
>> +
>> +			VM_WARN_ON(folio_nr_pages(dst) !=
>> +				   folio_nr_pages(src));
>> +
>> +			per_cpu_item_idx++;
>> +			item_idx++;
>> +			dst = dst2;
>> +			dst2 = list_next_entry(dst, lru);
>> +
>> +			if (per_cpu_item_idx == work_items[cpu]->num_items) {
>> +				queue_work_on(cpu_id_list[cpu],
>> +					system_unbound_wq,
>> +					(struct work_struct *)work_items[cpu]);
>> +				per_cpu_item_idx = 0;
>> +				cpu++;
>> +			}
>> +		}
>> +		if (item_idx != nr_items)
>> +			pr_warn("%s: only %d out of %d pages are transferred\n",
>> +				__func__, item_idx - 1, nr_items);
>> +	}
>> +
>> +	/* Wait until it finishes  */
>> +	for (i = 0; i < total_mt_num; ++i)
>> +		flush_work((struct work_struct *)work_items[i]);
>> +
>> +free_work_items:
>> +	for (cpu = 0; cpu < total_mt_num; ++cpu)
>> +		kfree(work_items[cpu]);
>> +
>> +	return err;
>> +}
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 95c4cc4a7823..18440180d747 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1799,7 +1799,7 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
>>  		int *nr_retry_pages)
>>  {
>>  	struct folio *folio, *folio2, *dst, *dst2;
>> -	int rc, nr_pages = 0, nr_mig_folios = 0;
>> +	int rc, nr_pages = 0, total_nr_pages = 0, total_nr_folios = 0;
>>  	int old_page_state = 0;
>>  	struct anon_vma *anon_vma = NULL;
>>  	bool is_lru;
>> @@ -1807,11 +1807,6 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
>>  	LIST_HEAD(err_src);
>>  	LIST_HEAD(err_dst);
>>
>> -	if (mode != MIGRATE_ASYNC) {
>> -		*retry += 1;
>> -		return;
>> -	}
>> -
>>  	/*
>>  	 * Iterate over the list of locked src/dst folios to copy the metadata
>>  	 */
>> @@ -1859,19 +1854,23 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
>>  			migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
>>  					anon_vma, true, ret_folios);
>>  			migrate_folio_undo_dst(dst, true, put_new_folio, private);
>> -		} else /* MIGRATEPAGE_SUCCESS */
>> -			nr_mig_folios++;
>> +		} else { /* MIGRATEPAGE_SUCCESS */
>> +			total_nr_pages += nr_pages;
>> +			total_nr_folios++;
>> +		}
>>
>>  		dst = dst2;
>>  		dst2 = list_next_entry(dst, lru);
>>  	}
>>
>>  	/* Exit if folio list for batch migration is empty */
>> -	if (!nr_mig_folios)
>> +	if (!total_nr_pages)
>>  		goto out;
>>
>>  	/* Batch copy the folios */
>> -	{
>> +	if (total_nr_pages > 32) {
>> +		copy_page_lists_mt(dst_folios, src_folios, total_nr_folios);
>> +	} else {
>>  		dst = list_first_entry(dst_folios, struct folio, lru);
>>  		dst2 = list_next_entry(dst, lru);
>>  		list_for_each_entry_safe(folio, folio2, src_folios, lru) {
>> -- 
>> 2.45.2
>>


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
  2025-02-13  8:17 ` Byungchul Park
@ 2025-02-13 15:36   ` Zi Yan
  0 siblings, 0 replies; 30+ messages in thread
From: Zi Yan @ 2025-02-13 15:36 UTC (permalink / raw)
  To: Byungchul Park
  Cc: linux-mm, David Rientjes, Shivank Garg, Aneesh Kumar,
	David Hildenbrand, John Hubbard, Kirill Shutemov, Matthew Wilcox,
	Mel Gorman, Rao, Bharata Bhasker, Rik van Riel, RaghavendraKT,
	Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
	shy828301, Liam Howlett, Gregory Price, Huang, Ying, kernel_team

On 13 Feb 2025, at 3:17, Byungchul Park wrote:

> On Fri, Jan 03, 2025 at 12:24:14PM -0500, Zi Yan wrote:
>> Hi all,
>
> Hi,
>
> It'd be appreciated to cc me from the next.
Sure.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine
  2025-02-13 15:34     ` Zi Yan
@ 2025-02-13 21:34       ` Byungchul Park
  0 siblings, 0 replies; 30+ messages in thread
From: Byungchul Park @ 2025-02-13 21:34 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, David Rientjes, Shivank Garg, Aneesh Kumar,
	David Hildenbrand, John Hubbard, Kirill Shutemov, Matthew Wilcox,
	Mel Gorman, Rao, Bharata Bhasker, Rik van Riel, RaghavendraKT,
	Wei Xu, Suyeon Lee, Lei Chen, Shukla, Santosh, Grimm, Jon, sj,
	shy828301, Liam Howlett, Gregory Price, Huang, Ying, kernel_team

On Thu, Feb 13, 2025 at 10:34:05AM -0500, Zi Yan wrote:
> On 13 Feb 2025, at 7:44, Byungchul Park wrote:
> 
> > On Fri, Jan 03, 2025 at 12:24:18PM -0500, Zi Yan wrote:
> >> Now page copies are batched, multi-threaded page copy can be used to
> >> increase page copy throughput. Add copy_page_lists_mt() to copy pages in
> >> multi-threaded manners. Empirical data show more than 32 base pages are
> >> needed to show the benefit of using multi-threaded page copy, so use 32 as
> >> the threshold.
> >>
> >> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >> ---
> >>  include/linux/migrate.h |   3 +
> >>  mm/Makefile             |   2 +-
> >>  mm/copy_pages.c         | 186 ++++++++++++++++++++++++++++++++++++++++
> >>  mm/migrate.c            |  19 ++--
> >>  4 files changed, 199 insertions(+), 11 deletions(-)
> >>  create mode 100644 mm/copy_pages.c
> >>
> >> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >> index 29919faea2f1..a0124f4893b0 100644
> >> --- a/include/linux/migrate.h
> >> +++ b/include/linux/migrate.h
> >> @@ -80,6 +80,9 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
> >>  int folio_migrate_mapping(struct address_space *mapping,
> >>  		struct folio *newfolio, struct folio *folio, int extra_count);
> >>
> >> +int copy_page_lists_mt(struct list_head *dst_folios,
> >> +		struct list_head *src_folios, int nr_items);
> >> +
> >>  #else
> >>
> >>  static inline void putback_movable_pages(struct list_head *l) {}
> >> diff --git a/mm/Makefile b/mm/Makefile
> >> index 850386a67b3e..f8c7f6b4cebb 100644
> >> --- a/mm/Makefile
> >> +++ b/mm/Makefile
> >> @@ -92,7 +92,7 @@ obj-$(CONFIG_KMSAN)	+= kmsan/
> >>  obj-$(CONFIG_FAILSLAB) += failslab.o
> >>  obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
> >>  obj-$(CONFIG_MEMTEST)		+= memtest.o
> >> -obj-$(CONFIG_MIGRATION) += migrate.o
> >> +obj-$(CONFIG_MIGRATION) += migrate.o copy_pages.o
> >>  obj-$(CONFIG_NUMA) += memory-tiers.o
> >>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> >> diff --git a/mm/copy_pages.c b/mm/copy_pages.c
> >> new file mode 100644
> >> index 000000000000..0e2231199f66
> >> --- /dev/null
> >> +++ b/mm/copy_pages.c
> >> @@ -0,0 +1,186 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Parallel page copy routine.
> >> + */
> >> +
> >> +#include <linux/sysctl.h>
> >> +#include <linux/highmem.h>
> >> +#include <linux/workqueue.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/migrate.h>
> >> +
> >> +
> >> +unsigned int limit_mt_num = 4;
> >> +
> >> +struct copy_item {
> >> +	char *to;
> >> +	char *from;
> >> +	unsigned long chunk_size;
> >> +};
> >> +
> >> +struct copy_page_info {
> >> +	struct work_struct copy_page_work;
> >> +	unsigned long num_items;
> >> +	struct copy_item item_list[];
> >> +};
> >> +
> >> +static void copy_page_routine(char *vto, char *vfrom,
> >> +	unsigned long chunk_size)
> >> +{
> >> +	memcpy(vto, vfrom, chunk_size);
> >> +}
> >> +
> >> +static void copy_page_work_queue_thread(struct work_struct *work)
> >> +{
> >> +	struct copy_page_info *my_work = (struct copy_page_info *)work;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < my_work->num_items; ++i)
> >> +		copy_page_routine(my_work->item_list[i].to,
> >> +						  my_work->item_list[i].from,
> >> +						  my_work->item_list[i].chunk_size);
> >> +}
> >> +
> >> +int copy_page_lists_mt(struct list_head *dst_folios,
> >> +		struct list_head *src_folios, int nr_items)
> >> +{
> >> +	int err = 0;
> >> +	unsigned int total_mt_num = limit_mt_num;
> >> +	int to_node = folio_nid(list_first_entry(dst_folios, struct folio, lru));
> >> +	int i;
> >> +	struct copy_page_info *work_items[32] = {0};
> >> +	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
> >
> > Hi,
> >
> > Why do you use the cpumask of dst's node than src for where queueing
> > the works on?  Is it for utilizing CPU cache?  Isn't it better to use
> > src's node than dst where nothing has been loaded to CPU cache?  Or why
> 
> Because some vendor’s CPU achieves higher copy throughput by pushing
> data from src to dst, whereas other vendor’s CPU get higher by pulling
> data from dst to src. More in [1].

Ah, okay.  You have already added the additional option for it in 5/5.

> > don't you avoid specifying cpus to queue on but let system_unbound_wq
> > select the appropriate CPUs e.g. idlest CPUs, when the system is not
> > that idle?
> 
> Based on wq_select_unbound_cpu()[2], a round robin method is used to
> select target CPUs, not the idlest CPUs. queue_work_node(), which queue jobs

It was just an example based on what I think it should be.. but indeed!

> on a NUMA node, uses select_numa_node_cpu() and it just chooses the
> random (or known as first) CPU from the NUMA node[3]. There is no idleness
> detection in workqueue implementation yet.
> 
> In addition, based on CPU topology, not all idle CPUs are equal. For example,
> AMD CPUs have CCDs and two cores from one CCD would saturate the CCD bandwidth.
> This means if you want to achieve high copy throughput, even if all cores
> in a CCD are idle, other idle CPUs from another CCD should be chosen first[4].

Yeah..  I'd like to think it more what is the best design for it.

> I am planning to reach out to scheduling folks to learn more about CPU scheduling
> and come up with a better workqueue or an alternative for multithreading page
> migration.

Good luck.

	Byungchul

> [1] https://lore.kernel.org/linux-mm/8B66C7BA-96D6-4E04-89F7-13829BF480D7@nvidia.com/
> [2] https://elixir.bootlin.com/linux/v6.13.2/source/kernel/workqueue.c#L2212
> [3] https://elixir.bootlin.com/linux/v6.13.2/source/kernel/workqueue.c#L2408
> [4] https://lore.kernel.org/linux-mm/D969919C-A241-432E-A0E3-353CCD8AC7E8@nvidia.com/
> 
> 
> 
> >
> > 	Byungchul
> >
> >> +	int cpu_id_list[32] = {0};
> >> +	int cpu;
> >> +	int max_items_per_thread;
> >> +	int item_idx;
> >> +	struct folio *src, *src2, *dst, *dst2;
> >> +
> >> +	total_mt_num = min_t(unsigned int, total_mt_num,
> >> +			cpumask_weight(per_node_cpumask));
> >> +
> >> +	if (total_mt_num > 32)
> >> +		total_mt_num = 32;
> >> +
> >> +	/* Each threads get part of each page, if nr_items < totla_mt_num */
> >> +	if (nr_items < total_mt_num)
> >> +		max_items_per_thread = nr_items;
> >> +	else
> >> +		max_items_per_thread = (nr_items / total_mt_num) +
> >> +				((nr_items % total_mt_num) ? 1 : 0);
> >> +
> >> +
> >> +	for (cpu = 0; cpu < total_mt_num; ++cpu) {
> >> +		work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
> >> +					sizeof(struct copy_item) * max_items_per_thread,
> >> +					GFP_NOWAIT);
> >> +		if (!work_items[cpu]) {
> >> +			err = -ENOMEM;
> >> +			goto free_work_items;
> >> +		}
> >> +	}
> >> +
> >> +	i = 0;
> >> +	/* TODO: need a better cpu selection method */
> >> +	for_each_cpu(cpu, per_node_cpumask) {
> >> +		if (i >= total_mt_num)
> >> +			break;
> >> +		cpu_id_list[i] = cpu;
> >> +		++i;
> >> +	}
> >> +
> >> +	if (nr_items < total_mt_num) {
> >> +		for (cpu = 0; cpu < total_mt_num; ++cpu) {
> >> +			INIT_WORK((struct work_struct *)work_items[cpu],
> >> +					  copy_page_work_queue_thread);
> >> +			work_items[cpu]->num_items = max_items_per_thread;
> >> +		}
> >> +
> >> +		item_idx = 0;
> >> +		dst = list_first_entry(dst_folios, struct folio, lru);
> >> +		dst2 = list_next_entry(dst, lru);
> >> +		list_for_each_entry_safe(src, src2, src_folios, lru) {
> >> +			unsigned long chunk_size = PAGE_SIZE * folio_nr_pages(src) / total_mt_num;
> >> +			/* XXX: not working in HIGHMEM */
> >> +			char *vfrom = page_address(&src->page);
> >> +			char *vto = page_address(&dst->page);
> >> +
> >> +			VM_WARN_ON(PAGE_SIZE * folio_nr_pages(src) % total_mt_num);
> >> +			VM_WARN_ON(folio_nr_pages(dst) != folio_nr_pages(src));
> >> +
> >> +			for (cpu = 0; cpu < total_mt_num; ++cpu) {
> >> +				work_items[cpu]->item_list[item_idx].to =
> >> +					vto + chunk_size * cpu;
> >> +				work_items[cpu]->item_list[item_idx].from =
> >> +					vfrom + chunk_size * cpu;
> >> +				work_items[cpu]->item_list[item_idx].chunk_size =
> >> +					chunk_size;
> >> +			}
> >> +
> >> +			item_idx++;
> >> +			dst = dst2;
> >> +			dst2 = list_next_entry(dst, lru);
> >> +		}
> >> +
> >> +		for (cpu = 0; cpu < total_mt_num; ++cpu)
> >> +			queue_work_on(cpu_id_list[cpu],
> >> +						  system_unbound_wq,
> >> +						  (struct work_struct *)work_items[cpu]);
> >> +	} else {
> >> +		int num_xfer_per_thread = nr_items / total_mt_num;
> >> +		int per_cpu_item_idx;
> >> +
> >> +
> >> +		for (cpu = 0; cpu < total_mt_num; ++cpu) {
> >> +			INIT_WORK((struct work_struct *)work_items[cpu],
> >> +					  copy_page_work_queue_thread);
> >> +
> >> +			work_items[cpu]->num_items = num_xfer_per_thread +
> >> +					(cpu < (nr_items % total_mt_num));
> >> +		}
> >> +
> >> +		cpu = 0;
> >> +		per_cpu_item_idx = 0;
> >> +		item_idx = 0;
> >> +		dst = list_first_entry(dst_folios, struct folio, lru);
> >> +		dst2 = list_next_entry(dst, lru);
> >> +		list_for_each_entry_safe(src, src2, src_folios, lru) {
> >> +			/* XXX: not working in HIGHMEM */
> >> +			work_items[cpu]->item_list[per_cpu_item_idx].to =
> >> +				page_address(&dst->page);
> >> +			work_items[cpu]->item_list[per_cpu_item_idx].from =
> >> +				page_address(&src->page);
> >> +			work_items[cpu]->item_list[per_cpu_item_idx].chunk_size =
> >> +				PAGE_SIZE * folio_nr_pages(src);
> >> +
> >> +			VM_WARN_ON(folio_nr_pages(dst) !=
> >> +				   folio_nr_pages(src));
> >> +
> >> +			per_cpu_item_idx++;
> >> +			item_idx++;
> >> +			dst = dst2;
> >> +			dst2 = list_next_entry(dst, lru);
> >> +
> >> +			if (per_cpu_item_idx == work_items[cpu]->num_items) {
> >> +				queue_work_on(cpu_id_list[cpu],
> >> +					system_unbound_wq,
> >> +					(struct work_struct *)work_items[cpu]);
> >> +				per_cpu_item_idx = 0;
> >> +				cpu++;
> >> +			}
> >> +		}
> >> +		if (item_idx != nr_items)
> >> +			pr_warn("%s: only %d out of %d pages are transferred\n",
> >> +				__func__, item_idx - 1, nr_items);
> >> +	}
> >> +
> >> +	/* Wait until it finishes  */
> >> +	for (i = 0; i < total_mt_num; ++i)
> >> +		flush_work((struct work_struct *)work_items[i]);
> >> +
> >> +free_work_items:
> >> +	for (cpu = 0; cpu < total_mt_num; ++cpu)
> >> +		kfree(work_items[cpu]);
> >> +
> >> +	return err;
> >> +}
> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index 95c4cc4a7823..18440180d747 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -1799,7 +1799,7 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
> >>  		int *nr_retry_pages)
> >>  {
> >>  	struct folio *folio, *folio2, *dst, *dst2;
> >> -	int rc, nr_pages = 0, nr_mig_folios = 0;
> >> +	int rc, nr_pages = 0, total_nr_pages = 0, total_nr_folios = 0;
> >>  	int old_page_state = 0;
> >>  	struct anon_vma *anon_vma = NULL;
> >>  	bool is_lru;
> >> @@ -1807,11 +1807,6 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
> >>  	LIST_HEAD(err_src);
> >>  	LIST_HEAD(err_dst);
> >>
> >> -	if (mode != MIGRATE_ASYNC) {
> >> -		*retry += 1;
> >> -		return;
> >> -	}
> >> -
> >>  	/*
> >>  	 * Iterate over the list of locked src/dst folios to copy the metadata
> >>  	 */
> >> @@ -1859,19 +1854,23 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
> >>  			migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
> >>  					anon_vma, true, ret_folios);
> >>  			migrate_folio_undo_dst(dst, true, put_new_folio, private);
> >> -		} else /* MIGRATEPAGE_SUCCESS */
> >> -			nr_mig_folios++;
> >> +		} else { /* MIGRATEPAGE_SUCCESS */
> >> +			total_nr_pages += nr_pages;
> >> +			total_nr_folios++;
> >> +		}
> >>
> >>  		dst = dst2;
> >>  		dst2 = list_next_entry(dst, lru);
> >>  	}
> >>
> >>  	/* Exit if folio list for batch migration is empty */
> >> -	if (!nr_mig_folios)
> >> +	if (!total_nr_pages)
> >>  		goto out;
> >>
> >>  	/* Batch copy the folios */
> >> -	{
> >> +	if (total_nr_pages > 32) {
> >> +		copy_page_lists_mt(dst_folios, src_folios, total_nr_folios);
> >> +	} else {
> >>  		dst = list_first_entry(dst_folios, struct folio, lru);
> >>  		dst2 = list_next_entry(dst, lru);
> >>  		list_for_each_entry_safe(folio, folio2, src_folios, lru) {
> >> -- 
> >> 2.45.2
> >>
> 
> 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch()
  2024-06-14 22:15 [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA Shivank Garg
@ 2024-06-14 22:15 ` Shivank Garg
  0 siblings, 0 replies; 30+ messages in thread
From: Shivank Garg @ 2024-06-14 22:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: bharata, raghavendra.kodsarathimmappa, Michael.Day, dmaengine,
	vkoul, shivankg, Byungchul Park

From: Byungchul Park <byungchul@sk.com>

Functionally, no change. This is a preparatory patch picked from luf
(lazy unmap flush) patch series. This patch improve code organization
and readability for steps involving migrate_folio_move().

Refactored migrate_pages_batch() and separated move and undo parts
operating on folio list, from migrate_pages_batch().

Signed-off-by: Byungchul Park <byungchul@sk.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 mm/migrate.c | 134 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 83 insertions(+), 51 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index c27b1f8097d4..6c36c6e0a360 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1606,6 +1606,81 @@ static int migrate_hugetlbs(struct list_head *from, new_folio_t get_new_folio,
 	return nr_failed;
 }
 
+static void migrate_folios_move(struct list_head *src_folios,
+		struct list_head *dst_folios,
+		free_folio_t put_new_folio, unsigned long private,
+		enum migrate_mode mode, int reason,
+		struct list_head *ret_folios,
+		struct migrate_pages_stats *stats,
+		int *retry, int *thp_retry, int *nr_failed,
+		int *nr_retry_pages)
+{
+	struct folio *folio, *folio2, *dst, *dst2;
+	bool is_thp;
+	int nr_pages;
+	int rc;
+
+	dst = list_first_entry(dst_folios, struct folio, lru);
+	dst2 = list_next_entry(dst, lru);
+	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+		is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
+		nr_pages = folio_nr_pages(folio);
+
+		cond_resched();
+
+		rc = migrate_folio_move(put_new_folio, private,
+				folio, dst, mode,
+				reason, ret_folios);
+		/*
+		 * The rules are:
+		 *	Success: folio will be freed
+		 *	-EAGAIN: stay on the unmap_folios list
+		 *	Other errno: put on ret_folios list
+		 */
+		switch (rc) {
+		case -EAGAIN:
+			*retry += 1;
+			*thp_retry += is_thp;
+			*nr_retry_pages += nr_pages;
+			break;
+		case MIGRATEPAGE_SUCCESS:
+			stats->nr_succeeded += nr_pages;
+			stats->nr_thp_succeeded += is_thp;
+			break;
+		default:
+			*nr_failed += 1;
+			stats->nr_thp_failed += is_thp;
+			stats->nr_failed_pages += nr_pages;
+			break;
+		}
+		dst = dst2;
+		dst2 = list_next_entry(dst, lru);
+	}
+}
+
+static void migrate_folios_undo(struct list_head *src_folios,
+		struct list_head *dst_folios,
+		free_folio_t put_new_folio, unsigned long private,
+		struct list_head *ret_folios)
+{
+	struct folio *folio, *folio2, *dst, *dst2;
+
+	dst = list_first_entry(dst_folios, struct folio, lru);
+	dst2 = list_next_entry(dst, lru);
+	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+		int old_page_state = 0;
+		struct anon_vma *anon_vma = NULL;
+
+		__migrate_folio_extract(dst, &old_page_state, &anon_vma);
+		migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
+				anon_vma, true, ret_folios);
+		list_del(&dst->lru);
+		migrate_folio_undo_dst(dst, true, put_new_folio, private);
+		dst = dst2;
+		dst2 = list_next_entry(dst, lru);
+	}
+}
+
 /*
  * migrate_pages_batch() first unmaps folios in the from list as many as
  * possible, then move the unmapped folios.
@@ -1628,7 +1703,7 @@ static int migrate_pages_batch(struct list_head *from,
 	int pass = 0;
 	bool is_thp = false;
 	bool is_large = false;
-	struct folio *folio, *folio2, *dst = NULL, *dst2;
+	struct folio *folio, *folio2, *dst = NULL;
 	int rc, rc_saved = 0, nr_pages;
 	LIST_HEAD(unmap_folios);
 	LIST_HEAD(dst_folios);
@@ -1764,42 +1839,11 @@ static int migrate_pages_batch(struct list_head *from,
 		thp_retry = 0;
 		nr_retry_pages = 0;
 
-		dst = list_first_entry(&dst_folios, struct folio, lru);
-		dst2 = list_next_entry(dst, lru);
-		list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
-			is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
-			nr_pages = folio_nr_pages(folio);
-
-			cond_resched();
-
-			rc = migrate_folio_move(put_new_folio, private,
-						folio, dst, mode,
-						reason, ret_folios);
-			/*
-			 * The rules are:
-			 *	Success: folio will be freed
-			 *	-EAGAIN: stay on the unmap_folios list
-			 *	Other errno: put on ret_folios list
-			 */
-			switch(rc) {
-			case -EAGAIN:
-				retry++;
-				thp_retry += is_thp;
-				nr_retry_pages += nr_pages;
-				break;
-			case MIGRATEPAGE_SUCCESS:
-				stats->nr_succeeded += nr_pages;
-				stats->nr_thp_succeeded += is_thp;
-				break;
-			default:
-				nr_failed++;
-				stats->nr_thp_failed += is_thp;
-				stats->nr_failed_pages += nr_pages;
-				break;
-			}
-			dst = dst2;
-			dst2 = list_next_entry(dst, lru);
-		}
+		/* Move the unmapped folios */
+		migrate_folios_move(&unmap_folios, &dst_folios,
+				put_new_folio, private, mode, reason,
+				ret_folios, stats, &retry, &thp_retry,
+				&nr_failed, &nr_retry_pages);
 	}
 	nr_failed += retry;
 	stats->nr_thp_failed += thp_retry;
@@ -1808,20 +1852,8 @@ static int migrate_pages_batch(struct list_head *from,
 	rc = rc_saved ? : nr_failed;
 out:
 	/* Cleanup remaining folios */
-	dst = list_first_entry(&dst_folios, struct folio, lru);
-	dst2 = list_next_entry(dst, lru);
-	list_for_each_entry_safe(folio, folio2, &unmap_folios, lru) {
-		int old_page_state = 0;
-		struct anon_vma *anon_vma = NULL;
-
-		__migrate_folio_extract(dst, &old_page_state, &anon_vma);
-		migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
-				       anon_vma, true, ret_folios);
-		list_del(&dst->lru);
-		migrate_folio_undo_dst(dst, true, put_new_folio, private);
-		dst = dst2;
-		dst2 = list_next_entry(dst, lru);
-	}
+	migrate_folios_undo(&unmap_folios, &dst_folios,
+			put_new_folio, private, ret_folios);
 
 	return rc;
 }
-- 
2.34.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-02-13 21:34 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
2025-01-09 11:47   ` Shivank Garg
2025-01-09 14:08     ` Zi Yan
2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
2025-01-06  1:18   ` Hyeonggon Yoo
2025-01-06  2:01     ` Zi Yan
2025-02-13 12:44   ` Byungchul Park
2025-02-13 15:34     ` Zi Yan
2025-02-13 21:34       ` Byungchul Park
2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
2025-01-03 22:21   ` Gregory Price
2025-01-03 22:56     ` Zi Yan
2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
2025-01-03 19:32   ` Zi Yan
2025-01-03 22:09 ` Yang Shi
2025-01-06  2:33   ` Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 15:04   ` Zi Yan
2025-01-09 18:03     ` Shivank Garg
2025-01-09 19:32       ` Zi Yan
2025-01-10 17:05         ` Zi Yan
2025-01-10 19:51           ` Zi Yan
2025-01-16  4:57             ` Shivank Garg
2025-01-21  6:15               ` Shivank Garg
2025-02-13  8:17 ` Byungchul Park
2025-02-13 15:36   ` Zi Yan
  -- strict thread matches above, loose matches on Subject: below --
2024-06-14 22:15 [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA Shivank Garg
2024-06-14 22:15 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Shivank Garg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox