[RFC V3 0/9] Accelerate page migration with batch copying and hardware offload

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload
@ 2025-09-23 17:47 Shivank Garg
  2025-09-23 17:47 ` [RFC V3 1/9] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Shivank Garg
                   ` (10 more replies)
  0 siblings, 11 replies; 26+ messages in thread
From: Shivank Garg @ 2025-09-23 17:47 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	shivankg, alirad.malek, yiannis, weixugc, linux-kernel, linux-mm

This is the third RFC of the patchset to enhance page migration by batching
folio-copy operations and enabling acceleration via multi-threaded CPU or
DMA offload.

Single-threaded, folio-by-folio copying bottlenecks page migration
in modern systems with deep memory hierarchies, especially for large
folios where copy overhead dominates, leaving significant hardware
potential untapped. 

By batching the copy phase, we create an opportunity for significant
hardware acceleration. This series builds a framework for this acceleration
and provides two initial offload driver implementations: one using multiple
CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).

This version incorporates significant feedback to improve correctness,
robustness, and the efficiency of the DMA offload path.

Changelog since V2:

1. DMA Engine Rewrite:
   - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
   - Single completion interrupt per batch (reduced overhead)
   - Order of magnitude improvement in setup time for large batches
2. Code cleanups and refactoring
3. Rebased on latest mainline (6.17-rc6+)

MOTIVATION:
-----------

Current Migration Flow:
[ move_pages(), Compaction, Tiering, etc. ]
              |
              v
     [ migrate_pages() ] // Common entry point
              |
              v
    [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
      |
      |--> [ migrate_folio_unmap() ]
      |
      |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
      |
      |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
           - For each folio:
             - Metadata prep: Copy flags, mappings, etc.
             - folio_copy()  <-- Single-threaded, serial data copy.
             - Update PTEs & finalize for that single folio.

Understanding overheads in page migration (move_pages() syscall):

Total move_pages() overheads = folio_copy() + Other overheads
1. folio_copy() is the core copy operation that interests us.
2. The remaining operations are user/kernel transitions, page table walks,
locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
mappings and PTEs etc. that contribute to the remaining overheads.

Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
Number of pages being migrated and folio size:
            4KB     2MB
1 page     <1%     ~66%
512 page   ~35%    ~97%

Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
substantial performance opportunity.

move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
Where F is the fraction of time spent in folio_copy() and S is the speedup of
folio_copy().

For 4KB folios, folio copy overheads are significantly small in single-page
migrations to impact overall speedup, even for 512 pages, maximum theoretical
speedup is limited to ~1.54x with infinite folio_copy() speedup.

For 2MB THPs, folio copy overheads are significant even in single page
migrations, with a theoretical speedup of ~3x with infinite folio_copy()
speedup and up to ~33x for 512 pages.

A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
based on my measurements for copying 512 2MB pages.
This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
observed in the experiments below).

DESIGN: A Pluggable Migrator Framework
---------------------------------------

Introduce migrate_folios_batch_move():

[ migrate_pages_batch() ]
    |
    |--> migrate_folio_unmap()
    |      
    |--> try_to_unmap_flush()
    |
    +--> [ migrate_folios_batch_move() ] // new batched design
            |
            |--> Metadata migration
            |    - Metadata prep: Copy flags, mappings, etc.
            |    - Use MIGRATE_NO_COPY to skip the actual data copy.
            |
            |--> Batch copy folio data
            |    - Migrator is configurable at runtime via sysfs.
            |
            |          static_call(_folios_copy) // Pluggable migrators
            |          /          |            \
            |         v           v             v
            | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
            |
            +--> Update PTEs to point to dst folios and complete migration.

User Control of Migrator:

# echo 1 > /sys/kernel/dcbm/offloading
   |
   +--> Driver's sysfs handler
        |
        +--> calls start_offloading(&cpu_migrator)
              |
              +--> calls offc_update_migrator()
                    |
                    +--> static_call_update(_folios_copy, mig->migrate_offc)

Later, During Migration ...
migrate_folios_batch_move()
    |
    +--> static_call(_folios_copy) // Now dispatches to the selected migrator
          |
          +-> [ mtcopy | dcbm | kernel_default ]

PERFORMANCE RESULTS:
--------------------

System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
1 NUMA node per socket, Linux Kernel 6.16.0-rc6, DVFS set to Performance,
PTDMA hardware.

Benchmark: Use move_pages() syscall to move pages between two NUMA nodes.

1. Moving different sized folios (4KB, 16KB,..., 2MB) such that total transfer size is constant
(1GB), with different number of parallel threads/channels.
Metric: Throughput is reported in GB/s.

a. Baseline (Vanilla kernel, single-threaded, folio-by-folio migration):

Folio size|4K       | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
===============================================================================================================
Tput(GB/s)|3.73±0.33| 5.53±0.36 | 5.90±0.56  | 6.34±0.08  | 6.50±0.05  | 6.86±0.61  | 6.92±0.71  | 10.67±0.36 |

b. Multi-threaded CPU copy offload (mtcopy driver, use N Parallel Threads=1,2,4,8,12,16):

Thread | 4K         | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
===============================================================================================================
1      | 3.84±0.10  | 5.23±0.31 | 6.01±0.55  | 6.34±0.60  | 7.16±1.00  | 7.12±0.78  | 7.10±0.86  | 10.94±0.13 |
2      | 4.04±0.19  | 6.72±0.38 | 7.68±0.12  | 8.15±0.06  | 8.45±0.09  | 9.29±0.17  | 9.87±1.01  | 17.80±0.12 |
4      | 4.72±0.21  | 8.41±0.70 | 10.08±1.67 | 11.44±2.42 | 10.45±0.17 | 12.60±1.97 | 12.38±1.73 | 31.41±1.14 |
8      | 4.91±0.28  | 8.62±0.13 | 10.40±0.20 | 13.94±3.75 | 11.03±0.61 | 14.96±3.29 | 12.84±0.63 | 33.50±3.29 |
12     | 4.84±0.24  | 8.75±0.08 | 10.16±0.26 | 10.92±0.22 | 11.72±0.14 | 14.02±2.51 | 14.09±2.65 | 34.70±2.38 |
16     | 4.77±0.22  | 8.95±0.69 | 10.36±0.26 | 11.03±0.22 | 11.58±0.30 | 13.88±2.71 | 13.00±0.75 | 35.89±2.07 |

c. DMA offload (dcbm driver, use N DMA Channels=1,2,4,8,12,16):

Chan Cnt| 4K        | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
===============================================================================================================
1      | 2.75±0.19  | 2.86±0.13 | 3.28±0.20  | 4.57±0.72  | 5.03±0.62  | 4.69±0.25  | 4.78±0.34  | 12.50±0.24 |
2      | 3.35±0.19  | 4.57±0.19 | 5.35±0.55  | 6.71±0.71  | 7.40±1.07  | 7.38±0.61  | 7.21±0.73  | 14.23±0.34 |
4      | 4.01±0.17  | 6.36±0.26 | 7.71±0.89  | 9.40±1.35  | 10.27±1.96 | 10.60±1.42 | 12.35±2.64 | 26.84±0.91 |
8      | 4.46±0.16  | 7.74±0.13 | 9.72±1.29  | 10.88±0.16 | 12.12±2.54 | 15.62±3.96 | 13.29±2.65 | 45.27±2.60 |
12     | 4.60±0.22  | 8.90±0.84 | 11.26±2.19 | 16.00±4.41 | 14.90±4.38 | 14.57±2.84 | 13.79±3.18 | 59.94±4.19 |
16     | 4.61±0.25  | 9.08±0.79 | 11.14±1.75 | 13.95±3.85 | 13.69±3.39 | 15.47±3.44 | 15.44±4.65 | 63.69±5.01 |

- Throughput increases with folio size. Larger folios benefit more from DMA.
- Scaling shows diminishing returns beyond 8-12 threads/channels.
- Multi-threading and DMA offloading both provide significant gains - up to 3.4x and 6x respectively.

2. Varying total move size: (folio count = 1,8,..8192) for a fixed folio size of 2MB
   using only single thread/channel

folio_cnt | Baseline    | MTCPU      | DMA 
====================================================
1         | 7.96±2.22   | 6.43±0.66  | 6.52±0.45   |
8         | 8.20±0.75   | 8.82±1.10  | 8.88±0.54   |
16        | 7.54±0.61   | 9.06±0.95  | 9.03±0.62   |
32        | 8.68±0.77   | 10.11±0.42 | 10.17±0.50  |
64        | 9.08±1.03   | 10.12±0.44 | 11.21±0.24  |
256       | 10.53±0.39  | 10.77±0.28 | 12.43±0.12  |
512       | 10.59±0.29  | 10.81±0.19 | 12.61±0.07  |
2048      | 10.86±0.26  | 11.05±0.05 | 12.75±0.03  |
8192      | 10.84±0.18  | 11.12±0.05 | 12.81±0.02  |

- Throughput increases with folios count but plateaus after a threshold.
  (The migrate_pages function uses a folio batch size of 512)

Performance Analysis (V2 vs V3):

The new SG-based DMA driver dramatically reduces software overhead. By
switching from per-folio dma_map_page() to batch dma_map_sgtable(), setup
time improves by an order of magnitude for large batches.
This is most visible with 4KB folios, making DMA viable even for smaller
page sizes. For 2MB THP migrations, where hardware transfer time is more
dominant, the gains are more modest.

OPEN QUESTIONS:
---------------

User-Interface:

1. Control Interface Design:
The current interface creates separate sysfs files
for each driver, which can be confusing. Should we implement a unified interface
(/sys/kernel/mm/migration/offload_migrator), which accepts the name of the desired migrator
("kernel", "mtcopy", "dcbm"). This would ensure only one migrator is active at a time.
Is this the right approach?

2. Dynamic Migrator Selection:
Currently, active migrator is a global state, and only one can be active a time.
A more flexible approach might be for the caller of migrate_pages() to specify/hint which
offload mechanism to use, if any. This would allow a CXL driver to explicitly request DMA while a GPU driver might prefer
multi-threaded CPU copy.

3. Tuning Parameters: Expose parameters like number of threads/channels, batch size,
and thresholds for using migrators. Who should own these parameters?

4. Resources Accounting[3]:
a. CPU cgroups accounting and fairness
b. Migration cost attribution

FUTURE WORK:
------------

1. Enhance DMA drivers for bulk copying (e.g., SDXi Engine).
2. Enhance multi-threaded CPU copying for platform-specific scheduling of worker threads to optimize bandwidth utilization. Explore sched-ext for this. [2]
3. Enable kpromoted [4] to use the migration offload infrastructure.

EARLIER POSTINGS:
-----------------

- RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
- RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com

REFERENCES:
-----------

[1] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
[2] LSFMM: https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
[3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
[4] https://lore.kernel.org/all/20250910144653.212066-1-bharata@amd.com

Mike Day (1):
  mm: add support for copy offload for folio Migration

Shivank Garg (4):
  mm: Introduce folios_mc_copy() for batch copying folios
  mm/migrate: add migrate_folios_batch_move to  batch the folio move
    operations
  dcbm: add dma core batch migrator for batch page offloading
  mtcopy: spread threads across die for testing

Zi Yan (4):
  mm/migrate: factor out code in move_to_new_folio() and
    migrate_folio_move()
  mm/migrate: revive MIGRATE_NO_COPY in migrate_mode
  mtcopy: introduce multi-threaded page copy routine
  adjust NR_MAX_BATCHED_MIGRATION for testing

 drivers/Kconfig                        |   2 +
 drivers/Makefile                       |   3 +
 drivers/migoffcopy/Kconfig             |  17 +
 drivers/migoffcopy/Makefile            |   2 +
 drivers/migoffcopy/dcbm/Makefile       |   1 +
 drivers/migoffcopy/dcbm/dcbm.c         | 415 +++++++++++++++++++++++++
 drivers/migoffcopy/mtcopy/Makefile     |   1 +
 drivers/migoffcopy/mtcopy/copy_pages.c | 397 +++++++++++++++++++++++
 include/linux/migrate_mode.h           |   2 +
 include/linux/migrate_offc.h           |  34 ++
 include/linux/mm.h                     |   2 +
 mm/Kconfig                             |   8 +
 mm/Makefile                            |   1 +
 mm/migrate.c                           | 358 ++++++++++++++++++---
 mm/migrate_offc.c                      |  58 ++++
 mm/util.c                              |  29 ++
 16 files changed, 1284 insertions(+), 46 deletions(-)
 create mode 100644 drivers/migoffcopy/Kconfig
 create mode 100644 drivers/migoffcopy/Makefile
 create mode 100644 drivers/migoffcopy/dcbm/Makefile
 create mode 100644 drivers/migoffcopy/dcbm/dcbm.c
 create mode 100644 drivers/migoffcopy/mtcopy/Makefile
 create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c
 create mode 100644 include/linux/migrate_offc.h
 create mode 100644 mm/migrate_offc.c

-- 
2.43.0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC V3 1/9] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move()
  2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
@ 2025-09-23 17:47 ` Shivank Garg
  2025-10-02 10:30   ` Jonathan Cameron
  2025-09-23 17:47 ` [RFC V3 2/9] mm/migrate: revive MIGRATE_NO_COPY in migrate_mode Shivank Garg
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Shivank Garg @ 2025-09-23 17:47 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	shivankg, alirad.malek, yiannis, weixugc, linux-kernel, linux-mm

From: Zi Yan <ziy@nvidia.com>

No function change is intended. The factored out code will be reused in
an upcoming batched folio move function.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 mm/migrate.c | 106 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 67 insertions(+), 39 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 9e5ef39ce73a..ad03e7257847 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1061,19 +1061,7 @@ static int fallback_migrate_folio(struct address_space *mapping,
 	return migrate_folio(mapping, dst, src, mode);
 }
 
-/*
- * Move a src folio to a newly allocated dst folio.
- *
- * The src and dst folios are locked and the src folios was unmapped from
- * the page tables.
- *
- * On success, the src folio was replaced by the dst folio.
- *
- * Return value:
- *   < 0 - error code
- *  MIGRATEPAGE_SUCCESS - success
- */
-static int move_to_new_folio(struct folio *dst, struct folio *src,
+static int _move_to_new_folio_prep(struct folio *dst, struct folio *src,
 				enum migrate_mode mode)
 {
 	struct address_space *mapping = folio_mapping(src);
@@ -1098,7 +1086,12 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
 							mode);
 	else
 		rc = fallback_migrate_folio(mapping, dst, src, mode);
+	return rc;
+}
 
+static void _move_to_new_folio_finalize(struct folio *dst, struct folio *src,
+				int rc)
+{
 	if (rc == MIGRATEPAGE_SUCCESS) {
 		/*
 		 * For pagecache folios, src->mapping must be cleared before src
@@ -1110,6 +1103,29 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
 		if (likely(!folio_is_zone_device(dst)))
 			flush_dcache_folio(dst);
 	}
+}
+
+/*
+ * Move a src folio to a newly allocated dst folio.
+ *
+ * The src and dst folios are locked and the src folios was unmapped from
+ * the page tables.
+ *
+ * On success, the src folio was replaced by the dst folio.
+ *
+ * Return value:
+ *   < 0 - error code
+ *  MIGRATEPAGE_SUCCESS - success
+ */
+static int move_to_new_folio(struct folio *dst, struct folio *src,
+			enum migrate_mode mode)
+{
+	int rc;
+
+	rc = _move_to_new_folio_prep(dst, src, mode);
+
+	_move_to_new_folio_finalize(dst, src, rc);
+
 	return rc;
 }
 
@@ -1345,32 +1361,9 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
 	return rc;
 }
 
-/* Migrate the folio to the newly allocated folio in dst. */
-static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
-			      struct folio *src, struct folio *dst,
-			      enum migrate_mode mode, enum migrate_reason reason,
-			      struct list_head *ret)
+static void _migrate_folio_move_finalize1(struct folio *src, struct folio *dst,
+					  int old_page_state)
 {
-	int rc;
-	int old_page_state = 0;
-	struct anon_vma *anon_vma = NULL;
-	struct list_head *prev;
-
-	__migrate_folio_extract(dst, &old_page_state, &anon_vma);
-	prev = dst->lru.prev;
-	list_del(&dst->lru);
-
-	if (unlikely(page_has_movable_ops(&src->page))) {
-		rc = migrate_movable_ops_page(&dst->page, &src->page, mode);
-		if (rc)
-			goto out;
-		goto out_unlock_both;
-	}
-
-	rc = move_to_new_folio(dst, src, mode);
-	if (rc)
-		goto out;
-
 	/*
 	 * When successful, push dst to LRU immediately: so that if it
 	 * turns out to be an mlocked page, remove_migration_ptes() will
@@ -1386,8 +1379,12 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 
 	if (old_page_state & PAGE_WAS_MAPPED)
 		remove_migration_ptes(src, dst, 0);
+}
 
-out_unlock_both:
+static void _migrate_folio_move_finalize2(struct folio *src, struct folio *dst,
+					  enum migrate_reason reason,
+					  struct anon_vma *anon_vma)
+{
 	folio_unlock(dst);
 	folio_set_owner_migrate_reason(dst, reason);
 	/*
@@ -1407,6 +1404,37 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 		put_anon_vma(anon_vma);
 	folio_unlock(src);
 	migrate_folio_done(src, reason);
+}
+
+/* Migrate the folio to the newly allocated folio in dst. */
+static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
+			      struct folio *src, struct folio *dst,
+			      enum migrate_mode mode, enum migrate_reason reason,
+			      struct list_head *ret)
+{
+	int rc;
+	int old_page_state = 0;
+	struct anon_vma *anon_vma = NULL;
+	struct list_head *prev;
+
+	__migrate_folio_extract(dst, &old_page_state, &anon_vma);
+	prev = dst->lru.prev;
+	list_del(&dst->lru);
+
+	if (unlikely(page_has_movable_ops(&src->page))) {
+		rc = migrate_movable_ops_page(&dst->page, &src->page, mode);
+		if (rc)
+			goto out;
+		goto out_unlock_both;
+	}
+
+	rc = move_to_new_folio(dst, src, mode);
+	if (rc)
+		goto out;
+
+	_migrate_folio_move_finalize1(src, dst, old_page_state);
+out_unlock_both:
+	_migrate_folio_move_finalize2(src, dst, reason, anon_vma);
 
 	return rc;
 out:
-- 
2.43.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 1/9] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move()
  2025-09-23 17:47 ` [RFC V3 1/9] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Shivank Garg
@ 2025-10-02 10:30   ` Jonathan Cameron
  0 siblings, 0 replies; 26+ messages in thread
From: Jonathan Cameron @ 2025-10-02 10:30 UTC (permalink / raw)
  To: Shivank Garg
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, vkoul,
	lucas.demarchi, rdunlap, jgg, kuba, justonli, ivecera,
	dave.jiang, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm

On Tue, 23 Sep 2025 17:47:36 +0000
Shivank Garg <shivankg@amd.com> wrote:

> From: Zi Yan <ziy@nvidia.com>
> 
> No function change is intended. The factored out code will be reused in
> an upcoming batched folio move function.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
Hi.  A few code structure things inline.

The naming of the various helpers needs some more thought I think as
with it like this the loss of readability of existing code is painful.

Jonathan

> ---
>  mm/migrate.c | 106 ++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 67 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9e5ef39ce73a..ad03e7257847 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1061,19 +1061,7 @@ static int fallback_migrate_folio(struct address_space *mapping,
>  	return migrate_folio(mapping, dst, src, mode);
>  }
>  
> -/*
> - * Move a src folio to a newly allocated dst folio.
> - *
> - * The src and dst folios are locked and the src folios was unmapped from
> - * the page tables.
> - *
> - * On success, the src folio was replaced by the dst folio.
> - *
> - * Return value:
> - *   < 0 - error code
> - *  MIGRATEPAGE_SUCCESS - success
> - */
> -static int move_to_new_folio(struct folio *dst, struct folio *src,
> +static int _move_to_new_folio_prep(struct folio *dst, struct folio *src,

I'm not sure the _ prefix is needed. Or maybe it should be __ like
__buffer_migrate_folio()


>  				enum migrate_mode mode)
>  {
>  	struct address_space *mapping = folio_mapping(src);
> @@ -1098,7 +1086,12 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
>  							mode);
>  	else
>  		rc = fallback_migrate_folio(mapping, dst, src, mode);
> +	return rc;

May be worth switching this whole function to early returns given we no longer
have a shared block of stuff to do at the end.

	if (!mapping)
		return migrate_folio(mapping, st, src, mode);

	if (mapping_inaccessible(mapping))
		return -EOPNOTSUPP;

	if (mapping->a_ops->migrate_folio)
		return mapping->a_ops->migrate_folio(mapping, dst, src, mode);

	return fallback_migrate_folio(mapping, dst, src, mode);

> +}
>  
> +static void _move_to_new_folio_finalize(struct folio *dst, struct folio *src,
> +				int rc)
> +{
>  	if (rc == MIGRATEPAGE_SUCCESS) {

Perhaps
	if (rc != MIGRATE_PAGE_SUCCESS)
		return rc;

	/*
	 * For pagecache folios,....

...

	return rc;

Unless other stuff is likely to get added in here.
Or drag the condition to the caller.

>  		/*
>  		 * For pagecache folios, src->mapping must be cleared before src
> @@ -1110,6 +1103,29 @@ static int move_to_new_folio(struct folio *dst, struct folio *src,
>  		if (likely(!folio_is_zone_device(dst)))
>  			flush_dcache_folio(dst);
>  	}
> +}
> +
> +/*
> + * Move a src folio to a newly allocated dst folio.
> + *
> + * The src and dst folios are locked and the src folios was unmapped from
> + * the page tables.
> + *
> + * On success, the src folio was replaced by the dst folio.
> + *
> + * Return value:
> + *   < 0 - error code
> + *  MIGRATEPAGE_SUCCESS - success
> + */
> +static int move_to_new_folio(struct folio *dst, struct folio *src,
> +			enum migrate_mode mode)
> +{
> +	int rc;
> +
> +	rc = _move_to_new_folio_prep(dst, src, mode);
> +
> +	_move_to_new_folio_finalize(dst, src, rc);
> +
>  	return rc;
>  }
>  
> @@ -1345,32 +1361,9 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
>  	return rc;
>  }
>  
> -/* Migrate the folio to the newly allocated folio in dst. */
> -static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
> -			      struct folio *src, struct folio *dst,
> -			      enum migrate_mode mode, enum migrate_reason reason,
> -			      struct list_head *ret)
> +static void _migrate_folio_move_finalize1(struct folio *src, struct folio *dst,
> +					  int old_page_state)
>  {
> -	int rc;
> -	int old_page_state = 0;
> -	struct anon_vma *anon_vma = NULL;
> -	struct list_head *prev;
> -
> -	__migrate_folio_extract(dst, &old_page_state, &anon_vma);
> -	prev = dst->lru.prev;
> -	list_del(&dst->lru);
> -
> -	if (unlikely(page_has_movable_ops(&src->page))) {
> -		rc = migrate_movable_ops_page(&dst->page, &src->page, mode);
> -		if (rc)
> -			goto out;
> -		goto out_unlock_both;
> -	}
> -
> -	rc = move_to_new_folio(dst, src, mode);
> -	if (rc)
> -		goto out;
> -
>  	/*
>  	 * When successful, push dst to LRU immediately: so that if it
>  	 * turns out to be an mlocked page, remove_migration_ptes() will
> @@ -1386,8 +1379,12 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
>  
>  	if (old_page_state & PAGE_WAS_MAPPED)
>  		remove_migration_ptes(src, dst, 0);
> +}
>  
> -out_unlock_both:
> +static void _migrate_folio_move_finalize2(struct folio *src, struct folio *dst,
> +					  enum migrate_reason reason,
> +					  struct anon_vma *anon_vma)
> +{
>  	folio_unlock(dst);
>  	folio_set_owner_migrate_reason(dst, reason);
>  	/*
> @@ -1407,6 +1404,37 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
>  		put_anon_vma(anon_vma);
>  	folio_unlock(src);
>  	migrate_folio_done(src, reason);
> +}
> +
> +/* Migrate the folio to the newly allocated folio in dst. */
> +static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
> +			      struct folio *src, struct folio *dst,
> +			      enum migrate_mode mode, enum migrate_reason reason,
> +			      struct list_head *ret)
> +{
> +	int rc;
> +	int old_page_state = 0;
> +	struct anon_vma *anon_vma = NULL;
> +	struct list_head *prev;
> +
> +	__migrate_folio_extract(dst, &old_page_state, &anon_vma);
> +	prev = dst->lru.prev;
> +	list_del(&dst->lru);
> +
> +	if (unlikely(page_has_movable_ops(&src->page))) {
> +		rc = migrate_movable_ops_page(&dst->page, &src->page, mode);
> +		if (rc)
> +			goto out;
> +		goto out_unlock_both;
I would drop this..
> +	}
and do
	} else {
		rc = move_to_new_folio(dst, src, mode);
		if (rc)
			goto out;
		_migrate_folio_move_finalize1(src, dst, old_page_state);
	}
	_migrate_folio_move_finalize2(src, dst, reason, anon_vma);
  
  	return rc;

This makes sense now as the amount of code indented more in this approach
is much smaller than it would have been before you factored stuff out.


> +
> +	rc = move_to_new_folio(dst, src, mode);
> +	if (rc)
> +		goto out;
> +
Hmm. These two functions might be useful but this is hurting readability
here.  Can we come up with some more meaningful names perhaps?

> +	_migrate_folio_move_finalize1(src, dst, old_page_state);
> +out_unlock_both:
> +	_migrate_folio_move_finalize2(src, dst, reason, anon_vma);
>  
>  	return rc;
>  out:



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC V3 2/9] mm/migrate: revive MIGRATE_NO_COPY in migrate_mode
  2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
  2025-09-23 17:47 ` [RFC V3 1/9] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Shivank Garg
@ 2025-09-23 17:47 ` Shivank Garg
  2025-09-23 17:47 ` [RFC V3 3/9] mm: Introduce folios_mc_copy() for batch copying folios Shivank Garg
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Shivank Garg @ 2025-09-23 17:47 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	shivankg, alirad.malek, yiannis, weixugc, linux-kernel, linux-mm

From: Zi Yan <ziy@nvidia.com>

It is a preparation patch. The added MIGRATE_NO_COPY will be used by the
following patches to implement batched page copy functions by skipping
folio copy process in __migrate_folio() and copying folios in one shot
at the end.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 include/linux/migrate_mode.h | 2 ++
 mm/migrate.c                 | 8 +++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 265c4328b36a..9af6c949a057 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -7,11 +7,13 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_NO_COPY will not copy page content
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
+	MIGRATE_NO_COPY,
 };
 
 enum migrate_reason {
diff --git a/mm/migrate.c b/mm/migrate.c
index ad03e7257847..3fe78ecb146a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -848,9 +848,11 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
 	if (folio_ref_count(src) != expected_count)
 		return -EAGAIN;
 
-	rc = folio_mc_copy(dst, src);
-	if (unlikely(rc))
-		return rc;
+	if (mode != MIGRATE_NO_COPY) {
+		rc = folio_mc_copy(dst, src);
+		if (unlikely(rc))
+			return rc;
+	}
 
 	rc = __folio_migrate_mapping(mapping, dst, src, expected_count);
 	if (rc != MIGRATEPAGE_SUCCESS)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC V3 3/9] mm: Introduce folios_mc_copy() for batch copying folios
  2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
  2025-09-23 17:47 ` [RFC V3 1/9] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Shivank Garg
  2025-09-23 17:47 ` [RFC V3 2/9] mm/migrate: revive MIGRATE_NO_COPY in migrate_mode Shivank Garg
@ 2025-09-23 17:47 ` Shivank Garg
  2025-09-23 17:47 ` [RFC V3 4/9] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Shivank Garg
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Shivank Garg @ 2025-09-23 17:47 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	shivankg, alirad.malek, yiannis, weixugc, linux-kernel, linux-mm

Introduce the folios_mc_copy() to copy the folio content from the list
of src folios to the list of dst folios.

This is preparatory patch for batch page migration offloading.

Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 include/linux/mm.h |  2 ++
 mm/util.c          | 29 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ae97a0b8ec7..383702a819ac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1187,6 +1187,8 @@ void __folio_put(struct folio *folio);
 void split_page(struct page *page, unsigned int order);
 void folio_copy(struct folio *dst, struct folio *src);
 int folio_mc_copy(struct folio *dst, struct folio *src);
+int folios_mc_copy(struct list_head *dst_list, struct list_head *src_list,
+		 unsigned int __maybe_unused folios_cnt);
 
 unsigned long nr_free_buffer_pages(void);
 
diff --git a/mm/util.c b/mm/util.c
index f814e6a59ab1..2d7758f33fc6 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -748,6 +748,35 @@ int folio_mc_copy(struct folio *dst, struct folio *src)
 }
 EXPORT_SYMBOL(folio_mc_copy);
 
+/**
+ * folios_mc_copy - Copy the contents of list of folios.
+ * @dst_list: Folios to copy to.
+ * @src_list: Folios to copy from.
+ * @folios_cnt: Number of folios in each list (unused).
+ *
+ * The folio contents are copied from @src_list to @dst_list.
+ * Assume the caller has validated that lists are not empty and both lists
+ * have equal number of folios. This may sleep.
+ */
+int folios_mc_copy(struct list_head *dst_list, struct list_head *src_list,
+		 unsigned int __maybe_unused folios_cnt)
+{
+	struct folio *src, *dst;
+	int ret;
+
+	dst = list_first_entry(dst_list, struct folio, lru);
+	list_for_each_entry(src, src_list, lru) {
+		cond_resched();
+		ret = folio_mc_copy(dst, src);
+		if (ret)
+			return ret;
+		dst = list_next_entry(dst, lru);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(folios_mc_copy);
+
 int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
 static int sysctl_overcommit_ratio __read_mostly = 50;
 static unsigned long sysctl_overcommit_kbytes __read_mostly;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC V3 4/9] mm/migrate: add migrate_folios_batch_move to  batch the folio move operations
  2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
                   ` (2 preceding siblings ...)
  2025-09-23 17:47 ` [RFC V3 3/9] mm: Introduce folios_mc_copy() for batch copying folios Shivank Garg
@ 2025-09-23 17:47 ` Shivank Garg
  2025-10-02 11:03   ` Jonathan Cameron
  2025-09-23 17:47 ` [RFC V3 5/9] mm: add support for copy offload for folio Migration Shivank Garg
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Shivank Garg @ 2025-09-23 17:47 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	shivankg, alirad.malek, yiannis, weixugc, linux-kernel, linux-mm

This is a preparatory patch that enables batch copying for folios
undergoing migration. By enabling batch copying the folio content, we can
efficiently utilize the capabilities of DMA hardware or multi-threaded
folio copy. It uses MIGRATE_NO_COPY to skip folio copy during metadata
copy process and performed the copies in a batch later.

Currently, the folio move operation is performed individually for each
folio in sequential manner:
for_each_folio() {
        Copy folio metadata like flags and mappings
        Copy the folio content from src to dst
        Update page tables with dst folio
}

With this patch, we transition to a batch processing approach as shown
below:
for_each_folio() {
        Copy folio metadata like flags and mappings
}
Batch copy all src folios to dst
for_each_folio() {
        Update page tables with dst folios
}

dst->private is used to store page states and possible anon_vma value,
thus needs to be cleared during metadata copy process. To avoid additional
memory allocation to store the data during batch copy process, src->private
is used to store the data after metadata copy process, since src is no
longer used.

Co-developed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 mm/migrate.c | 197 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 193 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 3fe78ecb146a..ce94e73a930d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -843,12 +843,15 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
 			   enum migrate_mode mode)
 {
 	int rc, expected_count = folio_expected_ref_count(src) + 1;
+	unsigned long dst_private = (unsigned long)dst->private;
 
 	/* Check whether src does not have extra refs before we do more work */
 	if (folio_ref_count(src) != expected_count)
 		return -EAGAIN;
 
-	if (mode != MIGRATE_NO_COPY) {
+	if (mode == MIGRATE_NO_COPY) {
+		dst->private = NULL;
+	} else {
 		rc = folio_mc_copy(dst, src);
 		if (unlikely(rc))
 			return rc;
@@ -862,6 +865,10 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
 		folio_attach_private(dst, folio_detach_private(src));
 
 	folio_migrate_flags(dst, src);
+
+	if (mode == MIGRATE_NO_COPY)
+		src->private = (void *)dst_private;
+
 	return MIGRATEPAGE_SUCCESS;
 }
 
@@ -1149,7 +1156,7 @@ static void __migrate_folio_record(struct folio *dst,
 	dst->private = (void *)anon_vma + old_page_state;
 }
 
-static void __migrate_folio_extract(struct folio *dst,
+static void __migrate_folio_read(struct folio *dst,
 				   int *old_page_state,
 				   struct anon_vma **anon_vmap)
 {
@@ -1157,6 +1164,12 @@ static void __migrate_folio_extract(struct folio *dst,
 
 	*anon_vmap = (struct anon_vma *)(private & ~PAGE_OLD_STATES);
 	*old_page_state = private & PAGE_OLD_STATES;
+}
+static void __migrate_folio_extract(struct folio *dst,
+				   int *old_page_state,
+				   struct anon_vma **anon_vmap)
+{
+	__migrate_folio_read(dst, old_page_state, anon_vmap);
 	dst->private = NULL;
 }
 
@@ -1776,6 +1789,176 @@ static void migrate_folios_move(struct list_head *src_folios,
 	}
 }
 
+static void migrate_folios_batch_move(struct list_head *src_folios,
+		struct list_head *dst_folios,
+		free_folio_t put_new_folio, unsigned long private,
+		enum migrate_mode mode, int reason,
+		struct list_head *ret_folios,
+		struct migrate_pages_stats *stats,
+		int *retry, int *thp_retry, int *nr_failed,
+		int *nr_retry_pages)
+{
+	struct folio *folio, *folio2, *dst, *dst2;
+	int rc, nr_pages = 0, nr_batched_folios = 0;
+	int old_page_state = 0;
+	struct anon_vma *anon_vma = NULL;
+	int is_thp = 0;
+	LIST_HEAD(err_src);
+	LIST_HEAD(err_dst);
+
+	/*
+	 * Iterate over the list of locked src/dst folios to copy the metadata
+	 */
+	dst = list_first_entry(dst_folios, struct folio, lru);
+	dst2 = list_next_entry(dst, lru);
+	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+		is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
+		nr_pages = folio_nr_pages(folio);
+
+		/*
+		 * dst->private is not cleared here. It is cleared and moved to
+		 * src->private in __migrate_folio().
+		 */
+		__migrate_folio_read(dst, &old_page_state, &anon_vma);
+
+		/*
+		 * Use MIGRATE_NO_COPY mode in migrate_folio family functions
+		 * to copy the flags, mapping and some other ancillary information.
+		 * This does everything except the page copy. The actual page copy
+		 * is handled later in a batch manner.
+		 */
+		if (unlikely(page_movable_ops(&folio->page)))
+			rc = -EAGAIN;
+		else
+			rc = _move_to_new_folio_prep(dst, folio, MIGRATE_NO_COPY);
+		/*
+		 * The rules are:
+		 *	Success: folio will be copied in batch
+		 *	-EAGAIN: move src/dst folios to tmp lists for
+		 *	         non-batch retry
+		 *	Other errno: put src folio on ret_folios list, restore
+		 *	             the dst folio
+		 */
+		if (rc == -EAGAIN) {
+			*retry += 1;
+			*thp_retry += is_thp;
+			*nr_retry_pages += nr_pages;
+
+			list_move_tail(&folio->lru, &err_src);
+			list_move_tail(&dst->lru, &err_dst);
+			__migrate_folio_record(dst, old_page_state, anon_vma);
+		} else if (rc != MIGRATEPAGE_SUCCESS) {
+			*nr_failed += 1;
+			stats->nr_thp_failed += is_thp;
+			stats->nr_failed_pages += nr_pages;
+
+			list_del(&dst->lru);
+			migrate_folio_undo_src(folio,
+					old_page_state & PAGE_WAS_MAPPED,
+					anon_vma, true, ret_folios);
+			migrate_folio_undo_dst(dst, true, put_new_folio, private);
+		} else { /* MIGRATEPAGE_SUCCESS */
+			nr_batched_folios++;
+		}
+
+		dst = dst2;
+		dst2 = list_next_entry(dst, lru);
+	}
+
+	/* Exit if folio list for batch migration is empty */
+	if (!nr_batched_folios)
+		goto out;
+
+	/* Batch copy the folios */
+	rc = folios_mc_copy(dst_folios, src_folios, nr_batched_folios);
+
+	/* TODO:  Is there a better way of handling the poison
+	 * recover for batch copy, instead of falling back to serial copy?
+	 */
+	/* fallback to serial page copy if needed */
+	if (rc) {
+		dst = list_first_entry(dst_folios, struct folio, lru);
+		dst2 = list_next_entry(dst, lru);
+		list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+			is_thp = folio_test_large(folio) &&
+				 folio_test_pmd_mappable(folio);
+			nr_pages = folio_nr_pages(folio);
+			rc = folio_mc_copy(dst, folio);
+
+			if (rc) {
+				/*
+				 * dst->private is moved to src->private in
+				 * __migrate_folio(), so page state and anon_vma
+				 * values can be extracted from (src) folio.
+				 */
+				__migrate_folio_extract(folio, &old_page_state,
+						&anon_vma);
+				migrate_folio_undo_src(folio,
+						old_page_state & PAGE_WAS_MAPPED,
+						anon_vma, true, ret_folios);
+				list_del(&dst->lru);
+				migrate_folio_undo_dst(dst, true, put_new_folio,
+						private);
+			}
+
+			switch (rc) {
+			case MIGRATEPAGE_SUCCESS:
+				stats->nr_succeeded += nr_pages;
+				stats->nr_thp_succeeded += is_thp;
+				break;
+			default:
+				*nr_failed += 1;
+				stats->nr_thp_failed += is_thp;
+				stats->nr_failed_pages += nr_pages;
+				break;
+			}
+
+			dst = dst2;
+			dst2 = list_next_entry(dst, lru);
+		}
+	}
+
+	/*
+	 * Iterate the folio lists to remove migration pte and restore them
+	 * as working pte. Unlock the folios, add/remove them to LRU lists (if
+	 * applicable) and release the src folios.
+	 */
+	dst = list_first_entry(dst_folios, struct folio, lru);
+	dst2 = list_next_entry(dst, lru);
+	list_for_each_entry_safe(folio, folio2, src_folios, lru) {
+		is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio);
+		nr_pages = folio_nr_pages(folio);
+		/*
+		 * dst->private is moved to src->private in __migrate_folio(),
+		 * so page state and anon_vma values can be extracted from
+		 * (src) folio.
+		 */
+		__migrate_folio_extract(folio, &old_page_state, &anon_vma);
+		list_del(&dst->lru);
+
+		_move_to_new_folio_finalize(dst, folio, MIGRATEPAGE_SUCCESS);
+
+		/*
+		 * Below few steps are only applicable for lru pages which is
+		 * ensured as we have removed the non-lru pages from our list.
+		 */
+		_migrate_folio_move_finalize1(folio, dst, old_page_state);
+
+		_migrate_folio_move_finalize2(folio, dst, reason, anon_vma);
+
+		/* Page migration successful, increase stat counter */
+		stats->nr_succeeded += nr_pages;
+		stats->nr_thp_succeeded += is_thp;
+
+		dst = dst2;
+		dst2 = list_next_entry(dst, lru);
+	}
+out:
+	/* Add tmp folios back to the list to re-attempt migration. */
+	list_splice(&err_src, src_folios);
+	list_splice(&err_dst, dst_folios);
+}
+
 static void migrate_folios_undo(struct list_head *src_folios,
 		struct list_head *dst_folios,
 		free_folio_t put_new_folio, unsigned long private,
@@ -1986,13 +2169,19 @@ static int migrate_pages_batch(struct list_head *from,
 	/* Flush TLBs for all unmapped folios */
 	try_to_unmap_flush();
 
-	retry = 1;
+	retry = 0;
+	/* Batch move the unmapped folios */
+	migrate_folios_batch_move(&unmap_folios, &dst_folios,
+				put_new_folio, private, mode, reason,
+				ret_folios, stats, &retry, &thp_retry,
+				&nr_failed, &nr_retry_pages);
+
 	for (pass = 0; pass < nr_pass && retry; pass++) {
 		retry = 0;
 		thp_retry = 0;
 		nr_retry_pages = 0;
 
-		/* Move the unmapped folios */
+		/* Move the remaining unmapped folios */
 		migrate_folios_move(&unmap_folios, &dst_folios,
 				put_new_folio, private, mode, reason,
 				ret_folios, stats, &retry, &thp_retry,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 4/9] mm/migrate: add migrate_folios_batch_move to  batch the folio move operations
  2025-09-23 17:47 ` [RFC V3 4/9] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Shivank Garg
@ 2025-10-02 11:03   ` Jonathan Cameron
  2025-10-16  9:17     ` Garg, Shivank
  0 siblings, 1 reply; 26+ messages in thread
From: Jonathan Cameron @ 2025-10-02 11:03 UTC (permalink / raw)
  To: Shivank Garg
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, vkoul,
	lucas.demarchi, rdunlap, jgg, kuba, justonli, ivecera,
	dave.jiang, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm

On Tue, 23 Sep 2025 17:47:39 +0000
Shivank Garg <shivankg@amd.com> wrote:

> This is a preparatory patch that enables batch copying for folios
> undergoing migration. By enabling batch copying the folio content, we can
> efficiently utilize the capabilities of DMA hardware or multi-threaded
> folio copy. It uses MIGRATE_NO_COPY to skip folio copy during metadata
> copy process and performed the copies in a batch later.
> 
> Currently, the folio move operation is performed individually for each
> folio in sequential manner:
> for_each_folio() {
>         Copy folio metadata like flags and mappings
>         Copy the folio content from src to dst
>         Update page tables with dst folio
> }
> 
> With this patch, we transition to a batch processing approach as shown
> below:
> for_each_folio() {
>         Copy folio metadata like flags and mappings
> }
> Batch copy all src folios to dst
> for_each_folio() {
>         Update page tables with dst folios
> }
> 
> dst->private is used to store page states and possible anon_vma value,
> thus needs to be cleared during metadata copy process. To avoid additional
> memory allocation to store the data during batch copy process, src->private
> is used to store the data after metadata copy process, since src is no
> longer used.
> 
> Co-developed-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>  mm/migrate.c | 197 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 193 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 3fe78ecb146a..ce94e73a930d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -843,12 +843,15 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
>  			   enum migrate_mode mode)
>  {
>  	int rc, expected_count = folio_expected_ref_count(src) + 1;
> +	unsigned long dst_private = (unsigned long)dst->private;
Why not just stash it in a void * and void the casts?

>  
>  	/* Check whether src does not have extra refs before we do more work */
>  	if (folio_ref_count(src) != expected_count)
>  		return -EAGAIN;
>  
> -	if (mode != MIGRATE_NO_COPY) {
> +	if (mode == MIGRATE_NO_COPY) {
> +		dst->private = NULL;
> +	} else {
>  		rc = folio_mc_copy(dst, src);
>  		if (unlikely(rc))
>  			return rc;
> @@ -862,6 +865,10 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
>  		folio_attach_private(dst, folio_detach_private(src));
>  
>  	folio_migrate_flags(dst, src);
> +
> +	if (mode == MIGRATE_NO_COPY)

I'd add a comment on what you mention in the commit message about this being a safe place
to stash this.

> +		src->private = (void *)dst_private;
> +
>  	return MIGRATEPAGE_SUCCESS;
>  }
>  
> @@ -1149,7 +1156,7 @@ static void __migrate_folio_record(struct folio *dst,
>  	dst->private = (void *)anon_vma + old_page_state;
>  }
>  
> -static void __migrate_folio_extract(struct folio *dst,
> +static void __migrate_folio_read(struct folio *dst,
>  				   int *old_page_state,
>  				   struct anon_vma **anon_vmap)
>  {
> @@ -1157,6 +1164,12 @@ static void __migrate_folio_extract(struct folio *dst,
>  
>  	*anon_vmap = (struct anon_vma *)(private & ~PAGE_OLD_STATES);
>  	*old_page_state = private & PAGE_OLD_STATES;
> +}

Probably a blank line here.

> +static void __migrate_folio_extract(struct folio *dst,
> +				   int *old_page_state,
> +				   struct anon_vma **anon_vmap)
> +{
> +	__migrate_folio_read(dst, old_page_state, anon_vmap);
>  	dst->private = NULL;
>  }
>  
> @@ -1776,6 +1789,176 @@ static void migrate_folios_move(struct list_head *src_folios,
>  	}
>  }
>  
> +static void migrate_folios_batch_move(struct list_head *src_folios,
> +		struct list_head *dst_folios,
> +		free_folio_t put_new_folio, unsigned long private,
> +		enum migrate_mode mode, int reason,
> +		struct list_head *ret_folios,
> +		struct migrate_pages_stats *stats,
> +		int *retry, int *thp_retry, int *nr_failed,
> +		int *nr_retry_pages)
> +{
> +	struct folio *folio, *folio2, *dst, *dst2;
> +	int rc, nr_pages = 0, nr_batched_folios = 0;
> +	int old_page_state = 0;
> +	struct anon_vma *anon_vma = NULL;
> +	int is_thp = 0;

Always set in each loop before use. So no need to init here that I can see.

> +	LIST_HEAD(err_src);
> +	LIST_HEAD(err_dst);

> +	/* Batch copy the folios */
> +	rc = folios_mc_copy(dst_folios, src_folios, nr_batched_folios);
> +
> +	/* TODO:  Is there a better way of handling the poison
> +	 * recover for batch copy, instead of falling back to serial copy?

Is there a reason we might expect this to be common enough to care about
not using the serial path?

> +	 */
> +	/* fallback to serial page copy if needed */
> +	if (rc) {
> +		dst = list_first_entry(dst_folios, struct folio, lru);
> +		dst2 = list_next_entry(dst, lru);
> +		list_for_each_entry_safe(folio, folio2, src_folios, lru) {
> +			is_thp = folio_test_large(folio) &&
> +				 folio_test_pmd_mappable(folio);
> +			nr_pages = folio_nr_pages(folio);
> +			rc = folio_mc_copy(dst, folio);
> +



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 4/9] mm/migrate: add migrate_folios_batch_move to batch the folio move operations
  2025-10-02 11:03   ` Jonathan Cameron
@ 2025-10-16  9:17     ` Garg, Shivank
  0 siblings, 0 replies; 26+ messages in thread
From: Garg, Shivank @ 2025-10-16  9:17 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, vkoul,
	lucas.demarchi, rdunlap, jgg, kuba, justonli, ivecera,
	dave.jiang, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm



On 10/2/2025 4:33 PM, Jonathan Cameron wrote:
>> +	/* TODO:  Is there a better way of handling the poison
>> +	 * recover for batch copy, instead of falling back to serial copy?
> Is there a reason we might expect this to be common enough to care about
> not using the serial path?

Not common enough, I guess!

> 
>> +	 */
>> +	/* fallback to serial page copy if needed */
>> +	if (rc) {
>> +		dst = list_first_entry(dst_folios, struct folio, lru);
>> +		dst2 = list_next_entry(dst, lru);
>> +		list_for_each_entry_safe(folio, folio2, src_folios, lru) {
>> +			is_thp = folio_test_large(folio) &&
>> +				 folio_test_pmd_mappable(folio);
>> +			nr_pages = folio_nr_pages(folio);
>> +			rc = folio_mc_copy(dst, folio);



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC V3 5/9] mm: add support for copy offload for folio Migration
  2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
                   ` (3 preceding siblings ...)
  2025-09-23 17:47 ` [RFC V3 4/9] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Shivank Garg
@ 2025-09-23 17:47 ` Shivank Garg
  2025-10-02 11:10   ` Jonathan Cameron
  2025-09-23 17:47 ` [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine Shivank Garg
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Shivank Garg @ 2025-09-23 17:47 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	shivankg, alirad.malek, yiannis, weixugc, linux-kernel, linux-mm

From: Mike Day <michael.day@amd.com>

Offload-Copy drivers should implement following functions to enable folio
migration offloading:
migrate_offc() - This function takes src and dst folios list undergoing
migration. It is responsible for transfer of page content between the
src and dst folios.
can_migrate_offc() - It performs necessary checks if offload copying
migration is supported for the give src and dst folios.

Offload-Copy driver should include a mechanism to call start_offloading and
stop_offloading for enabling and disabling migration offload respectively.

Signed-off-by: Mike Day <michael.day@amd.com>
Co-developed-by: Shivank Garg <shivankg@amd.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 include/linux/migrate_offc.h | 34 +++++++++++++++++++++
 mm/Kconfig                   |  8 +++++
 mm/Makefile                  |  1 +
 mm/migrate.c                 | 49 +++++++++++++++++++++++++++++-
 mm/migrate_offc.c            | 58 ++++++++++++++++++++++++++++++++++++
 5 files changed, 149 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/migrate_offc.h
 create mode 100644 mm/migrate_offc.c

diff --git a/include/linux/migrate_offc.h b/include/linux/migrate_offc.h
new file mode 100644
index 000000000000..e9e8a30f40f0
--- /dev/null
+++ b/include/linux/migrate_offc.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _MIGRATE_OFFC_H
+#define _MIGRATE_OFFC_H
+#include <linux/migrate_mode.h>
+
+#define MIGRATOR_NAME_LEN 32
+struct migrator {
+	char name[MIGRATOR_NAME_LEN];
+	int (*migrate_offc)(struct list_head *dst_list, struct list_head *src_list,
+			    unsigned int folio_cnt);
+	struct rcu_head srcu_head;
+	struct module *owner;
+};
+
+extern struct migrator migrator;
+extern struct mutex migrator_mut;
+extern struct srcu_struct mig_srcu;
+
+#ifdef CONFIG_OFFC_MIGRATION
+void srcu_mig_cb(struct rcu_head *head);
+int offc_update_migrator(struct migrator *mig);
+unsigned char *get_active_migrator_name(void);
+int start_offloading(struct migrator *migrator);
+int stop_offloading(void);
+#else
+static inline void srcu_mig_cb(struct rcu_head *head) { };
+static inline int offc_update_migrator(struct migrator *mig) { return 0; };
+static inline unsigned char *get_active_migrator_name(void) { return NULL; };
+static inline void start_offloading(struct migrator *migrator) { return 0; };
+static inline void stop_offloading(void) { return 0; };
+#endif /* CONFIG_OFFC_MIGRATION */
+
+#endif /* _MIGRATE_OFFC_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf..a9cbb8d1f1f6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -689,6 +689,14 @@ config MIGRATION
 config DEVICE_MIGRATION
 	def_bool MIGRATION && ZONE_DEVICE
 
+config OFFC_MIGRATION
+	bool "Migrate Pages offloading copy"
+	def_bool n
+	depends on MIGRATION
+	help
+	 An interface allowing external modules or driver to offload
+	 page copying in page migration.
+
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index ef54aa615d9d..f609d3899992 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -96,6 +96,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_OFFC_MIGRATION) += migrate_offc.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
diff --git a/mm/migrate.c b/mm/migrate.c
index ce94e73a930d..41bea48d823c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -43,6 +43,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/memory-tiers.h>
 #include <linux/pagewalk.h>
+#include <linux/migrate_offc.h>
 
 #include <asm/tlbflush.h>
 
@@ -834,6 +835,52 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
 }
 EXPORT_SYMBOL(folio_migrate_flags);
 
+#ifdef CONFIG_HAVE_STATIC_CALL
+DEFINE_STATIC_CALL(_folios_copy, folios_mc_copy);
+
+#ifdef CONFIG_OFFC_MIGRATION
+void srcu_mig_cb(struct rcu_head *head)
+{
+	static_call_query(_folios_copy);
+}
+
+int offc_update_migrator(struct migrator *mig)
+{
+	struct module *old_owner, *new_owner;
+	int index;
+	int ret = 0;
+
+	mutex_lock(&migrator_mut);
+	index = srcu_read_lock(&mig_srcu);
+	old_owner = READ_ONCE(migrator.owner);
+	new_owner = mig ? mig->owner : NULL;
+
+	if (new_owner && !try_module_get(new_owner)) {
+		ret = -ENODEV;
+		goto out_unlock;
+	}
+
+	strscpy(migrator.name, mig ? mig->name : "kernel", MIGRATOR_NAME_LEN);
+	static_call_update(_folios_copy, mig ? mig->migrate_offc : folios_mc_copy);
+	xchg(&migrator.owner, mig ? mig->owner : NULL);
+	if (old_owner)
+		module_put(old_owner);
+
+out_unlock:
+	WARN_ON(ret < 0);
+	srcu_read_unlock(&mig_srcu, index);
+	mutex_unlock(&migrator_mut);
+
+	if (ret == 0) {
+		call_srcu(&mig_srcu, &migrator.srcu_head, srcu_mig_cb);
+		srcu_barrier(&mig_srcu);
+	}
+	return ret;
+}
+
+#endif /* CONFIG_OFFC_MIGRATION */
+#endif /* CONFIG_HAVE_STATIC_CALL */
+
 /************************************************************
  *                    Migration functions
  ***********************************************************/
@@ -1870,7 +1917,7 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
 		goto out;
 
 	/* Batch copy the folios */
-	rc = folios_mc_copy(dst_folios, src_folios, nr_batched_folios);
+	rc = static_call(_folios_copy)(dst_folios, src_folios, nr_batched_folios);
 
 	/* TODO:  Is there a better way of handling the poison
 	 * recover for batch copy, instead of falling back to serial copy?
diff --git a/mm/migrate_offc.c b/mm/migrate_offc.c
new file mode 100644
index 000000000000..a6530658a3f7
--- /dev/null
+++ b/mm/migrate_offc.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/migrate.h>
+#include <linux/migrate_offc.h>
+#include <linux/rculist.h>
+#include <linux/static_call.h>
+
+atomic_t dispatch_to_offc = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(dispatch_to_offc);
+
+DEFINE_MUTEX(migrator_mut);
+DEFINE_SRCU(mig_srcu);
+
+struct migrator migrator = {
+	.name = "kernel",
+	.migrate_offc = folios_mc_copy,
+	.srcu_head.func = srcu_mig_cb,
+	.owner = NULL,
+};
+
+int start_offloading(struct migrator *m)
+{
+	int offloading = 0;
+	int ret;
+
+	pr_info("starting migration offload by %s\n", m->name);
+	ret = offc_update_migrator(m);
+	if (ret < 0) {
+		pr_err("failed to start migration offload by %s, err=%d\n",
+		       m->name, ret);
+		return ret;
+	}
+	atomic_try_cmpxchg(&dispatch_to_offc, &offloading, 1);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(start_offloading);
+
+int stop_offloading(void)
+{
+	int offloading = 1;
+	int ret;
+
+	pr_info("stopping migration offload by %s\n", migrator.name);
+	ret = offc_update_migrator(NULL);
+	if (ret < 0) {
+		pr_err("failed to stop migration offload by %s, err=%d\n",
+		       migrator.name, ret);
+		return ret;
+	}
+	atomic_try_cmpxchg(&dispatch_to_offc, &offloading, 0);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(stop_offloading);
+
+unsigned char *get_active_migrator_name(void)
+{
+	return migrator.name;
+}
+EXPORT_SYMBOL_GPL(get_active_migrator_name);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 5/9] mm: add support for copy offload for folio Migration
  2025-09-23 17:47 ` [RFC V3 5/9] mm: add support for copy offload for folio Migration Shivank Garg
@ 2025-10-02 11:10   ` Jonathan Cameron
  2025-10-16  9:40     ` Garg, Shivank
  0 siblings, 1 reply; 26+ messages in thread
From: Jonathan Cameron @ 2025-10-02 11:10 UTC (permalink / raw)
  To: Shivank Garg, lorenzo.stoakes
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap, jgg, kuba,
	justonli, ivecera, dave.jiang, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm

On Tue, 23 Sep 2025 17:47:40 +0000
Shivank Garg <shivankg@amd.com> wrote:

> From: Mike Day <michael.day@amd.com>
> 
> Offload-Copy drivers should implement following functions to enable folio
> migration offloading:
> migrate_offc() - This function takes src and dst folios list undergoing

Trivial but I'd burn the characters to just spell out offc.
migrate_offload_copy() isn't exactly long.

> migration. It is responsible for transfer of page content between the
> src and dst folios.
> can_migrate_offc() - It performs necessary checks if offload copying
> migration is supported for the give src and dst folios.
> 
> Offload-Copy driver should include a mechanism to call start_offloading and
> stop_offloading for enabling and disabling migration offload respectively.
> 
> Signed-off-by: Mike Day <michael.day@amd.com>
> Co-developed-by: Shivank Garg <shivankg@amd.com>
> Signed-off-by: Shivank Garg <shivankg@amd.com>

Just a trivial comment inline.

Ultimately feels like more complexity will be needed to deal with
multiple providers of copying facilities being available, but I guess
this works for now.

Jonathan

> diff --git a/mm/migrate_offc.c b/mm/migrate_offc.c
> new file mode 100644
> index 000000000000..a6530658a3f7
> --- /dev/null
> +++ b/mm/migrate_offc.c
> @@ -0,0 +1,58 @@


> +
> +struct migrator migrator = {
> +	.name = "kernel",
> +	.migrate_offc = folios_mc_copy,
> +	.srcu_head.func = srcu_mig_cb,
> +	.owner = NULL,

No point in setting this to null explicitly unless intent
is to act as some sort of documentation.


> +};
> +
> +int start_offloading(struct migrator *m)
> +{
> +	int offloading = 0;
> +	int ret;
> +
> +	pr_info("starting migration offload by %s\n", m->name);
> +	ret = offc_update_migrator(m);
> +	if (ret < 0) {
> +		pr_err("failed to start migration offload by %s, err=%d\n",
> +		       m->name, ret);
> +		return ret;
> +	}
> +	atomic_try_cmpxchg(&dispatch_to_offc, &offloading, 1);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(start_offloading);
> +
> +int stop_offloading(void)
> +{
> +	int offloading = 1;
> +	int ret;
> +
> +	pr_info("stopping migration offload by %s\n", migrator.name);
> +	ret = offc_update_migrator(NULL);
> +	if (ret < 0) {
> +		pr_err("failed to stop migration offload by %s, err=%d\n",
> +		       migrator.name, ret);
> +		return ret;
> +	}
> +	atomic_try_cmpxchg(&dispatch_to_offc, &offloading, 0);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(stop_offloading);



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 5/9] mm: add support for copy offload for folio Migration
  2025-10-02 11:10   ` Jonathan Cameron
@ 2025-10-16  9:40     ` Garg, Shivank
  0 siblings, 0 replies; 26+ messages in thread
From: Garg, Shivank @ 2025-10-16  9:40 UTC (permalink / raw)
  To: Jonathan Cameron, lorenzo.stoakes
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap, jgg, kuba,
	justonli, ivecera, dave.jiang, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm



On 10/2/2025 4:40 PM, Jonathan Cameron wrote:
> Ultimately feels like more complexity will be needed to deal with
> multiple providers of copying facilities being available, but I guess
> this works for now.

I agree. The current design is simple to keep the implementation clean and focused.

Depending on usecases, if there is need to support multiple concurrent migrators,
we will need to revisit this design and implement a more dynamic selection mechanism.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine
  2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
                   ` (4 preceding siblings ...)
  2025-09-23 17:47 ` [RFC V3 5/9] mm: add support for copy offload for folio Migration Shivank Garg
@ 2025-09-23 17:47 ` Shivank Garg
  2025-10-02 11:29   ` Jonathan Cameron
  2025-10-20  8:28   ` Byungchul Park
  2025-09-23 17:47 ` [RFC V3 7/9] dcbm: add dma core batch migrator for batch page offloading Shivank Garg
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 26+ messages in thread
From: Shivank Garg @ 2025-09-23 17:47 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	shivankg, alirad.malek, yiannis, weixugc, linux-kernel, linux-mm

From: Zi Yan <ziy@nvidia.com>

Now page copies are batched, multi-threaded page copy can be used to
increase page copy throughput.

Enable using:
echo 1 >  /sys/kernel/cpu_mt/offloading
echo NR_THREADS >  /sys/kernel/cpu_mt/threads

Disable:
echo 0 >  /sys/kernel/cpu_mt/offloading

Signed-off-by: Zi Yan <ziy@nvidia.com>
Co-developed-by: Shivank Garg <shivankg@amd.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 drivers/Kconfig                        |   2 +
 drivers/Makefile                       |   3 +
 drivers/migoffcopy/Kconfig             |   9 +
 drivers/migoffcopy/Makefile            |   1 +
 drivers/migoffcopy/mtcopy/Makefile     |   1 +
 drivers/migoffcopy/mtcopy/copy_pages.c | 327 +++++++++++++++++++++++++
 6 files changed, 343 insertions(+)
 create mode 100644 drivers/migoffcopy/Kconfig
 create mode 100644 drivers/migoffcopy/Makefile
 create mode 100644 drivers/migoffcopy/mtcopy/Makefile
 create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c

diff --git a/drivers/Kconfig b/drivers/Kconfig
index 4915a63866b0..d2cbc97a7683 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -251,4 +251,6 @@ source "drivers/hte/Kconfig"
 
 source "drivers/cdx/Kconfig"
 
+source "drivers/migoffcopy/Kconfig"
+
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index b5749cf67044..5326d88cf31c 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -42,6 +42,9 @@ obj-y				+= clk/
 # really early.
 obj-$(CONFIG_DMADEVICES)	+= dma/
 
+# Migration copy Offload
+obj-$(CONFIG_OFFC_MIGRATION)	+= migoffcopy/
+
 # SOC specific infrastructure drivers.
 obj-y				+= soc/
 obj-$(CONFIG_PM_GENERIC_DOMAINS)	+= pmdomain/
diff --git a/drivers/migoffcopy/Kconfig b/drivers/migoffcopy/Kconfig
new file mode 100644
index 000000000000..e73698af3e72
--- /dev/null
+++ b/drivers/migoffcopy/Kconfig
@@ -0,0 +1,9 @@
+config MTCOPY_CPU
+       bool "Multi-Threaded Copy with CPU"
+       depends on OFFC_MIGRATION
+       default n
+       help
+         Interface MT COPY CPU driver for batch page migration
+         offloading. Say Y if you want to try offloading with
+         MultiThreaded CPU copy APIs.
+
diff --git a/drivers/migoffcopy/Makefile b/drivers/migoffcopy/Makefile
new file mode 100644
index 000000000000..0a3c356d67e6
--- /dev/null
+++ b/drivers/migoffcopy/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_MTCOPY_CPU) += mtcopy/
diff --git a/drivers/migoffcopy/mtcopy/Makefile b/drivers/migoffcopy/mtcopy/Makefile
new file mode 100644
index 000000000000..b4d7da85eda9
--- /dev/null
+++ b/drivers/migoffcopy/mtcopy/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_MTCOPY_CPU) += copy_pages.o
diff --git a/drivers/migoffcopy/mtcopy/copy_pages.c b/drivers/migoffcopy/mtcopy/copy_pages.c
new file mode 100644
index 000000000000..68e50de602d6
--- /dev/null
+++ b/drivers/migoffcopy/mtcopy/copy_pages.c
@@ -0,0 +1,327 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Parallel page copy routine.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/printk.h>
+#include <linux/init.h>
+#include <linux/sysctl.h>
+#include <linux/sysfs.h>
+#include <linux/highmem.h>
+#include <linux/workqueue.h>
+#include <linux/slab.h>
+#include <linux/migrate.h>
+#include <linux/migrate_offc.h>
+
+#define MAX_NUM_COPY_THREADS 64
+
+unsigned int limit_mt_num = 4;
+static int is_dispatching;
+
+static int copy_page_lists_mt(struct list_head *dst_folios,
+		struct list_head *src_folios, unsigned int nr_items);
+
+static DEFINE_MUTEX(migratecfg_mutex);
+
+/* CPU Multithreaded Batch Migrator */
+struct migrator cpu_migrator = {
+	.name = "CPU_MT_COPY\0",
+	.migrate_offc = copy_page_lists_mt,
+	.owner = THIS_MODULE,
+};
+
+struct copy_item {
+	char *to;
+	char *from;
+	unsigned long chunk_size;
+};
+
+struct copy_page_info {
+	struct work_struct copy_page_work;
+	int ret;
+	unsigned long num_items;
+	struct copy_item item_list[];
+};
+
+static unsigned long copy_page_routine(char *vto, char *vfrom,
+	unsigned long chunk_size)
+{
+	return copy_mc_to_kernel(vto, vfrom, chunk_size);
+}
+
+static void copy_page_work_queue_thread(struct work_struct *work)
+{
+	struct copy_page_info *my_work = (struct copy_page_info *)work;
+	int i;
+
+	my_work->ret = 0;
+	for (i = 0; i < my_work->num_items; ++i)
+		my_work->ret |= !!copy_page_routine(my_work->item_list[i].to,
+					my_work->item_list[i].from,
+					my_work->item_list[i].chunk_size);
+}
+
+static ssize_t mt_offloading_set(struct kobject *kobj, struct kobj_attribute *attr,
+		const char *buf, size_t count)
+{
+	int ccode;
+	int action;
+
+	ccode = kstrtoint(buf, 0, &action);
+	if (ccode) {
+		pr_debug("(%s:) error parsing input %s\n", __func__, buf);
+		return ccode;
+	}
+
+	/*
+	 * action is 0: User wants to disable MT offloading.
+	 * action is 1: User wants to enable MT offloading.
+	 */
+	switch (action) {
+	case 0:
+		mutex_lock(&migratecfg_mutex);
+		if (is_dispatching == 1) {
+			stop_offloading();
+			is_dispatching = 0;
+		} else
+			pr_debug("MT migration offloading is already OFF\n");
+		mutex_unlock(&migratecfg_mutex);
+		break;
+	case 1:
+		mutex_lock(&migratecfg_mutex);
+		if (is_dispatching == 0) {
+			start_offloading(&cpu_migrator);
+			is_dispatching = 1;
+		} else
+			pr_debug("MT migration offloading is already ON\n");
+		mutex_unlock(&migratecfg_mutex);
+		break;
+	default:
+		pr_debug("input should be zero or one, parsed as %d\n", action);
+	}
+	return sizeof(action);
+}
+
+static ssize_t mt_offloading_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", is_dispatching);
+}
+
+static ssize_t mt_threads_set(struct kobject *kobj, struct kobj_attribute *attr,
+		const char *buf, size_t count)
+{
+	int ccode;
+	unsigned int threads;
+
+	ccode = kstrtouint(buf, 0, &threads);
+	if (ccode) {
+		pr_debug("(%s:) error parsing input %s\n", __func__, buf);
+		return ccode;
+	}
+
+	if (threads > 0 && threads <= MAX_NUM_COPY_THREADS) {
+		mutex_lock(&migratecfg_mutex);
+		limit_mt_num = threads;
+		mutex_unlock(&migratecfg_mutex);
+		pr_debug("MT threads set to %u\n", limit_mt_num);
+	} else {
+		pr_debug("Invalid thread count. Must be between 1 and %d\n", MAX_NUM_COPY_THREADS);
+		return -EINVAL;
+	}
+
+	return count;
+}
+
+static ssize_t mt_threads_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%u\n", limit_mt_num);
+}
+
+int copy_page_lists_mt(struct list_head *dst_folios,
+		struct list_head *src_folios, unsigned int nr_items)
+{
+	struct copy_page_info *work_items[MAX_NUM_COPY_THREADS] = {0};
+	unsigned int total_mt_num = limit_mt_num;
+	struct folio *src, *src2, *dst, *dst2;
+	int max_items_per_thread;
+	int item_idx;
+	int err = 0;
+	int cpu;
+	int i;
+
+	if (IS_ENABLED(CONFIG_HIGHMEM))
+		return -EOPNOTSUPP;
+
+	/* Each threads get part of each page, if nr_items < totla_mt_num */
+	if (nr_items < total_mt_num)
+		max_items_per_thread = nr_items;
+	else
+		max_items_per_thread = (nr_items / total_mt_num) +
+				((nr_items % total_mt_num) ? 1 : 0);
+
+
+	for (cpu = 0; cpu < total_mt_num; ++cpu) {
+		work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
+						sizeof(struct copy_item) *
+							max_items_per_thread,
+					  GFP_NOWAIT);
+		if (!work_items[cpu]) {
+			err = -ENOMEM;
+			goto free_work_items;
+		}
+	}
+
+	if (nr_items < total_mt_num) {
+		for (cpu = 0; cpu < total_mt_num; ++cpu) {
+			INIT_WORK((struct work_struct *)work_items[cpu],
+					  copy_page_work_queue_thread);
+			work_items[cpu]->num_items = max_items_per_thread;
+		}
+
+		item_idx = 0;
+		dst = list_first_entry(dst_folios, struct folio, lru);
+		dst2 = list_next_entry(dst, lru);
+		list_for_each_entry_safe(src, src2, src_folios, lru) {
+			unsigned long chunk_size = PAGE_SIZE * folio_nr_pages(src) / total_mt_num;
+			char *vfrom = page_address(&src->page);
+			char *vto = page_address(&dst->page);
+
+			VM_WARN_ON(PAGE_SIZE * folio_nr_pages(src) % total_mt_num);
+			VM_WARN_ON(folio_nr_pages(dst) != folio_nr_pages(src));
+
+			for (cpu = 0; cpu < total_mt_num; ++cpu) {
+				work_items[cpu]->item_list[item_idx].to =
+					vto + chunk_size * cpu;
+				work_items[cpu]->item_list[item_idx].from =
+					vfrom + chunk_size * cpu;
+				work_items[cpu]->item_list[item_idx].chunk_size =
+					chunk_size;
+			}
+
+			item_idx++;
+			dst = dst2;
+			dst2 = list_next_entry(dst, lru);
+		}
+
+		for (cpu = 0; cpu < total_mt_num; ++cpu)
+			queue_work(system_unbound_wq,
+				   (struct work_struct *)work_items[cpu]);
+	} else {
+		int num_xfer_per_thread = nr_items / total_mt_num;
+		int per_cpu_item_idx;
+
+
+		for (cpu = 0; cpu < total_mt_num; ++cpu) {
+			INIT_WORK((struct work_struct *)work_items[cpu],
+					  copy_page_work_queue_thread);
+
+			work_items[cpu]->num_items = num_xfer_per_thread +
+					(cpu < (nr_items % total_mt_num));
+		}
+
+		cpu = 0;
+		per_cpu_item_idx = 0;
+		item_idx = 0;
+		dst = list_first_entry(dst_folios, struct folio, lru);
+		dst2 = list_next_entry(dst, lru);
+		list_for_each_entry_safe(src, src2, src_folios, lru) {
+			work_items[cpu]->item_list[per_cpu_item_idx].to =
+				page_address(&dst->page);
+			work_items[cpu]->item_list[per_cpu_item_idx].from =
+				page_address(&src->page);
+			work_items[cpu]->item_list[per_cpu_item_idx].chunk_size =
+				PAGE_SIZE * folio_nr_pages(src);
+
+			VM_WARN_ON(folio_nr_pages(dst) !=
+				   folio_nr_pages(src));
+
+			per_cpu_item_idx++;
+			item_idx++;
+			dst = dst2;
+			dst2 = list_next_entry(dst, lru);
+
+			if (per_cpu_item_idx == work_items[cpu]->num_items) {
+				queue_work(system_unbound_wq,
+					(struct work_struct *)work_items[cpu]);
+				per_cpu_item_idx = 0;
+				cpu++;
+			}
+		}
+		if (item_idx != nr_items)
+			pr_warn("%s: only %d out of %d pages are transferred\n",
+				__func__, item_idx - 1, nr_items);
+	}
+
+	/* Wait until it finishes  */
+	for (i = 0; i < total_mt_num; ++i) {
+		flush_work((struct work_struct *)work_items[i]);
+		/* retry if any copy fails */
+		if (work_items[i]->ret)
+			err = -EAGAIN;
+	}
+
+free_work_items:
+	for (cpu = 0; cpu < total_mt_num; ++cpu)
+		kfree(work_items[cpu]);
+
+	return err;
+}
+
+static struct kobject *mt_kobj_ref;
+static struct kobj_attribute mt_offloading_attribute = __ATTR(offloading, 0664,
+		mt_offloading_show, mt_offloading_set);
+static struct kobj_attribute mt_threads_attribute = __ATTR(threads, 0664,
+		mt_threads_show, mt_threads_set);
+
+static int __init cpu_mt_module_init(void)
+{
+	int ret = 0;
+
+	mt_kobj_ref = kobject_create_and_add("cpu_mt", kernel_kobj);
+	if (!mt_kobj_ref)
+		return -ENOMEM;
+
+	ret = sysfs_create_file(mt_kobj_ref, &mt_offloading_attribute.attr);
+	if (ret)
+		goto out_offloading;
+
+	ret = sysfs_create_file(mt_kobj_ref, &mt_threads_attribute.attr);
+	if (ret)
+		goto out_threads;
+
+	is_dispatching = 0;
+
+	return 0;
+
+out_threads:
+	sysfs_remove_file(mt_kobj_ref, &mt_offloading_attribute.attr);
+out_offloading:
+	kobject_put(mt_kobj_ref);
+	return ret;
+}
+
+static void __exit cpu_mt_module_exit(void)
+{
+	/* Stop the MT offloading to unload the module */
+	mutex_lock(&migratecfg_mutex);
+	if (is_dispatching == 1) {
+		stop_offloading();
+		is_dispatching = 0;
+	}
+	mutex_unlock(&migratecfg_mutex);
+
+	sysfs_remove_file(mt_kobj_ref, &mt_threads_attribute.attr);
+	sysfs_remove_file(mt_kobj_ref, &mt_offloading_attribute.attr);
+	kobject_put(mt_kobj_ref);
+}
+
+module_init(cpu_mt_module_init);
+module_exit(cpu_mt_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Zi Yan");
+MODULE_DESCRIPTION("CPU_MT_COPY"); /* CPU Multithreaded Batch Migrator */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine
  2025-09-23 17:47 ` [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine Shivank Garg
@ 2025-10-02 11:29   ` Jonathan Cameron
  2025-10-20  8:28   ` Byungchul Park
  1 sibling, 0 replies; 26+ messages in thread
From: Jonathan Cameron @ 2025-10-02 11:29 UTC (permalink / raw)
  To: Shivank Garg
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, vkoul,
	lucas.demarchi, rdunlap, jgg, kuba, justonli, ivecera,
	dave.jiang, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm

On Tue, 23 Sep 2025 17:47:41 +0000
Shivank Garg <shivankg@amd.com> wrote:

> From: Zi Yan <ziy@nvidia.com>
> 
> Now page copies are batched, multi-threaded page copy can be used to
> increase page copy throughput.
> 
> Enable using:
> echo 1 >  /sys/kernel/cpu_mt/offloading
> echo NR_THREADS >  /sys/kernel/cpu_mt/threads

I guess this order is to show that you can update threads with it on
as system load changes.

Maybe call this out explicitly?

> 
> Disable:
> echo 0 >  /sys/kernel/cpu_mt/offloading
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Co-developed-by: Shivank Garg <shivankg@amd.com>
> Signed-off-by: Shivank Garg <shivankg@amd.com>

Various other things inline.

Thanks,

Jonathan

> diff --git a/drivers/migoffcopy/Kconfig b/drivers/migoffcopy/Kconfig
> new file mode 100644
> index 000000000000..e73698af3e72
> --- /dev/null
> +++ b/drivers/migoffcopy/Kconfig
> @@ -0,0 +1,9 @@
> +config MTCOPY_CPU
> +       bool "Multi-Threaded Copy with CPU"
> +       depends on OFFC_MIGRATION
> +       default n
> +       help
> +         Interface MT COPY CPU driver for batch page migration
> +         offloading. Say Y if you want to try offloading with
> +         MultiThreaded CPU copy APIs.
Try?  I'd be more positive in the help text :)

> +

> diff --git a/drivers/migoffcopy/mtcopy/copy_pages.c b/drivers/migoffcopy/mtcopy/copy_pages.c
> new file mode 100644
> index 000000000000..68e50de602d6
> --- /dev/null
> +++ b/drivers/migoffcopy/mtcopy/copy_pages.c
> @@ -0,0 +1,327 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Parallel page copy routine.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
Generally we are trying to get away from anything including kernel.h
directly.  There is relatively little still in there, so maybe check
you actually need it here.

> +#include <linux/printk.h>
> +#include <linux/init.h>
> +#include <linux/sysctl.h>
> +#include <linux/sysfs.h>
> +#include <linux/highmem.h>
> +#include <linux/workqueue.h>
> +#include <linux/slab.h>
> +#include <linux/migrate.h>
> +#include <linux/migrate_offc.h>
> +
> +#define MAX_NUM_COPY_THREADS 64
Add a comment on why this number.

> +
> +struct copy_page_info {
> +	struct work_struct copy_page_work;
> +	int ret;
> +	unsigned long num_items;
> +	struct copy_item item_list[];
__counted_by

> +};
> +
> +static unsigned long copy_page_routine(char *vto, char *vfrom,
> +	unsigned long chunk_size)
> +{
> +	return copy_mc_to_kernel(vto, vfrom, chunk_size);
> +}
> +
> +static void copy_page_work_queue_thread(struct work_struct *work)
> +{
> +	struct copy_page_info *my_work = (struct copy_page_info *)work;

container_of()

> +	int i;
> +
> +	my_work->ret = 0;
> +	for (i = 0; i < my_work->num_items; ++i)
> +		my_work->ret |= !!copy_page_routine(my_work->item_list[i].to,
> +					my_work->item_list[i].from,
> +					my_work->item_list[i].chunk_size);
> +}
> +
> +static ssize_t mt_offloading_set(struct kobject *kobj, struct kobj_attribute *attr,
> +		const char *buf, size_t count)
> +{
> +	int ccode;
> +	int action;
> +
> +	ccode = kstrtoint(buf, 0, &action);
> +	if (ccode) {
> +		pr_debug("(%s:) error parsing input %s\n", __func__, buf);
> +		return ccode;
> +	}
> +
> +	/*
> +	 * action is 0: User wants to disable MT offloading.
> +	 * action is 1: User wants to enable MT offloading.
> +	 */
> +	switch (action) {
> +	case 0:
> +		mutex_lock(&migratecfg_mutex);
> +		if (is_dispatching == 1) {
> +			stop_offloading();
> +			is_dispatching = 0;
> +		} else
> +			pr_debug("MT migration offloading is already OFF\n");
> +		mutex_unlock(&migratecfg_mutex);
> +		break;
> +	case 1:
> +		mutex_lock(&migratecfg_mutex);
> +		if (is_dispatching == 0) {
> +			start_offloading(&cpu_migrator);
> +			is_dispatching = 1;
> +		} else
> +			pr_debug("MT migration offloading is already ON\n");
> +		mutex_unlock(&migratecfg_mutex);
> +		break;
> +	default:
> +		pr_debug("input should be zero or one, parsed as %d\n", action);
> +	}
> +	return sizeof(action);
> +}
> +
> +static ssize_t mt_offloading_show(struct kobject *kobj,
> +		struct kobj_attribute *attr, char *buf)
> +{
> +	return sysfs_emit(buf, "%d\n", is_dispatching);
> +}
> +
> +static ssize_t mt_threads_set(struct kobject *kobj, struct kobj_attribute *attr,
> +		const char *buf, size_t count)
> +{
> +	int ccode;
> +	unsigned int threads;
> +
> +	ccode = kstrtouint(buf, 0, &threads);
> +	if (ccode) {
> +		pr_debug("(%s:) error parsing input %s\n", __func__, buf);

I'm fairly sure you can use dynamic debug here to add the __func__ so no need
to do it by hand.

> +		return ccode;
> +	}
> +
> +	if (threads > 0 && threads <= MAX_NUM_COPY_THREADS) {
> +		mutex_lock(&migratecfg_mutex);
> +		limit_mt_num = threads;
> +		mutex_unlock(&migratecfg_mutex);
> +		pr_debug("MT threads set to %u\n", limit_mt_num);
> +	} else {

I'd flip the logic to test first for in range and exit if not. Then
no indent on the good path.

> +		pr_debug("Invalid thread count. Must be between 1 and %d\n", MAX_NUM_COPY_THREADS);
> +		return -EINVAL;
> +	}
> +
> +	return count;
> +}

> +int copy_page_lists_mt(struct list_head *dst_folios,
> +		struct list_head *src_folios, unsigned int nr_items)
> +{
> +	struct copy_page_info *work_items[MAX_NUM_COPY_THREADS] = {0};

{} or { NULL } perhaps given it's an array of pointers.

> +	unsigned int total_mt_num = limit_mt_num;
> +	struct folio *src, *src2, *dst, *dst2;
> +	int max_items_per_thread;
> +	int item_idx;
> +	int err = 0;
> +	int cpu;
> +	int i;
> +
> +	if (IS_ENABLED(CONFIG_HIGHMEM))
> +		return -EOPNOTSUPP;
> +
> +	/* Each threads get part of each page, if nr_items < totla_mt_num */
Each thread gets part of each page

total_mt_num  Though isn't the comment talking about when it's greater than or equal?


> +	if (nr_items < total_mt_num)
> +		max_items_per_thread = nr_items;
> +	else
> +		max_items_per_thread = (nr_items / total_mt_num) +
> +				((nr_items % total_mt_num) ? 1 : 0);
> +
> +
> +	for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +		work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
> +						sizeof(struct copy_item) *
> +							max_items_per_thread,

struct_size() looks appropriate here.

> +					  GFP_NOWAIT);
> +		if (!work_items[cpu]) {
> +			err = -ENOMEM;
> +			goto free_work_items;
> +		}
> +	}
> +
> +	if (nr_items < total_mt_num) {
> +		for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +			INIT_WORK((struct work_struct *)work_items[cpu],
Why not avoid having to know it is at start of structure by using
work_items[cpu]->copy_page_work instead.

> +					  copy_page_work_queue_thread);
> +			work_items[cpu]->num_items = max_items_per_thread;
> +		}
> +
> +		item_idx = 0;
> +		dst = list_first_entry(dst_folios, struct folio, lru);
> +		dst2 = list_next_entry(dst, lru);
> +		list_for_each_entry_safe(src, src2, src_folios, lru) {
> +			unsigned long chunk_size = PAGE_SIZE * folio_nr_pages(src) / total_mt_num;
> +			char *vfrom = page_address(&src->page);
> +			char *vto = page_address(&dst->page);
> +
> +			VM_WARN_ON(PAGE_SIZE * folio_nr_pages(src) % total_mt_num);
> +			VM_WARN_ON(folio_nr_pages(dst) != folio_nr_pages(src));
> +
> +			for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +				work_items[cpu]->item_list[item_idx].to =
> +					vto + chunk_size * cpu;
> +				work_items[cpu]->item_list[item_idx].from =
> +					vfrom + chunk_size * cpu;
> +				work_items[cpu]->item_list[item_idx].chunk_size =
> +					chunk_size;
> +			}
> +
> +			item_idx++;
> +			dst = dst2;
> +			dst2 = list_next_entry(dst, lru);
> +		}
> +
> +		for (cpu = 0; cpu < total_mt_num; ++cpu)
> +			queue_work(system_unbound_wq,
> +				   (struct work_struct *)work_items[cpu]);

As above. If you want the work struct, using the member that is the right type.

> +	} else {
> +		int num_xfer_per_thread = nr_items / total_mt_num;
> +		int per_cpu_item_idx;
> +
> +
> +		for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +			INIT_WORK((struct work_struct *)work_items[cpu],

Same again.

> +					  copy_page_work_queue_thread);
> +
> +			work_items[cpu]->num_items = num_xfer_per_thread +
> +					(cpu < (nr_items % total_mt_num));
> +		}
> +
> +		cpu = 0;
> +		per_cpu_item_idx = 0;
> +		item_idx = 0;
> +		dst = list_first_entry(dst_folios, struct folio, lru);
> +		dst2 = list_next_entry(dst, lru);
> +		list_for_each_entry_safe(src, src2, src_folios, lru) {
> +			work_items[cpu]->item_list[per_cpu_item_idx].to =
> +				page_address(&dst->page);
> +			work_items[cpu]->item_list[per_cpu_item_idx].from =
> +				page_address(&src->page);
> +			work_items[cpu]->item_list[per_cpu_item_idx].chunk_size =
> +				PAGE_SIZE * folio_nr_pages(src);
> +
> +			VM_WARN_ON(folio_nr_pages(dst) !=
> +				   folio_nr_pages(src));
> +
> +			per_cpu_item_idx++;
> +			item_idx++;
> +			dst = dst2;
> +			dst2 = list_next_entry(dst, lru);
> +
> +			if (per_cpu_item_idx == work_items[cpu]->num_items) {
> +				queue_work(system_unbound_wq,
> +					(struct work_struct *)work_items[cpu]);
and one more.
> +				per_cpu_item_idx = 0;
> +				cpu++;
> +			}
> +		}
> +		if (item_idx != nr_items)
> +			pr_warn("%s: only %d out of %d pages are transferred\n",
> +				__func__, item_idx - 1, nr_items);
> +	}
> +
> +	/* Wait until it finishes  */
> +	for (i = 0; i < total_mt_num; ++i) {
> +		flush_work((struct work_struct *)work_items[i]);
> +		/* retry if any copy fails */
> +		if (work_items[i]->ret)
> +			err = -EAGAIN;
> +	}
> +
> +free_work_items:
> +	for (cpu = 0; cpu < total_mt_num; ++cpu)
> +		kfree(work_items[cpu]);
> +
> +	return err;
> +}
> +
> +static struct kobject *mt_kobj_ref;
> +static struct kobj_attribute mt_offloading_attribute = __ATTR(offloading, 0664,
> +		mt_offloading_show, mt_offloading_set);
> +static struct kobj_attribute mt_threads_attribute = __ATTR(threads, 0664,
> +		mt_threads_show, mt_threads_set);
> +
> +static int __init cpu_mt_module_init(void)
> +{
> +	int ret = 0;

Always set before use so don't init here.

> +
> +	mt_kobj_ref = kobject_create_and_add("cpu_mt", kernel_kobj);
> +	if (!mt_kobj_ref)
> +		return -ENOMEM;
> +
> +	ret = sysfs_create_file(mt_kobj_ref, &mt_offloading_attribute.attr);
> +	if (ret)
> +		goto out_offloading;
> +
> +	ret = sysfs_create_file(mt_kobj_ref, &mt_threads_attribute.attr);
> +	if (ret)
> +		goto out_threads;
> +
> +	is_dispatching = 0;
> +
> +	return 0;
> +
> +out_threads:
> +	sysfs_remove_file(mt_kobj_ref, &mt_offloading_attribute.attr);
> +out_offloading:
> +	kobject_put(mt_kobj_ref);
> +	return ret;
> +}

> +module_init(cpu_mt_module_init);
> +module_exit(cpu_mt_module_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Zi Yan");
> +MODULE_DESCRIPTION("CPU_MT_COPY"); /* CPU Multithreaded Batch Migrator */

If a module description needs a comment after it I'd rewrite that description!




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine
  2025-09-23 17:47 ` [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine Shivank Garg
  2025-10-02 11:29   ` Jonathan Cameron
@ 2025-10-20  8:28   ` Byungchul Park
  2025-11-06  6:27     ` Garg, Shivank
  1 sibling, 1 reply; 26+ messages in thread
From: Byungchul Park @ 2025-10-20  8:28 UTC (permalink / raw)
  To: Shivank Garg
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	alirad.malek, yiannis, weixugc, linux-kernel, linux-mm,
	kernel_team

On Tue, Sep 23, 2025 at 05:47:41PM +0000, Shivank Garg wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Now page copies are batched, multi-threaded page copy can be used to
> increase page copy throughput.
> 
> Enable using:
> echo 1 >  /sys/kernel/cpu_mt/offloading
> echo NR_THREADS >  /sys/kernel/cpu_mt/threads
> 
> Disable:
> echo 0 >  /sys/kernel/cpu_mt/offloading
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> Co-developed-by: Shivank Garg <shivankg@amd.com>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>  drivers/Kconfig                        |   2 +
>  drivers/Makefile                       |   3 +
>  drivers/migoffcopy/Kconfig             |   9 +
>  drivers/migoffcopy/Makefile            |   1 +
>  drivers/migoffcopy/mtcopy/Makefile     |   1 +
>  drivers/migoffcopy/mtcopy/copy_pages.c | 327 +++++++++++++++++++++++++
>  6 files changed, 343 insertions(+)
>  create mode 100644 drivers/migoffcopy/Kconfig
>  create mode 100644 drivers/migoffcopy/Makefile
>  create mode 100644 drivers/migoffcopy/mtcopy/Makefile
>  create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c
> 
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index 4915a63866b0..d2cbc97a7683 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -251,4 +251,6 @@ source "drivers/hte/Kconfig"
> 
>  source "drivers/cdx/Kconfig"
> 
> +source "drivers/migoffcopy/Kconfig"
> +
>  endmenu
> diff --git a/drivers/Makefile b/drivers/Makefile
> index b5749cf67044..5326d88cf31c 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -42,6 +42,9 @@ obj-y                         += clk/
>  # really early.
>  obj-$(CONFIG_DMADEVICES)       += dma/
> 
> +# Migration copy Offload
> +obj-$(CONFIG_OFFC_MIGRATION)   += migoffcopy/
> +
>  # SOC specific infrastructure drivers.
>  obj-y                          += soc/
>  obj-$(CONFIG_PM_GENERIC_DOMAINS)       += pmdomain/
> diff --git a/drivers/migoffcopy/Kconfig b/drivers/migoffcopy/Kconfig
> new file mode 100644
> index 000000000000..e73698af3e72
> --- /dev/null
> +++ b/drivers/migoffcopy/Kconfig
> @@ -0,0 +1,9 @@
> +config MTCOPY_CPU
> +       bool "Multi-Threaded Copy with CPU"
> +       depends on OFFC_MIGRATION
> +       default n
> +       help
> +         Interface MT COPY CPU driver for batch page migration
> +         offloading. Say Y if you want to try offloading with
> +         MultiThreaded CPU copy APIs.
> +
> diff --git a/drivers/migoffcopy/Makefile b/drivers/migoffcopy/Makefile
> new file mode 100644
> index 000000000000..0a3c356d67e6
> --- /dev/null
> +++ b/drivers/migoffcopy/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_MTCOPY_CPU) += mtcopy/
> diff --git a/drivers/migoffcopy/mtcopy/Makefile b/drivers/migoffcopy/mtcopy/Makefile
> new file mode 100644
> index 000000000000..b4d7da85eda9
> --- /dev/null
> +++ b/drivers/migoffcopy/mtcopy/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_MTCOPY_CPU) += copy_pages.o
> diff --git a/drivers/migoffcopy/mtcopy/copy_pages.c b/drivers/migoffcopy/mtcopy/copy_pages.c
> new file mode 100644
> index 000000000000..68e50de602d6
> --- /dev/null
> +++ b/drivers/migoffcopy/mtcopy/copy_pages.c
> @@ -0,0 +1,327 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Parallel page copy routine.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/printk.h>
> +#include <linux/init.h>
> +#include <linux/sysctl.h>
> +#include <linux/sysfs.h>
> +#include <linux/highmem.h>
> +#include <linux/workqueue.h>
> +#include <linux/slab.h>
> +#include <linux/migrate.h>
> +#include <linux/migrate_offc.h>
> +
> +#define MAX_NUM_COPY_THREADS 64
> +
> +unsigned int limit_mt_num = 4;
> +static int is_dispatching;
> +
> +static int copy_page_lists_mt(struct list_head *dst_folios,
> +               struct list_head *src_folios, unsigned int nr_items);
> +
> +static DEFINE_MUTEX(migratecfg_mutex);
> +
> +/* CPU Multithreaded Batch Migrator */
> +struct migrator cpu_migrator = {
> +       .name = "CPU_MT_COPY\0",
> +       .migrate_offc = copy_page_lists_mt,
> +       .owner = THIS_MODULE,
> +};
> +
> +struct copy_item {
> +       char *to;
> +       char *from;
> +       unsigned long chunk_size;
> +};
> +
> +struct copy_page_info {
> +       struct work_struct copy_page_work;
> +       int ret;
> +       unsigned long num_items;
> +       struct copy_item item_list[];
> +};
> +
> +static unsigned long copy_page_routine(char *vto, char *vfrom,
> +       unsigned long chunk_size)
> +{
> +       return copy_mc_to_kernel(vto, vfrom, chunk_size);
> +}
> +
> +static void copy_page_work_queue_thread(struct work_struct *work)
> +{
> +       struct copy_page_info *my_work = (struct copy_page_info *)work;
> +       int i;
> +
> +       my_work->ret = 0;
> +       for (i = 0; i < my_work->num_items; ++i)
> +               my_work->ret |= !!copy_page_routine(my_work->item_list[i].to,
> +                                       my_work->item_list[i].from,
> +                                       my_work->item_list[i].chunk_size);
> +}
> +
> +static ssize_t mt_offloading_set(struct kobject *kobj, struct kobj_attribute *attr,
> +               const char *buf, size_t count)
> +{
> +       int ccode;
> +       int action;
> +
> +       ccode = kstrtoint(buf, 0, &action);
> +       if (ccode) {
> +               pr_debug("(%s:) error parsing input %s\n", __func__, buf);
> +               return ccode;
> +       }
> +
> +       /*
> +        * action is 0: User wants to disable MT offloading.
> +        * action is 1: User wants to enable MT offloading.
> +        */
> +       switch (action) {
> +       case 0:
> +               mutex_lock(&migratecfg_mutex);
> +               if (is_dispatching == 1) {
> +                       stop_offloading();
> +                       is_dispatching = 0;
> +               } else
> +                       pr_debug("MT migration offloading is already OFF\n");
> +               mutex_unlock(&migratecfg_mutex);
> +               break;
> +       case 1:
> +               mutex_lock(&migratecfg_mutex);
> +               if (is_dispatching == 0) {
> +                       start_offloading(&cpu_migrator);
> +                       is_dispatching = 1;
> +               } else
> +                       pr_debug("MT migration offloading is already ON\n");
> +               mutex_unlock(&migratecfg_mutex);
> +               break;
> +       default:
> +               pr_debug("input should be zero or one, parsed as %d\n", action);
> +       }
> +       return sizeof(action);
> +}
> +
> +static ssize_t mt_offloading_show(struct kobject *kobj,
> +               struct kobj_attribute *attr, char *buf)
> +{
> +       return sysfs_emit(buf, "%d\n", is_dispatching);
> +}
> +
> +static ssize_t mt_threads_set(struct kobject *kobj, struct kobj_attribute *attr,
> +               const char *buf, size_t count)
> +{
> +       int ccode;
> +       unsigned int threads;
> +
> +       ccode = kstrtouint(buf, 0, &threads);
> +       if (ccode) {
> +               pr_debug("(%s:) error parsing input %s\n", __func__, buf);
> +               return ccode;
> +       }
> +
> +       if (threads > 0 && threads <= MAX_NUM_COPY_THREADS) {
> +               mutex_lock(&migratecfg_mutex);
> +               limit_mt_num = threads;
> +               mutex_unlock(&migratecfg_mutex);
> +               pr_debug("MT threads set to %u\n", limit_mt_num);
> +       } else {
> +               pr_debug("Invalid thread count. Must be between 1 and %d\n", MAX_NUM_COPY_THREADS);
> +               return -EINVAL;
> +       }
> +
> +       return count;
> +}
> +
> +static ssize_t mt_threads_show(struct kobject *kobj,
> +               struct kobj_attribute *attr, char *buf)
> +{
> +       return sysfs_emit(buf, "%u\n", limit_mt_num);
> +}
> +
> +int copy_page_lists_mt(struct list_head *dst_folios,
> +               struct list_head *src_folios, unsigned int nr_items)
> +{
> +       struct copy_page_info *work_items[MAX_NUM_COPY_THREADS] = {0};
> +       unsigned int total_mt_num = limit_mt_num;
> +       struct folio *src, *src2, *dst, *dst2;
> +       int max_items_per_thread;
> +       int item_idx;
> +       int err = 0;
> +       int cpu;
> +       int i;
> +
> +       if (IS_ENABLED(CONFIG_HIGHMEM))
> +               return -EOPNOTSUPP;
> +
> +       /* Each threads get part of each page, if nr_items < totla_mt_num */
> +       if (nr_items < total_mt_num)
> +               max_items_per_thread = nr_items;
> +       else
> +               max_items_per_thread = (nr_items / total_mt_num) +
> +                               ((nr_items % total_mt_num) ? 1 : 0);
> +
> +
> +       for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +               work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
> +                                               sizeof(struct copy_item) *
> +                                                       max_items_per_thread,
> +                                         GFP_NOWAIT);
> +               if (!work_items[cpu]) {
> +                       err = -ENOMEM;
> +                       goto free_work_items;
> +               }
> +       }
> +
> +       if (nr_items < total_mt_num) {
> +               for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +                       INIT_WORK((struct work_struct *)work_items[cpu],
> +                                         copy_page_work_queue_thread);
> +                       work_items[cpu]->num_items = max_items_per_thread;
> +               }
> +
> +               item_idx = 0;
> +               dst = list_first_entry(dst_folios, struct folio, lru);
> +               dst2 = list_next_entry(dst, lru);
> +               list_for_each_entry_safe(src, src2, src_folios, lru) {
> +                       unsigned long chunk_size = PAGE_SIZE * folio_nr_pages(src) / total_mt_num;
> +                       char *vfrom = page_address(&src->page);
> +                       char *vto = page_address(&dst->page);
> +
> +                       VM_WARN_ON(PAGE_SIZE * folio_nr_pages(src) % total_mt_num);
> +                       VM_WARN_ON(folio_nr_pages(dst) != folio_nr_pages(src));
> +
> +                       for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +                               work_items[cpu]->item_list[item_idx].to =
> +                                       vto + chunk_size * cpu;
> +                               work_items[cpu]->item_list[item_idx].from =
> +                                       vfrom + chunk_size * cpu;
> +                               work_items[cpu]->item_list[item_idx].chunk_size =
> +                                       chunk_size;
> +                       }
> +
> +                       item_idx++;
> +                       dst = dst2;
> +                       dst2 = list_next_entry(dst, lru);
> +               }
> +
> +               for (cpu = 0; cpu < total_mt_num; ++cpu)
> +                       queue_work(system_unbound_wq,
> +                                  (struct work_struct *)work_items[cpu]);
> +       } else {
> +               int num_xfer_per_thread = nr_items / total_mt_num;
> +               int per_cpu_item_idx;
> +
> +
> +               for (cpu = 0; cpu < total_mt_num; ++cpu) {
> +                       INIT_WORK((struct work_struct *)work_items[cpu],
> +                                         copy_page_work_queue_thread);
> +
> +                       work_items[cpu]->num_items = num_xfer_per_thread +
> +                                       (cpu < (nr_items % total_mt_num));
> +               }
> +
> +               cpu = 0;
> +               per_cpu_item_idx = 0;
> +               item_idx = 0;
> +               dst = list_first_entry(dst_folios, struct folio, lru);
> +               dst2 = list_next_entry(dst, lru);
> +               list_for_each_entry_safe(src, src2, src_folios, lru) {
> +                       work_items[cpu]->item_list[per_cpu_item_idx].to =
> +                               page_address(&dst->page);
> +                       work_items[cpu]->item_list[per_cpu_item_idx].from =
> +                               page_address(&src->page);
> +                       work_items[cpu]->item_list[per_cpu_item_idx].chunk_size =
> +                               PAGE_SIZE * folio_nr_pages(src);
> +
> +                       VM_WARN_ON(folio_nr_pages(dst) !=
> +                                  folio_nr_pages(src));
> +
> +                       per_cpu_item_idx++;
> +                       item_idx++;
> +                       dst = dst2;
> +                       dst2 = list_next_entry(dst, lru);
> +
> +                       if (per_cpu_item_idx == work_items[cpu]->num_items) {
> +                               queue_work(system_unbound_wq,
> +                                       (struct work_struct *)work_items[cpu]);

Thanks for the great work.

By the way, is it okay to use work queue?  When the system is idle, this
patch will improve the migration performance, but when there are a lot
of other runnable tasks in the system, it might be worse than the
current one.  That's gonna be even worse if there are some other tasks
that wait for the migration to end.  It's worth noting that
padata_do_multithreaded() also uses work queue internally.

I think, at worst, the performance should be same as is.  Or am I
missing something?

	Byungchul

> +                               per_cpu_item_idx = 0;
> +                               cpu++;
> +                       }
> +               }
> +               if (item_idx != nr_items)
> +                       pr_warn("%s: only %d out of %d pages are transferred\n",
> +                               __func__, item_idx - 1, nr_items);
> +       }
> +
> +       /* Wait until it finishes  */
> +       for (i = 0; i < total_mt_num; ++i) {
> +               flush_work((struct work_struct *)work_items[i]);
> +               /* retry if any copy fails */
> +               if (work_items[i]->ret)
> +                       err = -EAGAIN;
> +       }
> +
> +free_work_items:
> +       for (cpu = 0; cpu < total_mt_num; ++cpu)
> +               kfree(work_items[cpu]);
> +
> +       return err;
> +}
> +
> +static struct kobject *mt_kobj_ref;
> +static struct kobj_attribute mt_offloading_attribute = __ATTR(offloading, 0664,
> +               mt_offloading_show, mt_offloading_set);
> +static struct kobj_attribute mt_threads_attribute = __ATTR(threads, 0664,
> +               mt_threads_show, mt_threads_set);
> +
> +static int __init cpu_mt_module_init(void)
> +{
> +       int ret = 0;
> +
> +       mt_kobj_ref = kobject_create_and_add("cpu_mt", kernel_kobj);
> +       if (!mt_kobj_ref)
> +               return -ENOMEM;
> +
> +       ret = sysfs_create_file(mt_kobj_ref, &mt_offloading_attribute.attr);
> +       if (ret)
> +               goto out_offloading;
> +
> +       ret = sysfs_create_file(mt_kobj_ref, &mt_threads_attribute.attr);
> +       if (ret)
> +               goto out_threads;
> +
> +       is_dispatching = 0;
> +
> +       return 0;
> +
> +out_threads:
> +       sysfs_remove_file(mt_kobj_ref, &mt_offloading_attribute.attr);
> +out_offloading:
> +       kobject_put(mt_kobj_ref);
> +       return ret;
> +}
> +
> +static void __exit cpu_mt_module_exit(void)
> +{
> +       /* Stop the MT offloading to unload the module */
> +       mutex_lock(&migratecfg_mutex);
> +       if (is_dispatching == 1) {
> +               stop_offloading();
> +               is_dispatching = 0;
> +       }
> +       mutex_unlock(&migratecfg_mutex);
> +
> +       sysfs_remove_file(mt_kobj_ref, &mt_threads_attribute.attr);
> +       sysfs_remove_file(mt_kobj_ref, &mt_offloading_attribute.attr);
> +       kobject_put(mt_kobj_ref);
> +}
> +
> +module_init(cpu_mt_module_init);
> +module_exit(cpu_mt_module_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Zi Yan");
> +MODULE_DESCRIPTION("CPU_MT_COPY"); /* CPU Multithreaded Batch Migrator */
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine
  2025-10-20  8:28   ` Byungchul Park
@ 2025-11-06  6:27     ` Garg, Shivank
  2025-11-12  2:12       ` Byungchul Park
  0 siblings, 1 reply; 26+ messages in thread
From: Garg, Shivank @ 2025-11-06  6:27 UTC (permalink / raw)
  To: Byungchul Park
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	alirad.malek, yiannis, weixugc, linux-kernel, linux-mm,
	kernel_team



On 10/20/2025 1:58 PM, Byungchul Park wrote:
> On Tue, Sep 23, 2025 at 05:47:41PM +0000, Shivank Garg wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Now page copies are batched, multi-threaded page copy can be used to
>> increase page copy throughput.
>>
>> Enable using:
>> echo 1 >  /sys/kernel/cpu_mt/offloading
>> echo NR_THREADS >  /sys/kernel/cpu_mt/threads
>>
>> Disable:
>> echo 0 >  /sys/kernel/cpu_mt/offloading
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> Co-developed-by: Shivank Garg <shivankg@amd.com>
>> Signed-off-by: Shivank Garg <shivankg@amd.com>
>> ---
>>  drivers/Kconfig                        |   2 +
>>  drivers/Makefile                       |   3 +
>>  drivers/migoffcopy/Kconfig             |   9 +
>>  drivers/migoffcopy/Makefile            |   1 +
>>  drivers/migoffcopy/mtcopy/Makefile     |   1 +
>>  drivers/migoffcopy/mtcopy/copy_pages.c | 327 +++++++++++++++++++++++++
>>  6 files changed, 343 insertions(+)
>>  create mode 100644 drivers/migoffcopy/Kconfig
>>  create mode 100644 drivers/migoffcopy/Makefile
>>  create mode 100644 drivers/migoffcopy/mtcopy/Makefile
>>  create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c
>>
>> diff --git a/drivers/Kconfig b/drivers/Kconfig
>> index 4915a63866b0..d2cbc97a7683 100644
>> --- a/drivers/Kconfig
>> +++ b/drivers/Kconfig
>> @@ -251,4 +251,6 @@ source "drivers/hte/Kconfig"
>>
>>  source "drivers/cdx/Kconfig"
>>
>> +source "drivers/migoffcopy/Kconfig"
>> +
>>  endmenu
>> diff --git a/drivers/Makefile b/drivers/Makefile
>> index b5749cf67044..5326d88cf31c 100644
>> --- a/drivers/Makefile
>> +++ b/drivers/Makefile
>> @@ -42,6 +42,9 @@ obj-y                         += clk/
>>  # really early.
>>  obj-$(CONFIG_DMADEVICES)       += dma/
>>
>> +# Migration copy Offload
>> +obj-$(CONFIG_OFFC_MIGRATION)   += migoffcopy/
>> +
>>  # SOC specific infrastructure drivers.
>>  obj-y                          += soc/
>>  obj-$(CONFIG_PM_GENERIC_DOMAINS)       += pmdomain/
>> diff --git a/drivers/migoffcopy/Kconfig b/drivers/migoffcopy/Kconfig
>> new file mode 100644
>> index 000000000000..e73698af3e72
>> --- /dev/null
>> +++ b/drivers/migoffcopy/Kconfig
>> @@ -0,0 +1,9 @@
>> +config MTCOPY_CPU
>> +       bool "Multi-Threaded Copy with CPU"
>> +       depends on OFFC_MIGRATION
>> +       default n
>> +       help
>> +         Interface MT COPY CPU driver for batch page migration
>> +         offloading. Say Y if you want to try offloading with
>> +         MultiThreaded CPU copy APIs.
>> +
>> diff --git a/drivers/migoffcopy/Makefile b/drivers/migoffcopy/Makefile
>> new file mode 100644
>> index 000000000000..0a3c356d67e6
>> --- /dev/null
>> +++ b/drivers/migoffcopy/Makefile
>> @@ -0,0 +1 @@
>> +obj-$(CONFIG_MTCOPY_CPU) += mtcopy/
>> diff --git a/drivers/migoffcopy/mtcopy/Makefile b/drivers/migoffcopy/mtcopy/Makefile
>> new file mode 100644
>> index 000000000000..b4d7da85eda9
>> --- /dev/null
>> +++ b/drivers/migoffcopy/mtcopy/Makefile
>> @@ -0,0 +1 @@
>> +obj-$(CONFIG_MTCOPY_CPU) += copy_pages.o
>> diff --git a/drivers/migoffcopy/mtcopy/copy_pages.c b/drivers/migoffcopy/mtcopy/copy_pages.c
>> new file mode 100644
>> index 000000000000..68e50de602d6
>> --- /dev/null
>> +++ b/drivers/migoffcopy/mtcopy/copy_pages.c
>> @@ -0,0 +1,327 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Parallel page copy routine.
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/kernel.h>
>> +#include <linux/printk.h>
>> +#include <linux/init.h>
>> +#include <linux/sysctl.h>
>> +#include <linux/sysfs.h>
>> +#include <linux/highmem.h>
>> +#include <linux/workqueue.h>
>> +#include <linux/slab.h>
>> +#include <linux/migrate.h>
>> +#include <linux/migrate_offc.h>
>> +
>> +#define MAX_NUM_COPY_THREADS 64
>> +
>> +unsigned int limit_mt_num = 4;
>> +static int is_dispatching;
>> +
>> +static int copy_page_lists_mt(struct list_head *dst_folios,
>> +               struct list_head *src_folios, unsigned int nr_items);
>> +
>> +static DEFINE_MUTEX(migratecfg_mutex);
>> +
>> +/* CPU Multithreaded Batch Migrator */
>> +struct migrator cpu_migrator = {
>> +       .name = "CPU_MT_COPY\0",
>> +       .migrate_offc = copy_page_lists_mt,
>> +       .owner = THIS_MODULE,
>> +};
>> +
>> +struct copy_item {
>> +       char *to;
>> +       char *from;
>> +       unsigned long chunk_size;
>> +};
>> +
>> +struct copy_page_info {
>> +       struct work_struct copy_page_work;
>> +       int ret;
>> +       unsigned long num_items;
>> +       struct copy_item item_list[];
>> +};
>> +
>> +static unsigned long copy_page_routine(char *vto, char *vfrom,
>> +       unsigned long chunk_size)
>> +{
>> +       return copy_mc_to_kernel(vto, vfrom, chunk_size);
>> +}
>> +
>> +static void copy_page_work_queue_thread(struct work_struct *work)
>> +{
>> +       struct copy_page_info *my_work = (struct copy_page_info *)work;
>> +       int i;
>> +
>> +       my_work->ret = 0;
>> +       for (i = 0; i < my_work->num_items; ++i)
>> +               my_work->ret |= !!copy_page_routine(my_work->item_list[i].to,
>> +                                       my_work->item_list[i].from,
>> +                                       my_work->item_list[i].chunk_size);
>> +}
>> +
>> +static ssize_t mt_offloading_set(struct kobject *kobj, struct kobj_attribute *attr,
>> +               const char *buf, size_t count)
>> +{
>> +       int ccode;
>> +       int action;
>> +
>> +       ccode = kstrtoint(buf, 0, &action);
>> +       if (ccode) {
>> +               pr_debug("(%s:) error parsing input %s\n", __func__, buf);
>> +               return ccode;
>> +       }
>> +
>> +       /*
>> +        * action is 0: User wants to disable MT offloading.
>> +        * action is 1: User wants to enable MT offloading.
>> +        */
>> +       switch (action) {
>> +       case 0:
>> +               mutex_lock(&migratecfg_mutex);
>> +               if (is_dispatching == 1) {
>> +                       stop_offloading();
>> +                       is_dispatching = 0;
>> +               } else
>> +                       pr_debug("MT migration offloading is already OFF\n");
>> +               mutex_unlock(&migratecfg_mutex);
>> +               break;
>> +       case 1:
>> +               mutex_lock(&migratecfg_mutex);
>> +               if (is_dispatching == 0) {
>> +                       start_offloading(&cpu_migrator);
>> +                       is_dispatching = 1;
>> +               } else
>> +                       pr_debug("MT migration offloading is already ON\n");
>> +               mutex_unlock(&migratecfg_mutex);
>> +               break;
>> +       default:
>> +               pr_debug("input should be zero or one, parsed as %d\n", action);
>> +       }
>> +       return sizeof(action);
>> +}
>> +
>> +static ssize_t mt_offloading_show(struct kobject *kobj,
>> +               struct kobj_attribute *attr, char *buf)
>> +{
>> +       return sysfs_emit(buf, "%d\n", is_dispatching);
>> +}
>> +
>> +static ssize_t mt_threads_set(struct kobject *kobj, struct kobj_attribute *attr,
>> +               const char *buf, size_t count)
>> +{
>> +       int ccode;
>> +       unsigned int threads;
>> +
>> +       ccode = kstrtouint(buf, 0, &threads);
>> +       if (ccode) {
>> +               pr_debug("(%s:) error parsing input %s\n", __func__, buf);
>> +               return ccode;
>> +       }
>> +
>> +       if (threads > 0 && threads <= MAX_NUM_COPY_THREADS) {
>> +               mutex_lock(&migratecfg_mutex);
>> +               limit_mt_num = threads;
>> +               mutex_unlock(&migratecfg_mutex);
>> +               pr_debug("MT threads set to %u\n", limit_mt_num);
>> +       } else {
>> +               pr_debug("Invalid thread count. Must be between 1 and %d\n", MAX_NUM_COPY_THREADS);
>> +               return -EINVAL;
>> +       }
>> +
>> +       return count;
>> +}
>> +
>> +static ssize_t mt_threads_show(struct kobject *kobj,
>> +               struct kobj_attribute *attr, char *buf)
>> +{
>> +       return sysfs_emit(buf, "%u\n", limit_mt_num);
>> +}
>> +
>> +int copy_page_lists_mt(struct list_head *dst_folios,
>> +               struct list_head *src_folios, unsigned int nr_items)
>> +{
>> +       struct copy_page_info *work_items[MAX_NUM_COPY_THREADS] = {0};
>> +       unsigned int total_mt_num = limit_mt_num;
>> +       struct folio *src, *src2, *dst, *dst2;
>> +       int max_items_per_thread;
>> +       int item_idx;
>> +       int err = 0;
>> +       int cpu;
>> +       int i;
>> +
>> +       if (IS_ENABLED(CONFIG_HIGHMEM))
>> +               return -EOPNOTSUPP;
>> +
>> +       /* Each threads get part of each page, if nr_items < totla_mt_num */
>> +       if (nr_items < total_mt_num)
>> +               max_items_per_thread = nr_items;
>> +       else
>> +               max_items_per_thread = (nr_items / total_mt_num) +
>> +                               ((nr_items % total_mt_num) ? 1 : 0);
>> +
>> +
>> +       for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +               work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
>> +                                               sizeof(struct copy_item) *
>> +                                                       max_items_per_thread,
>> +                                         GFP_NOWAIT);
>> +               if (!work_items[cpu]) {
>> +                       err = -ENOMEM;
>> +                       goto free_work_items;
>> +               }
>> +       }
>> +
>> +       if (nr_items < total_mt_num) {
>> +               for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +                       INIT_WORK((struct work_struct *)work_items[cpu],
>> +                                         copy_page_work_queue_thread);
>> +                       work_items[cpu]->num_items = max_items_per_thread;
>> +               }
>> +
>> +               item_idx = 0;
>> +               dst = list_first_entry(dst_folios, struct folio, lru);
>> +               dst2 = list_next_entry(dst, lru);
>> +               list_for_each_entry_safe(src, src2, src_folios, lru) {
>> +                       unsigned long chunk_size = PAGE_SIZE * folio_nr_pages(src) / total_mt_num;
>> +                       char *vfrom = page_address(&src->page);
>> +                       char *vto = page_address(&dst->page);
>> +
>> +                       VM_WARN_ON(PAGE_SIZE * folio_nr_pages(src) % total_mt_num);
>> +                       VM_WARN_ON(folio_nr_pages(dst) != folio_nr_pages(src));
>> +
>> +                       for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +                               work_items[cpu]->item_list[item_idx].to =
>> +                                       vto + chunk_size * cpu;
>> +                               work_items[cpu]->item_list[item_idx].from =
>> +                                       vfrom + chunk_size * cpu;
>> +                               work_items[cpu]->item_list[item_idx].chunk_size =
>> +                                       chunk_size;
>> +                       }
>> +
>> +                       item_idx++;
>> +                       dst = dst2;
>> +                       dst2 = list_next_entry(dst, lru);
>> +               }
>> +
>> +               for (cpu = 0; cpu < total_mt_num; ++cpu)
>> +                       queue_work(system_unbound_wq,
>> +                                  (struct work_struct *)work_items[cpu]);
>> +       } else {
>> +               int num_xfer_per_thread = nr_items / total_mt_num;
>> +               int per_cpu_item_idx;
>> +
>> +
>> +               for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +                       INIT_WORK((struct work_struct *)work_items[cpu],
>> +                                         copy_page_work_queue_thread);
>> +
>> +                       work_items[cpu]->num_items = num_xfer_per_thread +
>> +                                       (cpu < (nr_items % total_mt_num));
>> +               }
>> +
>> +               cpu = 0;
>> +               per_cpu_item_idx = 0;
>> +               item_idx = 0;
>> +               dst = list_first_entry(dst_folios, struct folio, lru);
>> +               dst2 = list_next_entry(dst, lru);
>> +               list_for_each_entry_safe(src, src2, src_folios, lru) {
>> +                       work_items[cpu]->item_list[per_cpu_item_idx].to =
>> +                               page_address(&dst->page);
>> +                       work_items[cpu]->item_list[per_cpu_item_idx].from =
>> +                               page_address(&src->page);
>> +                       work_items[cpu]->item_list[per_cpu_item_idx].chunk_size =
>> +                               PAGE_SIZE * folio_nr_pages(src);
>> +
>> +                       VM_WARN_ON(folio_nr_pages(dst) !=
>> +                                  folio_nr_pages(src));
>> +
>> +                       per_cpu_item_idx++;
>> +                       item_idx++;
>> +                       dst = dst2;
>> +                       dst2 = list_next_entry(dst, lru);
>> +
>> +                       if (per_cpu_item_idx == work_items[cpu]->num_items) {
>> +                               queue_work(system_unbound_wq,
>> +                                       (struct work_struct *)work_items[cpu]);
> 
> Thanks for the great work.
> 
> By the way, is it okay to use work queue?  When the system is idle, this
> patch will improve the migration performance, but when there are a lot
> of other runnable tasks in the system, it might be worse than the
> current one.  That's gonna be even worse if there are some other tasks
> that wait for the migration to end.  It's worth noting that
> padata_do_multithreaded() also uses work queue internally.
> 
> I think, at worst, the performance should be same as is.  Or am I
> missing something?
> 
> 	Byungchul

Hi Byungchul,

This was addressed by Zi in the mail:
https://lore.kernel.org/linux-mm/61F6152C-A91E-453B-9521-34B7497AE532@nvidia.com

So, there are some specific use cases that can benefit significantly when CPU cores are idle
while GPUs or accelerators handle most of the workload.
In such scenarios, migrating pages to and from device memory (GPU or AI accelerator) quickly
is critical and ensure hot data is always available for accelerators.

Thanks,
Shivank

> 
>> +                               per_cpu_item_idx = 0;
>> +                               cpu++;
>> +                       }
>> +               }
>> +               if (item_idx != nr_items)
>> +                       pr_warn("%s: only %d out of %d pages are transferred\n",
>> +                               __func__, item_idx - 1, nr_items);
>> +       }
>> +
>> +       /* Wait until it finishes  */
>> +       for (i = 0; i < total_mt_num; ++i) {
>> +               flush_work((struct work_struct *)work_items[i]);
>> +               /* retry if any copy fails */
>> +               if (work_items[i]->ret)
>> +                       err = -EAGAIN;
>> +       }
>> +
>> +free_work_items:
>> +       for (cpu = 0; cpu < total_mt_num; ++cpu)
>> +               kfree(work_items[cpu]);
>> +
>> +       return err;
>> +}
>> +
>> +static struct kobject *mt_kobj_ref;
>> +static struct kobj_attribute mt_offloading_attribute = __ATTR(offloading, 0664,
>> +               mt_offloading_show, mt_offloading_set);
>> +static struct kobj_attribute mt_threads_attribute = __ATTR(threads, 0664,
>> +               mt_threads_show, mt_threads_set);
>> +
>> +static int __init cpu_mt_module_init(void)
>> +{
>> +       int ret = 0;
>> +
>> +       mt_kobj_ref = kobject_create_and_add("cpu_mt", kernel_kobj);
>> +       if (!mt_kobj_ref)
>> +               return -ENOMEM;
>> +
>> +       ret = sysfs_create_file(mt_kobj_ref, &mt_offloading_attribute.attr);
>> +       if (ret)
>> +               goto out_offloading;
>> +
>> +       ret = sysfs_create_file(mt_kobj_ref, &mt_threads_attribute.attr);
>> +       if (ret)
>> +               goto out_threads;
>> +
>> +       is_dispatching = 0;
>> +
>> +       return 0;
>> +
>> +out_threads:
>> +       sysfs_remove_file(mt_kobj_ref, &mt_offloading_attribute.attr);
>> +out_offloading:
>> +       kobject_put(mt_kobj_ref);
>> +       return ret;
>> +}
>> +
>> +static void __exit cpu_mt_module_exit(void)
>> +{
>> +       /* Stop the MT offloading to unload the module */
>> +       mutex_lock(&migratecfg_mutex);
>> +       if (is_dispatching == 1) {
>> +               stop_offloading();
>> +               is_dispatching = 0;
>> +       }
>> +       mutex_unlock(&migratecfg_mutex);
>> +
>> +       sysfs_remove_file(mt_kobj_ref, &mt_threads_attribute.attr);
>> +       sysfs_remove_file(mt_kobj_ref, &mt_offloading_attribute.attr);
>> +       kobject_put(mt_kobj_ref);
>> +}
>> +
>> +module_init(cpu_mt_module_init);
>> +module_exit(cpu_mt_module_exit);
>> +
>> +MODULE_LICENSE("GPL");
>> +MODULE_AUTHOR("Zi Yan");
>> +MODULE_DESCRIPTION("CPU_MT_COPY"); /* CPU Multithreaded Batch Migrator */
>> --
>> 2.43.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine
  2025-11-06  6:27     ` Garg, Shivank
@ 2025-11-12  2:12       ` Byungchul Park
  0 siblings, 0 replies; 26+ messages in thread
From: Byungchul Park @ 2025-11-12  2:12 UTC (permalink / raw)
  To: Garg, Shivank
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	alirad.malek, yiannis, weixugc, linux-kernel, linux-mm,
	kernel_team

On Thu, Nov 06, 2025 at 11:57:54AM +0530, Garg, Shivank wrote:
> On 10/20/2025 1:58 PM, Byungchul Park wrote:
> > Thanks for the great work.
> >
> > By the way, is it okay to use work queue?  When the system is idle, this
> > patch will improve the migration performance, but when there are a lot
> > of other runnable tasks in the system, it might be worse than the
> > current one.  That's gonna be even worse if there are some other tasks
> > that wait for the migration to end.  It's worth noting that
> > padata_do_multithreaded() also uses work queue internally.
> >
> > I think, at worst, the performance should be same as is.  Or am I
> > missing something?
> >
> >       Byungchul
> 
> Hi Byungchul,
> 
> This was addressed by Zi in the mail:
> https://lore.kernel.org/linux-mm/61F6152C-A91E-453B-9521-34B7497AE532@nvidia.com
> 
> So, there are some specific use cases that can benefit significantly when CPU cores are idle

Sure.  I think so.  I meant the mechanism using multi-threads would
better be performed for faster migration when a system is idle, but it'd
better avoid the aggressiveness when the system is busy.

Or we might observe a performance degradation due to this work.

	Byungchul

> while GPUs or accelerators handle most of the workload.
> In such scenarios, migrating pages to and from device memory (GPU or AI accelerator) quickly
> is critical and ensure hot data is always available for accelerators.
> 
> Thanks,
> Shivank
> 
> >
> >> +                               per_cpu_item_idx = 0;
> >> +                               cpu++;
> >> +                       }
> >> +               }
> >> +               if (item_idx != nr_items)
> >> +                       pr_warn("%s: only %d out of %d pages are transferred\n",
> >> +                               __func__, item_idx - 1, nr_items);
> >> +       }
> >> +
> >> +       /* Wait until it finishes  */
> >> +       for (i = 0; i < total_mt_num; ++i) {
> >> +               flush_work((struct work_struct *)work_items[i]);
> >> +               /* retry if any copy fails */
> >> +               if (work_items[i]->ret)
> >> +                       err = -EAGAIN;
> >> +       }
> >> +
> >> +free_work_items:
> >> +       for (cpu = 0; cpu < total_mt_num; ++cpu)
> >> +               kfree(work_items[cpu]);
> >> +
> >> +       return err;
> >> +}
> >> +
> >> +static struct kobject *mt_kobj_ref;
> >> +static struct kobj_attribute mt_offloading_attribute = __ATTR(offloading, 0664,
> >> +               mt_offloading_show, mt_offloading_set);
> >> +static struct kobj_attribute mt_threads_attribute = __ATTR(threads, 0664,
> >> +               mt_threads_show, mt_threads_set);
> >> +
> >> +static int __init cpu_mt_module_init(void)
> >> +{
> >> +       int ret = 0;
> >> +
> >> +       mt_kobj_ref = kobject_create_and_add("cpu_mt", kernel_kobj);
> >> +       if (!mt_kobj_ref)
> >> +               return -ENOMEM;
> >> +
> >> +       ret = sysfs_create_file(mt_kobj_ref, &mt_offloading_attribute.attr);
> >> +       if (ret)
> >> +               goto out_offloading;
> >> +
> >> +       ret = sysfs_create_file(mt_kobj_ref, &mt_threads_attribute.attr);
> >> +       if (ret)
> >> +               goto out_threads;
> >> +
> >> +       is_dispatching = 0;
> >> +
> >> +       return 0;
> >> +
> >> +out_threads:
> >> +       sysfs_remove_file(mt_kobj_ref, &mt_offloading_attribute.attr);
> >> +out_offloading:
> >> +       kobject_put(mt_kobj_ref);
> >> +       return ret;
> >> +}
> >> +
> >> +static void __exit cpu_mt_module_exit(void)
> >> +{
> >> +       /* Stop the MT offloading to unload the module */
> >> +       mutex_lock(&migratecfg_mutex);
> >> +       if (is_dispatching == 1) {
> >> +               stop_offloading();
> >> +               is_dispatching = 0;
> >> +       }
> >> +       mutex_unlock(&migratecfg_mutex);
> >> +
> >> +       sysfs_remove_file(mt_kobj_ref, &mt_threads_attribute.attr);
> >> +       sysfs_remove_file(mt_kobj_ref, &mt_offloading_attribute.attr);
> >> +       kobject_put(mt_kobj_ref);
> >> +}
> >> +
> >> +module_init(cpu_mt_module_init);
> >> +module_exit(cpu_mt_module_exit);
> >> +
> >> +MODULE_LICENSE("GPL");
> >> +MODULE_AUTHOR("Zi Yan");
> >> +MODULE_DESCRIPTION("CPU_MT_COPY"); /* CPU Multithreaded Batch Migrator */
> >> --
> >> 2.43.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC V3 7/9] dcbm: add dma core batch migrator for batch page offloading
  2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
                   ` (5 preceding siblings ...)
  2025-09-23 17:47 ` [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine Shivank Garg
@ 2025-09-23 17:47 ` Shivank Garg
  2025-10-02 11:38   ` Jonathan Cameron
  2025-09-23 17:47 ` [RFC V3 8/9] adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Shivank Garg @ 2025-09-23 17:47 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	shivankg, alirad.malek, yiannis, weixugc, linux-kernel, linux-mm

The dcbm (DMA core batch migrator) provides a generic interface using
DMAEngine for end-to-end testing of the batch page migration offload
feature.

Enable DCBM offload:
echo 1 > /sys/kernel/dcbm/offloading
echo NR_DMA_CHAN_TO_USE > /sys/kernel/dcbm/nr_dma_chan

Disable DCBM offload:
echo 0 > /sys/kernel/dcbm/offloading

Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 drivers/migoffcopy/Kconfig       |   8 +
 drivers/migoffcopy/Makefile      |   1 +
 drivers/migoffcopy/dcbm/Makefile |   1 +
 drivers/migoffcopy/dcbm/dcbm.c   | 415 +++++++++++++++++++++++++++++++
 4 files changed, 425 insertions(+)
 create mode 100644 drivers/migoffcopy/dcbm/Makefile
 create mode 100644 drivers/migoffcopy/dcbm/dcbm.c

diff --git a/drivers/migoffcopy/Kconfig b/drivers/migoffcopy/Kconfig
index e73698af3e72..c1b2eff7650d 100644
--- a/drivers/migoffcopy/Kconfig
+++ b/drivers/migoffcopy/Kconfig
@@ -6,4 +6,12 @@ config MTCOPY_CPU
          Interface MT COPY CPU driver for batch page migration
          offloading. Say Y if you want to try offloading with
          MultiThreaded CPU copy APIs.
+config DCBM_DMA
+       bool "DMA Core Batch Migrator"
+       depends on OFFC_MIGRATION && DMA_ENGINE
+       default n
+       help
+         Interface DMA driver for batch page migration offloading.
+         Say Y if you want to try offloading with DMAEngine APIs
+         based driver.
 
diff --git a/drivers/migoffcopy/Makefile b/drivers/migoffcopy/Makefile
index 0a3c356d67e6..dedc86ff54c1 100644
--- a/drivers/migoffcopy/Makefile
+++ b/drivers/migoffcopy/Makefile
@@ -1 +1,2 @@
 obj-$(CONFIG_MTCOPY_CPU) += mtcopy/
+obj-$(CONFIG_DCBM_DMA)		+= dcbm/
diff --git a/drivers/migoffcopy/dcbm/Makefile b/drivers/migoffcopy/dcbm/Makefile
new file mode 100644
index 000000000000..56ba47cce0f1
--- /dev/null
+++ b/drivers/migoffcopy/dcbm/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_DCBM_DMA) += dcbm.o
diff --git a/drivers/migoffcopy/dcbm/dcbm.c b/drivers/migoffcopy/dcbm/dcbm.c
new file mode 100644
index 000000000000..87a58c0c3b9b
--- /dev/null
+++ b/drivers/migoffcopy/dcbm/dcbm.c
@@ -0,0 +1,415 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *
+ * DMA batch-offloading interface driver
+ *
+ * Copyright (C) 2024-25 Advanced Micro Devices, Inc.
+ */
+
+#include <linux/module.h>
+#include <linux/dma-mapping.h>
+#include <linux/dmaengine.h>
+#include <linux/migrate.h>
+#include <linux/migrate_offc.h>
+
+#define MAX_DMA_CHANNELS	16
+
+static bool offloading_enabled;
+static unsigned int nr_dma_channels = 1;
+static DEFINE_MUTEX(dcbm_mutex);
+
+struct dma_work {
+	struct dma_chan *chan;
+	struct completion done;
+	atomic_t pending;
+	struct sg_table *src_sgt;
+	struct sg_table *dst_sgt;
+	bool mapped;
+};
+
+static void dma_completion_callback(void *data)
+{
+	struct dma_work *work = data;
+
+	if (atomic_dec_and_test(&work->pending))
+		complete(&work->done);
+}
+
+static int setup_sg_tables(struct dma_work *work, struct list_head **src_pos,
+		struct list_head **dst_pos, int nr)
+{
+	struct scatterlist *sg_src, *sg_dst;
+	struct device *dev;
+	int i, ret;
+
+	work->src_sgt = kmalloc(sizeof(*work->src_sgt), GFP_KERNEL);
+	if (!work->src_sgt)
+		return -ENOMEM;
+	work->dst_sgt = kmalloc(sizeof(*work->dst_sgt), GFP_KERNEL);
+	if (!work->dst_sgt)
+		goto err_free_src;
+
+	ret = sg_alloc_table(work->src_sgt, nr, GFP_KERNEL);
+	if (ret)
+		goto err_free_dst;
+	ret = sg_alloc_table(work->dst_sgt, nr, GFP_KERNEL);
+	if (ret)
+		goto err_free_src_table;
+
+	sg_src = work->src_sgt->sgl;
+	sg_dst = work->dst_sgt->sgl;
+	for (i = 0; i < nr; i++) {
+		struct folio *src = list_entry(*src_pos, struct folio, lru);
+		struct folio *dst = list_entry(*dst_pos, struct folio, lru);
+
+		sg_set_folio(sg_src, src, folio_size(src), 0);
+		sg_set_folio(sg_dst, dst, folio_size(dst), 0);
+
+		*src_pos = (*src_pos)->next;
+		*dst_pos = (*dst_pos)->next;
+
+		if (i < nr - 1) {
+			sg_src = sg_next(sg_src);
+			sg_dst = sg_next(sg_dst);
+		}
+	}
+
+	dev = dmaengine_get_dma_device(work->chan);
+	if (!dev) {
+		ret = -ENODEV;
+		goto err_free_dst_table;
+	}
+	ret = dma_map_sgtable(dev, work->src_sgt, DMA_TO_DEVICE,
+			      DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+	if (ret)
+		goto err_free_dst_table;
+	ret = dma_map_sgtable(dev, work->dst_sgt, DMA_FROM_DEVICE,
+			      DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+	if (ret)
+		goto err_unmap_src;
+	/* Verify mapping produced same number of entries */
+	if (work->src_sgt->nents != work->dst_sgt->nents) {
+		pr_err("Mismatched SG entries after mapping: src=%d dst=%d\n",
+			work->src_sgt->nents, work->dst_sgt->nents);
+		ret = -EINVAL;
+		goto err_unmap_dst;
+	}
+	work->mapped = true;
+	return 0;
+
+err_unmap_dst:
+	dma_unmap_sgtable(dev, work->dst_sgt, DMA_FROM_DEVICE,
+			  DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+err_unmap_src:
+	dma_unmap_sgtable(dev, work->src_sgt, DMA_TO_DEVICE,
+			  DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+err_free_dst_table:
+	sg_free_table(work->dst_sgt);
+err_free_src_table:
+	sg_free_table(work->src_sgt);
+err_free_dst:
+	kfree(work->dst_sgt);
+	work->dst_sgt = NULL;
+err_free_src:
+	kfree(work->src_sgt);
+	work->src_sgt = NULL;
+	pr_err("DCBM: Failed to setup SG tables\n");
+	return ret;
+}
+
+static void cleanup_dma_work(struct dma_work *works, int actual_channels)
+{
+	struct device *dev;
+	int i;
+
+	if (!works)
+		return;
+
+	for (i = 0; i < actual_channels; i++) {
+		if (!works[i].chan)
+			continue;
+
+		dev = dmaengine_get_dma_device(works[i].chan);
+
+		/* Terminate any pending transfers */
+		if (atomic_read(&works[i].pending) > 0)
+			dmaengine_terminate_sync(works[i].chan);
+
+		if (dev && works[i].mapped) {
+			if (works[i].src_sgt) {
+				dma_unmap_sgtable(dev, works[i].src_sgt, DMA_TO_DEVICE,
+					DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+				sg_free_table(works[i].src_sgt);
+				kfree(works[i].src_sgt);
+			}
+			if (works[i].dst_sgt) {
+				dma_unmap_sgtable(dev, works[i].dst_sgt, DMA_FROM_DEVICE,
+					DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_NO_KERNEL_MAPPING);
+				sg_free_table(works[i].dst_sgt);
+				kfree(works[i].dst_sgt);
+			}
+		}
+		dma_release_channel(works[i].chan);
+	}
+	kfree(works);
+}
+
+static int submit_dma_transfers(struct dma_work *work)
+{
+	struct scatterlist *sg_src, *sg_dst;
+	struct dma_async_tx_descriptor *tx;
+	unsigned long flags = DMA_CTRL_ACK;
+	dma_cookie_t cookie;
+	int i;
+
+	/* Logic: send single callback for the entire batch */
+	atomic_set(&work->pending, 1);
+
+	sg_src = work->src_sgt->sgl;
+	sg_dst = work->dst_sgt->sgl;
+	/* Iterate over DMA-mapped entries */
+	for_each_sgtable_dma_sg(work->src_sgt, sg_src, i) {
+		/* Only interrupt on the last transfer */
+		if (i == work->src_sgt->nents - 1)
+			flags |= DMA_PREP_INTERRUPT;
+
+		tx = dmaengine_prep_dma_memcpy(work->chan,
+					       sg_dma_address(sg_dst),
+					       sg_dma_address(sg_src),
+					       sg_dma_len(sg_src),
+					       flags);
+		if (!tx) {
+			atomic_set(&work->pending, 0);
+			return -EIO;
+		}
+
+		/* Only set callback on last transfer */
+		if (i == work->src_sgt->nents - 1) {
+			tx->callback = dma_completion_callback;
+			tx->callback_param = work;
+		}
+
+		cookie = dmaengine_submit(tx);
+		if (dma_submit_error(cookie)) {
+			atomic_set(&work->pending, 0);
+			return -EIO;
+		}
+		sg_dst = sg_next(sg_dst);
+	}
+	return 0;
+}
+
+/**
+ * folios_copy_dma - Copy folios using DMA engine
+ * @dst_list: Destination folio list
+ * @src_list: Source folio list
+ * @nr_folios: Number of folios to copy
+ *
+ * Return: 0. Fallback to CPU copy on any error.
+ */
+static int folios_copy_dma(struct list_head *dst_list,
+			   struct list_head *src_list,
+			   unsigned int nr_folios)
+{
+	struct dma_work *works;
+	struct list_head *src_pos = src_list->next;
+	struct list_head *dst_pos = dst_list->next;
+	int i, folios_per_chan, ret = 0;
+	dma_cap_mask_t mask;
+	int actual_channels = 0;
+	int max_channels;
+
+	max_channels = min3(nr_dma_channels, nr_folios, MAX_DMA_CHANNELS);
+
+	works = kcalloc(max_channels, sizeof(*works), GFP_KERNEL);
+	if (!works)
+		goto fallback;
+
+	dma_cap_zero(mask);
+	dma_cap_set(DMA_MEMCPY, mask);
+
+	for (i = 0; i < max_channels; i++) {
+		works[actual_channels].chan = dma_request_chan_by_mask(&mask);
+		if (IS_ERR(works[actual_channels].chan))
+			break;
+		init_completion(&works[actual_channels].done);
+		actual_channels++;
+	}
+
+	if (actual_channels == 0) {
+		kfree(works);
+		goto fallback;
+	}
+
+	for (i = 0; i < actual_channels; i++) {
+		folios_per_chan = nr_folios * (i + 1) / actual_channels -
+					(nr_folios * i) / actual_channels;
+		if (folios_per_chan == 0)
+			continue;
+
+		ret = setup_sg_tables(&works[i], &src_pos, &dst_pos, folios_per_chan);
+	if (ret)
+		goto cleanup;
+	}
+
+	for (i = 0; i < actual_channels; i++) {
+		ret = submit_dma_transfers(&works[i]);
+		if (ret) {
+			dev_err(dmaengine_get_dma_device(works[i].chan),
+				"Failed to submit transfers for channel %d\n", i);
+			goto cleanup;
+		}
+	}
+
+	for (i = 0; i < actual_channels; i++) {
+		if (atomic_read(&works[i].pending) > 0)
+			dma_async_issue_pending(works[i].chan);
+	}
+
+	for (i = 0; i < actual_channels; i++) {
+		if (atomic_read(&works[i].pending) > 0) {
+			if (!wait_for_completion_timeout(&works[i].done, msecs_to_jiffies(10000))) {
+				dev_err(dmaengine_get_dma_device(works[i].chan),
+					"DMA timeout on channel %d\n", i);
+				ret = -ETIMEDOUT;
+				goto cleanup;
+			}
+		}
+	}
+
+cleanup:
+	cleanup_dma_work(works, actual_channels);
+	if (ret)
+		goto fallback;
+	return 0;
+fallback:
+	/* Fall back to CPU copy */
+	pr_err("DCBM: Falling back to CPU copy\n");
+	folios_mc_copy(dst_list, src_list, nr_folios);
+	return 0;
+}
+
+static struct migrator dma_migrator = {
+	.name = "DCBM",
+	.migrate_offc = folios_copy_dma,
+	.owner = THIS_MODULE,
+};
+
+static ssize_t offloading_show(struct kobject *kobj,
+			       struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", offloading_enabled);
+}
+
+static ssize_t offloading_store(struct kobject *kobj,
+				struct kobj_attribute *attr,
+				const char *buf, size_t count)
+{
+	bool enable;
+	int ret;
+
+	ret = kstrtobool(buf, &enable);
+	if (ret)
+		return ret;
+
+	mutex_lock(&dcbm_mutex);
+
+	if (enable == offloading_enabled) {
+		pr_err("migration offloading is already %s\n",
+			 enable ? "ON" : "OFF");
+		goto out;
+	}
+
+	if (enable) {
+		start_offloading(&dma_migrator);
+		offloading_enabled = true;
+		pr_info("migration offloading is now ON\n");
+	} else {
+		stop_offloading();
+		offloading_enabled = false;
+		pr_info("migration offloading is now OFF\n");
+	}
+out:
+	mutex_unlock(&dcbm_mutex);
+	return count;
+}
+
+static ssize_t nr_dma_chan_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%u\n", nr_dma_channels);
+}
+
+static ssize_t nr_dma_chan_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count)
+{
+	unsigned int val;
+	int ret;
+
+	ret = kstrtouint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	if (val < 1 || val > MAX_DMA_CHANNELS)
+		return -EINVAL;
+
+	mutex_lock(&dcbm_mutex);
+	nr_dma_channels = val;
+	mutex_unlock(&dcbm_mutex);
+
+	return count;
+}
+
+static struct kobj_attribute offloading_attr = __ATTR_RW(offloading);
+static struct kobj_attribute nr_dma_chan_attr = __ATTR_RW(nr_dma_chan);
+
+static struct attribute *dcbm_attrs[] = {
+	&offloading_attr.attr,
+	&nr_dma_chan_attr.attr,
+	NULL,
+};
+ATTRIBUTE_GROUPS(dcbm);
+
+static struct kobject *dcbm_kobj;
+
+static int __init dcbm_init(void)
+{
+	int ret;
+
+	dcbm_kobj = kobject_create_and_add("dcbm", kernel_kobj);
+	if (!dcbm_kobj)
+		return -ENOMEM;
+
+	ret = sysfs_create_groups(dcbm_kobj, dcbm_groups);
+	if (ret) {
+		kobject_put(dcbm_kobj);
+		return ret;
+	}
+
+	pr_info("DMA Core Batch Migrator initialized\n");
+	return 0;
+}
+
+static void __exit dcbm_exit(void)
+{
+	/* Ensure offloading is stopped before module unload */
+	mutex_lock(&dcbm_mutex);
+	if (offloading_enabled) {
+		stop_offloading();
+		offloading_enabled = false;
+	}
+	mutex_unlock(&dcbm_mutex);
+
+	sysfs_remove_groups(dcbm_kobj, dcbm_groups);
+	kobject_put(dcbm_kobj);
+
+	pr_info("DMA Core Batch Migrator unloaded\n");
+}
+
+module_init(dcbm_init);
+module_exit(dcbm_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Shivank Garg");
+MODULE_DESCRIPTION("DMA Core Batch Migrator");
-- 
2.43.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 7/9] dcbm: add dma core batch migrator for batch page offloading
  2025-09-23 17:47 ` [RFC V3 7/9] dcbm: add dma core batch migrator for batch page offloading Shivank Garg
@ 2025-10-02 11:38   ` Jonathan Cameron
  2025-10-16  9:59     ` Garg, Shivank
  0 siblings, 1 reply; 26+ messages in thread
From: Jonathan Cameron @ 2025-10-02 11:38 UTC (permalink / raw)
  To: Shivank Garg
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, vkoul,
	lucas.demarchi, rdunlap, jgg, kuba, justonli, ivecera,
	dave.jiang, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm

On Tue, 23 Sep 2025 17:47:42 +0000
Shivank Garg <shivankg@amd.com> wrote:

> The dcbm (DMA core batch migrator) provides a generic interface using
> DMAEngine for end-to-end testing of the batch page migration offload
> feature.
> 
> Enable DCBM offload:
> echo 1 > /sys/kernel/dcbm/offloading
> echo NR_DMA_CHAN_TO_USE > /sys/kernel/dcbm/nr_dma_chan
> 
> Disable DCBM offload:
> echo 0 > /sys/kernel/dcbm/offloading
> 
> Signed-off-by: Shivank Garg <shivankg@amd.com>
Hi Shivank,

Some minor comments inline.

J
> ---
>  drivers/migoffcopy/Kconfig       |   8 +
>  drivers/migoffcopy/Makefile      |   1 +
>  drivers/migoffcopy/dcbm/Makefile |   1 +
>  drivers/migoffcopy/dcbm/dcbm.c   | 415 +++++++++++++++++++++++++++++++
>  4 files changed, 425 insertions(+)
>  create mode 100644 drivers/migoffcopy/dcbm/Makefile
>  create mode 100644 drivers/migoffcopy/dcbm/dcbm.c
> 
> diff --git a/drivers/migoffcopy/Kconfig b/drivers/migoffcopy/Kconfig
> index e73698af3e72..c1b2eff7650d 100644
> --- a/drivers/migoffcopy/Kconfig
> +++ b/drivers/migoffcopy/Kconfig
> @@ -6,4 +6,12 @@ config MTCOPY_CPU
>           Interface MT COPY CPU driver for batch page migration
>           offloading. Say Y if you want to try offloading with
>           MultiThreaded CPU copy APIs.

I'd put a blank line here.

> +config DCBM_DMA
> +       bool "DMA Core Batch Migrator"
> +       depends on OFFC_MIGRATION && DMA_ENGINE
> +       default n

no need to say this. Everything is default n.

> +       help
> +         Interface DMA driver for batch page migration offloading.
> +         Say Y if you want to try offloading with DMAEngine APIs
> +         based driver.

Similar comment on the 'try'

>  
> diff --git a/drivers/migoffcopy/Makefile b/drivers/migoffcopy/Makefile
> index 0a3c356d67e6..dedc86ff54c1 100644
> --- a/drivers/migoffcopy/Makefile
> +++ b/drivers/migoffcopy/Makefile
> @@ -1 +1,2 @@
>  obj-$(CONFIG_MTCOPY_CPU) += mtcopy/
> +obj-$(CONFIG_DCBM_DMA)		+= dcbm/
> diff --git a/drivers/migoffcopy/dcbm/Makefile b/drivers/migoffcopy/dcbm/Makefile
> new file mode 100644
> index 000000000000..56ba47cce0f1
> --- /dev/null
> +++ b/drivers/migoffcopy/dcbm/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_DCBM_DMA) += dcbm.o
> diff --git a/drivers/migoffcopy/dcbm/dcbm.c b/drivers/migoffcopy/dcbm/dcbm.c
> new file mode 100644
> index 000000000000..87a58c0c3b9b
> --- /dev/null
> +++ b/drivers/migoffcopy/dcbm/dcbm.c


> +/**
> + * folios_copy_dma - Copy folios using DMA engine
> + * @dst_list: Destination folio list
> + * @src_list: Source folio list
> + * @nr_folios: Number of folios to copy
> + *
> + * Return: 0. Fallback to CPU copy on any error.
> + */
> +static int folios_copy_dma(struct list_head *dst_list,
> +			   struct list_head *src_list,
> +			   unsigned int nr_folios)
> +{
> +	struct dma_work *works;
> +	struct list_head *src_pos = src_list->next;
> +	struct list_head *dst_pos = dst_list->next;
> +	int i, folios_per_chan, ret = 0;
> +	dma_cap_mask_t mask;
> +	int actual_channels = 0;
> +	int max_channels;
> +
> +	max_channels = min3(nr_dma_channels, nr_folios, MAX_DMA_CHANNELS);
> +
> +	works = kcalloc(max_channels, sizeof(*works), GFP_KERNEL);
> +	if (!works)
> +		goto fallback;
> +
> +	dma_cap_zero(mask);
> +	dma_cap_set(DMA_MEMCPY, mask);
> +
> +	for (i = 0; i < max_channels; i++) {
> +		works[actual_channels].chan = dma_request_chan_by_mask(&mask);
> +		if (IS_ERR(works[actual_channels].chan))
> +			break;
> +		init_completion(&works[actual_channels].done);
> +		actual_channels++;
> +	}
> +
> +	if (actual_channels == 0) {
> +		kfree(works);
> +		goto fallback;
> +	}
> +
> +	for (i = 0; i < actual_channels; i++) {
> +		folios_per_chan = nr_folios * (i + 1) / actual_channels -
> +					(nr_folios * i) / actual_channels;
> +		if (folios_per_chan == 0)
> +			continue;
> +
> +		ret = setup_sg_tables(&works[i], &src_pos, &dst_pos, folios_per_chan);
> +	if (ret)
> +		goto cleanup;
> +	}

Indent issues here.

> +
> +	for (i = 0; i < actual_channels; i++) {
> +		ret = submit_dma_transfers(&works[i]);
> +		if (ret) {
> +			dev_err(dmaengine_get_dma_device(works[i].chan),
> +				"Failed to submit transfers for channel %d\n", i);
> +			goto cleanup;
> +		}
> +	}
> +
> +	for (i = 0; i < actual_channels; i++) {
> +		if (atomic_read(&works[i].pending) > 0)
> +			dma_async_issue_pending(works[i].chan);
> +	}
> +
> +	for (i = 0; i < actual_channels; i++) {
> +		if (atomic_read(&works[i].pending) > 0) {
I'd flip logic to something like.
		if (!atomic_read(&works[i].pending)
			continue;

		if (!wait_for_...

Just to reduce the deep indent.

> +			if (!wait_for_completion_timeout(&works[i].done, msecs_to_jiffies(10000))) {
> +				dev_err(dmaengine_get_dma_device(works[i].chan),
> +					"DMA timeout on channel %d\n", i);
> +				ret = -ETIMEDOUT;
> +				goto cleanup;
> +			}
> +		}
> +	}
> +
> +cleanup:
> +	cleanup_dma_work(works, actual_channels);
> +	if (ret)
> +		goto fallback;

This goto goto dance is probably not worth it.  I'd just duplicate the
cleanup_dma_work() call to have a copy in the error path and one in the non error
path.  Then you just end up with a conventional error block of labels + stuff to do.

> +	return 0;
> +fallback:
> +	/* Fall back to CPU copy */
> +	pr_err("DCBM: Falling back to CPU copy\n");
> +	folios_mc_copy(dst_list, src_list, nr_folios);
> +	return 0;
> +}

> +static ssize_t offloading_store(struct kobject *kobj,
> +				struct kobj_attribute *attr,
> +				const char *buf, size_t count)
> +{
> +	bool enable;
> +	int ret;
> +
> +	ret = kstrtobool(buf, &enable);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&dcbm_mutex);
> +
> +	if (enable == offloading_enabled) {
> +		pr_err("migration offloading is already %s\n",
> +			 enable ? "ON" : "OFF");

To me that's not an error. Pointless, but not worth moaning about.
Just exit saying nothing.

> +		goto out;
> +	}
> +
> +	if (enable) {
> +		start_offloading(&dma_migrator);
> +		offloading_enabled = true;
> +		pr_info("migration offloading is now ON\n");
> +	} else {
> +		stop_offloading();
> +		offloading_enabled = false;
> +		pr_info("migration offloading is now OFF\n");
> +	}
> +out:
> +	mutex_unlock(&dcbm_mutex);

Perhaps guard and direct return above


> +	return count;
> +}

> +static struct kobj_attribute offloading_attr = __ATTR_RW(offloading);
> +static struct kobj_attribute nr_dma_chan_attr = __ATTR_RW(nr_dma_chan);
> +
> +static struct attribute *dcbm_attrs[] = {
> +	&offloading_attr.attr,
> +	&nr_dma_chan_attr.attr,
> +	NULL,
Trivial but doesn't need a trailing comma given this is a terminating entry
and nothing should ever come after it.

> +};
> +ATTRIBUTE_GROUPS(dcbm);
> +
> +static struct kobject *dcbm_kobj;
> +
> +static int __init dcbm_init(void)
> +{
> +	int ret;
> +
> +	dcbm_kobj = kobject_create_and_add("dcbm", kernel_kobj);
> +	if (!dcbm_kobj)
> +		return -ENOMEM;
> +
> +	ret = sysfs_create_groups(dcbm_kobj, dcbm_groups);
Why use a group here and separate files in the CPU thread one?

I'd prefer a group there as well given slightly simpler error
handling as seen here.


> +	if (ret) {
> +		kobject_put(dcbm_kobj);
> +		return ret;
> +	}
> +
> +	pr_info("DMA Core Batch Migrator initialized\n");
> +	return 0;
> +}
> +



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 7/9] dcbm: add dma core batch migrator for batch page offloading
  2025-10-02 11:38   ` Jonathan Cameron
@ 2025-10-16  9:59     ` Garg, Shivank
  0 siblings, 0 replies; 26+ messages in thread
From: Garg, Shivank @ 2025-10-16  9:59 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, vkoul,
	lucas.demarchi, rdunlap, jgg, kuba, justonli, ivecera,
	dave.jiang, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm



On 10/2/2025 5:08 PM, Jonathan Cameron wrote:
> On Tue, 23 Sep 2025 17:47:42 +0000
> Shivank Garg <shivankg@amd.com> wrote:
> 
>> The dcbm (DMA core batch migrator) provides a generic interface using
>> DMAEngine for end-to-end testing of the batch page migration offload
>> feature.
>>
>> Enable DCBM offload:
>> echo 1 > /sys/kernel/dcbm/offloading
>> echo NR_DMA_CHAN_TO_USE > /sys/kernel/dcbm/nr_dma_chan
>>
>> Disable DCBM offload:
>> echo 0 > /sys/kernel/dcbm/offloading
>>
>> Signed-off-by: Shivank Garg <shivankg@amd.com>
> Hi Shivank,
> 
> Some minor comments inline.
> 
> J

Hi Jonathan,

Thank you very much for your detailed feedback and review comments.
I have incorporated your suggestions and updated commits are available
on my development branch:
https://github.com/shivankgarg98/linux/tree/shivank/V4_migrate_pages_optimization

I will hold off on posting the RFC V4 series to the mailing list until I have
new findings or other substantial changes to report.

Thanks again,
Shivank


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC V3 8/9] adjust NR_MAX_BATCHED_MIGRATION for testing
  2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
                   ` (6 preceding siblings ...)
  2025-09-23 17:47 ` [RFC V3 7/9] dcbm: add dma core batch migrator for batch page offloading Shivank Garg
@ 2025-09-23 17:47 ` Shivank Garg
  2025-09-23 17:47 ` [RFC V3 9/9] mtcopy: spread threads across die " Shivank Garg
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Shivank Garg @ 2025-09-23 17:47 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	shivankg, alirad.malek, yiannis, weixugc, linux-kernel, linux-mm

From: Zi Yan <ziy@nvidia.com>

change NR_MAX_BATCHED_MIGRATION to HPAGE_PUD_NR to allow batching THP
copies.

These are for testing purpose only.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 41bea48d823c..7f50813d87e4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1668,7 +1668,7 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_MAX_BATCHED_MIGRATION	HPAGE_PMD_NR
+#define NR_MAX_BATCHED_MIGRATION	HPAGE_PUD_NR
 #else
 #define NR_MAX_BATCHED_MIGRATION	512
 #endif
-- 
2.43.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC V3 9/9] mtcopy: spread threads across die for testing
  2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
                   ` (7 preceding siblings ...)
  2025-09-23 17:47 ` [RFC V3 8/9] adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
@ 2025-09-23 17:47 ` Shivank Garg
  2025-09-24  1:49 ` [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Huang, Ying
  2025-09-24  3:22 ` Zi Yan
  10 siblings, 0 replies; 26+ messages in thread
From: Shivank Garg @ 2025-09-23 17:47 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim, byungchul,
	gourry, ying.huang, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	shivankg, alirad.malek, yiannis, weixugc, linux-kernel, linux-mm

Select CPUs using sysfs
For testing purpose only.

Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 drivers/migoffcopy/mtcopy/copy_pages.c | 76 +++++++++++++++++++++++++-
 1 file changed, 73 insertions(+), 3 deletions(-)

diff --git a/drivers/migoffcopy/mtcopy/copy_pages.c b/drivers/migoffcopy/mtcopy/copy_pages.c
index 68e50de602d6..e605acca39d0 100644
--- a/drivers/migoffcopy/mtcopy/copy_pages.c
+++ b/drivers/migoffcopy/mtcopy/copy_pages.c
@@ -15,11 +15,37 @@
 #include <linux/migrate.h>
 #include <linux/migrate_offc.h>
 
-#define MAX_NUM_COPY_THREADS 64
+#define MAX_NUM_COPY_THREADS 32
 
 unsigned int limit_mt_num = 4;
 static int is_dispatching;
 
+static int cpuselect;
+
+// spread across die
+static const int cpu_id_list_0[] = {
+	0, 8, 16, 24,
+	32, 40, 48, 56,
+	64, 72, 80, 88,
+	96, 104, 112, 120,
+	128, 136, 144, 152,
+	160, 168, 176, 184,
+	192, 200, 208, 216,
+	224, 232, 240, 248};
+
+// don't spread, fill the die
+static const int cpu_id_list_1[] = {
+	0, 1, 2, 3,
+	4, 5, 6, 7,
+	8, 9, 10, 11,
+	12, 13, 14, 15,
+	16, 17, 18, 19,
+	20, 21, 22, 23,
+	24, 25, 26, 27,
+	28, 29, 30, 31};
+
+int cpu_id_list[32] = {0};
+
 static int copy_page_lists_mt(struct list_head *dst_folios,
 		struct list_head *src_folios, unsigned int nr_items);
 
@@ -141,6 +167,39 @@ static ssize_t mt_threads_show(struct kobject *kobj,
 	return sysfs_emit(buf, "%u\n", limit_mt_num);
 }
 
+static ssize_t mt_cpuselect_set(struct kobject *kobj, struct kobj_attribute *attr,
+		const char *buf, size_t count)
+{
+	int ccode;
+	unsigned int cpuconfig;
+
+	ccode = kstrtouint(buf, 0, &cpuconfig);
+	if (ccode) {
+		pr_debug("(%s:) error parsing input %s\n", __func__, buf);
+		return ccode;
+	}
+	mutex_lock(&migratecfg_mutex);
+	cpuselect = cpuconfig;
+	switch (cpuselect) {
+	case 1:
+		memcpy(cpu_id_list, cpu_id_list_1, MAX_NUM_COPY_THREADS*sizeof(int));
+		break;
+	default:
+		memcpy(cpu_id_list, cpu_id_list_0, MAX_NUM_COPY_THREADS*sizeof(int));
+		break;
+	}
+
+	mutex_unlock(&migratecfg_mutex);
+
+	return count;
+}
+
+static ssize_t mt_cpuselect_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%u\n", cpuselect);
+}
+
 int copy_page_lists_mt(struct list_head *dst_folios,
 		struct list_head *src_folios, unsigned int nr_items)
 {
@@ -208,7 +267,7 @@ int copy_page_lists_mt(struct list_head *dst_folios,
 		}
 
 		for (cpu = 0; cpu < total_mt_num; ++cpu)
-			queue_work(system_unbound_wq,
+			queue_work_on(cpu_id_list[cpu], system_unbound_wq,
 				   (struct work_struct *)work_items[cpu]);
 	} else {
 		int num_xfer_per_thread = nr_items / total_mt_num;
@@ -245,7 +304,7 @@ int copy_page_lists_mt(struct list_head *dst_folios,
 			dst2 = list_next_entry(dst, lru);
 
 			if (per_cpu_item_idx == work_items[cpu]->num_items) {
-				queue_work(system_unbound_wq,
+				queue_work_on(cpu_id_list[cpu], system_unbound_wq,
 					(struct work_struct *)work_items[cpu]);
 				per_cpu_item_idx = 0;
 				cpu++;
@@ -276,6 +335,8 @@ static struct kobj_attribute mt_offloading_attribute = __ATTR(offloading, 0664,
 		mt_offloading_show, mt_offloading_set);
 static struct kobj_attribute mt_threads_attribute = __ATTR(threads, 0664,
 		mt_threads_show, mt_threads_set);
+static struct kobj_attribute mt_cpuselect_attribute = __ATTR(cpuselect, 0664,
+		mt_cpuselect_show, mt_cpuselect_set);
 
 static int __init cpu_mt_module_init(void)
 {
@@ -293,10 +354,18 @@ static int __init cpu_mt_module_init(void)
 	if (ret)
 		goto out_threads;
 
+	ret = sysfs_create_file(mt_kobj_ref, &mt_cpuselect_attribute.attr);
+	if (ret)
+		goto out_cpuselect;
+
+	memcpy(cpu_id_list, cpu_id_list_0, MAX_NUM_COPY_THREADS*sizeof(int));
+
 	is_dispatching = 0;
 
 	return 0;
 
+out_cpuselect:
+	sysfs_remove_file(mt_kobj_ref, &mt_threads_attribute.attr);
 out_threads:
 	sysfs_remove_file(mt_kobj_ref, &mt_offloading_attribute.attr);
 out_offloading:
@@ -314,6 +383,7 @@ static void __exit cpu_mt_module_exit(void)
 	}
 	mutex_unlock(&migratecfg_mutex);
 
+	sysfs_remove_file(mt_kobj_ref, &mt_cpuselect_attribute.attr);
 	sysfs_remove_file(mt_kobj_ref, &mt_threads_attribute.attr);
 	sysfs_remove_file(mt_kobj_ref, &mt_offloading_attribute.attr);
 	kobject_put(mt_kobj_ref);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload
  2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
                   ` (8 preceding siblings ...)
  2025-09-23 17:47 ` [RFC V3 9/9] mtcopy: spread threads across die " Shivank Garg
@ 2025-09-24  1:49 ` Huang, Ying
  2025-09-24  2:03   ` Zi Yan
  2025-09-24  3:22 ` Zi Yan
  10 siblings, 1 reply; 26+ messages in thread
From: Huang, Ying @ 2025-09-24  1:49 UTC (permalink / raw)
  To: Shivank Garg
  Cc: akpm, david, ziy, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, apopple, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, vkoul, lucas.demarchi, rdunlap,
	jgg, kuba, justonli, ivecera, dave.jiang, Jonathan.Cameron,
	dan.j.williams, rientjes, Raghavendra.KodsaraThimmappa, bharata,
	alirad.malek, yiannis, weixugc, linux-kernel, linux-mm

Hi, Shivank,

Thanks for working on this!

Shivank Garg <shivankg@amd.com> writes:

> This is the third RFC of the patchset to enhance page migration by batching
> folio-copy operations and enabling acceleration via multi-threaded CPU or
> DMA offload.
>
> Single-threaded, folio-by-folio copying bottlenecks page migration
> in modern systems with deep memory hierarchies, especially for large
> folios where copy overhead dominates, leaving significant hardware
> potential untapped. 
>
> By batching the copy phase, we create an opportunity for significant
> hardware acceleration. This series builds a framework for this acceleration
> and provides two initial offload driver implementations: one using multiple
> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>
> This version incorporates significant feedback to improve correctness,
> robustness, and the efficiency of the DMA offload path.
>
> Changelog since V2:
>
> 1. DMA Engine Rewrite:
>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>    - Single completion interrupt per batch (reduced overhead)
>    - Order of magnitude improvement in setup time for large batches
> 2. Code cleanups and refactoring
> 3. Rebased on latest mainline (6.17-rc6+)
>
> MOTIVATION:
> -----------
>
> Current Migration Flow:
> [ move_pages(), Compaction, Tiering, etc. ]
>               |
>               v
>      [ migrate_pages() ] // Common entry point
>               |
>               v
>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>       |
>       |--> [ migrate_folio_unmap() ]
>       |
>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>       |
>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>            - For each folio:
>              - Metadata prep: Copy flags, mappings, etc.
>              - folio_copy()  <-- Single-threaded, serial data copy.
>              - Update PTEs & finalize for that single folio.
>              
> Understanding overheads in page migration (move_pages() syscall):
>
> Total move_pages() overheads = folio_copy() + Other overheads
> 1. folio_copy() is the core copy operation that interests us.
> 2. The remaining operations are user/kernel transitions, page table walks,
> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
> mappings and PTEs etc. that contribute to the remaining overheads.
>
> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
> Number of pages being migrated and folio size:
>             4KB     2MB
> 1 page     <1%     ~66%
> 512 page   ~35%    ~97%
>
> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
> substantial performance opportunity.
>
> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
> Where F is the fraction of time spent in folio_copy() and S is the speedup of
> folio_copy().
>
> For 4KB folios, folio copy overheads are significantly small in single-page
> migrations to impact overall speedup, even for 512 pages, maximum theoretical
> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>
> For 2MB THPs, folio copy overheads are significant even in single page
> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
> speedup and up to ~33x for 512 pages.
>
> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
> based on my measurements for copying 512 2MB pages.
> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
> observed in the experiments below).
>
> DESIGN: A Pluggable Migrator Framework
> ---------------------------------------
>
> Introduce migrate_folios_batch_move():
>
> [ migrate_pages_batch() ]
>     |
>     |--> migrate_folio_unmap()
>     |      
>     |--> try_to_unmap_flush()
>     |
>     +--> [ migrate_folios_batch_move() ] // new batched design
>             |
>             |--> Metadata migration
>             |    - Metadata prep: Copy flags, mappings, etc.
>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>             |
>             |--> Batch copy folio data
>             |    - Migrator is configurable at runtime via sysfs.
>             |
>             |          static_call(_folios_copy) // Pluggable migrators
>             |          /          |            \
>             |         v           v             v
>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>             |
>             +--> Update PTEs to point to dst folios and complete migration.
>

I just jump in the discussion, so this may be discussed before already.
Sorry if so.  Why not

migrate_folios_unmap()
try_to_unmap_flush()
copy folios in parallel if possible
migrate_folios_move(): with MIGRATE_NO_COPY?

> User Control of Migrator:
>
> # echo 1 > /sys/kernel/dcbm/offloading
>    |
>    +--> Driver's sysfs handler
>         |
>         +--> calls start_offloading(&cpu_migrator)
>               |
>               +--> calls offc_update_migrator()
>                     |
>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>
> Later, During Migration ...
> migrate_folios_batch_move()
>     |
>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>           |
>           +-> [ mtcopy | dcbm | kernel_default ]
>

[snip]

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload
  2025-09-24  1:49 ` [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Huang, Ying
@ 2025-09-24  2:03   ` Zi Yan
  2025-09-24  3:11     ` Huang, Ying
  0 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-09-24  2:03 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Shivank Garg, akpm, david, willy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, apopple, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, vkoul,
	lucas.demarchi, rdunlap, jgg, kuba, justonli, ivecera,
	dave.jiang, Jonathan.Cameron, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm

On 23 Sep 2025, at 21:49, Huang, Ying wrote:

> Hi, Shivank,
>
> Thanks for working on this!
>
> Shivank Garg <shivankg@amd.com> writes:
>
>> This is the third RFC of the patchset to enhance page migration by batching
>> folio-copy operations and enabling acceleration via multi-threaded CPU or
>> DMA offload.
>>
>> Single-threaded, folio-by-folio copying bottlenecks page migration
>> in modern systems with deep memory hierarchies, especially for large
>> folios where copy overhead dominates, leaving significant hardware
>> potential untapped.
>>
>> By batching the copy phase, we create an opportunity for significant
>> hardware acceleration. This series builds a framework for this acceleration
>> and provides two initial offload driver implementations: one using multiple
>> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>>
>> This version incorporates significant feedback to improve correctness,
>> robustness, and the efficiency of the DMA offload path.
>>
>> Changelog since V2:
>>
>> 1. DMA Engine Rewrite:
>>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>>    - Single completion interrupt per batch (reduced overhead)
>>    - Order of magnitude improvement in setup time for large batches
>> 2. Code cleanups and refactoring
>> 3. Rebased on latest mainline (6.17-rc6+)
>>
>> MOTIVATION:
>> -----------
>>
>> Current Migration Flow:
>> [ move_pages(), Compaction, Tiering, etc. ]
>>               |
>>               v
>>      [ migrate_pages() ] // Common entry point
>>               |
>>               v
>>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>>       |
>>       |--> [ migrate_folio_unmap() ]
>>       |
>>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>>       |
>>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>>            - For each folio:
>>              - Metadata prep: Copy flags, mappings, etc.
>>              - folio_copy()  <-- Single-threaded, serial data copy.
>>              - Update PTEs & finalize for that single folio.
>>
>> Understanding overheads in page migration (move_pages() syscall):
>>
>> Total move_pages() overheads = folio_copy() + Other overheads
>> 1. folio_copy() is the core copy operation that interests us.
>> 2. The remaining operations are user/kernel transitions, page table walks,
>> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
>> mappings and PTEs etc. that contribute to the remaining overheads.
>>
>> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
>> Number of pages being migrated and folio size:
>>             4KB     2MB
>> 1 page     <1%     ~66%
>> 512 page   ~35%    ~97%
>>
>> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
>> substantial performance opportunity.
>>
>> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
>> Where F is the fraction of time spent in folio_copy() and S is the speedup of
>> folio_copy().
>>
>> For 4KB folios, folio copy overheads are significantly small in single-page
>> migrations to impact overall speedup, even for 512 pages, maximum theoretical
>> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>>
>> For 2MB THPs, folio copy overheads are significant even in single page
>> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
>> speedup and up to ~33x for 512 pages.
>>
>> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
>> based on my measurements for copying 512 2MB pages.
>> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
>> observed in the experiments below).
>>
>> DESIGN: A Pluggable Migrator Framework
>> ---------------------------------------
>>
>> Introduce migrate_folios_batch_move():
>>
>> [ migrate_pages_batch() ]
>>     |
>>     |--> migrate_folio_unmap()
>>     |
>>     |--> try_to_unmap_flush()
>>     |
>>     +--> [ migrate_folios_batch_move() ] // new batched design
>>             |
>>             |--> Metadata migration
>>             |    - Metadata prep: Copy flags, mappings, etc.
>>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>>             |
>>             |--> Batch copy folio data
>>             |    - Migrator is configurable at runtime via sysfs.
>>             |
>>             |          static_call(_folios_copy) // Pluggable migrators
>>             |          /          |            \
>>             |         v           v             v
>>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>>             |
>>             +--> Update PTEs to point to dst folios and complete migration.
>>
>
> I just jump in the discussion, so this may be discussed before already.
> Sorry if so.  Why not
>
> migrate_folios_unmap()
> try_to_unmap_flush()
> copy folios in parallel if possible
> migrate_folios_move(): with MIGRATE_NO_COPY?

Since in move_to_new_folio(), there are various migration preparation
works, which can fail. Copying folios regardless might lead to some
unnecessary work. What is your take on this?

>
>> User Control of Migrator:
>>
>> # echo 1 > /sys/kernel/dcbm/offloading
>>    |
>>    +--> Driver's sysfs handler
>>         |
>>         +--> calls start_offloading(&cpu_migrator)
>>               |
>>               +--> calls offc_update_migrator()
>>                     |
>>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>>
>> Later, During Migration ...
>> migrate_folios_batch_move()
>>     |
>>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>>           |
>>           +-> [ mtcopy | dcbm | kernel_default ]
>>
>
> [snip]
>
> ---
> Best Regards,
> Huang, Ying


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload
  2025-09-24  2:03   ` Zi Yan
@ 2025-09-24  3:11     ` Huang, Ying
  0 siblings, 0 replies; 26+ messages in thread
From: Huang, Ying @ 2025-09-24  3:11 UTC (permalink / raw)
  To: Zi Yan
  Cc: Shivank Garg, akpm, david, willy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, gourry, apopple, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, vkoul,
	lucas.demarchi, rdunlap, jgg, kuba, justonli, ivecera,
	dave.jiang, Jonathan.Cameron, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm

Zi Yan <ziy@nvidia.com> writes:

> On 23 Sep 2025, at 21:49, Huang, Ying wrote:
>
>> Hi, Shivank,
>>
>> Thanks for working on this!
>>
>> Shivank Garg <shivankg@amd.com> writes:
>>
>>> This is the third RFC of the patchset to enhance page migration by batching
>>> folio-copy operations and enabling acceleration via multi-threaded CPU or
>>> DMA offload.
>>>
>>> Single-threaded, folio-by-folio copying bottlenecks page migration
>>> in modern systems with deep memory hierarchies, especially for large
>>> folios where copy overhead dominates, leaving significant hardware
>>> potential untapped.
>>>
>>> By batching the copy phase, we create an opportunity for significant
>>> hardware acceleration. This series builds a framework for this acceleration
>>> and provides two initial offload driver implementations: one using multiple
>>> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>>>
>>> This version incorporates significant feedback to improve correctness,
>>> robustness, and the efficiency of the DMA offload path.
>>>
>>> Changelog since V2:
>>>
>>> 1. DMA Engine Rewrite:
>>>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>>>    - Single completion interrupt per batch (reduced overhead)
>>>    - Order of magnitude improvement in setup time for large batches
>>> 2. Code cleanups and refactoring
>>> 3. Rebased on latest mainline (6.17-rc6+)
>>>
>>> MOTIVATION:
>>> -----------
>>>
>>> Current Migration Flow:
>>> [ move_pages(), Compaction, Tiering, etc. ]
>>>               |
>>>               v
>>>      [ migrate_pages() ] // Common entry point
>>>               |
>>>               v
>>>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>>>       |
>>>       |--> [ migrate_folio_unmap() ]
>>>       |
>>>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>>>       |
>>>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>>>            - For each folio:
>>>              - Metadata prep: Copy flags, mappings, etc.
>>>              - folio_copy()  <-- Single-threaded, serial data copy.
>>>              - Update PTEs & finalize for that single folio.
>>>
>>> Understanding overheads in page migration (move_pages() syscall):
>>>
>>> Total move_pages() overheads = folio_copy() + Other overheads
>>> 1. folio_copy() is the core copy operation that interests us.
>>> 2. The remaining operations are user/kernel transitions, page table walks,
>>> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
>>> mappings and PTEs etc. that contribute to the remaining overheads.
>>>
>>> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
>>> Number of pages being migrated and folio size:
>>>             4KB     2MB
>>> 1 page     <1%     ~66%
>>> 512 page   ~35%    ~97%
>>>
>>> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
>>> substantial performance opportunity.
>>>
>>> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
>>> Where F is the fraction of time spent in folio_copy() and S is the speedup of
>>> folio_copy().
>>>
>>> For 4KB folios, folio copy overheads are significantly small in single-page
>>> migrations to impact overall speedup, even for 512 pages, maximum theoretical
>>> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>>>
>>> For 2MB THPs, folio copy overheads are significant even in single page
>>> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
>>> speedup and up to ~33x for 512 pages.
>>>
>>> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
>>> based on my measurements for copying 512 2MB pages.
>>> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
>>> observed in the experiments below).
>>>
>>> DESIGN: A Pluggable Migrator Framework
>>> ---------------------------------------
>>>
>>> Introduce migrate_folios_batch_move():
>>>
>>> [ migrate_pages_batch() ]
>>>     |
>>>     |--> migrate_folio_unmap()
>>>     |
>>>     |--> try_to_unmap_flush()
>>>     |
>>>     +--> [ migrate_folios_batch_move() ] // new batched design
>>>             |
>>>             |--> Metadata migration
>>>             |    - Metadata prep: Copy flags, mappings, etc.
>>>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>>>             |
>>>             |--> Batch copy folio data
>>>             |    - Migrator is configurable at runtime via sysfs.
>>>             |
>>>             |          static_call(_folios_copy) // Pluggable migrators
>>>             |          /          |            \
>>>             |         v           v             v
>>>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>>>             |
>>>             +--> Update PTEs to point to dst folios and complete migration.
>>>
>>
>> I just jump in the discussion, so this may be discussed before already.
>> Sorry if so.  Why not
>>
>> migrate_folios_unmap()
>> try_to_unmap_flush()
>> copy folios in parallel if possible
>> migrate_folios_move(): with MIGRATE_NO_COPY?
>
> Since in move_to_new_folio(), there are various migration preparation
> works, which can fail. Copying folios regardless might lead to some
> unnecessary work. What is your take on this?

Good point, we should skip copying folios that fails the checks.

>>
>>> User Control of Migrator:
>>>
>>> # echo 1 > /sys/kernel/dcbm/offloading
>>>    |
>>>    +--> Driver's sysfs handler
>>>         |
>>>         +--> calls start_offloading(&cpu_migrator)
>>>               |
>>>               +--> calls offc_update_migrator()
>>>                     |
>>>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>>>
>>> Later, During Migration ...
>>> migrate_folios_batch_move()
>>>     |
>>>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>>>           |
>>>           +-> [ mtcopy | dcbm | kernel_default ]
>>>
>>
>> [snip]

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload
  2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
                   ` (9 preceding siblings ...)
  2025-09-24  1:49 ` [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Huang, Ying
@ 2025-09-24  3:22 ` Zi Yan
  2025-10-02 17:10   ` Garg, Shivank
  10 siblings, 1 reply; 26+ messages in thread
From: Zi Yan @ 2025-09-24  3:22 UTC (permalink / raw)
  To: Shivank Garg
  Cc: akpm, david, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, vkoul,
	lucas.demarchi, rdunlap, jgg, kuba, justonli, ivecera,
	dave.jiang, Jonathan.Cameron, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm

On 23 Sep 2025, at 13:47, Shivank Garg wrote:

> This is the third RFC of the patchset to enhance page migration by batching
> folio-copy operations and enabling acceleration via multi-threaded CPU or
> DMA offload.
>
> Single-threaded, folio-by-folio copying bottlenecks page migration
> in modern systems with deep memory hierarchies, especially for large
> folios where copy overhead dominates, leaving significant hardware
> potential untapped.
>
> By batching the copy phase, we create an opportunity for significant
> hardware acceleration. This series builds a framework for this acceleration
> and provides two initial offload driver implementations: one using multiple
> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>
> This version incorporates significant feedback to improve correctness,
> robustness, and the efficiency of the DMA offload path.
>
> Changelog since V2:
>
> 1. DMA Engine Rewrite:
>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>    - Single completion interrupt per batch (reduced overhead)
>    - Order of magnitude improvement in setup time for large batches
> 2. Code cleanups and refactoring
> 3. Rebased on latest mainline (6.17-rc6+)

Thanks for working on this.

It is better to rebase on top of Andrew’s mm-new tree.

I have a version at: https://github.com/x-y-z/linux-dev/tree/batched_page_migration_copy_amd_v3-mm-everything-2025-09-23-00-13.

The difference is that I changed Patch 6 to use padata_do_multithreaded()
instead of my own implementation, since padata is a nice framework
for doing multithreaded jobs. The downside is that your patch 9
no longer applies and you will need to hack kernel/padata.c to
achieve the same thing.

I also tried to attribute back page copy kthread time to the initiating
thread so that page copy time does not disappear when it is parallelized
using CPU threads. It is currently a hack in the last patch from
the above repo. With the patch, I can see system time of a page migration
process with multithreaded page copy looks almost the same as without it,
while wall clock time is smaller. But I have not found time to ask
scheduler people about a proper implementation yet.


>
> MOTIVATION:
> -----------
>
> Current Migration Flow:
> [ move_pages(), Compaction, Tiering, etc. ]
>               |
>               v
>      [ migrate_pages() ] // Common entry point
>               |
>               v
>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>       |
>       |--> [ migrate_folio_unmap() ]
>       |
>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>       |
>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>            - For each folio:
>              - Metadata prep: Copy flags, mappings, etc.
>              - folio_copy()  <-- Single-threaded, serial data copy.
>              - Update PTEs & finalize for that single folio.
>
> Understanding overheads in page migration (move_pages() syscall):
>
> Total move_pages() overheads = folio_copy() + Other overheads
> 1. folio_copy() is the core copy operation that interests us.
> 2. The remaining operations are user/kernel transitions, page table walks,
> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
> mappings and PTEs etc. that contribute to the remaining overheads.
>
> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
> Number of pages being migrated and folio size:
>             4KB     2MB
> 1 page     <1%     ~66%
> 512 page   ~35%    ~97%
>
> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
> substantial performance opportunity.
>
> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
> Where F is the fraction of time spent in folio_copy() and S is the speedup of
> folio_copy().
>
> For 4KB folios, folio copy overheads are significantly small in single-page
> migrations to impact overall speedup, even for 512 pages, maximum theoretical
> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>
> For 2MB THPs, folio copy overheads are significant even in single page
> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
> speedup and up to ~33x for 512 pages.
>
> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
> based on my measurements for copying 512 2MB pages.
> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
> observed in the experiments below).
>
> DESIGN: A Pluggable Migrator Framework
> ---------------------------------------
>
> Introduce migrate_folios_batch_move():
>
> [ migrate_pages_batch() ]
>     |
>     |--> migrate_folio_unmap()
>     |
>     |--> try_to_unmap_flush()
>     |
>     +--> [ migrate_folios_batch_move() ] // new batched design
>             |
>             |--> Metadata migration
>             |    - Metadata prep: Copy flags, mappings, etc.
>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>             |
>             |--> Batch copy folio data
>             |    - Migrator is configurable at runtime via sysfs.
>             |
>             |          static_call(_folios_copy) // Pluggable migrators
>             |          /          |            \
>             |         v           v             v
>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>             |
>             +--> Update PTEs to point to dst folios and complete migration.
>
>
> User Control of Migrator:
>
> # echo 1 > /sys/kernel/dcbm/offloading
>    |
>    +--> Driver's sysfs handler
>         |
>         +--> calls start_offloading(&cpu_migrator)
>               |
>               +--> calls offc_update_migrator()
>                     |
>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>
> Later, During Migration ...
> migrate_folios_batch_move()
>     |
>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>           |
>           +-> [ mtcopy | dcbm | kernel_default ]
>
>
> PERFORMANCE RESULTS:
> --------------------
>
> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
> 1 NUMA node per socket, Linux Kernel 6.16.0-rc6, DVFS set to Performance,
> PTDMA hardware.
>
> Benchmark: Use move_pages() syscall to move pages between two NUMA nodes.
>
> 1. Moving different sized folios (4KB, 16KB,..., 2MB) such that total transfer size is constant
> (1GB), with different number of parallel threads/channels.
> Metric: Throughput is reported in GB/s.
>
> a. Baseline (Vanilla kernel, single-threaded, folio-by-folio migration):
>
> Folio size|4K       | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
> ===============================================================================================================
> Tput(GB/s)|3.73±0.33| 5.53±0.36 | 5.90±0.56  | 6.34±0.08  | 6.50±0.05  | 6.86±0.61  | 6.92±0.71  | 10.67±0.36 |
>
> b. Multi-threaded CPU copy offload (mtcopy driver, use N Parallel Threads=1,2,4,8,12,16):
>
> Thread | 4K         | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
> ===============================================================================================================
> 1      | 3.84±0.10  | 5.23±0.31 | 6.01±0.55  | 6.34±0.60  | 7.16±1.00  | 7.12±0.78  | 7.10±0.86  | 10.94±0.13 |
> 2      | 4.04±0.19  | 6.72±0.38 | 7.68±0.12  | 8.15±0.06  | 8.45±0.09  | 9.29±0.17  | 9.87±1.01  | 17.80±0.12 |
> 4      | 4.72±0.21  | 8.41±0.70 | 10.08±1.67 | 11.44±2.42 | 10.45±0.17 | 12.60±1.97 | 12.38±1.73 | 31.41±1.14 |
> 8      | 4.91±0.28  | 8.62±0.13 | 10.40±0.20 | 13.94±3.75 | 11.03±0.61 | 14.96±3.29 | 12.84±0.63 | 33.50±3.29 |
> 12     | 4.84±0.24  | 8.75±0.08 | 10.16±0.26 | 10.92±0.22 | 11.72±0.14 | 14.02±2.51 | 14.09±2.65 | 34.70±2.38 |
> 16     | 4.77±0.22  | 8.95±0.69 | 10.36±0.26 | 11.03±0.22 | 11.58±0.30 | 13.88±2.71 | 13.00±0.75 | 35.89±2.07 |
>
> c. DMA offload (dcbm driver, use N DMA Channels=1,2,4,8,12,16):
>
> Chan Cnt| 4K        | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
> ===============================================================================================================
> 1      | 2.75±0.19  | 2.86±0.13 | 3.28±0.20  | 4.57±0.72  | 5.03±0.62  | 4.69±0.25  | 4.78±0.34  | 12.50±0.24 |
> 2      | 3.35±0.19  | 4.57±0.19 | 5.35±0.55  | 6.71±0.71  | 7.40±1.07  | 7.38±0.61  | 7.21±0.73  | 14.23±0.34 |
> 4      | 4.01±0.17  | 6.36±0.26 | 7.71±0.89  | 9.40±1.35  | 10.27±1.96 | 10.60±1.42 | 12.35±2.64 | 26.84±0.91 |
> 8      | 4.46±0.16  | 7.74±0.13 | 9.72±1.29  | 10.88±0.16 | 12.12±2.54 | 15.62±3.96 | 13.29±2.65 | 45.27±2.60 |
> 12     | 4.60±0.22  | 8.90±0.84 | 11.26±2.19 | 16.00±4.41 | 14.90±4.38 | 14.57±2.84 | 13.79±3.18 | 59.94±4.19 |
> 16     | 4.61±0.25  | 9.08±0.79 | 11.14±1.75 | 13.95±3.85 | 13.69±3.39 | 15.47±3.44 | 15.44±4.65 | 63.69±5.01 |
>
> - Throughput increases with folio size. Larger folios benefit more from DMA.
> - Scaling shows diminishing returns beyond 8-12 threads/channels.
> - Multi-threading and DMA offloading both provide significant gains - up to 3.4x and 6x respectively.
>
> 2. Varying total move size: (folio count = 1,8,..8192) for a fixed folio size of 2MB
>    using only single thread/channel
>
> folio_cnt | Baseline    | MTCPU      | DMA
> ====================================================
> 1         | 7.96±2.22   | 6.43±0.66  | 6.52±0.45   |
> 8         | 8.20±0.75   | 8.82±1.10  | 8.88±0.54   |
> 16        | 7.54±0.61   | 9.06±0.95  | 9.03±0.62   |
> 32        | 8.68±0.77   | 10.11±0.42 | 10.17±0.50  |
> 64        | 9.08±1.03   | 10.12±0.44 | 11.21±0.24  |
> 256       | 10.53±0.39  | 10.77±0.28 | 12.43±0.12  |
> 512       | 10.59±0.29  | 10.81±0.19 | 12.61±0.07  |
> 2048      | 10.86±0.26  | 11.05±0.05 | 12.75±0.03  |
> 8192      | 10.84±0.18  | 11.12±0.05 | 12.81±0.02  |
>
> - Throughput increases with folios count but plateaus after a threshold.
>   (The migrate_pages function uses a folio batch size of 512)
>
> Performance Analysis (V2 vs V3):
>
> The new SG-based DMA driver dramatically reduces software overhead. By
> switching from per-folio dma_map_page() to batch dma_map_sgtable(), setup
> time improves by an order of magnitude for large batches.
> This is most visible with 4KB folios, making DMA viable even for smaller
> page sizes. For 2MB THP migrations, where hardware transfer time is more
> dominant, the gains are more modest.
>
> OPEN QUESTIONS:
> ---------------
>
> User-Interface:
>
> 1. Control Interface Design:
> The current interface creates separate sysfs files
> for each driver, which can be confusing. Should we implement a unified interface
> (/sys/kernel/mm/migration/offload_migrator), which accepts the name of the desired migrator
> ("kernel", "mtcopy", "dcbm"). This would ensure only one migrator is active at a time.
> Is this the right approach?
>
> 2. Dynamic Migrator Selection:
> Currently, active migrator is a global state, and only one can be active a time.
> A more flexible approach might be for the caller of migrate_pages() to specify/hint which
> offload mechanism to use, if any. This would allow a CXL driver to explicitly request DMA while a GPU driver might prefer
> multi-threaded CPU copy.
>
> 3. Tuning Parameters: Expose parameters like number of threads/channels, batch size,
> and thresholds for using migrators. Who should own these parameters?
>
> 4. Resources Accounting[3]:
> a. CPU cgroups accounting and fairness
> b. Migration cost attribution
>
> FUTURE WORK:
> ------------
>
> 1. Enhance DMA drivers for bulk copying (e.g., SDXi Engine).
> 2. Enhance multi-threaded CPU copying for platform-specific scheduling of worker threads to optimize bandwidth utilization. Explore sched-ext for this. [2]
> 3. Enable kpromoted [4] to use the migration offload infrastructure.
>
> EARLIER POSTINGS:
> -----------------
>
> - RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
> - RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
>
> REFERENCES:
> -----------
>
> [1] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
> [2] LSFMM: https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
> [4] https://lore.kernel.org/all/20250910144653.212066-1-bharata@amd.com
>
> Mike Day (1):
>   mm: add support for copy offload for folio Migration
>
> Shivank Garg (4):
>   mm: Introduce folios_mc_copy() for batch copying folios
>   mm/migrate: add migrate_folios_batch_move to  batch the folio move
>     operations
>   dcbm: add dma core batch migrator for batch page offloading
>   mtcopy: spread threads across die for testing
>
> Zi Yan (4):
>   mm/migrate: factor out code in move_to_new_folio() and
>     migrate_folio_move()
>   mm/migrate: revive MIGRATE_NO_COPY in migrate_mode
>   mtcopy: introduce multi-threaded page copy routine
>   adjust NR_MAX_BATCHED_MIGRATION for testing
>
>  drivers/Kconfig                        |   2 +
>  drivers/Makefile                       |   3 +
>  drivers/migoffcopy/Kconfig             |  17 +
>  drivers/migoffcopy/Makefile            |   2 +
>  drivers/migoffcopy/dcbm/Makefile       |   1 +
>  drivers/migoffcopy/dcbm/dcbm.c         | 415 +++++++++++++++++++++++++
>  drivers/migoffcopy/mtcopy/Makefile     |   1 +
>  drivers/migoffcopy/mtcopy/copy_pages.c | 397 +++++++++++++++++++++++
>  include/linux/migrate_mode.h           |   2 +
>  include/linux/migrate_offc.h           |  34 ++
>  include/linux/mm.h                     |   2 +
>  mm/Kconfig                             |   8 +
>  mm/Makefile                            |   1 +
>  mm/migrate.c                           | 358 ++++++++++++++++++---
>  mm/migrate_offc.c                      |  58 ++++
>  mm/util.c                              |  29 ++
>  16 files changed, 1284 insertions(+), 46 deletions(-)
>  create mode 100644 drivers/migoffcopy/Kconfig
>  create mode 100644 drivers/migoffcopy/Makefile
>  create mode 100644 drivers/migoffcopy/dcbm/Makefile
>  create mode 100644 drivers/migoffcopy/dcbm/dcbm.c
>  create mode 100644 drivers/migoffcopy/mtcopy/Makefile
>  create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c
>  create mode 100644 include/linux/migrate_offc.h
>  create mode 100644 mm/migrate_offc.c
>
> -- 
> 2.43.0


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload
  2025-09-24  3:22 ` Zi Yan
@ 2025-10-02 17:10   ` Garg, Shivank
  0 siblings, 0 replies; 26+ messages in thread
From: Garg, Shivank @ 2025-10-02 17:10 UTC (permalink / raw)
  To: Zi Yan
  Cc: akpm, david, willy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, gourry, ying.huang, apopple, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, vkoul,
	lucas.demarchi, rdunlap, jgg, kuba, justonli, ivecera,
	dave.jiang, Jonathan.Cameron, dan.j.williams, rientjes,
	Raghavendra.KodsaraThimmappa, bharata, alirad.malek, yiannis,
	weixugc, linux-kernel, linux-mm



On 9/24/2025 8:52 AM, Zi Yan wrote:
> On 23 Sep 2025, at 13:47, Shivank Garg wrote:
> 
>> This is the third RFC of the patchset to enhance page migration by batching
>> folio-copy operations and enabling acceleration via multi-threaded CPU or
>> DMA offload.
>>
>> Single-threaded, folio-by-folio copying bottlenecks page migration
>> in modern systems with deep memory hierarchies, especially for large
>> folios where copy overhead dominates, leaving significant hardware
>> potential untapped.
>>
>> By batching the copy phase, we create an opportunity for significant
>> hardware acceleration. This series builds a framework for this acceleration
>> and provides two initial offload driver implementations: one using multiple
>> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>>
>> This version incorporates significant feedback to improve correctness,
>> robustness, and the efficiency of the DMA offload path.
>>
>> Changelog since V2:
>>
>> 1. DMA Engine Rewrite:
>>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>>    - Single completion interrupt per batch (reduced overhead)
>>    - Order of magnitude improvement in setup time for large batches
>> 2. Code cleanups and refactoring
>> 3. Rebased on latest mainline (6.17-rc6+)
> 
> Thanks for working on this.
> 
> It is better to rebase on top of Andrew’s mm-new tree.
> 
> I have a version at: https://github.com/x-y-z/linux-dev/tree/batched_page_migration_copy_amd_v3-mm-everything-2025-09-23-00-13.
> 
> The difference is that I changed Patch 6 to use padata_do_multithreaded()
> instead of my own implementation, since padata is a nice framework
> for doing multithreaded jobs. The downside is that your patch 9
> no longer applies and you will need to hack kernel/padata.c to
> achieve the same thing.

This looks good. For now, I'll hack padata.c locally.

Currently, with numa_aware=true, padata round-robins work items across
NUMA nodes using queue_work_node().
For an upstream-able solution, I think we need a similar mechanism to
spread work across CCDs.

> I also tried to attribute back page copy kthread time to the initiating
> thread so that page copy time does not disappear when it is parallelized
> using CPU threads. It is currently a hack in the last patch from
> the above repo. With the patch, I can see system time of a page migration
> process with multithreaded page copy looks almost the same as without it,
> while wall clock time is smaller. But I have not found time to ask
> scheduler people about a proper implementation yet.
> 
> 
>>
>> MOTIVATION:
>> -----------
>>
>> Current Migration Flow:
>> [ move_pages(), Compaction, Tiering, etc. ]
>>               |
>>               v
>>      [ migrate_pages() ] // Common entry point
>>               |
>>               v
>>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>>       |
>>       |--> [ migrate_folio_unmap() ]
>>       |
>>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>>       |
>>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>>            - For each folio:
>>              - Metadata prep: Copy flags, mappings, etc.
>>              - folio_copy()  <-- Single-threaded, serial data copy.
>>              - Update PTEs & finalize for that single folio.
>>
>> Understanding overheads in page migration (move_pages() syscall):
>>
>> Total move_pages() overheads = folio_copy() + Other overheads
>> 1. folio_copy() is the core copy operation that interests us.
>> 2. The remaining operations are user/kernel transitions, page table walks,
>> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
>> mappings and PTEs etc. that contribute to the remaining overheads.
>>
>> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
>> Number of pages being migrated and folio size:
>>             4KB     2MB
>> 1 page     <1%     ~66%
>> 512 page   ~35%    ~97%
>>
>> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
>> substantial performance opportunity.
>>
>> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
>> Where F is the fraction of time spent in folio_copy() and S is the speedup of
>> folio_copy().
>>
>> For 4KB folios, folio copy overheads are significantly small in single-page
>> migrations to impact overall speedup, even for 512 pages, maximum theoretical
>> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>>
>> For 2MB THPs, folio copy overheads are significant even in single page
>> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
>> speedup and up to ~33x for 512 pages.
>>
>> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
>> based on my measurements for copying 512 2MB pages.
>> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
>> observed in the experiments below).
>>
>> DESIGN: A Pluggable Migrator Framework
>> ---------------------------------------
>>
>> Introduce migrate_folios_batch_move():
>>
>> [ migrate_pages_batch() ]
>>     |
>>     |--> migrate_folio_unmap()
>>     |
>>     |--> try_to_unmap_flush()
>>     |
>>     +--> [ migrate_folios_batch_move() ] // new batched design
>>             |
>>             |--> Metadata migration
>>             |    - Metadata prep: Copy flags, mappings, etc.
>>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>>             |
>>             |--> Batch copy folio data
>>             |    - Migrator is configurable at runtime via sysfs.
>>             |
>>             |          static_call(_folios_copy) // Pluggable migrators
>>             |          /          |            \
>>             |         v           v             v
>>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>>             |
>>             +--> Update PTEs to point to dst folios and complete migration.
>>
>>
>> User Control of Migrator:
>>
>> # echo 1 > /sys/kernel/dcbm/offloading
>>    |
>>    +--> Driver's sysfs handler
>>         |
>>         +--> calls start_offloading(&cpu_migrator)
>>               |
>>               +--> calls offc_update_migrator()
>>                     |
>>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>>
>> Later, During Migration ...
>> migrate_folios_batch_move()
>>     |
>>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>>           |
>>           +-> [ mtcopy | dcbm | kernel_default ]
>>
>>
>> PERFORMANCE RESULTS:
>> --------------------
>>
>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>> 1 NUMA node per socket, Linux Kernel 6.16.0-rc6, DVFS set to Performance,
>> PTDMA hardware.
>>
>> Benchmark: Use move_pages() syscall to move pages between two NUMA nodes.
>>
>> 1. Moving different sized folios (4KB, 16KB,..., 2MB) such that total transfer size is constant
>> (1GB), with different number of parallel threads/channels.
>> Metric: Throughput is reported in GB/s.
>>
>> a. Baseline (Vanilla kernel, single-threaded, folio-by-folio migration):
>>
>> Folio size|4K       | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
>> ===============================================================================================================
>> Tput(GB/s)|3.73±0.33| 5.53±0.36 | 5.90±0.56  | 6.34±0.08  | 6.50±0.05  | 6.86±0.61  | 6.92±0.71  | 10.67±0.36 |
>>
>> b. Multi-threaded CPU copy offload (mtcopy driver, use N Parallel Threads=1,2,4,8,12,16):
>>
>> Thread | 4K         | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
>> ===============================================================================================================
>> 1      | 3.84±0.10  | 5.23±0.31 | 6.01±0.55  | 6.34±0.60  | 7.16±1.00  | 7.12±0.78  | 7.10±0.86  | 10.94±0.13 |
>> 2      | 4.04±0.19  | 6.72±0.38 | 7.68±0.12  | 8.15±0.06  | 8.45±0.09  | 9.29±0.17  | 9.87±1.01  | 17.80±0.12 |
>> 4      | 4.72±0.21  | 8.41±0.70 | 10.08±1.67 | 11.44±2.42 | 10.45±0.17 | 12.60±1.97 | 12.38±1.73 | 31.41±1.14 |
>> 8      | 4.91±0.28  | 8.62±0.13 | 10.40±0.20 | 13.94±3.75 | 11.03±0.61 | 14.96±3.29 | 12.84±0.63 | 33.50±3.29 |
>> 12     | 4.84±0.24  | 8.75±0.08 | 10.16±0.26 | 10.92±0.22 | 11.72±0.14 | 14.02±2.51 | 14.09±2.65 | 34.70±2.38 |
>> 16     | 4.77±0.22  | 8.95±0.69 | 10.36±0.26 | 11.03±0.22 | 11.58±0.30 | 13.88±2.71 | 13.00±0.75 | 35.89±2.07 |
>>
>> c. DMA offload (dcbm driver, use N DMA Channels=1,2,4,8,12,16):
>>
>> Chan Cnt| 4K        | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
>> ===============================================================================================================
>> 1      | 2.75±0.19  | 2.86±0.13 | 3.28±0.20  | 4.57±0.72  | 5.03±0.62  | 4.69±0.25  | 4.78±0.34  | 12.50±0.24 |
>> 2      | 3.35±0.19  | 4.57±0.19 | 5.35±0.55  | 6.71±0.71  | 7.40±1.07  | 7.38±0.61  | 7.21±0.73  | 14.23±0.34 |
>> 4      | 4.01±0.17  | 6.36±0.26 | 7.71±0.89  | 9.40±1.35  | 10.27±1.96 | 10.60±1.42 | 12.35±2.64 | 26.84±0.91 |
>> 8      | 4.46±0.16  | 7.74±0.13 | 9.72±1.29  | 10.88±0.16 | 12.12±2.54 | 15.62±3.96 | 13.29±2.65 | 45.27±2.60 |
>> 12     | 4.60±0.22  | 8.90±0.84 | 11.26±2.19 | 16.00±4.41 | 14.90±4.38 | 14.57±2.84 | 13.79±3.18 | 59.94±4.19 |
>> 16     | 4.61±0.25  | 9.08±0.79 | 11.14±1.75 | 13.95±3.85 | 13.69±3.39 | 15.47±3.44 | 15.44±4.65 | 63.69±5.01 |
>>
>> - Throughput increases with folio size. Larger folios benefit more from DMA.
>> - Scaling shows diminishing returns beyond 8-12 threads/channels.
>> - Multi-threading and DMA offloading both provide significant gains - up to 3.4x and 6x respectively.
>>
>> 2. Varying total move size: (folio count = 1,8,..8192) for a fixed folio size of 2MB
>>    using only single thread/channel
>>
>> folio_cnt | Baseline    | MTCPU      | DMA
>> ====================================================
>> 1         | 7.96±2.22   | 6.43±0.66  | 6.52±0.45   |
>> 8         | 8.20±0.75   | 8.82±1.10  | 8.88±0.54   |
>> 16        | 7.54±0.61   | 9.06±0.95  | 9.03±0.62   |
>> 32        | 8.68±0.77   | 10.11±0.42 | 10.17±0.50  |
>> 64        | 9.08±1.03   | 10.12±0.44 | 11.21±0.24  |
>> 256       | 10.53±0.39  | 10.77±0.28 | 12.43±0.12  |
>> 512       | 10.59±0.29  | 10.81±0.19 | 12.61±0.07  |
>> 2048      | 10.86±0.26  | 11.05±0.05 | 12.75±0.03  |
>> 8192      | 10.84±0.18  | 11.12±0.05 | 12.81±0.02  |
>>
>> - Throughput increases with folios count but plateaus after a threshold.
>>   (The migrate_pages function uses a folio batch size of 512)
>>
>> Performance Analysis (V2 vs V3):
>>
>> The new SG-based DMA driver dramatically reduces software overhead. By
>> switching from per-folio dma_map_page() to batch dma_map_sgtable(), setup
>> time improves by an order of magnitude for large batches.
>> This is most visible with 4KB folios, making DMA viable even for smaller
>> page sizes. For 2MB THP migrations, where hardware transfer time is more
>> dominant, the gains are more modest.
>>
>> OPEN QUESTIONS:
>> ---------------
>>
>> User-Interface:
>>
>> 1. Control Interface Design:
>> The current interface creates separate sysfs files
>> for each driver, which can be confusing. Should we implement a unified interface
>> (/sys/kernel/mm/migration/offload_migrator), which accepts the name of the desired migrator
>> ("kernel", "mtcopy", "dcbm"). This would ensure only one migrator is active at a time.
>> Is this the right approach?
>>
>> 2. Dynamic Migrator Selection:
>> Currently, active migrator is a global state, and only one can be active a time.
>> A more flexible approach might be for the caller of migrate_pages() to specify/hint which
>> offload mechanism to use, if any. This would allow a CXL driver to explicitly request DMA while a GPU driver might prefer
>> multi-threaded CPU copy.
>>
>> 3. Tuning Parameters: Expose parameters like number of threads/channels, batch size,
>> and thresholds for using migrators. Who should own these parameters?
>>
>> 4. Resources Accounting[3]:
>> a. CPU cgroups accounting and fairness
>> b. Migration cost attribution
>>
>> FUTURE WORK:
>> ------------
>>
>> 1. Enhance DMA drivers for bulk copying (e.g., SDXi Engine).
>> 2. Enhance multi-threaded CPU copying for platform-specific scheduling of worker threads to optimize bandwidth utilization. Explore sched-ext for this. [2]
>> 3. Enable kpromoted [4] to use the migration offload infrastructure.
>>
>> EARLIER POSTINGS:
>> -----------------
>>
>> - RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
>> - RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
>>
>> REFERENCES:
>> -----------
>>
>> [1] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
>> [2] LSFMM: https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
>> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
>> [4] https://lore.kernel.org/all/20250910144653.212066-1-bharata@amd.com
>>
>> Mike Day (1):
>>   mm: add support for copy offload for folio Migration
>>
>> Shivank Garg (4):
>>   mm: Introduce folios_mc_copy() for batch copying folios
>>   mm/migrate: add migrate_folios_batch_move to  batch the folio move
>>     operations
>>   dcbm: add dma core batch migrator for batch page offloading
>>   mtcopy: spread threads across die for testing
>>
>> Zi Yan (4):
>>   mm/migrate: factor out code in move_to_new_folio() and
>>     migrate_folio_move()
>>   mm/migrate: revive MIGRATE_NO_COPY in migrate_mode
>>   mtcopy: introduce multi-threaded page copy routine
>>   adjust NR_MAX_BATCHED_MIGRATION for testing
>>
>>  drivers/Kconfig                        |   2 +
>>  drivers/Makefile                       |   3 +
>>  drivers/migoffcopy/Kconfig             |  17 +
>>  drivers/migoffcopy/Makefile            |   2 +
>>  drivers/migoffcopy/dcbm/Makefile       |   1 +
>>  drivers/migoffcopy/dcbm/dcbm.c         | 415 +++++++++++++++++++++++++
>>  drivers/migoffcopy/mtcopy/Makefile     |   1 +
>>  drivers/migoffcopy/mtcopy/copy_pages.c | 397 +++++++++++++++++++++++
>>  include/linux/migrate_mode.h           |   2 +
>>  include/linux/migrate_offc.h           |  34 ++
>>  include/linux/mm.h                     |   2 +
>>  mm/Kconfig                             |   8 +
>>  mm/Makefile                            |   1 +
>>  mm/migrate.c                           | 358 ++++++++++++++++++---
>>  mm/migrate_offc.c                      |  58 ++++
>>  mm/util.c                              |  29 ++
>>  16 files changed, 1284 insertions(+), 46 deletions(-)
>>  create mode 100644 drivers/migoffcopy/Kconfig
>>  create mode 100644 drivers/migoffcopy/Makefile
>>  create mode 100644 drivers/migoffcopy/dcbm/Makefile
>>  create mode 100644 drivers/migoffcopy/dcbm/dcbm.c
>>  create mode 100644 drivers/migoffcopy/mtcopy/Makefile
>>  create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c
>>  create mode 100644 include/linux/migrate_offc.h
>>  create mode 100644 mm/migrate_offc.c
>>
>> -- 
>> 2.43.0
> 
> 
> Best Regards,
> Yan, Zi



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-11-12  2:12 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-23 17:47 [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Shivank Garg
2025-09-23 17:47 ` [RFC V3 1/9] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Shivank Garg
2025-10-02 10:30   ` Jonathan Cameron
2025-09-23 17:47 ` [RFC V3 2/9] mm/migrate: revive MIGRATE_NO_COPY in migrate_mode Shivank Garg
2025-09-23 17:47 ` [RFC V3 3/9] mm: Introduce folios_mc_copy() for batch copying folios Shivank Garg
2025-09-23 17:47 ` [RFC V3 4/9] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Shivank Garg
2025-10-02 11:03   ` Jonathan Cameron
2025-10-16  9:17     ` Garg, Shivank
2025-09-23 17:47 ` [RFC V3 5/9] mm: add support for copy offload for folio Migration Shivank Garg
2025-10-02 11:10   ` Jonathan Cameron
2025-10-16  9:40     ` Garg, Shivank
2025-09-23 17:47 ` [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine Shivank Garg
2025-10-02 11:29   ` Jonathan Cameron
2025-10-20  8:28   ` Byungchul Park
2025-11-06  6:27     ` Garg, Shivank
2025-11-12  2:12       ` Byungchul Park
2025-09-23 17:47 ` [RFC V3 7/9] dcbm: add dma core batch migrator for batch page offloading Shivank Garg
2025-10-02 11:38   ` Jonathan Cameron
2025-10-16  9:59     ` Garg, Shivank
2025-09-23 17:47 ` [RFC V3 8/9] adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2025-09-23 17:47 ` [RFC V3 9/9] mtcopy: spread threads across die " Shivank Garg
2025-09-24  1:49 ` [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Huang, Ying
2025-09-24  2:03   ` Zi Yan
2025-09-24  3:11     ` Huang, Ying
2025-09-24  3:22 ` Zi Yan
2025-10-02 17:10   ` Garg, Shivank

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox