linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA
@ 2024-06-14 22:15 Shivank Garg
  2024-06-14 22:15 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Shivank Garg
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Shivank Garg @ 2024-06-14 22:15 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: bharata, raghavendra.kodsarathimmappa, Michael.Day, dmaengine,
	vkoul, shivankg

This series introduces enhancements to the page migration code to optimize
the "folio move" operations by batching them and enable offloading on DMA
hardware accelerators.

Page migration involves three key steps:
1. Unmap: Allocating dst folios and replace the src folio PTEs with
migration PTEs.
2. TLB Flush: Flushing the TLB for all unmapped folios.
3. Move: Copying the page mappings, flags and contents from src to dst.
Update metadata, lists, refcounts and restore working PTEs.

While the first two steps (setting TLB flush pending for unmapped folios
and TLB batch flush) been optimized with batching, this series focuses
on optimizing the folio move step.

In the current design, the folio move operation is performed sequentially
for each folio:
for_each_folio() {
        Copy folio metadata like flags and mappings
        Copy the folio content from src to dst
        Update PTEs with new mappings
}

In the proposed design, we batch the folio copy operations to leverage DMA
offloading. The updated design is as follows:
for_each_folio() {
        Copy folio metadata like flags and mappings
}
Batch copy the page content from src to dst by offloading to DMA engine
for_each_folio() {
        Update PTEs with new mappings
}

Motivation:
Data copying across NUMA nodes while page migration incurs significant
overhead. For instance, folio copy can take up to 26.6% of the total
migration cost for migrating 256MB of data.
Modern systems are equipped with powerful DMA engines for bulk data
copying. Utilizing these hardware accelerators will become essential for
large-scale tiered-memory systems with CXL nodes where lots of page
promotion and demotion can happen.
Following the trend of batching operations in the memory migration core
path (like batch migration and batch TLB flush), batch copying folio data
is a logical progression in this direction.

We conducted experiments to measure folio copy overheads for page
migration from a remote node to a local NUMA node, modeling page
promotions for different workload sizes (4KB, 2MB, 256MB and 1GB).

Setup Information: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT
Enabled), 1 NUMA node connected to each socket.
Linux Kernel 6.8.0, DVFS set to Performance, and cpuinfo_cur_freq: 2 GHz.
THP, compaction, numa_balancing are disabled to reduce interfernce.

migrate_pages() { <- t1
	..
	<- t2
	folio_copy()
	<- t3 
	..
} <- t4

overheads Fraction, F= (t3-t2)/(t4-t1)
Measurement: Mean ± SD is measured in cpu_cycles/page
Generic Kernel
4KB::   migrate_pages:17799.00±4278.25  folio_copy:794±232.87  F:0.0478±0.0199
2MB::   migrate_pages:3478.42±94.93  folio_copy:493.84±28.21  F:0.1418±0.0050
256MB:: migrate_pages:3668.56±158.47  folio_copy:815.40±171.76  F:0.2206±0.0371
1GB::   migrate_pages:3769.98±55.79  folio_copy:804.68±60.07  F:0.2132±0.0134

Results with patched kernel:
1. Offload disabled - folios batch-move using CPU
4KB::   migrate_pages:14941.60±2556.53  folio_copy:799.60±211.66  F:0.0554±0.0190
2MB::   migrate_pages:3448.44±83.74  folio_copy:533.34±37.81  F:0.1545±0.0085
256MB:: migrate_pages:3723.56±132.93  folio_copy:907.64±132.63  F:0.2427±0.0270
1GB::   migrate_pages:3788.20±46.65  folio_copy:888.46±49.50  F:0.2344±0.0107

2. Offload enabled - folios batch-move using DMAengine
4KB::   migrate_pages:46739.80±4827.15  folio_copy:32222.40±3543.42  F:0.6904±0.0423
2MB::   migrate_pages:13798.10±205.33  folio_copy:10971.60±202.50  F:0.7951±0.0033
256MB:: migrate_pages:13217.20±163.99  folio_copy:10431.20±167.25  F:0.7891±0.0029
1GB::   migrate_pages:13309.70±113.93  folio_copy:10410.00±117.77  F:0.7821±0.0023

Discussion:
The DMAEngine achieved net throughput of 768MB/s. Additional optimizations
are needed to make DMA offloading beneficial compared to CPU-based
migration. This can include parallelism, specialized DMA hardware,
asynchronous and speculative data migration.

Status:
Current patchset is functional, except for non-LRU folios.

Dependencies:
1. This series is based on Linux-v6.8.
2. Patch 1,2,3 involve preparatory work and implementation for batching
the folio move. Patch 4 adds support for DMA offload.
3. DMA hardware and driver support are required to enable DMA offload.
Without suitable support, CPU is used for batch migration. Requirements
are described in Patch 4.
4. Patch 5 adds a DMA driver using DMAengine APIs for end-to-end
testing and validation. 

Testing:
The patch series has been tested with migrate_pages(2) and move_pages(2)
using anonymous memory and memory-mapped files.

Byungchul Park (1):
  mm: separate move/undo doing on folio list from migrate_pages_batch()

Mike Day (1):
  mm: add support for DMA folio Migration

Shivank Garg (3):
  mm: add folios_copy() for copying pages in batch during migration
  mm: add migrate_folios_batch_move to batch the folio move operations
  dcbm: add dma core batch migrator for batch page offloading

 drivers/dma/Kconfig         |   2 +
 drivers/dma/Makefile        |   1 +
 drivers/dma/dcbm/Kconfig    |   7 +
 drivers/dma/dcbm/Makefile   |   1 +
 drivers/dma/dcbm/dcbm.c     | 229 +++++++++++++++++++++
 include/linux/migrate_dma.h |  36 ++++
 include/linux/mm.h          |   1 +
 mm/Kconfig                  |   8 +
 mm/Makefile                 |   1 +
 mm/migrate.c                | 385 +++++++++++++++++++++++++++++++-----
 mm/migrate_dma.c            |  51 +++++
 mm/util.c                   |  22 +++
 12 files changed, 692 insertions(+), 52 deletions(-)
 create mode 100644 drivers/dma/dcbm/Kconfig
 create mode 100644 drivers/dma/dcbm/Makefile
 create mode 100644 drivers/dma/dcbm/dcbm.c
 create mode 100644 include/linux/migrate_dma.h
 create mode 100644 mm/migrate_dma.c

-- 
2.34.1



^ permalink raw reply	[flat|nested] 10+ messages in thread
* [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
@ 2025-01-03 17:24 Zi Yan
  2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
  0 siblings, 1 reply; 10+ messages in thread
From: Zi Yan @ 2025-01-03 17:24 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, Shivank Garg, Aneesh Kumar, David Hildenbrand,
	John Hubbard, Kirill Shutemov, Matthew Wilcox, Mel Gorman, Rao,
	Bharata Bhasker, Rik van Riel, RaghavendraKT, Wei Xu, Suyeon Lee,
	Lei Chen, Shukla, Santosh, Grimm, Jon, sj, shy828301,
	Liam Howlett, Gregory Price, Huang, Ying, Zi Yan

Hi all,

This patchset accelerates page migration by batching folio copy operations and
using multiple CPU threads and is based on Shivank's Enhancements to Page
Migration with Batch Offloading via DMA patchset[1] and my original accelerate
page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
The last patch is for testing purpose and should not be considered.

The motivations are:

1. Batching folio copy increases copy throughput. Especially for base page
migrations, folio copy throughput is low since there are kernel activities like
moving folio metadata and updating page table entries sit between two folio
copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
and 64KB on ARM64.

2. Single CPU thread has limited copy throughput. Using multi threads is
a natural extension to speed up folio copy, when DMA engine is NOT
available in a system.


Design
===

It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
(renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
migrate_folio_move() and perform them in one shot afterwards. A
copy_page_lists_mt() function is added to use multi threads to copy
folios from src list to dst list.

Changes compared to Shivank's patchset (mainly rewrote batching folio
copy code)
===

1. mig_info is removed, so no memory allocation is needed during
batching folio copies. src->private is used to store old page state and
anon_vma after folio metadata is copied from src to dst.

2. move_to_new_folio() and migrate_folio_move() are refactored to remove
redundant code in migrate_folios_batch_move().

3. folio_mc_copy() is used for the single threaded copy code to keep the
original kernel behavior.


Performance
===

I benchmarked move_pages() throughput on a two socket NUMA system with two
NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
mTHP page migration are measured.

The tables below show move_pages() throughput with different
configurations and different numbers of copied pages. The x-axis is the
configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
threads with this patchset applied. And the unit is GB/s.

The 32-thread copy throughput can be up to 10x of single thread serial folio
copy. Batching folio copy not only benefits huge page but also base
page.

64KB (GB/s):

		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80

2MB mTHP (GB/s):

		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84


TODOs
===
1. Multi-threaded folio copy routine needs to look at CPU scheduler and
only use idle CPUs to avoid interfering userspace workloads. Of course
more complicated policies can be used based on migration issuing thread
priority.

2. Eliminate memory allocation during multi-threaded folio copy routine
if possible.

3. A runtime check to decide when use multi-threaded folio copy.
Something like cache hotness issue mentioned by Matthew[3].

4. Use non-temporal CPU instructions to avoid cache pollution issues.

5. Explicitly make multi-threaded folio copy only available to
!HIGHMEM, since kmap_local_page() would be needed for each kernel
folio copy work threads and expensive.

6. A better interface than copy_page_lists_mt() to allow DMA data copy
to be used as well.

Let me know your thoughts. Thanks.


[1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/
[2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
[3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/

Byungchul Park (1):
  mm: separate move/undo doing on folio list from migrate_pages_batch()

Zi Yan (4):
  mm/migrate: factor out code in move_to_new_folio() and
    migrate_folio_move()
  mm/migrate: add migrate_folios_batch_move to batch the folio move
    operations
  mm/migrate: introduce multi-threaded page copy routine
  test: add sysctl for folio copy tests and adjust
    NR_MAX_BATCHED_MIGRATION

 include/linux/migrate.h      |   3 +
 include/linux/migrate_mode.h |   2 +
 include/linux/mm.h           |   4 +
 include/linux/sysctl.h       |   1 +
 kernel/sysctl.c              |  29 ++-
 mm/Makefile                  |   2 +-
 mm/copy_pages.c              | 190 +++++++++++++++
 mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
 8 files changed, 577 insertions(+), 97 deletions(-)
 create mode 100644 mm/copy_pages.c

-- 
2.45.2



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-01-03 17:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-14 22:15 [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA Shivank Garg
2024-06-14 22:15 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Shivank Garg
2024-06-14 22:15 ` [RFC PATCH 2/5] mm: add folios_copy() for copying pages in batch during migration Shivank Garg
2024-06-14 22:15 ` [RFC PATCH 3/5] mm: add migrate_folios_batch_move to batch the folio move operations Shivank Garg
2024-06-14 22:15 ` [RFC PATCH 4/5] mm: add support for DMA folio Migration Shivank Garg
2024-06-14 22:15 ` [RFC PATCH 5/5] dcbm: add dma core batch migrator for batch page offloading Shivank Garg
2024-06-15  4:02 ` [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA Matthew Wilcox
2024-06-17 11:40   ` Garg, Shivank
2024-06-25  8:57     ` Garg, Shivank
2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox