Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Garg, Shivank" <shivankg@amd.com>
To: Zi Yan <ziy@nvidia.com>
Cc: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
	ying.huang@linux.alibaba.com, apopple@nvidia.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, vkoul@kernel.org, lucas.demarchi@intel.com,
	rdunlap@infradead.org, jgg@ziepe.ca, kuba@kernel.org,
	justonli@chromium.org, ivecera@redhat.com, dave.jiang@intel.com,
	Jonathan.Cameron@huawei.com, dan.j.williams@intel.com,
	rientjes@google.com, Raghavendra.KodsaraThimmappa@amd.com,
	bharata@amd.com, alirad.malek@zptcorp.com, yiannis@zptcorp.com,
	weixugc@google.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload
Date: Thu, 2 Oct 2025 22:40:48 +0530	[thread overview]
Message-ID: <1c8c91a7-9f6d-442f-8e20-736fd5d41ef3@amd.com> (raw)
In-Reply-To: <633F4EFC-13A9-40DF-A27D-DBBDD0AF44F3@nvidia.com>



On 9/24/2025 8:52 AM, Zi Yan wrote:
> On 23 Sep 2025, at 13:47, Shivank Garg wrote:
> 
>> This is the third RFC of the patchset to enhance page migration by batching
>> folio-copy operations and enabling acceleration via multi-threaded CPU or
>> DMA offload.
>>
>> Single-threaded, folio-by-folio copying bottlenecks page migration
>> in modern systems with deep memory hierarchies, especially for large
>> folios where copy overhead dominates, leaving significant hardware
>> potential untapped.
>>
>> By batching the copy phase, we create an opportunity for significant
>> hardware acceleration. This series builds a framework for this acceleration
>> and provides two initial offload driver implementations: one using multiple
>> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>>
>> This version incorporates significant feedback to improve correctness,
>> robustness, and the efficiency of the DMA offload path.
>>
>> Changelog since V2:
>>
>> 1. DMA Engine Rewrite:
>>    - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>>    - Single completion interrupt per batch (reduced overhead)
>>    - Order of magnitude improvement in setup time for large batches
>> 2. Code cleanups and refactoring
>> 3. Rebased on latest mainline (6.17-rc6+)
> 
> Thanks for working on this.
> 
> It is better to rebase on top of Andrew’s mm-new tree.
> 
> I have a version at: https://github.com/x-y-z/linux-dev/tree/batched_page_migration_copy_amd_v3-mm-everything-2025-09-23-00-13.
> 
> The difference is that I changed Patch 6 to use padata_do_multithreaded()
> instead of my own implementation, since padata is a nice framework
> for doing multithreaded jobs. The downside is that your patch 9
> no longer applies and you will need to hack kernel/padata.c to
> achieve the same thing.

This looks good. For now, I'll hack padata.c locally.

Currently, with numa_aware=true, padata round-robins work items across
NUMA nodes using queue_work_node().
For an upstream-able solution, I think we need a similar mechanism to
spread work across CCDs.

> I also tried to attribute back page copy kthread time to the initiating
> thread so that page copy time does not disappear when it is parallelized
> using CPU threads. It is currently a hack in the last patch from
> the above repo. With the patch, I can see system time of a page migration
> process with multithreaded page copy looks almost the same as without it,
> while wall clock time is smaller. But I have not found time to ask
> scheduler people about a proper implementation yet.
> 
> 
>>
>> MOTIVATION:
>> -----------
>>
>> Current Migration Flow:
>> [ move_pages(), Compaction, Tiering, etc. ]
>>               |
>>               v
>>      [ migrate_pages() ] // Common entry point
>>               |
>>               v
>>     [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>>       |
>>       |--> [ migrate_folio_unmap() ]
>>       |
>>       |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>>       |
>>       |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>>            - For each folio:
>>              - Metadata prep: Copy flags, mappings, etc.
>>              - folio_copy()  <-- Single-threaded, serial data copy.
>>              - Update PTEs & finalize for that single folio.
>>
>> Understanding overheads in page migration (move_pages() syscall):
>>
>> Total move_pages() overheads = folio_copy() + Other overheads
>> 1. folio_copy() is the core copy operation that interests us.
>> 2. The remaining operations are user/kernel transitions, page table walks,
>> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
>> mappings and PTEs etc. that contribute to the remaining overheads.
>>
>> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
>> Number of pages being migrated and folio size:
>>             4KB     2MB
>> 1 page     <1%     ~66%
>> 512 page   ~35%    ~97%
>>
>> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
>> substantial performance opportunity.
>>
>> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
>> Where F is the fraction of time spent in folio_copy() and S is the speedup of
>> folio_copy().
>>
>> For 4KB folios, folio copy overheads are significantly small in single-page
>> migrations to impact overall speedup, even for 512 pages, maximum theoretical
>> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>>
>> For 2MB THPs, folio copy overheads are significant even in single page
>> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
>> speedup and up to ~33x for 512 pages.
>>
>> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
>> based on my measurements for copying 512 2MB pages.
>> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
>> observed in the experiments below).
>>
>> DESIGN: A Pluggable Migrator Framework
>> ---------------------------------------
>>
>> Introduce migrate_folios_batch_move():
>>
>> [ migrate_pages_batch() ]
>>     |
>>     |--> migrate_folio_unmap()
>>     |
>>     |--> try_to_unmap_flush()
>>     |
>>     +--> [ migrate_folios_batch_move() ] // new batched design
>>             |
>>             |--> Metadata migration
>>             |    - Metadata prep: Copy flags, mappings, etc.
>>             |    - Use MIGRATE_NO_COPY to skip the actual data copy.
>>             |
>>             |--> Batch copy folio data
>>             |    - Migrator is configurable at runtime via sysfs.
>>             |
>>             |          static_call(_folios_copy) // Pluggable migrators
>>             |          /          |            \
>>             |         v           v             v
>>             | [ Default ]  [ MT CPU copy ]  [ DMA Offload ]
>>             |
>>             +--> Update PTEs to point to dst folios and complete migration.
>>
>>
>> User Control of Migrator:
>>
>> # echo 1 > /sys/kernel/dcbm/offloading
>>    |
>>    +--> Driver's sysfs handler
>>         |
>>         +--> calls start_offloading(&cpu_migrator)
>>               |
>>               +--> calls offc_update_migrator()
>>                     |
>>                     +--> static_call_update(_folios_copy, mig->migrate_offc)
>>
>> Later, During Migration ...
>> migrate_folios_batch_move()
>>     |
>>     +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>>           |
>>           +-> [ mtcopy | dcbm | kernel_default ]
>>
>>
>> PERFORMANCE RESULTS:
>> --------------------
>>
>> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
>> 1 NUMA node per socket, Linux Kernel 6.16.0-rc6, DVFS set to Performance,
>> PTDMA hardware.
>>
>> Benchmark: Use move_pages() syscall to move pages between two NUMA nodes.
>>
>> 1. Moving different sized folios (4KB, 16KB,..., 2MB) such that total transfer size is constant
>> (1GB), with different number of parallel threads/channels.
>> Metric: Throughput is reported in GB/s.
>>
>> a. Baseline (Vanilla kernel, single-threaded, folio-by-folio migration):
>>
>> Folio size|4K       | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
>> ===============================================================================================================
>> Tput(GB/s)|3.73±0.33| 5.53±0.36 | 5.90±0.56  | 6.34±0.08  | 6.50±0.05  | 6.86±0.61  | 6.92±0.71  | 10.67±0.36 |
>>
>> b. Multi-threaded CPU copy offload (mtcopy driver, use N Parallel Threads=1,2,4,8,12,16):
>>
>> Thread | 4K         | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
>> ===============================================================================================================
>> 1      | 3.84±0.10  | 5.23±0.31 | 6.01±0.55  | 6.34±0.60  | 7.16±1.00  | 7.12±0.78  | 7.10±0.86  | 10.94±0.13 |
>> 2      | 4.04±0.19  | 6.72±0.38 | 7.68±0.12  | 8.15±0.06  | 8.45±0.09  | 9.29±0.17  | 9.87±1.01  | 17.80±0.12 |
>> 4      | 4.72±0.21  | 8.41±0.70 | 10.08±1.67 | 11.44±2.42 | 10.45±0.17 | 12.60±1.97 | 12.38±1.73 | 31.41±1.14 |
>> 8      | 4.91±0.28  | 8.62±0.13 | 10.40±0.20 | 13.94±3.75 | 11.03±0.61 | 14.96±3.29 | 12.84±0.63 | 33.50±3.29 |
>> 12     | 4.84±0.24  | 8.75±0.08 | 10.16±0.26 | 10.92±0.22 | 11.72±0.14 | 14.02±2.51 | 14.09±2.65 | 34.70±2.38 |
>> 16     | 4.77±0.22  | 8.95±0.69 | 10.36±0.26 | 11.03±0.22 | 11.58±0.30 | 13.88±2.71 | 13.00±0.75 | 35.89±2.07 |
>>
>> c. DMA offload (dcbm driver, use N DMA Channels=1,2,4,8,12,16):
>>
>> Chan Cnt| 4K        | 16K       | 64K        | 128K       | 256K       | 512K       | 1M         | 2M         |
>> ===============================================================================================================
>> 1      | 2.75±0.19  | 2.86±0.13 | 3.28±0.20  | 4.57±0.72  | 5.03±0.62  | 4.69±0.25  | 4.78±0.34  | 12.50±0.24 |
>> 2      | 3.35±0.19  | 4.57±0.19 | 5.35±0.55  | 6.71±0.71  | 7.40±1.07  | 7.38±0.61  | 7.21±0.73  | 14.23±0.34 |
>> 4      | 4.01±0.17  | 6.36±0.26 | 7.71±0.89  | 9.40±1.35  | 10.27±1.96 | 10.60±1.42 | 12.35±2.64 | 26.84±0.91 |
>> 8      | 4.46±0.16  | 7.74±0.13 | 9.72±1.29  | 10.88±0.16 | 12.12±2.54 | 15.62±3.96 | 13.29±2.65 | 45.27±2.60 |
>> 12     | 4.60±0.22  | 8.90±0.84 | 11.26±2.19 | 16.00±4.41 | 14.90±4.38 | 14.57±2.84 | 13.79±3.18 | 59.94±4.19 |
>> 16     | 4.61±0.25  | 9.08±0.79 | 11.14±1.75 | 13.95±3.85 | 13.69±3.39 | 15.47±3.44 | 15.44±4.65 | 63.69±5.01 |
>>
>> - Throughput increases with folio size. Larger folios benefit more from DMA.
>> - Scaling shows diminishing returns beyond 8-12 threads/channels.
>> - Multi-threading and DMA offloading both provide significant gains - up to 3.4x and 6x respectively.
>>
>> 2. Varying total move size: (folio count = 1,8,..8192) for a fixed folio size of 2MB
>>    using only single thread/channel
>>
>> folio_cnt | Baseline    | MTCPU      | DMA
>> ====================================================
>> 1         | 7.96±2.22   | 6.43±0.66  | 6.52±0.45   |
>> 8         | 8.20±0.75   | 8.82±1.10  | 8.88±0.54   |
>> 16        | 7.54±0.61   | 9.06±0.95  | 9.03±0.62   |
>> 32        | 8.68±0.77   | 10.11±0.42 | 10.17±0.50  |
>> 64        | 9.08±1.03   | 10.12±0.44 | 11.21±0.24  |
>> 256       | 10.53±0.39  | 10.77±0.28 | 12.43±0.12  |
>> 512       | 10.59±0.29  | 10.81±0.19 | 12.61±0.07  |
>> 2048      | 10.86±0.26  | 11.05±0.05 | 12.75±0.03  |
>> 8192      | 10.84±0.18  | 11.12±0.05 | 12.81±0.02  |
>>
>> - Throughput increases with folios count but plateaus after a threshold.
>>   (The migrate_pages function uses a folio batch size of 512)
>>
>> Performance Analysis (V2 vs V3):
>>
>> The new SG-based DMA driver dramatically reduces software overhead. By
>> switching from per-folio dma_map_page() to batch dma_map_sgtable(), setup
>> time improves by an order of magnitude for large batches.
>> This is most visible with 4KB folios, making DMA viable even for smaller
>> page sizes. For 2MB THP migrations, where hardware transfer time is more
>> dominant, the gains are more modest.
>>
>> OPEN QUESTIONS:
>> ---------------
>>
>> User-Interface:
>>
>> 1. Control Interface Design:
>> The current interface creates separate sysfs files
>> for each driver, which can be confusing. Should we implement a unified interface
>> (/sys/kernel/mm/migration/offload_migrator), which accepts the name of the desired migrator
>> ("kernel", "mtcopy", "dcbm"). This would ensure only one migrator is active at a time.
>> Is this the right approach?
>>
>> 2. Dynamic Migrator Selection:
>> Currently, active migrator is a global state, and only one can be active a time.
>> A more flexible approach might be for the caller of migrate_pages() to specify/hint which
>> offload mechanism to use, if any. This would allow a CXL driver to explicitly request DMA while a GPU driver might prefer
>> multi-threaded CPU copy.
>>
>> 3. Tuning Parameters: Expose parameters like number of threads/channels, batch size,
>> and thresholds for using migrators. Who should own these parameters?
>>
>> 4. Resources Accounting[3]:
>> a. CPU cgroups accounting and fairness
>> b. Migration cost attribution
>>
>> FUTURE WORK:
>> ------------
>>
>> 1. Enhance DMA drivers for bulk copying (e.g., SDXi Engine).
>> 2. Enhance multi-threaded CPU copying for platform-specific scheduling of worker threads to optimize bandwidth utilization. Explore sched-ext for this. [2]
>> 3. Enable kpromoted [4] to use the migration offload infrastructure.
>>
>> EARLIER POSTINGS:
>> -----------------
>>
>> - RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@amd.com
>> - RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
>>
>> REFERENCES:
>> -----------
>>
>> [1] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
>> [2] LSFMM: https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@amd.com
>> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
>> [4] https://lore.kernel.org/all/20250910144653.212066-1-bharata@amd.com
>>
>> Mike Day (1):
>>   mm: add support for copy offload for folio Migration
>>
>> Shivank Garg (4):
>>   mm: Introduce folios_mc_copy() for batch copying folios
>>   mm/migrate: add migrate_folios_batch_move to  batch the folio move
>>     operations
>>   dcbm: add dma core batch migrator for batch page offloading
>>   mtcopy: spread threads across die for testing
>>
>> Zi Yan (4):
>>   mm/migrate: factor out code in move_to_new_folio() and
>>     migrate_folio_move()
>>   mm/migrate: revive MIGRATE_NO_COPY in migrate_mode
>>   mtcopy: introduce multi-threaded page copy routine
>>   adjust NR_MAX_BATCHED_MIGRATION for testing
>>
>>  drivers/Kconfig                        |   2 +
>>  drivers/Makefile                       |   3 +
>>  drivers/migoffcopy/Kconfig             |  17 +
>>  drivers/migoffcopy/Makefile            |   2 +
>>  drivers/migoffcopy/dcbm/Makefile       |   1 +
>>  drivers/migoffcopy/dcbm/dcbm.c         | 415 +++++++++++++++++++++++++
>>  drivers/migoffcopy/mtcopy/Makefile     |   1 +
>>  drivers/migoffcopy/mtcopy/copy_pages.c | 397 +++++++++++++++++++++++
>>  include/linux/migrate_mode.h           |   2 +
>>  include/linux/migrate_offc.h           |  34 ++
>>  include/linux/mm.h                     |   2 +
>>  mm/Kconfig                             |   8 +
>>  mm/Makefile                            |   1 +
>>  mm/migrate.c                           | 358 ++++++++++++++++++---
>>  mm/migrate_offc.c                      |  58 ++++
>>  mm/util.c                              |  29 ++
>>  16 files changed, 1284 insertions(+), 46 deletions(-)
>>  create mode 100644 drivers/migoffcopy/Kconfig
>>  create mode 100644 drivers/migoffcopy/Makefile
>>  create mode 100644 drivers/migoffcopy/dcbm/Makefile
>>  create mode 100644 drivers/migoffcopy/dcbm/dcbm.c
>>  create mode 100644 drivers/migoffcopy/mtcopy/Makefile
>>  create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c
>>  create mode 100644 include/linux/migrate_offc.h
>>  create mode 100644 mm/migrate_offc.c
>>
>> -- 
>> 2.43.0
> 
> 
> Best Regards,
> Yan, Zi

     prev parent reply	other threads:[~2025-10-02 17:11 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-23 17:47 Shivank Garg
2025-09-23 17:47 ` [RFC V3 1/9] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Shivank Garg
2025-10-02 10:30   ` Jonathan Cameron
2025-09-23 17:47 ` [RFC V3 2/9] mm/migrate: revive MIGRATE_NO_COPY in migrate_mode Shivank Garg
2025-09-23 17:47 ` [RFC V3 3/9] mm: Introduce folios_mc_copy() for batch copying folios Shivank Garg
2025-09-23 17:47 ` [RFC V3 4/9] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Shivank Garg
2025-10-02 11:03   ` Jonathan Cameron
2025-10-16  9:17     ` Garg, Shivank
2025-09-23 17:47 ` [RFC V3 5/9] mm: add support for copy offload for folio Migration Shivank Garg
2025-10-02 11:10   ` Jonathan Cameron
2025-10-16  9:40     ` Garg, Shivank
2025-09-23 17:47 ` [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine Shivank Garg
2025-10-02 11:29   ` Jonathan Cameron
2025-10-20  8:28   ` Byungchul Park
2025-11-06  6:27     ` Garg, Shivank
2025-11-12  2:12       ` Byungchul Park
2025-09-23 17:47 ` [RFC V3 7/9] dcbm: add dma core batch migrator for batch page offloading Shivank Garg
2025-10-02 11:38   ` Jonathan Cameron
2025-10-16  9:59     ` Garg, Shivank
2025-09-23 17:47 ` [RFC V3 8/9] adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2025-09-23 17:47 ` [RFC V3 9/9] mtcopy: spread threads across die " Shivank Garg
2025-09-24  1:49 ` [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Huang, Ying
2025-09-24  2:03   ` Zi Yan
2025-09-24  3:11     ` Huang, Ying
2025-09-24  3:22 ` Zi Yan
2025-10-02 17:10   ` Garg, Shivank [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1c8c91a7-9f6d-442f-8e20-736fd5d41ef3@amd.com \
    --to=shivankg@amd.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=Raghavendra.KodsaraThimmappa@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=alirad.malek@zptcorp.com \
    --cc=apopple@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=ivecera@redhat.com \
    --cc=jgg@ziepe.ca \
    --cc=joshua.hahnjy@gmail.com \
    --cc=justonli@chromium.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lucas.demarchi@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=rakie.kim@sk.com \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=vkoul@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox