From: Zi Yan <ziy@nvidia.com>
To: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Shivank Garg <shivankg@amd.com>,
akpm@linux-foundation.org, david@redhat.com, willy@infradead.org,
matthew.brost@intel.com, joshua.hahnjy@gmail.com,
rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
apopple@nvidia.com, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, vkoul@kernel.org,
lucas.demarchi@intel.com, rdunlap@infradead.org, jgg@ziepe.ca,
kuba@kernel.org, justonli@chromium.org, ivecera@redhat.com,
dave.jiang@intel.com, Jonathan.Cameron@huawei.com,
dan.j.williams@intel.com, rientjes@google.com,
Raghavendra.KodsaraThimmappa@amd.com, bharata@amd.com,
alirad.malek@zptcorp.com, yiannis@zptcorp.com,
weixugc@google.com, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload
Date: Tue, 23 Sep 2025 22:03:18 -0400 [thread overview]
Message-ID: <C8E561B3-B9DB-4F58-A2C7-4EE17E08A993@nvidia.com> (raw)
In-Reply-To: <87plbghb66.fsf@DESKTOP-5N7EMDA>
On 23 Sep 2025, at 21:49, Huang, Ying wrote:
> Hi, Shivank,
>
> Thanks for working on this!
>
> Shivank Garg <shivankg@amd.com> writes:
>
>> This is the third RFC of the patchset to enhance page migration by batching
>> folio-copy operations and enabling acceleration via multi-threaded CPU or
>> DMA offload.
>>
>> Single-threaded, folio-by-folio copying bottlenecks page migration
>> in modern systems with deep memory hierarchies, especially for large
>> folios where copy overhead dominates, leaving significant hardware
>> potential untapped.
>>
>> By batching the copy phase, we create an opportunity for significant
>> hardware acceleration. This series builds a framework for this acceleration
>> and provides two initial offload driver implementations: one using multiple
>> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>>
>> This version incorporates significant feedback to improve correctness,
>> robustness, and the efficiency of the DMA offload path.
>>
>> Changelog since V2:
>>
>> 1. DMA Engine Rewrite:
>> - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
>> - Single completion interrupt per batch (reduced overhead)
>> - Order of magnitude improvement in setup time for large batches
>> 2. Code cleanups and refactoring
>> 3. Rebased on latest mainline (6.17-rc6+)
>>
>> MOTIVATION:
>> -----------
>>
>> Current Migration Flow:
>> [ move_pages(), Compaction, Tiering, etc. ]
>> |
>> v
>> [ migrate_pages() ] // Common entry point
>> |
>> v
>> [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
>> |
>> |--> [ migrate_folio_unmap() ]
>> |
>> |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
>> |
>> |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
>> - For each folio:
>> - Metadata prep: Copy flags, mappings, etc.
>> - folio_copy() <-- Single-threaded, serial data copy.
>> - Update PTEs & finalize for that single folio.
>>
>> Understanding overheads in page migration (move_pages() syscall):
>>
>> Total move_pages() overheads = folio_copy() + Other overheads
>> 1. folio_copy() is the core copy operation that interests us.
>> 2. The remaining operations are user/kernel transitions, page table walks,
>> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
>> mappings and PTEs etc. that contribute to the remaining overheads.
>>
>> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
>> Number of pages being migrated and folio size:
>> 4KB 2MB
>> 1 page <1% ~66%
>> 512 page ~35% ~97%
>>
>> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
>> substantial performance opportunity.
>>
>> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
>> Where F is the fraction of time spent in folio_copy() and S is the speedup of
>> folio_copy().
>>
>> For 4KB folios, folio copy overheads are significantly small in single-page
>> migrations to impact overall speedup, even for 512 pages, maximum theoretical
>> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>>
>> For 2MB THPs, folio copy overheads are significant even in single page
>> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
>> speedup and up to ~33x for 512 pages.
>>
>> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
>> based on my measurements for copying 512 2MB pages.
>> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
>> observed in the experiments below).
>>
>> DESIGN: A Pluggable Migrator Framework
>> ---------------------------------------
>>
>> Introduce migrate_folios_batch_move():
>>
>> [ migrate_pages_batch() ]
>> |
>> |--> migrate_folio_unmap()
>> |
>> |--> try_to_unmap_flush()
>> |
>> +--> [ migrate_folios_batch_move() ] // new batched design
>> |
>> |--> Metadata migration
>> | - Metadata prep: Copy flags, mappings, etc.
>> | - Use MIGRATE_NO_COPY to skip the actual data copy.
>> |
>> |--> Batch copy folio data
>> | - Migrator is configurable at runtime via sysfs.
>> |
>> | static_call(_folios_copy) // Pluggable migrators
>> | / | \
>> | v v v
>> | [ Default ] [ MT CPU copy ] [ DMA Offload ]
>> |
>> +--> Update PTEs to point to dst folios and complete migration.
>>
>
> I just jump in the discussion, so this may be discussed before already.
> Sorry if so. Why not
>
> migrate_folios_unmap()
> try_to_unmap_flush()
> copy folios in parallel if possible
> migrate_folios_move(): with MIGRATE_NO_COPY?
Since in move_to_new_folio(), there are various migration preparation
works, which can fail. Copying folios regardless might lead to some
unnecessary work. What is your take on this?
>
>> User Control of Migrator:
>>
>> # echo 1 > /sys/kernel/dcbm/offloading
>> |
>> +--> Driver's sysfs handler
>> |
>> +--> calls start_offloading(&cpu_migrator)
>> |
>> +--> calls offc_update_migrator()
>> |
>> +--> static_call_update(_folios_copy, mig->migrate_offc)
>>
>> Later, During Migration ...
>> migrate_folios_batch_move()
>> |
>> +--> static_call(_folios_copy) // Now dispatches to the selected migrator
>> |
>> +-> [ mtcopy | dcbm | kernel_default ]
>>
>
> [snip]
>
> ---
> Best Regards,
> Huang, Ying
Best Regards,
Yan, Zi
next prev parent reply other threads:[~2025-09-24 2:03 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-23 17:47 Shivank Garg
2025-09-23 17:47 ` [RFC V3 1/9] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Shivank Garg
2025-10-02 10:30 ` Jonathan Cameron
2025-09-23 17:47 ` [RFC V3 2/9] mm/migrate: revive MIGRATE_NO_COPY in migrate_mode Shivank Garg
2025-09-23 17:47 ` [RFC V3 3/9] mm: Introduce folios_mc_copy() for batch copying folios Shivank Garg
2025-09-23 17:47 ` [RFC V3 4/9] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Shivank Garg
2025-10-02 11:03 ` Jonathan Cameron
2025-10-16 9:17 ` Garg, Shivank
2025-09-23 17:47 ` [RFC V3 5/9] mm: add support for copy offload for folio Migration Shivank Garg
2025-10-02 11:10 ` Jonathan Cameron
2025-10-16 9:40 ` Garg, Shivank
2025-09-23 17:47 ` [RFC V3 6/9] mtcopy: introduce multi-threaded page copy routine Shivank Garg
2025-10-02 11:29 ` Jonathan Cameron
2025-10-20 8:28 ` Byungchul Park
2025-11-06 6:27 ` Garg, Shivank
2025-11-12 2:12 ` Byungchul Park
2025-09-23 17:47 ` [RFC V3 7/9] dcbm: add dma core batch migrator for batch page offloading Shivank Garg
2025-10-02 11:38 ` Jonathan Cameron
2025-10-16 9:59 ` Garg, Shivank
2025-09-23 17:47 ` [RFC V3 8/9] adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2025-09-23 17:47 ` [RFC V3 9/9] mtcopy: spread threads across die " Shivank Garg
2025-09-24 1:49 ` [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload Huang, Ying
2025-09-24 2:03 ` Zi Yan [this message]
2025-09-24 3:11 ` Huang, Ying
2025-09-24 3:22 ` Zi Yan
2025-10-02 17:10 ` Garg, Shivank
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=C8E561B3-B9DB-4F58-A2C7-4EE17E08A993@nvidia.com \
--to=ziy@nvidia.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=Liam.Howlett@oracle.com \
--cc=Raghavendra.KodsaraThimmappa@amd.com \
--cc=akpm@linux-foundation.org \
--cc=alirad.malek@zptcorp.com \
--cc=apopple@nvidia.com \
--cc=bharata@amd.com \
--cc=byungchul@sk.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=david@redhat.com \
--cc=gourry@gourry.net \
--cc=ivecera@redhat.com \
--cc=jgg@ziepe.ca \
--cc=joshua.hahnjy@gmail.com \
--cc=justonli@chromium.org \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=lucas.demarchi@intel.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=rakie.kim@sk.com \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=rppt@kernel.org \
--cc=shivankg@amd.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=vkoul@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox