* [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA
@ 2025-01-23 5:55 Shivank Garg
2025-01-27 6:55 ` David Rientjes
2025-03-24 6:01 ` Shivank Garg
0 siblings, 2 replies; 8+ messages in thread
From: Shivank Garg @ 2025-01-23 5:55 UTC (permalink / raw)
To: akpm, lsf-pc, linux-mm, ziy
Cc: AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david,
gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm,
k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel,
mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel,
rientjes, santosh.shukla, shivankg, shy828301, sj,
wangkefeng.wang, weixugc, willy, ying.huang
Hi all,
Zi Yan and I would like to propose the topic: Enhancements to Page
Migration with Multi-threading and Batch Offloading to DMA.
Page migration is a critical operation in NUMA systems that can incur
significant overheads, affecting memory management performance across
various workloads. For example, copying folios between DRAM NUMA nodes
can take ~25% of the total migration cost for migrating 256MB of data.
Modern systems are equipped with powerful DMA engines for bulk data
copying, GPUs, and high CPU core counts. Leveraging these hardware
capabilities becomes essential for systems where frequent page promotion
and demotion occur - from large-scale tiered-memory systems with CXL nodes
to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
Existing page migration performs sequential page copying, underutilizing
modern CPU architectures and high-bandwidth memory subsystems.
We have proposed and posted RFCs to enhance page migration through three
key techniques:
1. Batching migration operations for bulk copying data [1]
2. Multi-threaded folio copying [2]
3. DMA offloading to hardware accelerators [1]
By employing batching and multi-threaded folio copying, we are able to
achieve significant improvements in page migration throughput for large
pages.
Discussion points:
1. Performance:
a. Policy decision for DMA and CPU selection
b. Platform-specific scheduling of folio-copy worker threads for better
bandwidth utilization
c. Using Non-temporal instructions for CPU-based memcpy
d. Upscaling/downscaling worker threads based on migration size, CPU
availability (system load), bandwidth saturation, etc.
2. Interface requirements with DMA hardware:
a. Standardizing APIs for DMA drivers and support for different DMA
drivers
b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
3. Resources Accounting:
a. CPU cgroups accounting and fairness [3]
b. Who bears migration cost? - (Migration cost attribution)
References:
[1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
[2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
[3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
Best Regards,
Shivank
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA 2025-01-23 5:55 [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA Shivank Garg @ 2025-01-27 6:55 ` David Rientjes 2025-01-27 12:37 ` Zi Yan 2025-03-24 6:01 ` Shivank Garg 1 sibling, 1 reply; 8+ messages in thread From: David Rientjes @ 2025-01-27 6:55 UTC (permalink / raw) To: Shivank Garg Cc: akpm, lsf-pc, linux-mm, ziy, AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david, gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm, k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel, mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel, santosh.shukla, shy828301, sj, wangkefeng.wang, weixugc, willy, ying.huang On Thu, 23 Jan 2025, Shivank Garg wrote: > Hi all, > > Zi Yan and I would like to propose the topic: Enhancements to Page > Migration with Multi-threading and Batch Offloading to DMA. > I think this would be a very useful topic to discuss, thanks for proposing it. > Page migration is a critical operation in NUMA systems that can incur > significant overheads, affecting memory management performance across > various workloads. For example, copying folios between DRAM NUMA nodes > can take ~25% of the total migration cost for migrating 256MB of data. > > Modern systems are equipped with powerful DMA engines for bulk data > copying, GPUs, and high CPU core counts. Leveraging these hardware > capabilities becomes essential for systems where frequent page promotion > and demotion occur - from large-scale tiered-memory systems with CXL nodes > to CPU-GPU coherent system with GPU memory exposed as NUMA nodes. > Indeed, there are multiple use cases for optimizations in this area. With the ramp of memory tiered systems, I think there will be an even greater reliance on memory migration going forward. Do you have numbers to share on how offloading, even as a proof of concept, moves the needle compared to traditional and sequential memory migration? > Existing page migration performs sequential page copying, underutilizing > modern CPU architectures and high-bandwidth memory subsystems. > > We have proposed and posted RFCs to enhance page migration through three > key techniques: > 1. Batching migration operations for bulk copying data [1] > 2. Multi-threaded folio copying [2] > 3. DMA offloading to hardware accelerators [1] > Curious: does memory migration of pages that are actively undergoing DMA with hardware assist fit into any of these? > By employing batching and multi-threaded folio copying, we are able to > achieve significant improvements in page migration throughput for large > pages. > > Discussion points: > 1. Performance: > a. Policy decision for DMA and CPU selection > b. Platform-specific scheduling of folio-copy worker threads for better > bandwidth utilization Why platform specific? I *assume* this means a generic framework that can optimize for scheduling based on the underlying hardware and not specific implementations that can only be used on AMD, for example. Is that the case? > c. Using Non-temporal instructions for CPU-based memcpy > d. Upscaling/downscaling worker threads based on migration size, CPU > availability (system load), bandwidth saturation, etc. > 2. Interface requirements with DMA hardware: > a. Standardizing APIs for DMA drivers and support for different DMA > drivers > b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine) > 3. Resources Accounting: > a. CPU cgroups accounting and fairness [3] > b. Who bears migration cost? - (Migration cost attribution) > > References: > [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com > [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com > [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA 2025-01-27 6:55 ` David Rientjes @ 2025-01-27 12:37 ` Zi Yan 2025-01-27 13:55 ` Jonathan Cameron 2025-01-28 6:54 ` Shivank Garg 0 siblings, 2 replies; 8+ messages in thread From: Zi Yan @ 2025-01-27 12:37 UTC (permalink / raw) To: David Rientjes, Shivank Garg Cc: akpm, lsf-pc, linux-mm, AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david, gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm, k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel, mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel, santosh.shukla, shy828301, sj, wangkefeng.wang, weixugc, willy, ying.huang On 27 Jan 2025, at 1:55, David Rientjes wrote: > On Thu, 23 Jan 2025, Shivank Garg wrote: > >> Hi all, >> >> Zi Yan and I would like to propose the topic: Enhancements to Page >> Migration with Multi-threading and Batch Offloading to DMA. >> > > I think this would be a very useful topic to discuss, thanks for proposing > it. > >> Page migration is a critical operation in NUMA systems that can incur >> significant overheads, affecting memory management performance across >> various workloads. For example, copying folios between DRAM NUMA nodes >> can take ~25% of the total migration cost for migrating 256MB of data. >> >> Modern systems are equipped with powerful DMA engines for bulk data >> copying, GPUs, and high CPU core counts. Leveraging these hardware >> capabilities becomes essential for systems where frequent page promotion >> and demotion occur - from large-scale tiered-memory systems with CXL nodes >> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes. >> > > Indeed, there are multiple use cases for optimizations in this area. With > the ramp of memory tiered systems, I think there will be an even greater > reliance on memory migration going forward. > > Do you have numbers to share on how offloading, even as a proof of > concept, moves the needle compared to traditional and sequential memory > migration? For multithreaded page migration, you can see my RFC patchset[1]: on NVIDIA Grace: The 32-thread copy throughput can be up to 10x of single thread serial folio copy. Batching folio copy not only benefits huge page but also base page. 64KB (GB/s): vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 32 5.43 4.90 5.65 7.31 7.60 8.61 6.43 256 6.95 6.89 9.28 14.67 22.41 23.39 23.93 512 7.88 7.26 10.15 17.53 27.82 27.88 33.93 768 7.65 7.42 10.46 18.59 28.65 29.67 30.76 1024 7.46 8.01 10.90 17.77 27.04 32.18 38.80 2MB mTHP (GB/s): vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 1 5.94 2.90 6.90 8.56 11.16 8.76 6.41 2 7.67 5.57 7.11 12.48 17.37 15.68 14.10 4 8.01 6.04 10.25 20.14 22.52 27.79 25.28 8 8.42 7.00 11.41 24.73 33.96 32.62 39.55 16 9.41 6.91 12.23 27.51 43.95 49.15 51.38 32 10.23 7.15 13.03 29.52 49.49 69.98 71.51 64 9.40 7.37 13.88 30.38 52.00 76.89 79.41 128 8.59 7.23 14.20 28.39 49.98 78.27 90.18 256 8.43 7.16 14.59 28.14 48.78 76.88 92.28 512 8.31 7.78 14.40 26.20 43.31 63.91 75.21 768 8.30 7.86 14.83 27.41 46.25 69.85 81.31 1024 8.31 7.90 14.96 27.62 46.75 71.76 83.84 I also ran it on on a two socket Xeon E5-2650 v4: 4KB (GB/s) | ---- | ------- | ---- | ---- | ---- | ---- | ----- | | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | | ---- | ------- | ---- | ---- | ---- | ---- | ----- | | 512 | 1.12 | 1.19 | 1.20 | 1.26 | 1.27 | 1.35 | | 768 | 1.29 | 1.14 | 1.28 | 1.40 | 1.39 | 1.46 | | 1024 | 1.19 | 1.25 | 1.34 | 1.51 | 1.52 | 1.53 | | 2048 | 1.14 | 1.12 | 1.44 | 1.61 | 1.73 | 1.71 | | 4096 | 1.09 | 1.14 | 1.46 | 1.64 | 1.81 | 1.78 | 2MB (GB/s) | ---- | ------- | ---- | ---- | ----- | ----- | ----- | | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | | ---- | ------- | ---- | ---- | ----- | ----- | ----- | | 1 | 2.03 | 2.21 | 2.69 | 2.93 | 3.17 | 3.14 | | 2 | 2.28 | 2.13 | 3.54 | 4.50 | 4.72 | 4.72 | | 4 | 2.92 | 2.93 | 4.44 | 6.50 | 7.24 | 7.06 | | 8 | 2.29 | 2.37 | 3.21 | 6.86 | 8.83 | 8.44 | | 16 | 2.10 | 2.09 | 4.57 | 8.06 | 8.32 | 9.70 | | 32 | 2.22 | 2.21 | 4.43 | 8.96 | 9.37 | 11.54 | | 64 | 2.35 | 2.35 | 3.15 | 7.77 | 10.77 | 13.61 | | 128 | 2.48 | 2.53 | 5.12 | 8.18 | 11.01 | 15.62 | | 256 | 2.55 | 2.53 | 5.44 | 8.25 | 12.73 | 16.49 | | 512 | 2.61 | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 | | 768 | 2.55 | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 | | 1024 | 2.56 | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 | Shivank ran it on AMD EPYC Zen 5, after some tuning (spread threads on different CCDs): 2MB pages (GB/s): nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 1 10.74 11.04 4.68 8.17 6.47 6.09 3.97 6.20 2 12.44 4.90 11.19 14.10 15.33 8.45 10.09 9.97 4 14.82 9.80 11.93 18.35 21.82 17.09 10.53 7.51 8 16.13 9.91 15.26 11.85 26.53 13.09 12.71 13.75 16 15.99 8.81 13.84 22.43 33.89 11.91 12.30 13.26 32 14.03 11.37 17.54 23.96 57.07 18.78 19.51 21.29 64 15.79 9.55 22.19 33.17 57.18 65.51 55.39 62.53 128 18.22 16.65 21.49 30.73 52.99 61.05 58.44 60.38 256 19.78 20.56 24.72 34.94 56.73 71.11 61.83 62.77 512 20.27 21.40 27.47 39.23 65.72 67.97 70.48 71.39 1024 20.48 21.48 27.48 38.30 68.62 77.94 78.00 78.95 > >> Existing page migration performs sequential page copying, underutilizing >> modern CPU architectures and high-bandwidth memory subsystems. >> >> We have proposed and posted RFCs to enhance page migration through three >> key techniques: >> 1. Batching migration operations for bulk copying data [1] >> 2. Multi-threaded folio copying [2] >> 3. DMA offloading to hardware accelerators [1] >> > > Curious: does memory migration of pages that are actively undergoing DMA > with hardware assist fit into any of these? It should be similar to 3, but in this case, DMA is used to copy pages between NUMA nodes, whereas traditional DMA page migration is used to copy pages between host and devices. > >> By employing batching and multi-threaded folio copying, we are able to >> achieve significant improvements in page migration throughput for large >> pages. >> >> Discussion points: >> 1. Performance: >> a. Policy decision for DMA and CPU selection >> b. Platform-specific scheduling of folio-copy worker threads for better >> bandwidth utilization > > Why platform specific? I *assume* this means a generic framework that can > optimize for scheduling based on the underlying hardware and not specific > implementations that can only be used on AMD, for example. Is that the > case? I think the framework will be generic but the CPU scheduling (which core to choose for page copying) will be different from vendor to vendor. Due to existing CPU structure, like chiplet design, a single CPU scheduling algorithm does not fit for CPUs from different vendors. For example, on NVIDIA Grace, you can use any CPUs to copy pages and always achieve high page copy throughput, but on AMD CPUs with multiple CCDs, spreading copy threads across different CCDs can achieve much higher page copy throughput than putting all threads in a single CCD. I assume Intel CPUs with chiplet design would see the same result. > >> c. Using Non-temporal instructions for CPU-based memcpy >> d. Upscaling/downscaling worker threads based on migration size, CPU >> availability (system load), bandwidth saturation, etc. >> 2. Interface requirements with DMA hardware: >> a. Standardizing APIs for DMA drivers and support for different DMA >> drivers >> b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine) >> 3. Resources Accounting: >> a. CPU cgroups accounting and fairness [3] >> b. Who bears migration cost? - (Migration cost attribution) >> >> References: >> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com >> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com >> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com >> [1] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com/ -- Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA 2025-01-27 12:37 ` Zi Yan @ 2025-01-27 13:55 ` Jonathan Cameron 2025-01-27 16:30 ` Zi Yan 2025-01-28 6:54 ` Shivank Garg 1 sibling, 1 reply; 8+ messages in thread From: Jonathan Cameron @ 2025-01-27 13:55 UTC (permalink / raw) To: Zi Yan Cc: David Rientjes, Shivank Garg, akpm, lsf-pc, linux-mm, AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david, gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm, k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel, mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel, santosh.shukla, shy828301, sj, wangkefeng.wang, weixugc, willy, ying.huang On Mon, 27 Jan 2025 07:37:19 -0500 Zi Yan <ziy@nvidia.com> wrote: > On 27 Jan 2025, at 1:55, David Rientjes wrote: > > > On Thu, 23 Jan 2025, Shivank Garg wrote: > > > >> Hi all, > >> > >> Zi Yan and I would like to propose the topic: Enhancements to Page > >> Migration with Multi-threading and Batch Offloading to DMA. > >> > > > > I think this would be a very useful topic to discuss, thanks for proposing > > it. > > > >> Page migration is a critical operation in NUMA systems that can incur > >> significant overheads, affecting memory management performance across > >> various workloads. For example, copying folios between DRAM NUMA nodes > >> can take ~25% of the total migration cost for migrating 256MB of data. > >> > >> Modern systems are equipped with powerful DMA engines for bulk data > >> copying, GPUs, and high CPU core counts. Leveraging these hardware > >> capabilities becomes essential for systems where frequent page promotion > >> and demotion occur - from large-scale tiered-memory systems with CXL nodes > >> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes. Hi, With the potential usecase of tiered memory (CXL) migration mentioned above, I'm curious what application scenario is such that we are willing to burn lots of CPU cores at once on that migration? I'm very interested in the DMA offload aspect, but multithreading for that usecase seems like a less good fit as presumably there is something running that is seeing the poor effects of memory latency that is making the move look like a good idea? Or are we looking at some sort of demotion when the system is idle? > >> > > > > Indeed, there are multiple use cases for optimizations in this area. With > > the ramp of memory tiered systems, I think there will be an even greater > > reliance on memory migration going forward. > > > > Do you have numbers to share on how offloading, even as a proof of > > concept, moves the needle compared to traditional and sequential memory > > migration? > > For multithreaded page migration, you can see my RFC patchset[1]: > > on NVIDIA Grace: > > The 32-thread copy throughput can be up to 10x of single thread serial folio > copy. Batching folio copy not only benefits huge page but also base > page. > > 64KB (GB/s): > > vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 > 32 5.43 4.90 5.65 7.31 7.60 8.61 6.43 > 256 6.95 6.89 9.28 14.67 22.41 23.39 23.93 > 512 7.88 7.26 10.15 17.53 27.82 27.88 33.93 > 768 7.65 7.42 10.46 18.59 28.65 29.67 30.76 > 1024 7.46 8.01 10.90 17.77 27.04 32.18 38.80 > > 2MB mTHP (GB/s): > > vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 > 1 5.94 2.90 6.90 8.56 11.16 8.76 6.41 > 2 7.67 5.57 7.11 12.48 17.37 15.68 14.10 > 4 8.01 6.04 10.25 20.14 22.52 27.79 25.28 > 8 8.42 7.00 11.41 24.73 33.96 32.62 39.55 > 16 9.41 6.91 12.23 27.51 43.95 49.15 51.38 > 32 10.23 7.15 13.03 29.52 49.49 69.98 71.51 > 64 9.40 7.37 13.88 30.38 52.00 76.89 79.41 > 128 8.59 7.23 14.20 28.39 49.98 78.27 90.18 > 256 8.43 7.16 14.59 28.14 48.78 76.88 92.28 > 512 8.31 7.78 14.40 26.20 43.31 63.91 75.21 > 768 8.30 7.86 14.83 27.41 46.25 69.85 81.31 > 1024 8.31 7.90 14.96 27.62 46.75 71.76 83.84 > > > I also ran it on on a two socket Xeon E5-2650 v4: > > > 4KB (GB/s) > > | ---- | ------- | ---- | ---- | ---- | ---- | ----- | > | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | > | ---- | ------- | ---- | ---- | ---- | ---- | ----- | > | 512 | 1.12 | 1.19 | 1.20 | 1.26 | 1.27 | 1.35 | > | 768 | 1.29 | 1.14 | 1.28 | 1.40 | 1.39 | 1.46 | > | 1024 | 1.19 | 1.25 | 1.34 | 1.51 | 1.52 | 1.53 | > | 2048 | 1.14 | 1.12 | 1.44 | 1.61 | 1.73 | 1.71 | > | 4096 | 1.09 | 1.14 | 1.46 | 1.64 | 1.81 | 1.78 | > > > > 2MB (GB/s) > | ---- | ------- | ---- | ---- | ----- | ----- | ----- | > | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | > | ---- | ------- | ---- | ---- | ----- | ----- | ----- | > | 1 | 2.03 | 2.21 | 2.69 | 2.93 | 3.17 | 3.14 | > | 2 | 2.28 | 2.13 | 3.54 | 4.50 | 4.72 | 4.72 | > | 4 | 2.92 | 2.93 | 4.44 | 6.50 | 7.24 | 7.06 | > | 8 | 2.29 | 2.37 | 3.21 | 6.86 | 8.83 | 8.44 | > | 16 | 2.10 | 2.09 | 4.57 | 8.06 | 8.32 | 9.70 | > | 32 | 2.22 | 2.21 | 4.43 | 8.96 | 9.37 | 11.54 | > | 64 | 2.35 | 2.35 | 3.15 | 7.77 | 10.77 | 13.61 | > | 128 | 2.48 | 2.53 | 5.12 | 8.18 | 11.01 | 15.62 | > | 256 | 2.55 | 2.53 | 5.44 | 8.25 | 12.73 | 16.49 | > | 512 | 2.61 | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 | > | 768 | 2.55 | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 | > | 1024 | 2.56 | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 | > > > > Shivank ran it on AMD EPYC Zen 5, after some tuning (spread threads on different CCDs): > > 2MB pages (GB/s): > nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 > 1 10.74 11.04 4.68 8.17 6.47 6.09 3.97 6.20 > 2 12.44 4.90 11.19 14.10 15.33 8.45 10.09 9.97 > 4 14.82 9.80 11.93 18.35 21.82 17.09 10.53 7.51 > 8 16.13 9.91 15.26 11.85 26.53 13.09 12.71 13.75 > 16 15.99 8.81 13.84 22.43 33.89 11.91 12.30 13.26 > 32 14.03 11.37 17.54 23.96 57.07 18.78 19.51 21.29 > 64 15.79 9.55 22.19 33.17 57.18 65.51 55.39 62.53 > 128 18.22 16.65 21.49 30.73 52.99 61.05 58.44 60.38 > 256 19.78 20.56 24.72 34.94 56.73 71.11 61.83 62.77 > 512 20.27 21.40 27.47 39.23 65.72 67.97 70.48 71.39 > 1024 20.48 21.48 27.48 38.30 68.62 77.94 78.00 78.95 > > > > > > >> Existing page migration performs sequential page copying, underutilizing > >> modern CPU architectures and high-bandwidth memory subsystems. > >> > >> We have proposed and posted RFCs to enhance page migration through three > >> key techniques: > >> 1. Batching migration operations for bulk copying data [1] > >> 2. Multi-threaded folio copying [2] > >> 3. DMA offloading to hardware accelerators [1] > >> > > > > Curious: does memory migration of pages that are actively undergoing DMA > > with hardware assist fit into any of these? > > It should be similar to 3, but in this case, DMA is used to copy pages > between NUMA nodes, whereas traditional DMA page migration is used to copy > pages between host and devices. > > > > >> By employing batching and multi-threaded folio copying, we are able to > >> achieve significant improvements in page migration throughput for large > >> pages. > >> > >> Discussion points: > >> 1. Performance: > >> a. Policy decision for DMA and CPU selection > >> b. Platform-specific scheduling of folio-copy worker threads for better > >> bandwidth utilization > > > > Why platform specific? I *assume* this means a generic framework that can > > optimize for scheduling based on the underlying hardware and not specific > > implementations that can only be used on AMD, for example. Is that the > > case? > > I think the framework will be generic but the CPU scheduling (which core > to choose for page copying) will be different from vendor to vendor. > > Due to existing CPU structure, like chiplet design, a single CPU scheduling > algorithm does not fit for CPUs from different vendors. For example, on > NVIDIA Grace, you can use any CPUs to copy pages and always achieve high > page copy throughput, but on AMD CPUs with multiple CCDs, spreading copy > threads across different CCDs can achieve much higher page copy throughput > than putting all threads in a single CCD. I assume Intel CPUs with chiplet > design would see the same result. On this I'd hope we could build something topology aware enough to make the right decisions. All the information should be available to do this without having per uarch specific code. Jonathan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA 2025-01-27 13:55 ` Jonathan Cameron @ 2025-01-27 16:30 ` Zi Yan 0 siblings, 0 replies; 8+ messages in thread From: Zi Yan @ 2025-01-27 16:30 UTC (permalink / raw) To: Jonathan Cameron Cc: David Rientjes, Shivank Garg, akpm, lsf-pc, linux-mm, AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david, gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm, k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel, mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel, santosh.shukla, shy828301, sj, wangkefeng.wang, weixugc, willy, ying.huang On 27 Jan 2025, at 8:55, Jonathan Cameron wrote: > On Mon, 27 Jan 2025 07:37:19 -0500 > Zi Yan <ziy@nvidia.com> wrote: > >> On 27 Jan 2025, at 1:55, David Rientjes wrote: >> >>> On Thu, 23 Jan 2025, Shivank Garg wrote: >>> >>>> Hi all, >>>> >>>> Zi Yan and I would like to propose the topic: Enhancements to Page >>>> Migration with Multi-threading and Batch Offloading to DMA. >>>> >>> >>> I think this would be a very useful topic to discuss, thanks for proposing >>> it. >>> >>>> Page migration is a critical operation in NUMA systems that can incur >>>> significant overheads, affecting memory management performance across >>>> various workloads. For example, copying folios between DRAM NUMA nodes >>>> can take ~25% of the total migration cost for migrating 256MB of data. >>>> >>>> Modern systems are equipped with powerful DMA engines for bulk data >>>> copying, GPUs, and high CPU core counts. Leveraging these hardware >>>> capabilities becomes essential for systems where frequent page promotion >>>> and demotion occur - from large-scale tiered-memory systems with CXL nodes >>>> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes. > Hi, > > > With the potential usecase of tiered memory (CXL) migration mentioned above, > I'm curious what application scenario is such that we are willing to burn > lots of CPU cores at once on that migration? I'm very interested in the > DMA offload aspect, but multithreading for that usecase seems > like a less good fit as presumably there is something running that is > seeing the poor effects of memory latency that is making the move look > like a good idea? There are two scenarios: 1) all CPUs are busy and 2) some CPUs are idle. For the first one, it is going to be a trade off between a) spending CPU cycles on workloads but the workload performance is limited by the long memory latency and low memory bandwidth when accessing data on remote nodes and b) spending CPU cycles on moving hot data much quicker from remote to local to make workloads run quicker. Without multithreading, we are still “burning” one CPU core to migrate data slowly from remote to local; with multithreading, we use more CPU resources to move data quicker to make workloads run faster. Admittedly, not all workloads would benefit from it, but if a workload sees significant performance boost when accessing data from local memory, using more CPU resources to move it over should justify the use. > > Or are we looking at some sort of demotion when the system is idle? For second scenario, a lot of GPU/accelerator intensive workloads would spend most of its time in GPU/accelerator, leaving CPU idle. It is going to be a great use of these idle CPUs to shuffle hot and cold data to the right place to make sure GPU/accelerator can always access hot data. > >>>> >>> >>> Indeed, there are multiple use cases for optimizations in this area. With >>> the ramp of memory tiered systems, I think there will be an even greater >>> reliance on memory migration going forward. >>> >>> Do you have numbers to share on how offloading, even as a proof of >>> concept, moves the needle compared to traditional and sequential memory >>> migration? >> >> For multithreaded page migration, you can see my RFC patchset[1]: >> >> on NVIDIA Grace: >> >> The 32-thread copy throughput can be up to 10x of single thread serial folio >> copy. Batching folio copy not only benefits huge page but also base >> page. >> >> 64KB (GB/s): >> >> vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 >> 32 5.43 4.90 5.65 7.31 7.60 8.61 6.43 >> 256 6.95 6.89 9.28 14.67 22.41 23.39 23.93 >> 512 7.88 7.26 10.15 17.53 27.82 27.88 33.93 >> 768 7.65 7.42 10.46 18.59 28.65 29.67 30.76 >> 1024 7.46 8.01 10.90 17.77 27.04 32.18 38.80 >> >> 2MB mTHP (GB/s): >> >> vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 >> 1 5.94 2.90 6.90 8.56 11.16 8.76 6.41 >> 2 7.67 5.57 7.11 12.48 17.37 15.68 14.10 >> 4 8.01 6.04 10.25 20.14 22.52 27.79 25.28 >> 8 8.42 7.00 11.41 24.73 33.96 32.62 39.55 >> 16 9.41 6.91 12.23 27.51 43.95 49.15 51.38 >> 32 10.23 7.15 13.03 29.52 49.49 69.98 71.51 >> 64 9.40 7.37 13.88 30.38 52.00 76.89 79.41 >> 128 8.59 7.23 14.20 28.39 49.98 78.27 90.18 >> 256 8.43 7.16 14.59 28.14 48.78 76.88 92.28 >> 512 8.31 7.78 14.40 26.20 43.31 63.91 75.21 >> 768 8.30 7.86 14.83 27.41 46.25 69.85 81.31 >> 1024 8.31 7.90 14.96 27.62 46.75 71.76 83.84 >> >> >> I also ran it on on a two socket Xeon E5-2650 v4: >> >> >> 4KB (GB/s) >> >> | ---- | ------- | ---- | ---- | ---- | ---- | ----- | >> | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | >> | ---- | ------- | ---- | ---- | ---- | ---- | ----- | >> | 512 | 1.12 | 1.19 | 1.20 | 1.26 | 1.27 | 1.35 | >> | 768 | 1.29 | 1.14 | 1.28 | 1.40 | 1.39 | 1.46 | >> | 1024 | 1.19 | 1.25 | 1.34 | 1.51 | 1.52 | 1.53 | >> | 2048 | 1.14 | 1.12 | 1.44 | 1.61 | 1.73 | 1.71 | >> | 4096 | 1.09 | 1.14 | 1.46 | 1.64 | 1.81 | 1.78 | >> >> >> >> 2MB (GB/s) >> | ---- | ------- | ---- | ---- | ----- | ----- | ----- | >> | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | >> | ---- | ------- | ---- | ---- | ----- | ----- | ----- | >> | 1 | 2.03 | 2.21 | 2.69 | 2.93 | 3.17 | 3.14 | >> | 2 | 2.28 | 2.13 | 3.54 | 4.50 | 4.72 | 4.72 | >> | 4 | 2.92 | 2.93 | 4.44 | 6.50 | 7.24 | 7.06 | >> | 8 | 2.29 | 2.37 | 3.21 | 6.86 | 8.83 | 8.44 | >> | 16 | 2.10 | 2.09 | 4.57 | 8.06 | 8.32 | 9.70 | >> | 32 | 2.22 | 2.21 | 4.43 | 8.96 | 9.37 | 11.54 | >> | 64 | 2.35 | 2.35 | 3.15 | 7.77 | 10.77 | 13.61 | >> | 128 | 2.48 | 2.53 | 5.12 | 8.18 | 11.01 | 15.62 | >> | 256 | 2.55 | 2.53 | 5.44 | 8.25 | 12.73 | 16.49 | >> | 512 | 2.61 | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 | >> | 768 | 2.55 | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 | >> | 1024 | 2.56 | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 | >> >> >> >> Shivank ran it on AMD EPYC Zen 5, after some tuning (spread threads on different CCDs): >> >> 2MB pages (GB/s): >> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 >> 1 10.74 11.04 4.68 8.17 6.47 6.09 3.97 6.20 >> 2 12.44 4.90 11.19 14.10 15.33 8.45 10.09 9.97 >> 4 14.82 9.80 11.93 18.35 21.82 17.09 10.53 7.51 >> 8 16.13 9.91 15.26 11.85 26.53 13.09 12.71 13.75 >> 16 15.99 8.81 13.84 22.43 33.89 11.91 12.30 13.26 >> 32 14.03 11.37 17.54 23.96 57.07 18.78 19.51 21.29 >> 64 15.79 9.55 22.19 33.17 57.18 65.51 55.39 62.53 >> 128 18.22 16.65 21.49 30.73 52.99 61.05 58.44 60.38 >> 256 19.78 20.56 24.72 34.94 56.73 71.11 61.83 62.77 >> 512 20.27 21.40 27.47 39.23 65.72 67.97 70.48 71.39 >> 1024 20.48 21.48 27.48 38.30 68.62 77.94 78.00 78.95 >> >> >> >>> >>>> Existing page migration performs sequential page copying, underutilizing >>>> modern CPU architectures and high-bandwidth memory subsystems. >>>> >>>> We have proposed and posted RFCs to enhance page migration through three >>>> key techniques: >>>> 1. Batching migration operations for bulk copying data [1] >>>> 2. Multi-threaded folio copying [2] >>>> 3. DMA offloading to hardware accelerators [1] >>>> >>> >>> Curious: does memory migration of pages that are actively undergoing DMA >>> with hardware assist fit into any of these? >> >> It should be similar to 3, but in this case, DMA is used to copy pages >> between NUMA nodes, whereas traditional DMA page migration is used to copy >> pages between host and devices. >> >>> >>>> By employing batching and multi-threaded folio copying, we are able to >>>> achieve significant improvements in page migration throughput for large >>>> pages. >>>> >>>> Discussion points: >>>> 1. Performance: >>>> a. Policy decision for DMA and CPU selection >>>> b. Platform-specific scheduling of folio-copy worker threads for better >>>> bandwidth utilization >>> >>> Why platform specific? I *assume* this means a generic framework that can >>> optimize for scheduling based on the underlying hardware and not specific >>> implementations that can only be used on AMD, for example. Is that the >>> case? >> >> I think the framework will be generic but the CPU scheduling (which core >> to choose for page copying) will be different from vendor to vendor. >> >> Due to existing CPU structure, like chiplet design, a single CPU scheduling >> algorithm does not fit for CPUs from different vendors. For example, on >> NVIDIA Grace, you can use any CPUs to copy pages and always achieve high >> page copy throughput, but on AMD CPUs with multiple CCDs, spreading copy >> threads across different CCDs can achieve much higher page copy throughput >> than putting all threads in a single CCD. I assume Intel CPUs with chiplet >> design would see the same result. > > On this I'd hope we could build something topology aware enough to make > the right decisions. All the information should be available to do this > without having per uarch specific code. I agree. Currently, I am using workqueue to copy data, but workqueue is not aware of CPU topology or CPU idleness. I think either workqueue needs to be enhanced or something else, which has better CPU resource management and topology awareness, should be used to achieve the optimal multithreading page migration results. Best Regards, Yan, Zi ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA 2025-01-27 12:37 ` Zi Yan 2025-01-27 13:55 ` Jonathan Cameron @ 2025-01-28 6:54 ` Shivank Garg 1 sibling, 0 replies; 8+ messages in thread From: Shivank Garg @ 2025-01-28 6:54 UTC (permalink / raw) To: Zi Yan, David Rientjes Cc: akpm, lsf-pc, linux-mm, AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david, gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm, k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel, mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel, santosh.shukla, shy828301, sj, wangkefeng.wang, weixugc, willy, ying.huang, Jonathan.Cameron Hi David, Zi, On 1/27/2025 6:07 PM, Zi Yan wrote: > On 27 Jan 2025, at 1:55, David Rientjes wrote: > >> On Thu, 23 Jan 2025, Shivank Garg wrote: >> >>> Hi all, >>> >>> Zi Yan and I would like to propose the topic: Enhancements to Page >>> Migration with Multi-threading and Batch Offloading to DMA. >>> >> >> I think this would be a very useful topic to discuss, thanks for proposing >> it. Thanks for your interest in our proposal. >> >>> Page migration is a critical operation in NUMA systems that can incur >>> significant overheads, affecting memory management performance across >>> various workloads. For example, copying folios between DRAM NUMA nodes >>> can take ~25% of the total migration cost for migrating 256MB of data. >>> >>> Modern systems are equipped with powerful DMA engines for bulk data >>> copying, GPUs, and high CPU core counts. Leveraging these hardware >>> capabilities becomes essential for systems where frequent page promotion >>> and demotion occur - from large-scale tiered-memory systems with CXL nodes >>> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes. >>> >> >> Indeed, there are multiple use cases for optimizations in this area. With >> the ramp of memory tiered systems, I think there will be an even greater >> reliance on memory migration going forward. >> >> Do you have numbers to share on how offloading, even as a proof of >> concept, moves the needle compared to traditional and sequential memory >> migration? > > For multithreaded page migration, you can see my RFC patchset[1]: > > on NVIDIA Grace: > > The 32-thread copy throughput can be up to 10x of single thread serial folio > copy. Batching folio copy not only benefits huge page but also base > page. > > 64KB (GB/s): > > vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 > 32 5.43 4.90 5.65 7.31 7.60 8.61 6.43 > 256 6.95 6.89 9.28 14.67 22.41 23.39 23.93 > 512 7.88 7.26 10.15 17.53 27.82 27.88 33.93 > 768 7.65 7.42 10.46 18.59 28.65 29.67 30.76 > 1024 7.46 8.01 10.90 17.77 27.04 32.18 38.80 > > 2MB mTHP (GB/s): > > vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 > 1 5.94 2.90 6.90 8.56 11.16 8.76 6.41 > 2 7.67 5.57 7.11 12.48 17.37 15.68 14.10 > 4 8.01 6.04 10.25 20.14 22.52 27.79 25.28 > 8 8.42 7.00 11.41 24.73 33.96 32.62 39.55 > 16 9.41 6.91 12.23 27.51 43.95 49.15 51.38 > 32 10.23 7.15 13.03 29.52 49.49 69.98 71.51 > 64 9.40 7.37 13.88 30.38 52.00 76.89 79.41 > 128 8.59 7.23 14.20 28.39 49.98 78.27 90.18 > 256 8.43 7.16 14.59 28.14 48.78 76.88 92.28 > 512 8.31 7.78 14.40 26.20 43.31 63.91 75.21 > 768 8.30 7.86 14.83 27.41 46.25 69.85 81.31 > 1024 8.31 7.90 14.96 27.62 46.75 71.76 83.84 > > > I also ran it on on a two socket Xeon E5-2650 v4: > > > 4KB (GB/s) > > | ---- | ------- | ---- | ---- | ---- | ---- | ----- | > | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | > | ---- | ------- | ---- | ---- | ---- | ---- | ----- | > | 512 | 1.12 | 1.19 | 1.20 | 1.26 | 1.27 | 1.35 | > | 768 | 1.29 | 1.14 | 1.28 | 1.40 | 1.39 | 1.46 | > | 1024 | 1.19 | 1.25 | 1.34 | 1.51 | 1.52 | 1.53 | > | 2048 | 1.14 | 1.12 | 1.44 | 1.61 | 1.73 | 1.71 | > | 4096 | 1.09 | 1.14 | 1.46 | 1.64 | 1.81 | 1.78 | > > > > 2MB (GB/s) > | ---- | ------- | ---- | ---- | ----- | ----- | ----- | > | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | > | ---- | ------- | ---- | ---- | ----- | ----- | ----- | > | 1 | 2.03 | 2.21 | 2.69 | 2.93 | 3.17 | 3.14 | > | 2 | 2.28 | 2.13 | 3.54 | 4.50 | 4.72 | 4.72 | > | 4 | 2.92 | 2.93 | 4.44 | 6.50 | 7.24 | 7.06 | > | 8 | 2.29 | 2.37 | 3.21 | 6.86 | 8.83 | 8.44 | > | 16 | 2.10 | 2.09 | 4.57 | 8.06 | 8.32 | 9.70 | > | 32 | 2.22 | 2.21 | 4.43 | 8.96 | 9.37 | 11.54 | > | 64 | 2.35 | 2.35 | 3.15 | 7.77 | 10.77 | 13.61 | > | 128 | 2.48 | 2.53 | 5.12 | 8.18 | 11.01 | 15.62 | > | 256 | 2.55 | 2.53 | 5.44 | 8.25 | 12.73 | 16.49 | > | 512 | 2.61 | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 | > | 768 | 2.55 | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 | > | 1024 | 2.56 | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 | > > > > Shivank ran it on AMD EPYC Zen 5, after some tuning (spread threads on different CCDs): > > 2MB pages (GB/s): > nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32 > 1 10.74 11.04 4.68 8.17 6.47 6.09 3.97 6.20 > 2 12.44 4.90 11.19 14.10 15.33 8.45 10.09 9.97 > 4 14.82 9.80 11.93 18.35 21.82 17.09 10.53 7.51 > 8 16.13 9.91 15.26 11.85 26.53 13.09 12.71 13.75 > 16 15.99 8.81 13.84 22.43 33.89 11.91 12.30 13.26 > 32 14.03 11.37 17.54 23.96 57.07 18.78 19.51 21.29 > 64 15.79 9.55 22.19 33.17 57.18 65.51 55.39 62.53 > 128 18.22 16.65 21.49 30.73 52.99 61.05 58.44 60.38 > 256 19.78 20.56 24.72 34.94 56.73 71.11 61.83 62.77 > 512 20.27 21.40 27.47 39.23 65.72 67.97 70.48 71.39 > 1024 20.48 21.48 27.48 38.30 68.62 77.94 78.00 78.95 > > > >> >>> Existing page migration performs sequential page copying, underutilizing >>> modern CPU architectures and high-bandwidth memory subsystems. >>> >>> We have proposed and posted RFCs to enhance page migration through three >>> key techniques: >>> 1. Batching migration operations for bulk copying data [1] >>> 2. Multi-threaded folio copying [2] >>> 3. DMA offloading to hardware accelerators [1] >>> >> >> Curious: does memory migration of pages that are actively undergoing DMA >> with hardware assist fit into any of these? > > It should be similar to 3, but in this case, DMA is used to copy pages > between NUMA nodes, whereas traditional DMA page migration is used to copy > pages between host and devices. > I'm planning to test using SDXi as the DMA engine for offload and it doesn't support migrating pages that are actively undergoing DMA AFAIU. >> >>> By employing batching and multi-threaded folio copying, we are able to >>> achieve significant improvements in page migration throughput for large >>> pages. >>> >>> Discussion points: >>> 1. Performance: >>> a. Policy decision for DMA and CPU selection >>> b. Platform-specific scheduling of folio-copy worker threads for better >>> bandwidth utilization >> >> Why platform specific? I *assume* this means a generic framework that can >> optimize for scheduling based on the underlying hardware and not specific >> implementations that can only be used on AMD, for example. Is that the >> case? > > I think the framework will be generic but the CPU scheduling (which core > to choose for page copying) will be different from vendor to vendor. > > Due to existing CPU structure, like chiplet design, a single CPU scheduling > algorithm does not fit for CPUs from different vendors. For example, on > NVIDIA Grace, you can use any CPUs to copy pages and always achieve high > page copy throughput, but on AMD CPUs with multiple CCDs, spreading copy > threads across different CCDs can achieve much higher page copy throughput > than putting all threads in a single CCD. I assume Intel CPUs with chiplet > design would see the same result. Thank you Zi for helping with results and queries. > >> >>> c. Using Non-temporal instructions for CPU-based memcpy >>> d. Upscaling/downscaling worker threads based on migration size, CPU >>> availability (system load), bandwidth saturation, etc. >>> 2. Interface requirements with DMA hardware: >>> a. Standardizing APIs for DMA drivers and support for different DMA >>> drivers >>> b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine) >>> 3. Resources Accounting: >>> a. CPU cgroups accounting and fairness [3] >>> b. Who bears migration cost? - (Migration cost attribution) >>> >>> References: >>> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com >>> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com >>> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com >>> > > [1] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com/ > -- > Best Regards, > Yan, Zi > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA 2025-01-23 5:55 [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA Shivank Garg 2025-01-27 6:55 ` David Rientjes @ 2025-03-24 6:01 ` Shivank Garg 2025-03-25 5:20 ` Shivank Garg 1 sibling, 1 reply; 8+ messages in thread From: Shivank Garg @ 2025-03-24 6:01 UTC (permalink / raw) To: akpm, lsf-pc, linux-mm, ziy Cc: AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david, gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm, k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel, mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel, rientjes, santosh.shukla, shy828301, sj, wangkefeng.wang, weixugc, willy, ying.huang, wei.huang2, Jonathan.Cameron, byungchul On 1/23/2025 11:25 AM, Shivank Garg wrote: > Hi all, > > Zi Yan and I would like to propose the topic: Enhancements to Page > Migration with Multi-threading and Batch Offloading to DMA. > > Page migration is a critical operation in NUMA systems that can incur > significant overheads, affecting memory management performance across > various workloads. For example, copying folios between DRAM NUMA nodes > can take ~25% of the total migration cost for migrating 256MB of data. > > Modern systems are equipped with powerful DMA engines for bulk data > copying, GPUs, and high CPU core counts. Leveraging these hardware > capabilities becomes essential for systems where frequent page promotion > and demotion occur - from large-scale tiered-memory systems with CXL nodes > to CPU-GPU coherent system with GPU memory exposed as NUMA nodes. > > Existing page migration performs sequential page copying, underutilizing > modern CPU architectures and high-bandwidth memory subsystems. > > We have proposed and posted RFCs to enhance page migration through three > key techniques: > 1. Batching migration operations for bulk copying data [1] > 2. Multi-threaded folio copying [2] > 3. DMA offloading to hardware accelerators [1] > > By employing batching and multi-threaded folio copying, we are able to > achieve significant improvements in page migration throughput for large > pages. > > Discussion points: > 1. Performance: > a. Policy decision for DMA and CPU selection > b. Platform-specific scheduling of folio-copy worker threads for better > bandwidth utilization > c. Using Non-temporal instructions for CPU-based memcpy > d. Upscaling/downscaling worker threads based on migration size, CPU > availability (system load), bandwidth saturation, etc. > 2. Interface requirements with DMA hardware: > a. Standardizing APIs for DMA drivers and support for different DMA > drivers > b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine) > 3. Resources Accounting: > a. CPU cgroups accounting and fairness [3] > b. Who bears migration cost? - (Migration cost attribution) > Hi all, For reference, here is the link to the latest RFC v2: https://lore.kernel.org/linux-mm/20250319192211.10092-1-shivankg@amd.com This version combines the ideas discussed in [1] and [2] and includes details on performance improvements and experimental findings to provide more context for discussion. > References: > [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com > [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com > [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com Looking forward to your feedback! Thanks, Shivank ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA 2025-03-24 6:01 ` Shivank Garg @ 2025-03-25 5:20 ` Shivank Garg 0 siblings, 0 replies; 8+ messages in thread From: Shivank Garg @ 2025-03-25 5:20 UTC (permalink / raw) To: akpm, lsf-pc, linux-mm, ziy Cc: AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david, gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm, k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel, mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel, rientjes, santosh.shukla, shy828301, sj, wangkefeng.wang, weixugc, willy, ying.huang, wei.huang2, Jonathan.Cameron, byungchul On 3/24/2025 11:31 AM, Shivank Garg wrote: > > > On 1/23/2025 11:25 AM, Shivank Garg wrote: >> Hi all, >> >> Zi Yan and I would like to propose the topic: Enhancements to Page >> Migration with Multi-threading and Batch Offloading to DMA. >> >> Page migration is a critical operation in NUMA systems that can incur >> significant overheads, affecting memory management performance across >> various workloads. For example, copying folios between DRAM NUMA nodes >> can take ~25% of the total migration cost for migrating 256MB of data. >> >> Modern systems are equipped with powerful DMA engines for bulk data >> copying, GPUs, and high CPU core counts. Leveraging these hardware >> capabilities becomes essential for systems where frequent page promotion >> and demotion occur - from large-scale tiered-memory systems with CXL nodes >> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes. >> >> Existing page migration performs sequential page copying, underutilizing >> modern CPU architectures and high-bandwidth memory subsystems. >> >> We have proposed and posted RFCs to enhance page migration through three >> key techniques: >> 1. Batching migration operations for bulk copying data [1] >> 2. Multi-threaded folio copying [2] >> 3. DMA offloading to hardware accelerators [1] >> >> By employing batching and multi-threaded folio copying, we are able to >> achieve significant improvements in page migration throughput for large >> pages. >> >> Discussion points: >> 1. Performance: >> a. Policy decision for DMA and CPU selection >> b. Platform-specific scheduling of folio-copy worker threads for better >> bandwidth utilization >> c. Using Non-temporal instructions for CPU-based memcpy >> d. Upscaling/downscaling worker threads based on migration size, CPU >> availability (system load), bandwidth saturation, etc. >> 2. Interface requirements with DMA hardware: >> a. Standardizing APIs for DMA drivers and support for different DMA >> drivers >> b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine) >> 3. Resources Accounting: >> a. CPU cgroups accounting and fairness [3] >> b. Who bears migration cost? - (Migration cost attribution) >> > > Hi all, > > For reference, here is the link to the latest RFC v2: > > https://lore.kernel.org/linux-mm/20250319192211.10092-1-shivankg@amd.com > > This version combines the ideas discussed in [1] and [2] and includes details > on performance improvements and experimental findings to provide more context > for discussion. Sharing the slides from today’s presentation: Main Slide Deck: https://docs.google.com/presentation/d/1mjl5-jiz-TMVRK9bQcQ_IsSXrIP82CqWS8Q6em3mJi0/edit?usp=sharing Multi-threading Slide Deck: https://docs.google.com/presentation/d/10czypcUbRMOUn6knp340Cwv4bf83Ha2gUX8TwNXUwCs/edit#slide=id.p6 Thanks, Shivank > >> References: >> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com >> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com >> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com > > Looking forward to your feedback! > > Thanks, > Shivank > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-03-25 5:21 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-01-23 5:55 [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA Shivank Garg 2025-01-27 6:55 ` David Rientjes 2025-01-27 12:37 ` Zi Yan 2025-01-27 13:55 ` Jonathan Cameron 2025-01-27 16:30 ` Zi Yan 2025-01-28 6:54 ` Shivank Garg 2025-03-24 6:01 ` Shivank Garg 2025-03-25 5:20 ` Shivank Garg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox