[LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA
@ 2025-01-23  5:55 Shivank Garg
  2025-01-27  6:55 ` David Rientjes
  2025-03-24  6:01 ` Shivank Garg
  0 siblings, 2 replies; 8+ messages in thread
From: Shivank Garg @ 2025-01-23  5:55 UTC (permalink / raw)
  To: akpm, lsf-pc, linux-mm, ziy
  Cc: AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david,
	gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm,
	k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel,
	rientjes, santosh.shukla, shivankg, shy828301, sj,
	wangkefeng.wang, weixugc, willy, ying.huang

Hi all,

Zi Yan and I would like to propose the topic: Enhancements to Page
Migration with Multi-threading and Batch Offloading to DMA.

Page migration is a critical operation in NUMA systems that can incur
significant overheads, affecting memory management performance across
various workloads. For example, copying folios between DRAM NUMA nodes
can take ~25% of the total migration cost for migrating 256MB of data.

Modern systems are equipped with powerful DMA engines for bulk data
copying, GPUs, and high CPU core counts. Leveraging these hardware
capabilities becomes essential for systems where frequent page promotion
and demotion occur - from large-scale tiered-memory systems with CXL nodes
to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.

Existing page migration performs sequential page copying, underutilizing
modern CPU architectures and high-bandwidth memory subsystems.

We have proposed and posted RFCs to enhance page migration through three
key techniques:
1. Batching migration operations for bulk copying data [1]
2. Multi-threaded folio copying [2]
3. DMA offloading to hardware accelerators [1]

By employing batching and multi-threaded folio copying, we are able to
achieve significant improvements in page migration throughput for large
pages.

Discussion points:
1. Performance:
   a. Policy decision for DMA and CPU selection
   b. Platform-specific scheduling of folio-copy worker threads for better
      bandwidth utilization
   c. Using Non-temporal instructions for CPU-based memcpy
   d. Upscaling/downscaling worker threads based on migration size, CPU
      availability (system load), bandwidth saturation, etc.
2. Interface requirements with DMA hardware:
   a. Standardizing APIs for DMA drivers and support for different DMA
      drivers
   b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
3. Resources Accounting:
   a. CPU cgroups accounting and fairness [3]
   b. Who bears migration cost? - (Migration cost attribution)

References:
[1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
[2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
[3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com

Best Regards,
Shivank

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA
  2025-01-23  5:55 [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA Shivank Garg
@ 2025-01-27  6:55 ` David Rientjes
  2025-01-27 12:37   ` Zi Yan
  2025-03-24  6:01 ` Shivank Garg
  1 sibling, 1 reply; 8+ messages in thread
From: David Rientjes @ 2025-01-27  6:55 UTC (permalink / raw)
  To: Shivank Garg
  Cc: akpm, lsf-pc, linux-mm, ziy, AneeshKumar.KizhakeVeetil,
	baolin.wang, bharata, david, gregory.price, honggyu.kim,
	jane.chu, jhubbard, jon.grimm, k.shutemov, leesuyeon0506, leillc,
	liam.howlett, linux-kernel, mel.gorman, Michael.Day,
	Raghavendra.KodsaraThimmappa, riel, santosh.shukla, shy828301,
	sj, wangkefeng.wang, weixugc, willy, ying.huang

On Thu, 23 Jan 2025, Shivank Garg wrote:

> Hi all,
> 
> Zi Yan and I would like to propose the topic: Enhancements to Page
> Migration with Multi-threading and Batch Offloading to DMA.
> 

I think this would be a very useful topic to discuss, thanks for proposing 
it.

> Page migration is a critical operation in NUMA systems that can incur
> significant overheads, affecting memory management performance across
> various workloads. For example, copying folios between DRAM NUMA nodes
> can take ~25% of the total migration cost for migrating 256MB of data.
> 
> Modern systems are equipped with powerful DMA engines for bulk data
> copying, GPUs, and high CPU core counts. Leveraging these hardware
> capabilities becomes essential for systems where frequent page promotion
> and demotion occur - from large-scale tiered-memory systems with CXL nodes
> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
> 

Indeed, there are multiple use cases for optimizations in this area.  With 
the ramp of memory tiered systems, I think there will be an even greater 
reliance on memory migration going forward.

Do you have numbers to share on how offloading, even as a proof of 
concept, moves the needle compared to traditional and sequential memory 
migration?

> Existing page migration performs sequential page copying, underutilizing
> modern CPU architectures and high-bandwidth memory subsystems.
> 
> We have proposed and posted RFCs to enhance page migration through three
> key techniques:
> 1. Batching migration operations for bulk copying data [1]
> 2. Multi-threaded folio copying [2]
> 3. DMA offloading to hardware accelerators [1]
> 

Curious: does memory migration of pages that are actively undergoing DMA 
with hardware assist fit into any of these?

> By employing batching and multi-threaded folio copying, we are able to
> achieve significant improvements in page migration throughput for large
> pages.
> 
> Discussion points:
> 1. Performance:
>    a. Policy decision for DMA and CPU selection
>    b. Platform-specific scheduling of folio-copy worker threads for better
>       bandwidth utilization

Why platform specific?  I *assume* this means a generic framework that can 
optimize for scheduling based on the underlying hardware and not specific 
implementations that can only be used on AMD, for example.  Is that the 
case?

>    c. Using Non-temporal instructions for CPU-based memcpy
>    d. Upscaling/downscaling worker threads based on migration size, CPU
>       availability (system load), bandwidth saturation, etc.
> 2. Interface requirements with DMA hardware:
>    a. Standardizing APIs for DMA drivers and support for different DMA
>       drivers
>    b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
> 3. Resources Accounting:
>    a. CPU cgroups accounting and fairness [3]
>    b. Who bears migration cost? - (Migration cost attribution)
> 
> References:
> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA
  2025-01-27  6:55 ` David Rientjes
@ 2025-01-27 12:37   ` Zi Yan
  2025-01-27 13:55     ` Jonathan Cameron
  2025-01-28  6:54     ` Shivank Garg
  0 siblings, 2 replies; 8+ messages in thread
From: Zi Yan @ 2025-01-27 12:37 UTC (permalink / raw)
  To: David Rientjes, Shivank Garg
  Cc: akpm, lsf-pc, linux-mm, AneeshKumar.KizhakeVeetil, baolin.wang,
	bharata, david, gregory.price, honggyu.kim, jane.chu, jhubbard,
	jon.grimm, k.shutemov, leesuyeon0506, leillc, liam.howlett,
	linux-kernel, mel.gorman, Michael.Day,
	Raghavendra.KodsaraThimmappa, riel, santosh.shukla, shy828301,
	sj, wangkefeng.wang, weixugc, willy, ying.huang

On 27 Jan 2025, at 1:55, David Rientjes wrote:

> On Thu, 23 Jan 2025, Shivank Garg wrote:
>
>> Hi all,
>>
>> Zi Yan and I would like to propose the topic: Enhancements to Page
>> Migration with Multi-threading and Batch Offloading to DMA.
>>
>
> I think this would be a very useful topic to discuss, thanks for proposing
> it.
>
>> Page migration is a critical operation in NUMA systems that can incur
>> significant overheads, affecting memory management performance across
>> various workloads. For example, copying folios between DRAM NUMA nodes
>> can take ~25% of the total migration cost for migrating 256MB of data.
>>
>> Modern systems are equipped with powerful DMA engines for bulk data
>> copying, GPUs, and high CPU core counts. Leveraging these hardware
>> capabilities becomes essential for systems where frequent page promotion
>> and demotion occur - from large-scale tiered-memory systems with CXL nodes
>> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
>>
>
> Indeed, there are multiple use cases for optimizations in this area.  With
> the ramp of memory tiered systems, I think there will be an even greater
> reliance on memory migration going forward.
>
> Do you have numbers to share on how offloading, even as a proof of
> concept, moves the needle compared to traditional and sequential memory
> migration?

For multithreaded page migration, you can see my RFC patchset[1]:

on NVIDIA Grace:

The 32-thread copy throughput can be up to 10x of single thread serial folio
copy. Batching folio copy not only benefits huge page but also base
page.

64KB (GB/s):

		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80

2MB mTHP (GB/s):

		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84


I also ran it on on a two socket Xeon E5-2650 v4:


4KB (GB/s)

| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
| 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
| 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
| 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
| 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
| 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |



2MB (GB/s)
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
| 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
| 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
| 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
| 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
| 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
| 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
| 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
| 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
| 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
| 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
| 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
| 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |



Shivank ran it on AMD EPYC Zen 5, after some tuning (spread threads on different CCDs):

2MB pages (GB/s):
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
1                   10.74     11.04     4.68      8.17      6.47      6.09      3.97      6.20
2                   12.44     4.90      11.19     14.10     15.33     8.45      10.09     9.97
4                   14.82     9.80      11.93     18.35     21.82     17.09     10.53     7.51
8                   16.13     9.91      15.26     11.85     26.53     13.09     12.71     13.75
16                  15.99     8.81      13.84     22.43     33.89     11.91     12.30     13.26
32                  14.03     11.37     17.54     23.96     57.07     18.78     19.51     21.29
64                  15.79     9.55      22.19     33.17     57.18     65.51     55.39     62.53
128                 18.22     16.65     21.49     30.73     52.99     61.05     58.44     60.38
256                 19.78     20.56     24.72     34.94     56.73     71.11     61.83     62.77
512                 20.27     21.40     27.47     39.23     65.72     67.97     70.48     71.39
1024                20.48     21.48     27.48     38.30     68.62     77.94     78.00     78.95



>
>> Existing page migration performs sequential page copying, underutilizing
>> modern CPU architectures and high-bandwidth memory subsystems.
>>
>> We have proposed and posted RFCs to enhance page migration through three
>> key techniques:
>> 1. Batching migration operations for bulk copying data [1]
>> 2. Multi-threaded folio copying [2]
>> 3. DMA offloading to hardware accelerators [1]
>>
>
> Curious: does memory migration of pages that are actively undergoing DMA
> with hardware assist fit into any of these?

It should be similar to 3, but in this case, DMA is used to copy pages
between NUMA nodes, whereas traditional DMA page migration is used to copy
pages between host and devices.

>
>> By employing batching and multi-threaded folio copying, we are able to
>> achieve significant improvements in page migration throughput for large
>> pages.
>>
>> Discussion points:
>> 1. Performance:
>>    a. Policy decision for DMA and CPU selection
>>    b. Platform-specific scheduling of folio-copy worker threads for better
>>       bandwidth utilization
>
> Why platform specific?  I *assume* this means a generic framework that can
> optimize for scheduling based on the underlying hardware and not specific
> implementations that can only be used on AMD, for example.  Is that the
> case?

I think the framework will be generic but the CPU scheduling (which core
to choose for page copying) will be different from vendor to vendor.

Due to existing CPU structure, like chiplet design, a single CPU scheduling
algorithm does not fit for CPUs from different vendors. For example, on
NVIDIA Grace, you can use any CPUs to copy pages and always achieve high
page copy throughput, but on AMD CPUs with multiple CCDs, spreading copy
threads across different CCDs can achieve much higher page copy throughput
than putting all threads in a single CCD. I assume Intel CPUs with chiplet
design would see the same result.

>
>>    c. Using Non-temporal instructions for CPU-based memcpy
>>    d. Upscaling/downscaling worker threads based on migration size, CPU
>>       availability (system load), bandwidth saturation, etc.
>> 2. Interface requirements with DMA hardware:
>>    a. Standardizing APIs for DMA drivers and support for different DMA
>>       drivers
>>    b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
>> 3. Resources Accounting:
>>    a. CPU cgroups accounting and fairness [3]
>>    b. Who bears migration cost? - (Migration cost attribution)
>>
>> References:
>> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
>> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
>> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
>>

[1] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com/
--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA
  2025-01-27 12:37   ` Zi Yan
@ 2025-01-27 13:55     ` Jonathan Cameron
  2025-01-27 16:30       ` Zi Yan
  2025-01-28  6:54     ` Shivank Garg
  1 sibling, 1 reply; 8+ messages in thread
From: Jonathan Cameron @ 2025-01-27 13:55 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Rientjes, Shivank Garg, akpm, lsf-pc, linux-mm,
	AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david,
	gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm,
	k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel,
	santosh.shukla, shy828301, sj, wangkefeng.wang, weixugc, willy,
	ying.huang

On Mon, 27 Jan 2025 07:37:19 -0500
Zi Yan <ziy@nvidia.com> wrote:

> On 27 Jan 2025, at 1:55, David Rientjes wrote:
> 
> > On Thu, 23 Jan 2025, Shivank Garg wrote:
> >  
> >> Hi all,
> >>
> >> Zi Yan and I would like to propose the topic: Enhancements to Page
> >> Migration with Multi-threading and Batch Offloading to DMA.
> >>  
> >
> > I think this would be a very useful topic to discuss, thanks for proposing
> > it.
> >  
> >> Page migration is a critical operation in NUMA systems that can incur
> >> significant overheads, affecting memory management performance across
> >> various workloads. For example, copying folios between DRAM NUMA nodes
> >> can take ~25% of the total migration cost for migrating 256MB of data.
> >>
> >> Modern systems are equipped with powerful DMA engines for bulk data
> >> copying, GPUs, and high CPU core counts. Leveraging these hardware
> >> capabilities becomes essential for systems where frequent page promotion
> >> and demotion occur - from large-scale tiered-memory systems with CXL nodes
> >> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
Hi,


With the potential usecase of tiered memory (CXL) migration mentioned above,
I'm curious what application scenario is such that we are willing to burn
lots of CPU cores at once on that migration? I'm very interested in the
DMA offload aspect, but multithreading for that usecase seems
like a less good fit as presumably there is something running that is
seeing the poor effects of memory latency that is making the move look
like a good idea? 

Or are we looking at some sort of demotion when the system is idle?

> >>  
> >
> > Indeed, there are multiple use cases for optimizations in this area.  With
> > the ramp of memory tiered systems, I think there will be an even greater
> > reliance on memory migration going forward.
> >
> > Do you have numbers to share on how offloading, even as a proof of
> > concept, moves the needle compared to traditional and sequential memory
> > migration?  
> 
> For multithreaded page migration, you can see my RFC patchset[1]:
> 
> on NVIDIA Grace:
> 
> The 32-thread copy throughput can be up to 10x of single thread serial folio
> copy. Batching folio copy not only benefits huge page but also base
> page.
> 
> 64KB (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
> 256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
> 512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
> 768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
> 1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80
> 
> 2MB mTHP (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
> 2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
> 4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
> 8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
> 16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
> 32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
> 64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
> 128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
> 256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
> 512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
> 768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
> 1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84
> 
> 
> I also ran it on on a two socket Xeon E5-2650 v4:
> 
> 
> 4KB (GB/s)
> 
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> | 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
> | 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
> | 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
> | 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
> | 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |
> 
> 
> 
> 2MB (GB/s)
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> | 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
> | 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
> | 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
> | 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
> | 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
> | 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
> | 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
> | 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
> | 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
> | 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
> | 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
> | 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
> 
> 
> 
> Shivank ran it on AMD EPYC Zen 5, after some tuning (spread threads on different CCDs):
> 
> 2MB pages (GB/s):
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 1                   10.74     11.04     4.68      8.17      6.47      6.09      3.97      6.20
> 2                   12.44     4.90      11.19     14.10     15.33     8.45      10.09     9.97
> 4                   14.82     9.80      11.93     18.35     21.82     17.09     10.53     7.51
> 8                   16.13     9.91      15.26     11.85     26.53     13.09     12.71     13.75
> 16                  15.99     8.81      13.84     22.43     33.89     11.91     12.30     13.26
> 32                  14.03     11.37     17.54     23.96     57.07     18.78     19.51     21.29
> 64                  15.79     9.55      22.19     33.17     57.18     65.51     55.39     62.53
> 128                 18.22     16.65     21.49     30.73     52.99     61.05     58.44     60.38
> 256                 19.78     20.56     24.72     34.94     56.73     71.11     61.83     62.77
> 512                 20.27     21.40     27.47     39.23     65.72     67.97     70.48     71.39
> 1024                20.48     21.48     27.48     38.30     68.62     77.94     78.00     78.95
> 
> 
> 
> >  
> >> Existing page migration performs sequential page copying, underutilizing
> >> modern CPU architectures and high-bandwidth memory subsystems.
> >>
> >> We have proposed and posted RFCs to enhance page migration through three
> >> key techniques:
> >> 1. Batching migration operations for bulk copying data [1]
> >> 2. Multi-threaded folio copying [2]
> >> 3. DMA offloading to hardware accelerators [1]
> >>  
> >
> > Curious: does memory migration of pages that are actively undergoing DMA
> > with hardware assist fit into any of these?  
> 
> It should be similar to 3, but in this case, DMA is used to copy pages
> between NUMA nodes, whereas traditional DMA page migration is used to copy
> pages between host and devices.
> 
> >  
> >> By employing batching and multi-threaded folio copying, we are able to
> >> achieve significant improvements in page migration throughput for large
> >> pages.
> >>
> >> Discussion points:
> >> 1. Performance:
> >>    a. Policy decision for DMA and CPU selection
> >>    b. Platform-specific scheduling of folio-copy worker threads for better
> >>       bandwidth utilization  
> >
> > Why platform specific?  I *assume* this means a generic framework that can
> > optimize for scheduling based on the underlying hardware and not specific
> > implementations that can only be used on AMD, for example.  Is that the
> > case?  
> 
> I think the framework will be generic but the CPU scheduling (which core
> to choose for page copying) will be different from vendor to vendor.
> 
> Due to existing CPU structure, like chiplet design, a single CPU scheduling
> algorithm does not fit for CPUs from different vendors. For example, on
> NVIDIA Grace, you can use any CPUs to copy pages and always achieve high
> page copy throughput, but on AMD CPUs with multiple CCDs, spreading copy
> threads across different CCDs can achieve much higher page copy throughput
> than putting all threads in a single CCD. I assume Intel CPUs with chiplet
> design would see the same result.

On this I'd hope we could build something topology aware enough to make
the right decisions. All the information should be available to do this
without having per uarch specific code.

Jonathan




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA
  2025-01-27 13:55     ` Jonathan Cameron
@ 2025-01-27 16:30       ` Zi Yan
  0 siblings, 0 replies; 8+ messages in thread
From: Zi Yan @ 2025-01-27 16:30 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: David Rientjes, Shivank Garg, akpm, lsf-pc, linux-mm,
	AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david,
	gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm,
	k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel,
	santosh.shukla, shy828301, sj, wangkefeng.wang, weixugc, willy,
	ying.huang

On 27 Jan 2025, at 8:55, Jonathan Cameron wrote:

> On Mon, 27 Jan 2025 07:37:19 -0500
> Zi Yan <ziy@nvidia.com> wrote:
>
>> On 27 Jan 2025, at 1:55, David Rientjes wrote:
>>
>>> On Thu, 23 Jan 2025, Shivank Garg wrote:
>>>
>>>> Hi all,
>>>>
>>>> Zi Yan and I would like to propose the topic: Enhancements to Page
>>>> Migration with Multi-threading and Batch Offloading to DMA.
>>>>
>>>
>>> I think this would be a very useful topic to discuss, thanks for proposing
>>> it.
>>>
>>>> Page migration is a critical operation in NUMA systems that can incur
>>>> significant overheads, affecting memory management performance across
>>>> various workloads. For example, copying folios between DRAM NUMA nodes
>>>> can take ~25% of the total migration cost for migrating 256MB of data.
>>>>
>>>> Modern systems are equipped with powerful DMA engines for bulk data
>>>> copying, GPUs, and high CPU core counts. Leveraging these hardware
>>>> capabilities becomes essential for systems where frequent page promotion
>>>> and demotion occur - from large-scale tiered-memory systems with CXL nodes
>>>> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
> Hi,
>
>
> With the potential usecase of tiered memory (CXL) migration mentioned above,
> I'm curious what application scenario is such that we are willing to burn
> lots of CPU cores at once on that migration? I'm very interested in the
> DMA offload aspect, but multithreading for that usecase seems
> like a less good fit as presumably there is something running that is
> seeing the poor effects of memory latency that is making the move look
> like a good idea?

There are two scenarios: 1) all CPUs are busy and 2) some CPUs are idle.

For the first one, it is going to be a trade off between a) spending CPU
cycles on workloads but the workload performance is limited by the
long memory latency and low memory bandwidth when accessing data on
remote nodes and b) spending CPU cycles on moving hot data much quicker
from remote to local to make workloads run quicker. Without multithreading,
we are still “burning” one CPU core to migrate data slowly from remote
to local; with multithreading, we use more CPU resources to move data
quicker to make workloads run faster. Admittedly, not all workloads would
benefit from it, but if a workload sees significant performance boost
when accessing data from local memory, using more CPU resources to move
it over should justify the use.

>
> Or are we looking at some sort of demotion when the system is idle?

For second scenario, a lot of GPU/accelerator intensive workloads would
spend most of its time in GPU/accelerator, leaving CPU idle. It is going
to be a great use of these idle CPUs to shuffle hot and cold data to
the right place to make sure GPU/accelerator can always access hot data.

>
>>>>
>>>
>>> Indeed, there are multiple use cases for optimizations in this area.  With
>>> the ramp of memory tiered systems, I think there will be an even greater
>>> reliance on memory migration going forward.
>>>
>>> Do you have numbers to share on how offloading, even as a proof of
>>> concept, moves the needle compared to traditional and sequential memory
>>> migration?
>>
>> For multithreaded page migration, you can see my RFC patchset[1]:
>>
>> on NVIDIA Grace:
>>
>> The 32-thread copy throughput can be up to 10x of single thread serial folio
>> copy. Batching folio copy not only benefits huge page but also base
>> page.
>>
>> 64KB (GB/s):
>>
>> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
>> 32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
>> 256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
>> 512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
>> 768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
>> 1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80
>>
>> 2MB mTHP (GB/s):
>>
>> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
>> 1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
>> 2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
>> 4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
>> 8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
>> 16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
>> 32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
>> 64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
>> 128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
>> 256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
>> 512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
>> 768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
>> 1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84
>>
>>
>> I also ran it on on a two socket Xeon E5-2650 v4:
>>
>>
>> 4KB (GB/s)
>>
>> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
>> |      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
>> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
>> | 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
>> | 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
>> | 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
>> | 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
>> | 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |
>>
>>
>>
>> 2MB (GB/s)
>> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
>> |      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
>> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
>> | 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
>> | 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
>> | 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
>> | 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
>> | 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
>> | 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
>> | 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
>> | 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
>> | 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
>> | 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
>> | 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
>> | 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
>>
>>
>>
>> Shivank ran it on AMD EPYC Zen 5, after some tuning (spread threads on different CCDs):
>>
>> 2MB pages (GB/s):
>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>> 1                   10.74     11.04     4.68      8.17      6.47      6.09      3.97      6.20
>> 2                   12.44     4.90      11.19     14.10     15.33     8.45      10.09     9.97
>> 4                   14.82     9.80      11.93     18.35     21.82     17.09     10.53     7.51
>> 8                   16.13     9.91      15.26     11.85     26.53     13.09     12.71     13.75
>> 16                  15.99     8.81      13.84     22.43     33.89     11.91     12.30     13.26
>> 32                  14.03     11.37     17.54     23.96     57.07     18.78     19.51     21.29
>> 64                  15.79     9.55      22.19     33.17     57.18     65.51     55.39     62.53
>> 128                 18.22     16.65     21.49     30.73     52.99     61.05     58.44     60.38
>> 256                 19.78     20.56     24.72     34.94     56.73     71.11     61.83     62.77
>> 512                 20.27     21.40     27.47     39.23     65.72     67.97     70.48     71.39
>> 1024                20.48     21.48     27.48     38.30     68.62     77.94     78.00     78.95
>>
>>
>>
>>>
>>>> Existing page migration performs sequential page copying, underutilizing
>>>> modern CPU architectures and high-bandwidth memory subsystems.
>>>>
>>>> We have proposed and posted RFCs to enhance page migration through three
>>>> key techniques:
>>>> 1. Batching migration operations for bulk copying data [1]
>>>> 2. Multi-threaded folio copying [2]
>>>> 3. DMA offloading to hardware accelerators [1]
>>>>
>>>
>>> Curious: does memory migration of pages that are actively undergoing DMA
>>> with hardware assist fit into any of these?
>>
>> It should be similar to 3, but in this case, DMA is used to copy pages
>> between NUMA nodes, whereas traditional DMA page migration is used to copy
>> pages between host and devices.
>>
>>>
>>>> By employing batching and multi-threaded folio copying, we are able to
>>>> achieve significant improvements in page migration throughput for large
>>>> pages.
>>>>
>>>> Discussion points:
>>>> 1. Performance:
>>>>    a. Policy decision for DMA and CPU selection
>>>>    b. Platform-specific scheduling of folio-copy worker threads for better
>>>>       bandwidth utilization
>>>
>>> Why platform specific?  I *assume* this means a generic framework that can
>>> optimize for scheduling based on the underlying hardware and not specific
>>> implementations that can only be used on AMD, for example.  Is that the
>>> case?
>>
>> I think the framework will be generic but the CPU scheduling (which core
>> to choose for page copying) will be different from vendor to vendor.
>>
>> Due to existing CPU structure, like chiplet design, a single CPU scheduling
>> algorithm does not fit for CPUs from different vendors. For example, on
>> NVIDIA Grace, you can use any CPUs to copy pages and always achieve high
>> page copy throughput, but on AMD CPUs with multiple CCDs, spreading copy
>> threads across different CCDs can achieve much higher page copy throughput
>> than putting all threads in a single CCD. I assume Intel CPUs with chiplet
>> design would see the same result.
>
> On this I'd hope we could build something topology aware enough to make
> the right decisions. All the information should be available to do this
> without having per uarch specific code.

I agree. Currently, I am using workqueue to copy data, but workqueue is not
aware of CPU topology or CPU idleness. I think either workqueue needs to be
enhanced or something else, which has better CPU resource management and
topology awareness, should be used to achieve the optimal multithreading
page migration results.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA
  2025-01-27 12:37   ` Zi Yan
  2025-01-27 13:55     ` Jonathan Cameron
@ 2025-01-28  6:54     ` Shivank Garg
  1 sibling, 0 replies; 8+ messages in thread
From: Shivank Garg @ 2025-01-28  6:54 UTC (permalink / raw)
  To: Zi Yan, David Rientjes
  Cc: akpm, lsf-pc, linux-mm, AneeshKumar.KizhakeVeetil, baolin.wang,
	bharata, david, gregory.price, honggyu.kim, jane.chu, jhubbard,
	jon.grimm, k.shutemov, leesuyeon0506, leillc, liam.howlett,
	linux-kernel, mel.gorman, Michael.Day,
	Raghavendra.KodsaraThimmappa, riel, santosh.shukla, shy828301,
	sj, wangkefeng.wang, weixugc, willy, ying.huang,
	Jonathan.Cameron

Hi David, Zi,

On 1/27/2025 6:07 PM, Zi Yan wrote:
> On 27 Jan 2025, at 1:55, David Rientjes wrote:
> 
>> On Thu, 23 Jan 2025, Shivank Garg wrote:
>>
>>> Hi all,
>>>
>>> Zi Yan and I would like to propose the topic: Enhancements to Page
>>> Migration with Multi-threading and Batch Offloading to DMA.
>>>
>>
>> I think this would be a very useful topic to discuss, thanks for proposing
>> it.

Thanks for your interest in our proposal. 

>>
>>> Page migration is a critical operation in NUMA systems that can incur
>>> significant overheads, affecting memory management performance across
>>> various workloads. For example, copying folios between DRAM NUMA nodes
>>> can take ~25% of the total migration cost for migrating 256MB of data.
>>>
>>> Modern systems are equipped with powerful DMA engines for bulk data
>>> copying, GPUs, and high CPU core counts. Leveraging these hardware
>>> capabilities becomes essential for systems where frequent page promotion
>>> and demotion occur - from large-scale tiered-memory systems with CXL nodes
>>> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
>>>
>>
>> Indeed, there are multiple use cases for optimizations in this area.  With
>> the ramp of memory tiered systems, I think there will be an even greater
>> reliance on memory migration going forward.
>>
>> Do you have numbers to share on how offloading, even as a proof of
>> concept, moves the needle compared to traditional and sequential memory
>> migration?
> 
> For multithreaded page migration, you can see my RFC patchset[1]:
> 
> on NVIDIA Grace:
> 
> The 32-thread copy throughput can be up to 10x of single thread serial folio
> copy. Batching folio copy not only benefits huge page but also base
> page.
> 
> 64KB (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 32		5.43	4.90	5.65	7.31	7.60	8.61	6.43
> 256		6.95	6.89	9.28	14.67	22.41	23.39	23.93
> 512		7.88	7.26	10.15	17.53	27.82	27.88	33.93
> 768		7.65	7.42	10.46	18.59	28.65	29.67	30.76
> 1024	7.46	8.01	10.90	17.77	27.04	32.18	38.80
> 
> 2MB mTHP (GB/s):
> 
> 		vanilla	mt_1	mt_2	mt_4	mt_8	mt_16	mt_32
> 1		5.94	2.90	6.90	8.56	11.16	8.76	6.41
> 2		7.67	5.57	7.11	12.48	17.37	15.68	14.10
> 4		8.01	6.04	10.25	20.14	22.52	27.79	25.28
> 8		8.42	7.00	11.41	24.73	33.96	32.62	39.55
> 16		9.41	6.91	12.23	27.51	43.95	49.15	51.38
> 32		10.23	7.15	13.03	29.52	49.49	69.98	71.51
> 64		9.40	7.37	13.88	30.38	52.00	76.89	79.41
> 128		8.59	7.23	14.20	28.39	49.98	78.27	90.18
> 256		8.43	7.16	14.59	28.14	48.78	76.88	92.28
> 512		8.31	7.78	14.40	26.20	43.31	63.91	75.21
> 768		8.30	7.86	14.83	27.41	46.25	69.85	81.31
> 1024	8.31	7.90	14.96	27.62	46.75	71.76	83.84
> 
> 
> I also ran it on on a two socket Xeon E5-2650 v4:
> 
> 
> 4KB (GB/s)
> 
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> | 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
> | 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
> | 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
> | 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
> | 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |
> 
> 
> 
> 2MB (GB/s)
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> | 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
> | 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
> | 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
> | 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
> | 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
> | 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
> | 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
> | 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
> | 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
> | 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
> | 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
> | 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
> 
> 
> 
> Shivank ran it on AMD EPYC Zen 5, after some tuning (spread threads on different CCDs):
> 
> 2MB pages (GB/s):
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
> 1                   10.74     11.04     4.68      8.17      6.47      6.09      3.97      6.20
> 2                   12.44     4.90      11.19     14.10     15.33     8.45      10.09     9.97
> 4                   14.82     9.80      11.93     18.35     21.82     17.09     10.53     7.51
> 8                   16.13     9.91      15.26     11.85     26.53     13.09     12.71     13.75
> 16                  15.99     8.81      13.84     22.43     33.89     11.91     12.30     13.26
> 32                  14.03     11.37     17.54     23.96     57.07     18.78     19.51     21.29
> 64                  15.79     9.55      22.19     33.17     57.18     65.51     55.39     62.53
> 128                 18.22     16.65     21.49     30.73     52.99     61.05     58.44     60.38
> 256                 19.78     20.56     24.72     34.94     56.73     71.11     61.83     62.77
> 512                 20.27     21.40     27.47     39.23     65.72     67.97     70.48     71.39
> 1024                20.48     21.48     27.48     38.30     68.62     77.94     78.00     78.95
> 
> 
> 
>>
>>> Existing page migration performs sequential page copying, underutilizing
>>> modern CPU architectures and high-bandwidth memory subsystems.
>>>
>>> We have proposed and posted RFCs to enhance page migration through three
>>> key techniques:
>>> 1. Batching migration operations for bulk copying data [1]
>>> 2. Multi-threaded folio copying [2]
>>> 3. DMA offloading to hardware accelerators [1]
>>>
>>
>> Curious: does memory migration of pages that are actively undergoing DMA
>> with hardware assist fit into any of these?
> 
> It should be similar to 3, but in this case, DMA is used to copy pages
> between NUMA nodes, whereas traditional DMA page migration is used to copy
> pages between host and devices.
> 

I'm planning to test using SDXi as the DMA engine for offload and it
doesn't support migrating pages that are actively undergoing DMA AFAIU.

>>
>>> By employing batching and multi-threaded folio copying, we are able to
>>> achieve significant improvements in page migration throughput for large
>>> pages.
>>>
>>> Discussion points:
>>> 1. Performance:
>>>    a. Policy decision for DMA and CPU selection
>>>    b. Platform-specific scheduling of folio-copy worker threads for better
>>>       bandwidth utilization
>>
>> Why platform specific?  I *assume* this means a generic framework that can
>> optimize for scheduling based on the underlying hardware and not specific
>> implementations that can only be used on AMD, for example.  Is that the
>> case?
> 
> I think the framework will be generic but the CPU scheduling (which core
> to choose for page copying) will be different from vendor to vendor.
> 
> Due to existing CPU structure, like chiplet design, a single CPU scheduling
> algorithm does not fit for CPUs from different vendors. For example, on
> NVIDIA Grace, you can use any CPUs to copy pages and always achieve high
> page copy throughput, but on AMD CPUs with multiple CCDs, spreading copy
> threads across different CCDs can achieve much higher page copy throughput
> than putting all threads in a single CCD. I assume Intel CPUs with chiplet
> design would see the same result.

Thank you Zi for helping with results and queries.

> 
>>
>>>    c. Using Non-temporal instructions for CPU-based memcpy
>>>    d. Upscaling/downscaling worker threads based on migration size, CPU
>>>       availability (system load), bandwidth saturation, etc.
>>> 2. Interface requirements with DMA hardware:
>>>    a. Standardizing APIs for DMA drivers and support for different DMA
>>>       drivers
>>>    b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
>>> 3. Resources Accounting:
>>>    a. CPU cgroups accounting and fairness [3]
>>>    b. Who bears migration cost? - (Migration cost attribution)
>>>
>>> References:
>>> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
>>> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
>>> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
>>>
> 
> [1] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com/
> --
> Best Regards,
> Yan, Zi
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA
  2025-01-23  5:55 [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA Shivank Garg
  2025-01-27  6:55 ` David Rientjes
@ 2025-03-24  6:01 ` Shivank Garg
  2025-03-25  5:20   ` Shivank Garg
  1 sibling, 1 reply; 8+ messages in thread
From: Shivank Garg @ 2025-03-24  6:01 UTC (permalink / raw)
  To: akpm, lsf-pc, linux-mm, ziy
  Cc: AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david,
	gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm,
	k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel,
	rientjes, santosh.shukla, shy828301, sj, wangkefeng.wang,
	weixugc, willy, ying.huang, wei.huang2, Jonathan.Cameron,
	byungchul



On 1/23/2025 11:25 AM, Shivank Garg wrote:
> Hi all,
> 
> Zi Yan and I would like to propose the topic: Enhancements to Page
> Migration with Multi-threading and Batch Offloading to DMA.
> 
> Page migration is a critical operation in NUMA systems that can incur
> significant overheads, affecting memory management performance across
> various workloads. For example, copying folios between DRAM NUMA nodes
> can take ~25% of the total migration cost for migrating 256MB of data.
> 
> Modern systems are equipped with powerful DMA engines for bulk data
> copying, GPUs, and high CPU core counts. Leveraging these hardware
> capabilities becomes essential for systems where frequent page promotion
> and demotion occur - from large-scale tiered-memory systems with CXL nodes
> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
> 
> Existing page migration performs sequential page copying, underutilizing
> modern CPU architectures and high-bandwidth memory subsystems.
> 
> We have proposed and posted RFCs to enhance page migration through three
> key techniques:
> 1. Batching migration operations for bulk copying data [1]
> 2. Multi-threaded folio copying [2]
> 3. DMA offloading to hardware accelerators [1]
> 
> By employing batching and multi-threaded folio copying, we are able to
> achieve significant improvements in page migration throughput for large
> pages.
> 
> Discussion points:
> 1. Performance:
>    a. Policy decision for DMA and CPU selection
>    b. Platform-specific scheduling of folio-copy worker threads for better
>       bandwidth utilization
>    c. Using Non-temporal instructions for CPU-based memcpy
>    d. Upscaling/downscaling worker threads based on migration size, CPU
>       availability (system load), bandwidth saturation, etc.
> 2. Interface requirements with DMA hardware:
>    a. Standardizing APIs for DMA drivers and support for different DMA
>       drivers
>    b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
> 3. Resources Accounting:
>    a. CPU cgroups accounting and fairness [3]
>    b. Who bears migration cost? - (Migration cost attribution)
> 

Hi all,

For reference, here is the link to the latest RFC v2:

https://lore.kernel.org/linux-mm/20250319192211.10092-1-shivankg@amd.com

This version combines the ideas discussed in [1] and [2] and includes details
on performance improvements and experimental findings to provide more context
for discussion.

> References:
> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com

Looking forward to your feedback!

Thanks,
Shivank
 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA
  2025-03-24  6:01 ` Shivank Garg
@ 2025-03-25  5:20   ` Shivank Garg
  0 siblings, 0 replies; 8+ messages in thread
From: Shivank Garg @ 2025-03-25  5:20 UTC (permalink / raw)
  To: akpm, lsf-pc, linux-mm, ziy
  Cc: AneeshKumar.KizhakeVeetil, baolin.wang, bharata, david,
	gregory.price, honggyu.kim, jane.chu, jhubbard, jon.grimm,
	k.shutemov, leesuyeon0506, leillc, liam.howlett, linux-kernel,
	mel.gorman, Michael.Day, Raghavendra.KodsaraThimmappa, riel,
	rientjes, santosh.shukla, shy828301, sj, wangkefeng.wang,
	weixugc, willy, ying.huang, wei.huang2, Jonathan.Cameron,
	byungchul



On 3/24/2025 11:31 AM, Shivank Garg wrote:
> 
> 
> On 1/23/2025 11:25 AM, Shivank Garg wrote:
>> Hi all,
>>
>> Zi Yan and I would like to propose the topic: Enhancements to Page
>> Migration with Multi-threading and Batch Offloading to DMA.
>>
>> Page migration is a critical operation in NUMA systems that can incur
>> significant overheads, affecting memory management performance across
>> various workloads. For example, copying folios between DRAM NUMA nodes
>> can take ~25% of the total migration cost for migrating 256MB of data.
>>
>> Modern systems are equipped with powerful DMA engines for bulk data
>> copying, GPUs, and high CPU core counts. Leveraging these hardware
>> capabilities becomes essential for systems where frequent page promotion
>> and demotion occur - from large-scale tiered-memory systems with CXL nodes
>> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
>>
>> Existing page migration performs sequential page copying, underutilizing
>> modern CPU architectures and high-bandwidth memory subsystems.
>>
>> We have proposed and posted RFCs to enhance page migration through three
>> key techniques:
>> 1. Batching migration operations for bulk copying data [1]
>> 2. Multi-threaded folio copying [2]
>> 3. DMA offloading to hardware accelerators [1]
>>
>> By employing batching and multi-threaded folio copying, we are able to
>> achieve significant improvements in page migration throughput for large
>> pages.
>>
>> Discussion points:
>> 1. Performance:
>>    a. Policy decision for DMA and CPU selection
>>    b. Platform-specific scheduling of folio-copy worker threads for better
>>       bandwidth utilization
>>    c. Using Non-temporal instructions for CPU-based memcpy
>>    d. Upscaling/downscaling worker threads based on migration size, CPU
>>       availability (system load), bandwidth saturation, etc.
>> 2. Interface requirements with DMA hardware:
>>    a. Standardizing APIs for DMA drivers and support for different DMA
>>       drivers
>>    b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
>> 3. Resources Accounting:
>>    a. CPU cgroups accounting and fairness [3]
>>    b. Who bears migration cost? - (Migration cost attribution)
>>
> 
> Hi all,
> 
> For reference, here is the link to the latest RFC v2:
> 
> https://lore.kernel.org/linux-mm/20250319192211.10092-1-shivankg@amd.com
> 
> This version combines the ideas discussed in [1] and [2] and includes details
> on performance improvements and experimental findings to provide more context
> for discussion.

Sharing the slides from today’s presentation:

Main Slide Deck: https://docs.google.com/presentation/d/1mjl5-jiz-TMVRK9bQcQ_IsSXrIP82CqWS8Q6em3mJi0/edit?usp=sharing
Multi-threading Slide Deck: https://docs.google.com/presentation/d/10czypcUbRMOUn6knp340Cwv4bf83Ha2gUX8TwNXUwCs/edit#slide=id.p6

Thanks,
Shivank

> 
>> References:
>> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@amd.com
>> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@nvidia.com
>> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com
> 
> Looking forward to your feedback!
> 
> Thanks,
> Shivank
>  



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-03-25  5:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-23  5:55 [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA Shivank Garg
2025-01-27  6:55 ` David Rientjes
2025-01-27 12:37   ` Zi Yan
2025-01-27 13:55     ` Jonathan Cameron
2025-01-27 16:30       ` Zi Yan
2025-01-28  6:54     ` Shivank Garg
2025-03-24  6:01 ` Shivank Garg
2025-03-25  5:20   ` Shivank Garg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox