Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Shivank Garg <shivankg@amd.com>
To: Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org, David Rientjes <rientjes@google.com>,
	Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
	David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Kirill Shutemov <k.shutemov@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mel.gorman@gmail.com>,
	"Rao, Bharata Bhasker" <bharata@amd.com>,
	Rik van Riel <riel@surriel.com>,
	RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
	Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>,
	Lei Chen <leillc@google.com>,
	"Shukla, Santosh" <santosh.shukla@amd.com>,
	"Grimm, Jon" <jon.grimm@amd.com>,
	sj@kernel.org, shy828301@gmail.com,
	Liam Howlett <liam.howlett@oracle.com>,
	Gregory Price <gregory.price@memverge.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>
Subject: Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
Date: Thu, 16 Jan 2025 10:27:49 +0530	[thread overview]
Message-ID: <3212f4d5-afdb-47fe-a2ea-ad61c69836af@amd.com> (raw)
In-Reply-To: <334B7551-7834-44E7-91E6-4AE4C0B382AF@nvidia.com>

On 1/11/2025 1:21 AM, Zi Yan wrote:
<snip>


>>> BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_dma(),
>>> which would incur a huge overhead, based on my past experience using DMA engine
>>> for page copy. I know it is needed to make sure DMA is still present, but
>>> its cost needs to be minimized to make DMA folio copy usable. Otherwise,
>>> the 768MB/s DMA copy throughput from your cover letter cannot convince people
>>> to use it for page migration, since single CPU can achieve more than that,
>>> as you showed in the table below.

Thank you for pointing this.
I'm learning about DMAEngine and will look more into DMA driver part.

>>>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>>>> but I'm still missing the performance scaling you observed.
>>>
>>> It might be the difference between x86 and ARM64, but I am not 100% sure.
>>> Based on your data below, 2 or 4 threads seem to the sweep spot for
>>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
>>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
>>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
>>> ~25GB/s bidirectional. I wonder if your results below are cross-socket
>>> link bandwidth limited.

I tested the cross-socket bandwidth on my EPYC Zen 5 system and easily getting >10X
bandwidth as this. I don't think BW is a issue here.


>>>
>>> From my results, NVIDIA Grace CPU can achieve high copy throughput
>>> with more threads between two sockets, maybe part of the reason is that
>>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.
>>
>> I talked to my colleague about this and he mentioned about CCD architecture
>> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
>> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
>> another. This means my naive scheduling algorithm, which use CPUs from
>> 0 to N threads, uses all cores from one CDD first, then move to another
>> CCD. It is not able to saturate the cross-socket bandwidth. Does it make
>> sense to you?
>>
>> If yes, can you please change the my cpu selection code in mm/copy_pages.c:

This is making sense.

I first tried distributing work threads across different CCDs, which yielded
better results.

Also, I switched my system to NPS-2 config (2 Nodes per socket). This was done
to eliminate cross-socket connections and variables by focusing on intra-socket
page migrations.

Cross-Socket (Node 0 -> Node 2)
THP Always (2 MB pages)

nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
262144              12.37     12.52     15.72     24.94     30.40     33.23     34.68     29.67
524288              12.44     12.19     15.70     24.96     32.72     33.40     35.40     29.18

Intra-Socket (Node 0 -> Node 1)
nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
262144              12.37     17.10     18.65     26.05     35.56     37.80     33.73     29.29
524288              12.44     16.73     18.87     24.34     35.63     37.49     33.79     29.76

I have temporarily hardcoded the CPU assignments and will work on improving the
CPU selection code.
>>
>> +	/* TODO: need a better cpu selection method */
>> +	for_each_cpu(cpu, per_node_cpumask) {
>> +		if (i >= total_mt_num)
>> +			break;
>> +		cpu_id_list[i] = cpu;
>> +		++i;
>> +	}
>>
>> to select CPUs from as many CCDs as possible and rerun the tests.
>> That might boost the page migration throughput on AMD CPUs more.
>>
>> Thanks.
>>
>>>>
>>>> THP Never
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
>>>> 4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
>>>> 8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
>>>> 16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41
>>>>
>>>> THP Always
>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>> 512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
>>>> 1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
>>>> 2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
>>>> 4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
>>>> 8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
>>>> 16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
>>>> 32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
>>>> 65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
>>>> 262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
>>>> 524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29
>>>
>>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study
> 
> 
> BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method.
> The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of
> vanilla kernel throughput using 8 or 16 threads.
> 
> 
> 4KB (GB/s)
> 
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> | 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
> | 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
> | 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
> | 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
> | 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |
> 
> 
> 
> 2MB (GB/s)
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> |      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> | 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
> | 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
> | 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
> | 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
> | 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
> | 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
> | 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
> | 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
> | 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
> | 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
> | 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
> | 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
> 

I see, thank you for checking.

Meanwhile, I'll continue to explore for performance optimization
avenues.

Best Regards,
Shivank
> 
> 
> Best Regards,
> Yan, Zi
>

next prev parent reply	other threads:[~2025-01-16  4:58 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-03 17:24 Zi Yan
2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
2025-01-09 11:47   ` Shivank Garg
2025-01-09 14:08     ` Zi Yan
2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
2025-01-06  1:18   ` Hyeonggon Yoo
2025-01-06  2:01     ` Zi Yan
2025-02-13 12:44   ` Byungchul Park
2025-02-13 15:34     ` Zi Yan
2025-02-13 21:34       ` Byungchul Park
2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
2025-01-03 22:21   ` Gregory Price
2025-01-03 22:56     ` Zi Yan
2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
2025-01-03 19:32   ` Zi Yan
2025-01-03 22:09 ` Yang Shi
2025-01-06  2:33   ` Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 15:04   ` Zi Yan
2025-01-09 18:03     ` Shivank Garg
2025-01-09 19:32       ` Zi Yan
2025-01-10 17:05         ` Zi Yan
2025-01-10 19:51           ` Zi Yan
2025-01-16  4:57             ` Shivank Garg [this message]
2025-01-21  6:15               ` Shivank Garg
2025-02-13  8:17 ` Byungchul Park
2025-02-13 15:36   ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3212f4d5-afdb-47fe-a2ea-ad61c69836af@amd.com \
    --to=shivankg@amd.com \
    --cc=AneeshKumar.KizhakeVeetil@arm.com \
    --cc=Raghavendra.KodsaraThimmappa@amd.com \
    --cc=bharata@amd.com \
    --cc=david@redhat.com \
    --cc=gregory.price@memverge.com \
    --cc=jhubbard@nvidia.com \
    --cc=jon.grimm@amd.com \
    --cc=k.shutemov@gmail.com \
    --cc=leesuyeon0506@gmail.com \
    --cc=leillc@google.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mel.gorman@gmail.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=santosh.shukla@amd.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox