From: Shivank Garg <shivankg@amd.com>
To: Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org, David Rientjes <rientjes@google.com>,
Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
David Hildenbrand <david@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Kirill Shutemov <k.shutemov@gmail.com>,
Matthew Wilcox <willy@infradead.org>,
Mel Gorman <mel.gorman@gmail.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
Rik van Riel <riel@surriel.com>,
RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>,
Lei Chen <leillc@google.com>,
"Shukla, Santosh" <santosh.shukla@amd.com>,
"Grimm, Jon" <jon.grimm@amd.com>,
sj@kernel.org, shy828301@gmail.com,
Liam Howlett <liam.howlett@oracle.com>,
Gregory Price <gregory.price@memverge.com>,
"Huang, Ying" <ying.huang@linux.alibaba.com>
Subject: Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
Date: Thu, 16 Jan 2025 10:27:49 +0530 [thread overview]
Message-ID: <3212f4d5-afdb-47fe-a2ea-ad61c69836af@amd.com> (raw)
In-Reply-To: <334B7551-7834-44E7-91E6-4AE4C0B382AF@nvidia.com>
On 1/11/2025 1:21 AM, Zi Yan wrote:
<snip>
>>> BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_dma(),
>>> which would incur a huge overhead, based on my past experience using DMA engine
>>> for page copy. I know it is needed to make sure DMA is still present, but
>>> its cost needs to be minimized to make DMA folio copy usable. Otherwise,
>>> the 768MB/s DMA copy throughput from your cover letter cannot convince people
>>> to use it for page migration, since single CPU can achieve more than that,
>>> as you showed in the table below.
Thank you for pointing this.
I'm learning about DMAEngine and will look more into DMA driver part.
>>>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>>>> but I'm still missing the performance scaling you observed.
>>>
>>> It might be the difference between x86 and ARM64, but I am not 100% sure.
>>> Based on your data below, 2 or 4 threads seem to the sweep spot for
>>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
>>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
>>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
>>> ~25GB/s bidirectional. I wonder if your results below are cross-socket
>>> link bandwidth limited.
I tested the cross-socket bandwidth on my EPYC Zen 5 system and easily getting >10X
bandwidth as this. I don't think BW is a issue here.
>>>
>>> From my results, NVIDIA Grace CPU can achieve high copy throughput
>>> with more threads between two sockets, maybe part of the reason is that
>>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.
>>
>> I talked to my colleague about this and he mentioned about CCD architecture
>> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
>> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
>> another. This means my naive scheduling algorithm, which use CPUs from
>> 0 to N threads, uses all cores from one CDD first, then move to another
>> CCD. It is not able to saturate the cross-socket bandwidth. Does it make
>> sense to you?
>>
>> If yes, can you please change the my cpu selection code in mm/copy_pages.c:
This is making sense.
I first tried distributing work threads across different CCDs, which yielded
better results.
Also, I switched my system to NPS-2 config (2 Nodes per socket). This was done
to eliminate cross-socket connections and variables by focusing on intra-socket
page migrations.
Cross-Socket (Node 0 -> Node 2)
THP Always (2 MB pages)
nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
262144 12.37 12.52 15.72 24.94 30.40 33.23 34.68 29.67
524288 12.44 12.19 15.70 24.96 32.72 33.40 35.40 29.18
Intra-Socket (Node 0 -> Node 1)
nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
262144 12.37 17.10 18.65 26.05 35.56 37.80 33.73 29.29
524288 12.44 16.73 18.87 24.34 35.63 37.49 33.79 29.76
I have temporarily hardcoded the CPU assignments and will work on improving the
CPU selection code.
>>
>> + /* TODO: need a better cpu selection method */
>> + for_each_cpu(cpu, per_node_cpumask) {
>> + if (i >= total_mt_num)
>> + break;
>> + cpu_id_list[i] = cpu;
>> + ++i;
>> + }
>>
>> to select CPUs from as many CCDs as possible and rerun the tests.
>> That might boost the page migration throughput on AMD CPUs more.
>>
>> Thanks.
>>
>>>>
>>>> THP Never
>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
>>>> 512 1.40 1.43 2.79 3.48 3.63 3.73 3.63 3.57
>>>> 4096 2.54 3.32 3.18 4.65 4.83 5.11 5.39 5.78
>>>> 8192 3.35 4.40 4.39 4.71 3.63 5.04 5.33 6.00
>>>> 16348 3.76 4.50 4.44 5.33 5.41 5.41 6.47 6.41
>>>>
>>>> THP Always
>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
>>>> 512 5.21 5.47 5.77 6.92 3.71 2.75 7.54 7.44
>>>> 1024 6.10 7.65 8.12 8.41 8.87 8.55 9.13 11.36
>>>> 2048 6.39 6.66 9.58 8.92 10.75 12.99 13.33 12.23
>>>> 4096 7.33 10.85 8.22 13.57 11.43 10.93 12.53 16.86
>>>> 8192 7.26 7.46 8.88 11.82 10.55 10.94 13.27 14.11
>>>> 16348 9.07 8.53 11.82 14.89 12.97 13.22 16.14 18.10
>>>> 32768 10.45 10.55 11.79 19.19 16.85 17.56 20.58 26.57
>>>> 65536 11.00 11.12 13.25 18.27 16.18 16.11 19.61 27.73
>>>> 262144 12.37 12.40 15.65 20.00 19.25 19.38 22.60 31.95
>>>> 524288 12.44 12.33 15.66 19.78 19.06 18.96 23.31 32.29
>>>
>>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study
>
>
> BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method.
> The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of
> vanilla kernel throughput using 8 or 16 threads.
>
>
> 4KB (GB/s)
>
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
> | ---- | ------- | ---- | ---- | ---- | ---- | ----- |
> | 512 | 1.12 | 1.19 | 1.20 | 1.26 | 1.27 | 1.35 |
> | 768 | 1.29 | 1.14 | 1.28 | 1.40 | 1.39 | 1.46 |
> | 1024 | 1.19 | 1.25 | 1.34 | 1.51 | 1.52 | 1.53 |
> | 2048 | 1.14 | 1.12 | 1.44 | 1.61 | 1.73 | 1.71 |
> | 4096 | 1.09 | 1.14 | 1.46 | 1.64 | 1.81 | 1.78 |
>
>
>
> 2MB (GB/s)
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> | | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
> | ---- | ------- | ---- | ---- | ----- | ----- | ----- |
> | 1 | 2.03 | 2.21 | 2.69 | 2.93 | 3.17 | 3.14 |
> | 2 | 2.28 | 2.13 | 3.54 | 4.50 | 4.72 | 4.72 |
> | 4 | 2.92 | 2.93 | 4.44 | 6.50 | 7.24 | 7.06 |
> | 8 | 2.29 | 2.37 | 3.21 | 6.86 | 8.83 | 8.44 |
> | 16 | 2.10 | 2.09 | 4.57 | 8.06 | 8.32 | 9.70 |
> | 32 | 2.22 | 2.21 | 4.43 | 8.96 | 9.37 | 11.54 |
> | 64 | 2.35 | 2.35 | 3.15 | 7.77 | 10.77 | 13.61 |
> | 128 | 2.48 | 2.53 | 5.12 | 8.18 | 11.01 | 15.62 |
> | 256 | 2.55 | 2.53 | 5.44 | 8.25 | 12.73 | 16.49 |
> | 512 | 2.61 | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
> | 768 | 2.55 | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
> | 1024 | 2.56 | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
>
I see, thank you for checking.
Meanwhile, I'll continue to explore for performance optimization
avenues.
Best Regards,
Shivank
>
>
> Best Regards,
> Yan, Zi
>
next prev parent reply other threads:[~2025-01-16 4:58 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-03 17:24 Zi Yan
2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 14:08 ` Zi Yan
2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
2025-01-06 1:18 ` Hyeonggon Yoo
2025-01-06 2:01 ` Zi Yan
2025-02-13 12:44 ` Byungchul Park
2025-02-13 15:34 ` Zi Yan
2025-02-13 21:34 ` Byungchul Park
2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
2025-01-03 22:21 ` Gregory Price
2025-01-03 22:56 ` Zi Yan
2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
2025-01-03 19:32 ` Zi Yan
2025-01-03 22:09 ` Yang Shi
2025-01-06 2:33 ` Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 15:04 ` Zi Yan
2025-01-09 18:03 ` Shivank Garg
2025-01-09 19:32 ` Zi Yan
2025-01-10 17:05 ` Zi Yan
2025-01-10 19:51 ` Zi Yan
2025-01-16 4:57 ` Shivank Garg [this message]
2025-01-21 6:15 ` Shivank Garg
2025-02-13 8:17 ` Byungchul Park
2025-02-13 15:36 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3212f4d5-afdb-47fe-a2ea-ad61c69836af@amd.com \
--to=shivankg@amd.com \
--cc=AneeshKumar.KizhakeVeetil@arm.com \
--cc=Raghavendra.KodsaraThimmappa@amd.com \
--cc=bharata@amd.com \
--cc=david@redhat.com \
--cc=gregory.price@memverge.com \
--cc=jhubbard@nvidia.com \
--cc=jon.grimm@amd.com \
--cc=k.shutemov@gmail.com \
--cc=leesuyeon0506@gmail.com \
--cc=leillc@google.com \
--cc=liam.howlett@oracle.com \
--cc=linux-mm@kvack.org \
--cc=mel.gorman@gmail.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=santosh.shukla@amd.com \
--cc=shy828301@gmail.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox