From: Zi Yan <ziy@nvidia.com>
To: Shivank Garg <shivankg@amd.com>
Cc: <linux-mm@kvack.org>, David Rientjes <rientjes@google.com>,
Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
David Hildenbrand <david@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Kirill Shutemov <k.shutemov@gmail.com>,
Matthew Wilcox <willy@infradead.org>,
Mel Gorman <mel.gorman@gmail.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
Rik van Riel <riel@surriel.com>,
RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>,
Lei Chen <leillc@google.com>,
"Shukla, Santosh" <santosh.shukla@amd.com>,
"Grimm, Jon" <jon.grimm@amd.com>, <sj@kernel.org>,
<shy828301@gmail.com>, Liam Howlett <liam.howlett@oracle.com>,
Gregory Price <gregory.price@memverge.com>,
"Huang, Ying" <ying.huang@linux.alibaba.com>
Subject: Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
Date: Fri, 10 Jan 2025 14:51:01 -0500 [thread overview]
Message-ID: <334B7551-7834-44E7-91E6-4AE4C0B382AF@nvidia.com> (raw)
In-Reply-To: <D969919C-A241-432E-A0E3-353CCD8AC7E8@nvidia.com>
On 10 Jan 2025, at 12:05, Zi Yan wrote:
> <snip>
>>>
>>>>> main() {
>>>>> ...
>>>>>
>>>>> // code snippet to measure throughput
>>>>> clock_gettime(CLOCK_MONOTONIC, &t1);
>>>>> retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
>>>>> clock_gettime(CLOCK_MONOTONIC, &t2);
>>>>>
>>>>> // tput = num_pages*PAGE_SIZE/(t2-t1)
>>>>>
>>>>> ...
>>>>> }
>>>>>
>>>>>
>>>>> Measurements:
>>>>> ============
>>>>> vanilla: base kernel without patchset
>>>>> mt:0 = MT kernel with use_mt_copy=0
>>>>> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>>>>>
>>>>> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
>>>>> for 4KB migration and THP migration.
>>>>>
>>>>> --------------------
>>>>> #1 push_0_pull_1 = 0 (src node CPUs are used)
>>>>>
>>>>> #1.1 THP=Never, 4KB (GB/s):
>>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
>>>>> 512 1.28 1.28 1.92 1.80 2.24 2.35 2.22 2.17
>>>>> 4096 2.40 2.40 2.51 2.58 2.83 2.72 2.99 3.25
>>>>> 8192 3.18 2.88 2.83 2.69 3.49 3.46 3.57 3.80
>>>>> 16348 3.17 2.94 2.96 3.17 3.63 3.68 4.06 4.15
>>>>>
>>>>> #1.2 THP=Always, 2MB (GB/s):
>>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
>>>>> 512 4.31 5.02 3.39 3.40 3.33 3.51 3.91 4.03
>>>>> 1024 7.13 4.49 3.58 3.56 3.91 3.87 4.39 4.57
>>>>> 2048 5.26 6.47 3.91 4.00 3.71 3.85 4.97 6.83
>>>>> 4096 9.93 7.77 4.58 3.79 3.93 3.53 6.41 4.77
>>>>> 8192 6.47 6.33 4.37 4.67 4.52 4.39 5.30 5.37
>>>>> 16348 7.66 8.00 5.20 5.22 5.24 5.28 6.41 7.02
>>>>> 32768 8.56 8.62 6.34 6.20 6.20 6.19 7.18 8.10
>>>>> 65536 9.41 9.40 7.14 7.15 7.15 7.19 7.96 8.89
>>>>> 262144 10.17 10.19 7.26 7.90 7.98 8.05 9.46 10.30
>>>>> 524288 10.40 9.95 7.25 7.93 8.02 8.76 9.55 10.30
>>>>>
>>>>> --------------------
>>>>> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>>>>>
>>>>> #2.1 THP=Never 4KB (GB/s):
>>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
>>>>> 512 1.28 1.36 2.01 2.74 2.33 2.31 2.53 2.96
>>>>> 4096 2.40 2.84 2.94 3.04 3.40 3.23 3.31 4.16
>>>>> 8192 3.18 3.27 3.34 3.94 3.77 3.68 4.23 4.76
>>>>> 16348 3.17 3.42 3.66 3.21 3.82 4.40 4.76 4.89
>>>>>
>>>>> #2.2 THP=Always 2MB (GB/s):
>>>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
>>>>> 512 4.31 5.91 4.03 3.73 4.26 4.13 4.78 3.44
>>>>> 1024 7.13 6.83 4.60 5.13 5.03 5.19 5.94 7.25
>>>>> 2048 5.26 7.09 5.20 5.69 5.83 5.73 6.85 8.13
>>>>> 4096 9.93 9.31 4.90 4.82 4.82 5.26 8.46 8.52
>>>>> 8192 6.47 7.63 5.66 5.85 5.75 6.14 7.45 8.63
>>>>> 16348 7.66 10.00 6.35 6.54 6.66 6.99 8.18 10.21
>>>>> 32768 8.56 9.78 7.06 7.41 7.76 9.02 9.55 11.92
>>>>> 65536 9.41 10.00 8.19 9.20 9.32 8.68 11.00 13.31
>>>>> 262144 10.17 11.17 9.01 9.96 9.99 10.00 11.70 14.27
>>>>> 524288 10.40 11.38 9.07 9.98 10.01 10.09 11.95 14.48
>>>>>
>>>>> Note:
>>>>> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
>>>>> experiment with 64KB pagesize)
>>>>> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
>>>>> nr_pages=512 => 512 4KB pages => 1 2MB page)
>>>>>
>>>>>
>>>>> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
>>>>> relatively flat across thread counts.
>>>>>
>>>>> Is it possible I'm missing something in my testing?
>>>>>
>>>>> Could the base page size difference (4KB vs 64KB) be playing a role in
>>>>> the scaling behavior? How the performance varies with 4KB pages on your system?
>>>>>
>>>>> I'd be happy to work with you on investigating this differences.
>>>>> Let me know if you'd like any additional test data or if there are specific
>>>>> configurations I should try.
>>>>
>>>> The results surprises me, since I was able to achieve ~9GB/s when migrating
>>>> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
>>>> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
>>>> These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
>>>> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
>>>> not make sense.
>>>>
>>>> One thing you might want to try is to set init_on_alloc=0 in your boot
>>>> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
>>>> might reduce the time spent on page zeros.
>>>>
>>>> I am also going to rerun the experiments locally on x86_64 boxes to see if your
>>>> results can be replicated.
>>>>
>>>> Thank you for the review and running these experiments. I really appreciate
>>>> it.>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
>>>>
>>>
>>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>>> but I'm still missing the performance scaling you observed.
>>
>> It might be the difference between x86 and ARM64, but I am not 100% sure.
>> Based on your data below, 2 or 4 threads seem to the sweep spot for
>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
>> ~25GB/s bidirectional. I wonder if your results below are cross-socket
>> link bandwidth limited.
>>
>> From my results, NVIDIA Grace CPU can achieve high copy throughput
>> with more threads between two sockets, maybe part of the reason is that
>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.
>
> I talked to my colleague about this and he mentioned about CCD architecture
> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
> another. This means my naive scheduling algorithm, which use CPUs from
> 0 to N threads, uses all cores from one CDD first, then move to another
> CCD. It is not able to saturate the cross-socket bandwidth. Does it make
> sense to you?
>
> If yes, can you please change the my cpu selection code in mm/copy_pages.c:
>
> + /* TODO: need a better cpu selection method */
> + for_each_cpu(cpu, per_node_cpumask) {
> + if (i >= total_mt_num)
> + break;
> + cpu_id_list[i] = cpu;
> + ++i;
> + }
>
> to select CPUs from as many CCDs as possible and rerun the tests.
> That might boost the page migration throughput on AMD CPUs more.
>
> Thanks.
>
>>>
>>> THP Never
>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
>>> 512 1.40 1.43 2.79 3.48 3.63 3.73 3.63 3.57
>>> 4096 2.54 3.32 3.18 4.65 4.83 5.11 5.39 5.78
>>> 8192 3.35 4.40 4.39 4.71 3.63 5.04 5.33 6.00
>>> 16348 3.76 4.50 4.44 5.33 5.41 5.41 6.47 6.41
>>>
>>> THP Always
>>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
>>> 512 5.21 5.47 5.77 6.92 3.71 2.75 7.54 7.44
>>> 1024 6.10 7.65 8.12 8.41 8.87 8.55 9.13 11.36
>>> 2048 6.39 6.66 9.58 8.92 10.75 12.99 13.33 12.23
>>> 4096 7.33 10.85 8.22 13.57 11.43 10.93 12.53 16.86
>>> 8192 7.26 7.46 8.88 11.82 10.55 10.94 13.27 14.11
>>> 16348 9.07 8.53 11.82 14.89 12.97 13.22 16.14 18.10
>>> 32768 10.45 10.55 11.79 19.19 16.85 17.56 20.58 26.57
>>> 65536 11.00 11.12 13.25 18.27 16.18 16.11 19.61 27.73
>>> 262144 12.37 12.40 15.65 20.00 19.25 19.38 22.60 31.95
>>> 524288 12.44 12.33 15.66 19.78 19.06 18.96 23.31 32.29
>>
>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study
BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method.
The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of
vanilla kernel throughput using 8 or 16 threads.
4KB (GB/s)
| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
| | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
| 512 | 1.12 | 1.19 | 1.20 | 1.26 | 1.27 | 1.35 |
| 768 | 1.29 | 1.14 | 1.28 | 1.40 | 1.39 | 1.46 |
| 1024 | 1.19 | 1.25 | 1.34 | 1.51 | 1.52 | 1.53 |
| 2048 | 1.14 | 1.12 | 1.44 | 1.61 | 1.73 | 1.71 |
| 4096 | 1.09 | 1.14 | 1.46 | 1.64 | 1.81 | 1.78 |
2MB (GB/s)
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
| | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
| 1 | 2.03 | 2.21 | 2.69 | 2.93 | 3.17 | 3.14 |
| 2 | 2.28 | 2.13 | 3.54 | 4.50 | 4.72 | 4.72 |
| 4 | 2.92 | 2.93 | 4.44 | 6.50 | 7.24 | 7.06 |
| 8 | 2.29 | 2.37 | 3.21 | 6.86 | 8.83 | 8.44 |
| 16 | 2.10 | 2.09 | 4.57 | 8.06 | 8.32 | 9.70 |
| 32 | 2.22 | 2.21 | 4.43 | 8.96 | 9.37 | 11.54 |
| 64 | 2.35 | 2.35 | 3.15 | 7.77 | 10.77 | 13.61 |
| 128 | 2.48 | 2.53 | 5.12 | 8.18 | 11.01 | 15.62 |
| 256 | 2.55 | 2.53 | 5.44 | 8.25 | 12.73 | 16.49 |
| 512 | 2.61 | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
| 768 | 2.55 | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
| 1024 | 2.56 | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |
Best Regards,
Yan, Zi
next prev parent reply other threads:[~2025-01-10 19:51 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-03 17:24 Zi Yan
2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 14:08 ` Zi Yan
2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
2025-01-06 1:18 ` Hyeonggon Yoo
2025-01-06 2:01 ` Zi Yan
2025-02-13 12:44 ` Byungchul Park
2025-02-13 15:34 ` Zi Yan
2025-02-13 21:34 ` Byungchul Park
2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
2025-01-03 22:21 ` Gregory Price
2025-01-03 22:56 ` Zi Yan
2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
2025-01-03 19:32 ` Zi Yan
2025-01-03 22:09 ` Yang Shi
2025-01-06 2:33 ` Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 15:04 ` Zi Yan
2025-01-09 18:03 ` Shivank Garg
2025-01-09 19:32 ` Zi Yan
2025-01-10 17:05 ` Zi Yan
2025-01-10 19:51 ` Zi Yan [this message]
2025-01-16 4:57 ` Shivank Garg
2025-01-21 6:15 ` Shivank Garg
2025-02-13 8:17 ` Byungchul Park
2025-02-13 15:36 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=334B7551-7834-44E7-91E6-4AE4C0B382AF@nvidia.com \
--to=ziy@nvidia.com \
--cc=AneeshKumar.KizhakeVeetil@arm.com \
--cc=Raghavendra.KodsaraThimmappa@amd.com \
--cc=bharata@amd.com \
--cc=david@redhat.com \
--cc=gregory.price@memverge.com \
--cc=jhubbard@nvidia.com \
--cc=jon.grimm@amd.com \
--cc=k.shutemov@gmail.com \
--cc=leesuyeon0506@gmail.com \
--cc=leillc@google.com \
--cc=liam.howlett@oracle.com \
--cc=linux-mm@kvack.org \
--cc=mel.gorman@gmail.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=santosh.shukla@amd.com \
--cc=shivankg@amd.com \
--cc=shy828301@gmail.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox