From: Zi Yan <ziy@nvidia.com>
To: Yang Shi <shy828301@gmail.com>
Cc: linux-mm@kvack.org, David Rientjes <rientjes@google.com>,
Shivank Garg <shivankg@amd.com>,
Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
David Hildenbrand <david@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Kirill Shutemov <k.shutemov@gmail.com>,
Matthew Wilcox <willy@infradead.org>,
Mel Gorman <mel.gorman@gmail.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
Rik van Riel <riel@surriel.com>,
RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>,
Lei Chen <leillc@google.com>,
"Shukla, Santosh" <santosh.shukla@amd.com>,
"Grimm, Jon" <jon.grimm@amd.com>,
sj@kernel.org, Liam Howlett <liam.howlett@oracle.com>,
Gregory Price <gregory.price@memverge.com>,
"Huang, Ying" <ying.huang@linux.alibaba.com>
Subject: Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
Date: Sun, 05 Jan 2025 21:33:21 -0500 [thread overview]
Message-ID: <E0AE4707-A31D-413A-99A6-422BCF57225C@nvidia.com> (raw)
In-Reply-To: <CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com>
On 3 Jan 2025, at 17:09, Yang Shi wrote:
> On Fri, Jan 3, 2025 at 9:24 AM Zi Yan <ziy@nvidia.com> wrote:
>>
>> Hi all,
>>
>> This patchset accelerates page migration by batching folio copy operations and
>> using multiple CPU threads and is based on Shivank's Enhancements to Page
>> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
>> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
>> The last patch is for testing purpose and should not be considered.
>>
>> The motivations are:
>>
>> 1. Batching folio copy increases copy throughput. Especially for base page
>> migrations, folio copy throughput is low since there are kernel activities like
>> moving folio metadata and updating page table entries sit between two folio
>> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
>> and 64KB on ARM64.
>>
>> 2. Single CPU thread has limited copy throughput. Using multi threads is
>> a natural extension to speed up folio copy, when DMA engine is NOT
>> available in a system.
>>
>>
>> Design
>> ===
>>
>> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
>> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
>> migrate_folio_move() and perform them in one shot afterwards. A
>> copy_page_lists_mt() function is added to use multi threads to copy
>> folios from src list to dst list.
>>
>> Changes compared to Shivank's patchset (mainly rewrote batching folio
>> copy code)
>> ===
>>
>> 1. mig_info is removed, so no memory allocation is needed during
>> batching folio copies. src->private is used to store old page state and
>> anon_vma after folio metadata is copied from src to dst.
>>
>> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
>> redundant code in migrate_folios_batch_move().
>>
>> 3. folio_mc_copy() is used for the single threaded copy code to keep the
>> original kernel behavior.
>>
>>
>> Performance
>> ===
>>
>> I benchmarked move_pages() throughput on a two socket NUMA system with two
>> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
>> mTHP page migration are measured.
>>
>> The tables below show move_pages() throughput with different
>> configurations and different numbers of copied pages. The x-axis is the
>> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
>> threads with this patchset applied. And the unit is GB/s.
>>
>> The 32-thread copy throughput can be up to 10x of single thread serial folio
>> copy. Batching folio copy not only benefits huge page but also base
>> page.
>>
>> 64KB (GB/s):
>>
>> vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32
>> 32 5.43 4.90 5.65 7.31 7.60 8.61 6.43
>> 256 6.95 6.89 9.28 14.67 22.41 23.39 23.93
>> 512 7.88 7.26 10.15 17.53 27.82 27.88 33.93
>> 768 7.65 7.42 10.46 18.59 28.65 29.67 30.76
>> 1024 7.46 8.01 10.90 17.77 27.04 32.18 38.80
>>
>> 2MB mTHP (GB/s):
>>
>> vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32
>> 1 5.94 2.90 6.90 8.56 11.16 8.76 6.41
>> 2 7.67 5.57 7.11 12.48 17.37 15.68 14.10
>> 4 8.01 6.04 10.25 20.14 22.52 27.79 25.28
>> 8 8.42 7.00 11.41 24.73 33.96 32.62 39.55
>> 16 9.41 6.91 12.23 27.51 43.95 49.15 51.38
>> 32 10.23 7.15 13.03 29.52 49.49 69.98 71.51
>> 64 9.40 7.37 13.88 30.38 52.00 76.89 79.41
>> 128 8.59 7.23 14.20 28.39 49.98 78.27 90.18
>> 256 8.43 7.16 14.59 28.14 48.78 76.88 92.28
>> 512 8.31 7.78 14.40 26.20 43.31 63.91 75.21
>> 768 8.30 7.86 14.83 27.41 46.25 69.85 81.31
>> 1024 8.31 7.90 14.96 27.62 46.75 71.76 83.84
>
> Is this done on an idle system or a busy system? For real production
> workloads, all the CPUs are likely busy. It would be great to have the
> performance data collected from a busys system too.
Yes, it was done on an idle system.
I redid the experiments on a busy system by running stress on all CPU
cores and the results are as not good, since all CPUs are occupied.
Then I switched to system_highpri_wq, the throughput got better,
almost on par with the results on an idle machine. The numbers are
below.
It becomes a trade-off between page migration throughput vs user
application performance on _a busy system_. If a page migration is badly
needed, system_highpri_wq can be used to retain high copy throughput.
Otherwise, multithreads should not be used.
64KB with system_unbound_wq on a busy system (GB/s):
| ---- | -------- | ---- | ---- | ---- | ---- | ----- | ----- |
| | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | mt_32 |
| ---- | -------- | ---- | ---- | ---- | ---- | ----- | ----- |
| 32 | 4.05 | 1.51 | 1.32 | 1.20 | 4.31 | 1.05 | 0.02 |
| 256 | 6.91 | 3.93 | 4.61 | 0.08 | 4.46 | 4.30 | 3.89 |
| 512 | 7.28 | 4.87 | 1.81 | 6.18 | 4.38 | 5.58 | 6.10 |
| 768 | 4.57 | 5.72 | 5.35 | 5.24 | 5.94 | 5.66 | 0.20 |
| 1024 | 7.88 | 5.73 | 5.81 | 6.52 | 7.29 | 6.06 | 5.62 |
2MB with system_unbound_wq on a busy system (GB/s):
| ---- | ------- | ---- | ---- | ---- | ----- | ----- | ----- |
| | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | mt_32 |
| ---- | ------- | ---- | ---- | ---- | ----- | ----- | ----- |
| 1 | 1.38 | 0.59 | 1.45 | 1.99 | 1.59 | 2.18 | 1.48 |
| 2 | 1.13 | 3.08 | 3.11 | 1.85 | 0.32 | 1.46 | 2.53 |
| 4 | 8.31 | 4.02 | 5.68 | 3.22 | 2.96 | 5.77 | 2.91 |
| 8 | 8.16 | 5.09 | 1.19 | 4.96 | 4.50 | 3.36 | 4.99 |
| 16 | 3.47 | 5.13 | 5.72 | 7.06 | 5.90 | 6.49 | 5.34 |
| 32 | 8.42 | 6.97 | 0.13 | 6.77 | 7.69 | 7.56 | 2.87 |
| 64 | 7.45 | 8.06 | 7.22 | 8.60 | 8.07 | 7.16 | 0.57 |
| 128 | 7.77 | 7.93 | 7.29 | 8.31 | 7.77 | 9.05 | 0.92 |
| 256 | 6.91 | 7.20 | 6.80 | 8.56 | 7.81 | 10.13 | 11.21 |
| 512 | 6.72 | 7.22 | 7.77 | 9.71 | 10.68 | 10.35 | 10.40 |
| 768 | 6.87 | 7.18 | 7.98 | 9.28 | 10.85 | 10.83 | 14.17 |
| 1024 | 6.95 | 7.23 | 8.03 | 9.59 | 10.88 | 10.22 | 20.27 |
64KB with system_highpri_wq on a busy system (GB/s):
| ---- | ------- | ---- | ---- | ----- | ----- | ----- | ----- |
| | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | mt_32 |
| ---- | ------- | ---- | ---- | ----- | ----- | ----- | ----- |
| 32 | 4.05 | 2.63 | 1.62 | 1.90 | 3.34 | 3.71 | 3.40 |
| 256 | 6.91 | 5.16 | 4.33 | 8.07 | 6.81 | 10.31 | 13.51 |
| 512 | 7.28 | 4.89 | 6.43 | 15.72 | 11.31 | 18.03 | 32.69 |
| 768 | 4.57 | 6.27 | 6.42 | 11.06 | 8.56 | 14.91 | 9.24 |
| 1024 | 7.88 | 6.73 | 0.49 | 17.09 | 19.34 | 23.60 | 18.12 |
2MB with system_highpri_wq on a busy system (GB/s):
| ---- | ------- | ---- | ----- | ----- | ----- | ----- | ----- |
| | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 | mt_32 |
| ---- | ------- | ---- | ----- | ----- | ----- | ----- | ----- |
| 1 | 1.38 | 1.18 | 1.17 | 5.00 | 1.68 | 3.86 | 2.46 |
| 2 | 1.13 | 1.78 | 1.05 | 0.01 | 3.52 | 1.84 | 1.80 |
| 4 | 8.31 | 3.91 | 5.24 | 4.30 | 4.12 | 2.93 | 3.44 |
| 8 | 8.16 | 6.09 | 3.67 | 7.81 | 11.10 | 8.47 | 15.21 |
| 16 | 3.47 | 6.02 | 8.44 | 11.80 | 9.56 | 12.84 | 9.81 |
| 32 | 8.42 | 7.34 | 10.10 | 13.79 | 23.03 | 26.68 | 45.24 |
| 64 | 7.45 | 7.90 | 12.27 | 19.99 | 36.08 | 35.11 | 60.26 |
| 128 | 7.77 | 7.57 | 13.35 | 24.67 | 35.03 | 41.40 | 51.68 |
| 256 | 6.91 | 7.40 | 14.13 | 25.37 | 38.83 | 62.18 | 51.37 |
| 512 | 6.72 | 7.26 | 14.72 | 27.37 | 43.99 | 66.84 | 69.63 |
| 768 | 6.87 | 7.29 | 14.84 | 26.34 | 47.21 | 67.51 | 80.32 |
| 1024 | 6.95 | 7.26 | 14.88 | 26.98 | 47.75 | 74.99 | 85.00 |
>
>>
>>
>> TODOs
>> ===
>> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
>> only use idle CPUs to avoid interfering userspace workloads. Of course
>> more complicated policies can be used based on migration issuing thread
>> priority.
>
> The other potential problem is it is hard to attribute cpu time
> consumed by the migration work threads to cpu cgroups. In a
> multi-tenant environment this may result in unfair cpu time counting.
> However, it is a chronic problem to properly count cpu time for kernel
> threads. I'm not sure whether it has been solved or not.
>
>>
>> 2. Eliminate memory allocation during multi-threaded folio copy routine
>> if possible.
>>
>> 3. A runtime check to decide when use multi-threaded folio copy.
>> Something like cache hotness issue mentioned by Matthew[3].
>>
>> 4. Use non-temporal CPU instructions to avoid cache pollution issues.
>
> AFAICT, arm64 already uses non-temporal instructions for copy page.
Right. My current implementation uses memcpy, which does not use non-temporal
on ARM64, since a huge page can be copied by multiple threads. A non-temporal
memcpy can be added for this use.
Thank you for the inputs.
>
>>
>> 5. Explicitly make multi-threaded folio copy only available to
>> !HIGHMEM, since kmap_local_page() would be needed for each kernel
>> folio copy work threads and expensive.
>>
>> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
>> to be used as well.
>>
>> Let me know your thoughts. Thanks.
>>
>>
>> [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/
>> [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
>> [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/
>>
>> Byungchul Park (1):
>> mm: separate move/undo doing on folio list from migrate_pages_batch()
>>
>> Zi Yan (4):
>> mm/migrate: factor out code in move_to_new_folio() and
>> migrate_folio_move()
>> mm/migrate: add migrate_folios_batch_move to batch the folio move
>> operations
>> mm/migrate: introduce multi-threaded page copy routine
>> test: add sysctl for folio copy tests and adjust
>> NR_MAX_BATCHED_MIGRATION
>>
>> include/linux/migrate.h | 3 +
>> include/linux/migrate_mode.h | 2 +
>> include/linux/mm.h | 4 +
>> include/linux/sysctl.h | 1 +
>> kernel/sysctl.c | 29 ++-
>> mm/Makefile | 2 +-
>> mm/copy_pages.c | 190 +++++++++++++++
>> mm/migrate.c | 443 +++++++++++++++++++++++++++--------
>> 8 files changed, 577 insertions(+), 97 deletions(-)
>> create mode 100644 mm/copy_pages.c
>>
>> --
>> 2.45.2
>>
--
Best Regards,
Yan, Zi
next prev parent reply other threads:[~2025-01-06 2:33 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-03 17:24 Zi Yan
2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 14:08 ` Zi Yan
2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
2025-01-06 1:18 ` Hyeonggon Yoo
2025-01-06 2:01 ` Zi Yan
2025-02-13 12:44 ` Byungchul Park
2025-02-13 15:34 ` Zi Yan
2025-02-13 21:34 ` Byungchul Park
2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
2025-01-03 22:21 ` Gregory Price
2025-01-03 22:56 ` Zi Yan
2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
2025-01-03 19:32 ` Zi Yan
2025-01-03 22:09 ` Yang Shi
2025-01-06 2:33 ` Zi Yan [this message]
2025-01-09 11:47 ` Shivank Garg
2025-01-09 15:04 ` Zi Yan
2025-01-09 18:03 ` Shivank Garg
2025-01-09 19:32 ` Zi Yan
2025-01-10 17:05 ` Zi Yan
2025-01-10 19:51 ` Zi Yan
2025-01-16 4:57 ` Shivank Garg
2025-01-21 6:15 ` Shivank Garg
2025-02-13 8:17 ` Byungchul Park
2025-02-13 15:36 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=E0AE4707-A31D-413A-99A6-422BCF57225C@nvidia.com \
--to=ziy@nvidia.com \
--cc=AneeshKumar.KizhakeVeetil@arm.com \
--cc=Raghavendra.KodsaraThimmappa@amd.com \
--cc=bharata@amd.com \
--cc=david@redhat.com \
--cc=gregory.price@memverge.com \
--cc=jhubbard@nvidia.com \
--cc=jon.grimm@amd.com \
--cc=k.shutemov@gmail.com \
--cc=leesuyeon0506@gmail.com \
--cc=leillc@google.com \
--cc=liam.howlett@oracle.com \
--cc=linux-mm@kvack.org \
--cc=mel.gorman@gmail.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=santosh.shukla@amd.com \
--cc=shivankg@amd.com \
--cc=shy828301@gmail.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox