From: Zi Yan <ziy@nvidia.com>
To: Shivank Garg <shivankg@amd.com>
Cc: linux-mm@kvack.org, David Rientjes <rientjes@google.com>,
Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
David Hildenbrand <david@redhat.com>,
John Hubbard <jhubbard@nvidia.com>,
Kirill Shutemov <k.shutemov@gmail.com>,
Matthew Wilcox <willy@infradead.org>,
Mel Gorman <mel.gorman@gmail.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
Rik van Riel <riel@surriel.com>,
RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>,
Lei Chen <leillc@google.com>,
"Shukla, Santosh" <santosh.shukla@amd.com>,
"Grimm, Jon" <jon.grimm@amd.com>,
sj@kernel.org, shy828301@gmail.com,
Liam Howlett <liam.howlett@oracle.com>,
Gregory Price <gregory.price@memverge.com>,
"Huang, Ying" <ying.huang@linux.alibaba.com>
Subject: Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
Date: Thu, 09 Jan 2025 10:04:15 -0500 [thread overview]
Message-ID: <567FDE63-E84E-4B1E-85F4-4E1EB0C2CD26@nvidia.com> (raw)
In-Reply-To: <600a57ff-a462-4997-a621-f919c2c4fa84@amd.com>
On 9 Jan 2025, at 6:47, Shivank Garg wrote:
> On 1/3/2025 10:54 PM, Zi Yan wrote:
>
> Hi Zi,
>
> It's interesting to see my batch page migration patchset evolution with
> multi-threading support. Thanks for sharing this.
>
>> Hi all,
>>
>> This patchset accelerates page migration by batching folio copy operations and
>> using multiple CPU threads and is based on Shivank's Enhancements to Page
>> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
>> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
>> The last patch is for testing purpose and should not be considered.
>>
>> The motivations are:
>>
>> 1. Batching folio copy increases copy throughput. Especially for base page
>> migrations, folio copy throughput is low since there are kernel activities like
>> moving folio metadata and updating page table entries sit between two folio
>> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
>> and 64KB on ARM64.
>>
>> 2. Single CPU thread has limited copy throughput. Using multi threads is
>> a natural extension to speed up folio copy, when DMA engine is NOT
>> available in a system.
>>
>>
>> Design
>> ===
>>
>> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
>> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
>> migrate_folio_move() and perform them in one shot afterwards. A
>> copy_page_lists_mt() function is added to use multi threads to copy
>> folios from src list to dst list.
>>
>> Changes compared to Shivank's patchset (mainly rewrote batching folio
>> copy code)
>> ===
>>
>> 1. mig_info is removed, so no memory allocation is needed during
>> batching folio copies. src->private is used to store old page state and
>> anon_vma after folio metadata is copied from src to dst.
>>
>> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
>> redundant code in migrate_folios_batch_move().
>>
>> 3. folio_mc_copy() is used for the single threaded copy code to keep the
>> original kernel behavior.
>>
>>
>
>
>>
>> TODOs
>> ===
>> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
>> only use idle CPUs to avoid interfering userspace workloads. Of course
>> more complicated policies can be used based on migration issuing thread
>> priority.
>>
>> 2. Eliminate memory allocation during multi-threaded folio copy routine
>> if possible.
>>
>> 3. A runtime check to decide when use multi-threaded folio copy.
>> Something like cache hotness issue mentioned by Matthew[3].
>>
>> 4. Use non-temporal CPU instructions to avoid cache pollution issues.
>
>>
>> 5. Explicitly make multi-threaded folio copy only available to
>> !HIGHMEM, since kmap_local_page() would be needed for each kernel
>> folio copy work threads and expensive.
>>
>> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
>> to be used as well.
>
> I think Static Calls can be better option for this.
This is the first time I hear about it. Based on the info I find, I agree
it is a great mechanism to switch between two methods globally.
>
> This will give a flexible copy interface to support both CPU and various DMA-based
> folio copy. DMA-capable driver can override the default CPU copy path without any
> additional runtime overheads.
Yes, supporting DMA-based folio copy is also my intention too. I am happy to
with you on that. Things to note are:
1. DMA engine should have more copy throughput as a single CPU thread, otherwise
the scatter-gather setup overheads will eliminate the benefit of using DMA engine.
2. Unless the DMA engine is really beef and can handle all possible page migration
requests, CPU-based migration (single or multi threads) should be a fallback.
In terms of 2, I wonder how much overheads does Static Calls have when switching
between functions. Also, a lock might be needed since falling back to CPU might
be per migrate_pages(). Considering these two, Static Calls might not work
as you intended if switching between CPU and DMA is needed.
>
>
>> Performance
>> ===
>>
>> I benchmarked move_pages() throughput on a two socket NUMA system with two
>> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
>> mTHP page migration are measured.
>>
>> The tables below show move_pages() throughput with different
>> configurations and different numbers of copied pages. The x-axis is the
>> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
>> threads with this patchset applied. And the unit is GB/s.
>>
>> The 32-thread copy throughput can be up to 10x of single thread serial folio
>> copy. Batching folio copy not only benefits huge page but also base
>> page.
>>
>> 64KB (GB/s):
>>
>> vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32
>> 32 5.43 4.90 5.65 7.31 7.60 8.61 6.43
>> 256 6.95 6.89 9.28 14.67 22.41 23.39 23.93
>> 512 7.88 7.26 10.15 17.53 27.82 27.88 33.93
>> 768 7.65 7.42 10.46 18.59 28.65 29.67 30.76
>> 1024 7.46 8.01 10.90 17.77 27.04 32.18 38.80
>>
>> 2MB mTHP (GB/s):
>>
>> vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32
>> 1 5.94 2.90 6.90 8.56 11.16 8.76 6.41
>> 2 7.67 5.57 7.11 12.48 17.37 15.68 14.10
>> 4 8.01 6.04 10.25 20.14 22.52 27.79 25.28
>> 8 8.42 7.00 11.41 24.73 33.96 32.62 39.55
>> 16 9.41 6.91 12.23 27.51 43.95 49.15 51.38
>> 32 10.23 7.15 13.03 29.52 49.49 69.98 71.51
>> 64 9.40 7.37 13.88 30.38 52.00 76.89 79.41
>> 128 8.59 7.23 14.20 28.39 49.98 78.27 90.18
>> 256 8.43 7.16 14.59 28.14 48.78 76.88 92.28
>> 512 8.31 7.78 14.40 26.20 43.31 63.91 75.21
>> 768 8.30 7.86 14.83 27.41 46.25 69.85 81.31
>> 1024 8.31 7.90 14.96 27.62 46.75 71.76 83.84
>
> I'm measuring the throughput(in GB/s) on our AMD EPYC Zen 5 system
> (2-socket, 64-core per socket with SMT Enabled, 2 NUMA nodes) with base
> page-size as 4KB and using using mm-everything-2025-01-04-04-41 as base
> kernel.
>
> Method:
> ======
> main() {
> ...
>
> // code snippet to measure throughput
> clock_gettime(CLOCK_MONOTONIC, &t1);
> retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
> clock_gettime(CLOCK_MONOTONIC, &t2);
>
> // tput = num_pages*PAGE_SIZE/(t2-t1)
>
> ...
> }
>
>
> Measurements:
> ============
> vanilla: base kernel without patchset
> mt:0 = MT kernel with use_mt_copy=0
> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>
> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
> for 4KB migration and THP migration.
>
> --------------------
> #1 push_0_pull_1 = 0 (src node CPUs are used)
>
> #1.1 THP=Never, 4KB (GB/s):
> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
> 512 1.28 1.28 1.92 1.80 2.24 2.35 2.22 2.17
> 4096 2.40 2.40 2.51 2.58 2.83 2.72 2.99 3.25
> 8192 3.18 2.88 2.83 2.69 3.49 3.46 3.57 3.80
> 16348 3.17 2.94 2.96 3.17 3.63 3.68 4.06 4.15
>
> #1.2 THP=Always, 2MB (GB/s):
> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
> 512 4.31 5.02 3.39 3.40 3.33 3.51 3.91 4.03
> 1024 7.13 4.49 3.58 3.56 3.91 3.87 4.39 4.57
> 2048 5.26 6.47 3.91 4.00 3.71 3.85 4.97 6.83
> 4096 9.93 7.77 4.58 3.79 3.93 3.53 6.41 4.77
> 8192 6.47 6.33 4.37 4.67 4.52 4.39 5.30 5.37
> 16348 7.66 8.00 5.20 5.22 5.24 5.28 6.41 7.02
> 32768 8.56 8.62 6.34 6.20 6.20 6.19 7.18 8.10
> 65536 9.41 9.40 7.14 7.15 7.15 7.19 7.96 8.89
> 262144 10.17 10.19 7.26 7.90 7.98 8.05 9.46 10.30
> 524288 10.40 9.95 7.25 7.93 8.02 8.76 9.55 10.30
>
> --------------------
> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>
> #2.1 THP=Never 4KB (GB/s):
> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
> 512 1.28 1.36 2.01 2.74 2.33 2.31 2.53 2.96
> 4096 2.40 2.84 2.94 3.04 3.40 3.23 3.31 4.16
> 8192 3.18 3.27 3.34 3.94 3.77 3.68 4.23 4.76
> 16348 3.17 3.42 3.66 3.21 3.82 4.40 4.76 4.89
>
> #2.2 THP=Always 2MB (GB/s):
> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 mt:8 mt:16 mt:32
> 512 4.31 5.91 4.03 3.73 4.26 4.13 4.78 3.44
> 1024 7.13 6.83 4.60 5.13 5.03 5.19 5.94 7.25
> 2048 5.26 7.09 5.20 5.69 5.83 5.73 6.85 8.13
> 4096 9.93 9.31 4.90 4.82 4.82 5.26 8.46 8.52
> 8192 6.47 7.63 5.66 5.85 5.75 6.14 7.45 8.63
> 16348 7.66 10.00 6.35 6.54 6.66 6.99 8.18 10.21
> 32768 8.56 9.78 7.06 7.41 7.76 9.02 9.55 11.92
> 65536 9.41 10.00 8.19 9.20 9.32 8.68 11.00 13.31
> 262144 10.17 11.17 9.01 9.96 9.99 10.00 11.70 14.27
> 524288 10.40 11.38 9.07 9.98 10.01 10.09 11.95 14.48
>
> Note:
> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
> experiment with 64KB pagesize)
> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
> nr_pages=512 => 512 4KB pages => 1 2MB page)
>
>
> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
> relatively flat across thread counts.
>
> Is it possible I'm missing something in my testing?
>
> Could the base page size difference (4KB vs 64KB) be playing a role in
> the scaling behavior? How the performance varies with 4KB pages on your system?
>
> I'd be happy to work with you on investigating this differences.
> Let me know if you'd like any additional test data or if there are specific
> configurations I should try.
The results surprises me, since I was able to achieve ~9GB/s when migrating
16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
(a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
not make sense.
One thing you might want to try is to set init_on_alloc=0 in your boot
parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
might reduce the time spent on page zeros.
I am also going to rerun the experiments locally on x86_64 boxes to see if your
results can be replicated.
Thank you for the review and running these experiments. I really appreciate
it.
[1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
Best Regards,
Yan, Zi
next prev parent reply other threads:[~2025-01-09 15:04 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-03 17:24 Zi Yan
2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 14:08 ` Zi Yan
2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
2025-01-06 1:18 ` Hyeonggon Yoo
2025-01-06 2:01 ` Zi Yan
2025-02-13 12:44 ` Byungchul Park
2025-02-13 15:34 ` Zi Yan
2025-02-13 21:34 ` Byungchul Park
2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
2025-01-03 22:21 ` Gregory Price
2025-01-03 22:56 ` Zi Yan
2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
2025-01-03 19:32 ` Zi Yan
2025-01-03 22:09 ` Yang Shi
2025-01-06 2:33 ` Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 15:04 ` Zi Yan [this message]
2025-01-09 18:03 ` Shivank Garg
2025-01-09 19:32 ` Zi Yan
2025-01-10 17:05 ` Zi Yan
2025-01-10 19:51 ` Zi Yan
2025-01-16 4:57 ` Shivank Garg
2025-01-21 6:15 ` Shivank Garg
2025-02-13 8:17 ` Byungchul Park
2025-02-13 15:36 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=567FDE63-E84E-4B1E-85F4-4E1EB0C2CD26@nvidia.com \
--to=ziy@nvidia.com \
--cc=AneeshKumar.KizhakeVeetil@arm.com \
--cc=Raghavendra.KodsaraThimmappa@amd.com \
--cc=bharata@amd.com \
--cc=david@redhat.com \
--cc=gregory.price@memverge.com \
--cc=jhubbard@nvidia.com \
--cc=jon.grimm@amd.com \
--cc=k.shutemov@gmail.com \
--cc=leesuyeon0506@gmail.com \
--cc=leillc@google.com \
--cc=liam.howlett@oracle.com \
--cc=linux-mm@kvack.org \
--cc=mel.gorman@gmail.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=santosh.shukla@amd.com \
--cc=shivankg@amd.com \
--cc=shy828301@gmail.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox