Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Zi Yan <ziy@nvidia.com>
To: Shivank Garg <shivankg@amd.com>
Cc: <linux-mm@kvack.org>, David Rientjes <rientjes@google.com>,
	Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
	David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Kirill Shutemov <k.shutemov@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mel.gorman@gmail.com>,
	"Rao, Bharata Bhasker" <bharata@amd.com>,
	Rik van Riel <riel@surriel.com>,
	RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
	Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>,
	Lei Chen <leillc@google.com>,
	"Shukla, Santosh" <santosh.shukla@amd.com>,
	"Grimm, Jon" <jon.grimm@amd.com>, <sj@kernel.org>,
	<shy828301@gmail.com>, Liam Howlett <liam.howlett@oracle.com>,
	Gregory Price <gregory.price@memverge.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>
Subject: Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
Date: Fri, 10 Jan 2025 14:51:01 -0500	[thread overview]
Message-ID: <334B7551-7834-44E7-91E6-4AE4C0B382AF@nvidia.com> (raw)
In-Reply-To: <D969919C-A241-432E-A0E3-353CCD8AC7E8@nvidia.com>

On 10 Jan 2025, at 12:05, Zi Yan wrote:

> <snip>
>>>
>>>>> main() {
>>>>> ...
>>>>>
>>>>>     // code snippet to measure throughput
>>>>>     clock_gettime(CLOCK_MONOTONIC, &t1);
>>>>>     retcode = move_pages(getpid(), num_pages, pages, nodesArray , statusArray, MPOL_MF_MOVE);
>>>>>     clock_gettime(CLOCK_MONOTONIC, &t2);
>>>>>
>>>>>     // tput = num_pages*PAGE_SIZE/(t2-t1)
>>>>>
>>>>> ...
>>>>> }
>>>>>
>>>>>
>>>>> Measurements:
>>>>> ============
>>>>> vanilla: base kernel without patchset
>>>>> mt:0 = MT kernel with use_mt_copy=0
>>>>> mt:1..mt:32 = MT kernel with use_mt_copy=1 and thread cnt = 1,2,...,32
>>>>>
>>>>> Measured for both configuration push_0_pull_1=0 and push_0_pull_1=1 and
>>>>> for 4KB migration and THP migration.
>>>>>
>>>>> --------------------
>>>>> #1 push_0_pull_1 = 0 (src node CPUs are used)
>>>>>
>>>>> #1.1 THP=Never, 4KB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 1.28      1.28      1.92      1.80      2.24      2.35      2.22      2.17
>>>>> 4096                2.40      2.40      2.51      2.58      2.83      2.72      2.99      3.25
>>>>> 8192                3.18      2.88      2.83      2.69      3.49      3.46      3.57      3.80
>>>>> 16348               3.17      2.94      2.96      3.17      3.63      3.68      4.06      4.15
>>>>>
>>>>> #1.2 THP=Always, 2MB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 4.31      5.02      3.39      3.40      3.33      3.51      3.91      4.03
>>>>> 1024                7.13      4.49      3.58      3.56      3.91      3.87      4.39      4.57
>>>>> 2048                5.26      6.47      3.91      4.00      3.71      3.85      4.97      6.83
>>>>> 4096                9.93      7.77      4.58      3.79      3.93      3.53      6.41      4.77
>>>>> 8192                6.47      6.33      4.37      4.67      4.52      4.39      5.30      5.37
>>>>> 16348               7.66      8.00      5.20      5.22      5.24      5.28      6.41      7.02
>>>>> 32768               8.56      8.62      6.34      6.20      6.20      6.19      7.18      8.10
>>>>> 65536               9.41      9.40      7.14      7.15      7.15      7.19      7.96      8.89
>>>>> 262144              10.17     10.19     7.26      7.90      7.98      8.05      9.46      10.30
>>>>> 524288              10.40     9.95      7.25      7.93      8.02      8.76      9.55      10.30
>>>>>
>>>>> --------------------
>>>>> #2 push_0_pull_1 = 1 (dst node CPUs are used):
>>>>>
>>>>> #2.1 THP=Never 4KB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 1.28      1.36      2.01      2.74      2.33      2.31      2.53      2.96
>>>>> 4096                2.40      2.84      2.94      3.04      3.40      3.23      3.31      4.16
>>>>> 8192                3.18      3.27      3.34      3.94      3.77      3.68      4.23      4.76
>>>>> 16348               3.17      3.42      3.66      3.21      3.82      4.40      4.76      4.89
>>>>>
>>>>> #2.2 THP=Always 2MB (GB/s):
>>>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>>>> 512                 4.31      5.91      4.03      3.73      4.26      4.13      4.78      3.44
>>>>> 1024                7.13      6.83      4.60      5.13      5.03      5.19      5.94      7.25
>>>>> 2048                5.26      7.09      5.20      5.69      5.83      5.73      6.85      8.13
>>>>> 4096                9.93      9.31      4.90      4.82      4.82      5.26      8.46      8.52
>>>>> 8192                6.47      7.63      5.66      5.85      5.75      6.14      7.45      8.63
>>>>> 16348               7.66      10.00     6.35      6.54      6.66      6.99      8.18      10.21
>>>>> 32768               8.56      9.78      7.06      7.41      7.76      9.02      9.55      11.92
>>>>> 65536               9.41      10.00     8.19      9.20      9.32      8.68      11.00     13.31
>>>>> 262144              10.17     11.17     9.01      9.96      9.99      10.00     11.70     14.27
>>>>> 524288              10.40     11.38     9.07      9.98      10.01     10.09     11.95     14.48
>>>>>
>>>>> Note:
>>>>> 1. For THP = Never: I'm doing for 16X pages to keep total size same for your
>>>>>    experiment with 64KB pagesize)
>>>>> 2. For THP = Always: nr_pages = Number of 4KB pages moved.
>>>>>    nr_pages=512 => 512 4KB pages => 1 2MB page)
>>>>>
>>>>>
>>>>> I'm seeing little (1.5X in some cases) to no benefits. The performance scaling is
>>>>> relatively flat across thread counts.
>>>>>
>>>>> Is it possible I'm missing something in my testing?
>>>>>
>>>>> Could the base page size difference (4KB vs 64KB) be playing a role in
>>>>> the scaling behavior? How the performance varies with 4KB pages on your system?
>>>>>
>>>>> I'd be happy to work with you on investigating this differences.
>>>>> Let me know if you'd like any additional test data or if there are specific
>>>>> configurations I should try.
>>>>
>>>> The results surprises me, since I was able to achieve ~9GB/s when migrating
>>>> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @ 2.30GHz
>>>> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
>>>> These are 10-year-old Haswell CPUs. And your results above show that EPYC 5 can
>>>> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just does
>>>> not make sense.
>>>>
>>>> One thing you might want to try is to set init_on_alloc=0 in your boot
>>>> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. That
>>>> might reduce the time spent on page zeros.
>>>>
>>>> I am also going to rerun the experiments locally on x86_64 boxes to see if your
>>>> results can be replicated.
>>>>
>>>> Thank you for the review and running these experiments. I really appreciate
>>>> it.>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
>>>>
>>>
>>> Using init_on_alloc=0 gave significant performance gain over the last experiment
>>> but I'm still missing the performance scaling you observed.
>>
>> It might be the difference between x86 and ARM64, but I am not 100% sure.
>> Based on your data below, 2 or 4 threads seem to the sweep spot for
>> the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between
>> two sockets in your system? From Figure 10 in [1], I see the InfiniteBand
>> between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional,
>> ~25GB/s bidirectional. I wonder if your results below are cross-socket
>> link bandwidth limited.
>>
>> From my results, NVIDIA Grace CPU can achieve high copy throughput
>> with more threads between two sockets, maybe part of the reason is that
>> its cross-socket link theoretical bandwidth is 900GB/s bidirectional.
>
> I talked to my colleague about this and he mentioned about CCD architecture
> on AMD CPUs. IIUC, one or two cores from one CCD can already saturate
> the CCD’s outgoing bandwidth and all CPUs are enumerated from one CCD to
> another. This means my naive scheduling algorithm, which use CPUs from
> 0 to N threads, uses all cores from one CDD first, then move to another
> CCD. It is not able to saturate the cross-socket bandwidth. Does it make
> sense to you?
>
> If yes, can you please change the my cpu selection code in mm/copy_pages.c:
>
> +	/* TODO: need a better cpu selection method */
> +	for_each_cpu(cpu, per_node_cpumask) {
> +		if (i >= total_mt_num)
> +			break;
> +		cpu_id_list[i] = cpu;
> +		++i;
> +	}
>
> to select CPUs from as many CCDs as possible and rerun the tests.
> That might boost the page migration throughput on AMD CPUs more.
>
> Thanks.
>
>>>
>>> THP Never
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>> 512                 1.40      1.43      2.79      3.48      3.63      3.73      3.63      3.57
>>> 4096                2.54      3.32      3.18      4.65      4.83      5.11      5.39      5.78
>>> 8192                3.35      4.40      4.39      4.71      3.63      5.04      5.33      6.00
>>> 16348               3.76      4.50      4.44      5.33      5.41      5.41      6.47      6.41
>>>
>>> THP Always
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      mt:8      mt:16     mt:32
>>> 512                 5.21      5.47      5.77      6.92      3.71      2.75      7.54      7.44
>>> 1024                6.10      7.65      8.12      8.41      8.87      8.55      9.13      11.36
>>> 2048                6.39      6.66      9.58      8.92      10.75     12.99     13.33     12.23
>>> 4096                7.33      10.85     8.22      13.57     11.43     10.93     12.53     16.86
>>> 8192                7.26      7.46      8.88      11.82     10.55     10.94     13.27     14.11
>>> 16348               9.07      8.53      11.82     14.89     12.97     13.22     16.14     18.10
>>> 32768               10.45     10.55     11.79     19.19     16.85     17.56     20.58     26.57
>>> 65536               11.00     11.12     13.25     18.27     16.18     16.11     19.61     27.73
>>> 262144              12.37     12.40     15.65     20.00     19.25     19.38     22.60     31.95
>>> 524288              12.44     12.33     15.66     19.78     19.06     18.96     23.31     32.29
>>
>> [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hpl-infiniband-and-wrf-performance-study


BTW, I rerun the experiments on a two socket Xeon E5-2650 v4 @ 2.20GHz system with pull method.
The 4KB is not very impressive, at most 60% more throughput, but 2MB can get ~6.5x of
vanilla kernel throughput using 8 or 16 threads.


4KB (GB/s)

| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4 | mt_8 | mt_16 |
| ---- | ------- | ---- | ---- | ---- | ---- | ----- |
| 512  | 1.12    | 1.19 | 1.20 | 1.26 | 1.27 | 1.35  |
| 768  | 1.29    | 1.14 | 1.28 | 1.40 | 1.39 | 1.46  |
| 1024 | 1.19    | 1.25 | 1.34 | 1.51 | 1.52 | 1.53  |
| 2048 | 1.14    | 1.12 | 1.44 | 1.61 | 1.73 | 1.71  |
| 4096 | 1.09    | 1.14 | 1.46 | 1.64 | 1.81 | 1.78  |



2MB (GB/s)
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
|      | vanilla | mt_1 | mt_2 | mt_4  | mt_8  | mt_16 |
| ---- | ------- | ---- | ---- | ----- | ----- | ----- |
| 1    | 2.03    | 2.21 | 2.69 | 2.93  | 3.17  | 3.14  |
| 2    | 2.28    | 2.13 | 3.54 | 4.50  | 4.72  | 4.72  |
| 4    | 2.92    | 2.93 | 4.44 | 6.50  | 7.24  | 7.06  |
| 8    | 2.29    | 2.37 | 3.21 | 6.86  | 8.83  | 8.44  |
| 16   | 2.10    | 2.09 | 4.57 | 8.06  | 8.32  | 9.70  |
| 32   | 2.22    | 2.21 | 4.43 | 8.96  | 9.37  | 11.54 |
| 64   | 2.35    | 2.35 | 3.15 | 7.77  | 10.77 | 13.61 |
| 128  | 2.48    | 2.53 | 5.12 | 8.18  | 11.01 | 15.62 |
| 256  | 2.55    | 2.53 | 5.44 | 8.25  | 12.73 | 16.49 |
| 512  | 2.61    | 2.52 | 5.73 | 11.26 | 17.18 | 16.97 |
| 768  | 2.55    | 2.53 | 5.90 | 11.41 | 14.86 | 17.15 |
| 1024 | 2.56    | 2.52 | 5.99 | 11.46 | 16.77 | 17.25 |



Best Regards,
Yan, Zi

next prev parent reply	other threads:[~2025-01-10 19:51 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-03 17:24 Zi Yan
2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
2025-01-09 11:47   ` Shivank Garg
2025-01-09 14:08     ` Zi Yan
2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
2025-01-06  1:18   ` Hyeonggon Yoo
2025-01-06  2:01     ` Zi Yan
2025-02-13 12:44   ` Byungchul Park
2025-02-13 15:34     ` Zi Yan
2025-02-13 21:34       ` Byungchul Park
2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
2025-01-03 22:21   ` Gregory Price
2025-01-03 22:56     ` Zi Yan
2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
2025-01-03 19:32   ` Zi Yan
2025-01-03 22:09 ` Yang Shi
2025-01-06  2:33   ` Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 15:04   ` Zi Yan
2025-01-09 18:03     ` Shivank Garg
2025-01-09 19:32       ` Zi Yan
2025-01-10 17:05         ` Zi Yan
2025-01-10 19:51           ` Zi Yan [this message]
2025-01-16  4:57             ` Shivank Garg
2025-01-21  6:15               ` Shivank Garg
2025-02-13  8:17 ` Byungchul Park
2025-02-13 15:36   ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=334B7551-7834-44E7-91E6-4AE4C0B382AF@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=AneeshKumar.KizhakeVeetil@arm.com \
    --cc=Raghavendra.KodsaraThimmappa@amd.com \
    --cc=bharata@amd.com \
    --cc=david@redhat.com \
    --cc=gregory.price@memverge.com \
    --cc=jhubbard@nvidia.com \
    --cc=jon.grimm@amd.com \
    --cc=k.shutemov@gmail.com \
    --cc=leesuyeon0506@gmail.com \
    --cc=leillc@google.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mel.gorman@gmail.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=santosh.shukla@amd.com \
    --cc=shivankg@amd.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox