linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org, David Rientjes <rientjes@google.com>,
	 Shivank Garg <shivankg@amd.com>,
	Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
	 David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	 Kirill Shutemov <k.shutemov@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	 Mel Gorman <mel.gorman@gmail.com>,
	"Rao, Bharata Bhasker" <bharata@amd.com>,
	Rik van Riel <riel@surriel.com>,
	 RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
	Wei Xu <weixugc@google.com>,
	 Suyeon Lee <leesuyeon0506@gmail.com>,
	Lei Chen <leillc@google.com>,
	 "Shukla, Santosh" <santosh.shukla@amd.com>,
	"Grimm, Jon" <jon.grimm@amd.com>,
	sj@kernel.org,  Liam Howlett <liam.howlett@oracle.com>,
	Gregory Price <gregory.price@memverge.com>,
	 "Huang, Ying" <ying.huang@linux.alibaba.com>
Subject: Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads
Date: Fri, 3 Jan 2025 14:09:39 -0800	[thread overview]
Message-ID: <CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com> (raw)
In-Reply-To: <20250103172419.4148674-1-ziy@nvidia.com>

On Fri, Jan 3, 2025 at 9:24 AM Zi Yan <ziy@nvidia.com> wrote:
>
> Hi all,
>
> This patchset accelerates page migration by batching folio copy operations and
> using multiple CPU threads and is based on Shivank's Enhancements to Page
> Migration with Batch Offloading via DMA patchset[1] and my original accelerate
> page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-59.
> The last patch is for testing purpose and should not be considered.
>
> The motivations are:
>
> 1. Batching folio copy increases copy throughput. Especially for base page
> migrations, folio copy throughput is low since there are kernel activities like
> moving folio metadata and updating page table entries sit between two folio
> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64
> and 64KB on ARM64.
>
> 2. Single CPU thread has limited copy throughput. Using multi threads is
> a natural extension to speed up folio copy, when DMA engine is NOT
> available in a system.
>
>
> Design
> ===
>
> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY
> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside
> migrate_folio_move() and perform them in one shot afterwards. A
> copy_page_lists_mt() function is added to use multi threads to copy
> folios from src list to dst list.
>
> Changes compared to Shivank's patchset (mainly rewrote batching folio
> copy code)
> ===
>
> 1. mig_info is removed, so no memory allocation is needed during
> batching folio copies. src->private is used to store old page state and
> anon_vma after folio metadata is copied from src to dst.
>
> 2. move_to_new_folio() and migrate_folio_move() are refactored to remove
> redundant code in migrate_folios_batch_move().
>
> 3. folio_mc_copy() is used for the single threaded copy code to keep the
> original kernel behavior.
>
>
> Performance
> ===
>
> I benchmarked move_pages() throughput on a two socket NUMA system with two
> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration and 2MB
> mTHP page migration are measured.
>
> The tables below show move_pages() throughput with different
> configurations and different numbers of copied pages. The x-axis is the
> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32
> threads with this patchset applied. And the unit is GB/s.
>
> The 32-thread copy throughput can be up to 10x of single thread serial folio
> copy. Batching folio copy not only benefits huge page but also base
> page.
>
> 64KB (GB/s):
>
>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
> 32              5.43    4.90    5.65    7.31    7.60    8.61    6.43
> 256             6.95    6.89    9.28    14.67   22.41   23.39   23.93
> 512             7.88    7.26    10.15   17.53   27.82   27.88   33.93
> 768             7.65    7.42    10.46   18.59   28.65   29.67   30.76
> 1024    7.46    8.01    10.90   17.77   27.04   32.18   38.80
>
> 2MB mTHP (GB/s):
>
>                 vanilla mt_1    mt_2    mt_4    mt_8    mt_16   mt_32
> 1               5.94    2.90    6.90    8.56    11.16   8.76    6.41
> 2               7.67    5.57    7.11    12.48   17.37   15.68   14.10
> 4               8.01    6.04    10.25   20.14   22.52   27.79   25.28
> 8               8.42    7.00    11.41   24.73   33.96   32.62   39.55
> 16              9.41    6.91    12.23   27.51   43.95   49.15   51.38
> 32              10.23   7.15    13.03   29.52   49.49   69.98   71.51
> 64              9.40    7.37    13.88   30.38   52.00   76.89   79.41
> 128             8.59    7.23    14.20   28.39   49.98   78.27   90.18
> 256             8.43    7.16    14.59   28.14   48.78   76.88   92.28
> 512             8.31    7.78    14.40   26.20   43.31   63.91   75.21
> 768             8.30    7.86    14.83   27.41   46.25   69.85   81.31
> 1024    8.31    7.90    14.96   27.62   46.75   71.76   83.84

Is this done on an idle system or a busy system? For real production
workloads, all the CPUs are likely busy. It would be great to have the
performance data collected from a busys system too.

>
>
> TODOs
> ===
> 1. Multi-threaded folio copy routine needs to look at CPU scheduler and
> only use idle CPUs to avoid interfering userspace workloads. Of course
> more complicated policies can be used based on migration issuing thread
> priority.

The other potential problem is it is hard to attribute cpu time
consumed by the migration work threads to cpu cgroups. In a
multi-tenant environment this may result in unfair cpu time counting.
However, it is a chronic problem to properly count cpu time for kernel
threads. I'm not sure whether it has been solved or not.

>
> 2. Eliminate memory allocation during multi-threaded folio copy routine
> if possible.
>
> 3. A runtime check to decide when use multi-threaded folio copy.
> Something like cache hotness issue mentioned by Matthew[3].
>
> 4. Use non-temporal CPU instructions to avoid cache pollution issues.

AFAICT, arm64 already uses non-temporal instructions for copy page.

>
> 5. Explicitly make multi-threaded folio copy only available to
> !HIGHMEM, since kmap_local_page() would be needed for each kernel
> folio copy work threads and expensive.
>
> 6. A better interface than copy_page_lists_mt() to allow DMA data copy
> to be used as well.
>
> Let me know your thoughts. Thanks.
>
>
> [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.com/
> [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.com/
> [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.org/
>
> Byungchul Park (1):
>   mm: separate move/undo doing on folio list from migrate_pages_batch()
>
> Zi Yan (4):
>   mm/migrate: factor out code in move_to_new_folio() and
>     migrate_folio_move()
>   mm/migrate: add migrate_folios_batch_move to batch the folio move
>     operations
>   mm/migrate: introduce multi-threaded page copy routine
>   test: add sysctl for folio copy tests and adjust
>     NR_MAX_BATCHED_MIGRATION
>
>  include/linux/migrate.h      |   3 +
>  include/linux/migrate_mode.h |   2 +
>  include/linux/mm.h           |   4 +
>  include/linux/sysctl.h       |   1 +
>  kernel/sysctl.c              |  29 ++-
>  mm/Makefile                  |   2 +-
>  mm/copy_pages.c              | 190 +++++++++++++++
>  mm/migrate.c                 | 443 +++++++++++++++++++++++++++--------
>  8 files changed, 577 insertions(+), 97 deletions(-)
>  create mode 100644 mm/copy_pages.c
>
> --
> 2.45.2
>


  parent reply	other threads:[~2025-01-03 22:09 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-03 17:24 Zi Yan
2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
2025-01-09 11:47   ` Shivank Garg
2025-01-09 14:08     ` Zi Yan
2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
2025-01-06  1:18   ` Hyeonggon Yoo
2025-01-06  2:01     ` Zi Yan
2025-02-13 12:44   ` Byungchul Park
2025-02-13 15:34     ` Zi Yan
2025-02-13 21:34       ` Byungchul Park
2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
2025-01-03 22:21   ` Gregory Price
2025-01-03 22:56     ` Zi Yan
2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
2025-01-03 19:32   ` Zi Yan
2025-01-03 22:09 ` Yang Shi [this message]
2025-01-06  2:33   ` Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 15:04   ` Zi Yan
2025-01-09 18:03     ` Shivank Garg
2025-01-09 19:32       ` Zi Yan
2025-01-10 17:05         ` Zi Yan
2025-01-10 19:51           ` Zi Yan
2025-01-16  4:57             ` Shivank Garg
2025-01-21  6:15               ` Shivank Garg
2025-02-13  8:17 ` Byungchul Park
2025-02-13 15:36   ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@mail.gmail.com \
    --to=shy828301@gmail.com \
    --cc=AneeshKumar.KizhakeVeetil@arm.com \
    --cc=Raghavendra.KodsaraThimmappa@amd.com \
    --cc=bharata@amd.com \
    --cc=david@redhat.com \
    --cc=gregory.price@memverge.com \
    --cc=jhubbard@nvidia.com \
    --cc=jon.grimm@amd.com \
    --cc=k.shutemov@gmail.com \
    --cc=leesuyeon0506@gmail.com \
    --cc=leillc@google.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mel.gorman@gmail.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=santosh.shukla@amd.com \
    --cc=shivankg@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox