Re: [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Zi Yan <ziy@nvidia.com>
To: Byungchul Park <byungchul@sk.com>
Cc: linux-mm@kvack.org, David Rientjes <rientjes@google.com>,
	Shivank Garg <shivankg@amd.com>,
	Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
	David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Kirill Shutemov <k.shutemov@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mel.gorman@gmail.com>,
	"Rao, Bharata Bhasker" <bharata@amd.com>,
	Rik van Riel <riel@surriel.com>,
	RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
	Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>,
	Lei Chen <leillc@google.com>,
	"Shukla, Santosh" <santosh.shukla@amd.com>,
	"Grimm, Jon" <jon.grimm@amd.com>,
	sj@kernel.org, shy828301@gmail.com,
	Liam Howlett <liam.howlett@oracle.com>,
	Gregory Price <gregory.price@memverge.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	kernel_team@skhynix.com
Subject: Re: [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine
Date: Thu, 13 Feb 2025 10:34:05 -0500	[thread overview]
Message-ID: <D0E47A0F-3651-463C-AEDD-56CB0A5543E1@nvidia.com> (raw)
In-Reply-To: <20250213124401.GA29526@system.software.com>

On 13 Feb 2025, at 7:44, Byungchul Park wrote:

> On Fri, Jan 03, 2025 at 12:24:18PM -0500, Zi Yan wrote:
>> Now page copies are batched, multi-threaded page copy can be used to
>> increase page copy throughput. Add copy_page_lists_mt() to copy pages in
>> multi-threaded manners. Empirical data show more than 32 base pages are
>> needed to show the benefit of using multi-threaded page copy, so use 32 as
>> the threshold.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  include/linux/migrate.h |   3 +
>>  mm/Makefile             |   2 +-
>>  mm/copy_pages.c         | 186 ++++++++++++++++++++++++++++++++++++++++
>>  mm/migrate.c            |  19 ++--
>>  4 files changed, 199 insertions(+), 11 deletions(-)
>>  create mode 100644 mm/copy_pages.c
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index 29919faea2f1..a0124f4893b0 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -80,6 +80,9 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
>>  int folio_migrate_mapping(struct address_space *mapping,
>>  		struct folio *newfolio, struct folio *folio, int extra_count);
>>
>> +int copy_page_lists_mt(struct list_head *dst_folios,
>> +		struct list_head *src_folios, int nr_items);
>> +
>>  #else
>>
>>  static inline void putback_movable_pages(struct list_head *l) {}
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 850386a67b3e..f8c7f6b4cebb 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -92,7 +92,7 @@ obj-$(CONFIG_KMSAN)	+= kmsan/
>>  obj-$(CONFIG_FAILSLAB) += failslab.o
>>  obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
>>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>> -obj-$(CONFIG_MIGRATION) += migrate.o
>> +obj-$(CONFIG_MIGRATION) += migrate.o copy_pages.o
>>  obj-$(CONFIG_NUMA) += memory-tiers.o
>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>> diff --git a/mm/copy_pages.c b/mm/copy_pages.c
>> new file mode 100644
>> index 000000000000..0e2231199f66
>> --- /dev/null
>> +++ b/mm/copy_pages.c
>> @@ -0,0 +1,186 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Parallel page copy routine.
>> + */
>> +
>> +#include <linux/sysctl.h>
>> +#include <linux/highmem.h>
>> +#include <linux/workqueue.h>
>> +#include <linux/slab.h>
>> +#include <linux/migrate.h>
>> +
>> +
>> +unsigned int limit_mt_num = 4;
>> +
>> +struct copy_item {
>> +	char *to;
>> +	char *from;
>> +	unsigned long chunk_size;
>> +};
>> +
>> +struct copy_page_info {
>> +	struct work_struct copy_page_work;
>> +	unsigned long num_items;
>> +	struct copy_item item_list[];
>> +};
>> +
>> +static void copy_page_routine(char *vto, char *vfrom,
>> +	unsigned long chunk_size)
>> +{
>> +	memcpy(vto, vfrom, chunk_size);
>> +}
>> +
>> +static void copy_page_work_queue_thread(struct work_struct *work)
>> +{
>> +	struct copy_page_info *my_work = (struct copy_page_info *)work;
>> +	int i;
>> +
>> +	for (i = 0; i < my_work->num_items; ++i)
>> +		copy_page_routine(my_work->item_list[i].to,
>> +						  my_work->item_list[i].from,
>> +						  my_work->item_list[i].chunk_size);
>> +}
>> +
>> +int copy_page_lists_mt(struct list_head *dst_folios,
>> +		struct list_head *src_folios, int nr_items)
>> +{
>> +	int err = 0;
>> +	unsigned int total_mt_num = limit_mt_num;
>> +	int to_node = folio_nid(list_first_entry(dst_folios, struct folio, lru));
>> +	int i;
>> +	struct copy_page_info *work_items[32] = {0};
>> +	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
>
> Hi,
>
> Why do you use the cpumask of dst's node than src for where queueing
> the works on?  Is it for utilizing CPU cache?  Isn't it better to use
> src's node than dst where nothing has been loaded to CPU cache?  Or why

Because some vendor’s CPU achieves higher copy throughput by pushing
data from src to dst, whereas other vendor’s CPU get higher by pulling
data from dst to src. More in [1].


> don't you avoid specifying cpus to queue on but let system_unbound_wq
> select the appropriate CPUs e.g. idlest CPUs, when the system is not
> that idle?

Based on wq_select_unbound_cpu()[2], a round robin method is used to
select target CPUs, not the idlest CPUs. queue_work_node(), which queue jobs
on a NUMA node, uses select_numa_node_cpu() and it just chooses the
random (or known as first) CPU from the NUMA node[3]. There is no idleness
detection in workqueue implementation yet.

In addition, based on CPU topology, not all idle CPUs are equal. For example,
AMD CPUs have CCDs and two cores from one CCD would saturate the CCD bandwidth.
This means if you want to achieve high copy throughput, even if all cores
in a CCD are idle, other idle CPUs from another CCD should be chosen first[4].

I am planning to reach out to scheduling folks to learn more about CPU scheduling
and come up with a better workqueue or an alternative for multithreading page
migration.


[1] https://lore.kernel.org/linux-mm/8B66C7BA-96D6-4E04-89F7-13829BF480D7@nvidia.com/
[2] https://elixir.bootlin.com/linux/v6.13.2/source/kernel/workqueue.c#L2212
[3] https://elixir.bootlin.com/linux/v6.13.2/source/kernel/workqueue.c#L2408
[4] https://lore.kernel.org/linux-mm/D969919C-A241-432E-A0E3-353CCD8AC7E8@nvidia.com/



>
> 	Byungchul
>
>> +	int cpu_id_list[32] = {0};
>> +	int cpu;
>> +	int max_items_per_thread;
>> +	int item_idx;
>> +	struct folio *src, *src2, *dst, *dst2;
>> +
>> +	total_mt_num = min_t(unsigned int, total_mt_num,
>> +			cpumask_weight(per_node_cpumask));
>> +
>> +	if (total_mt_num > 32)
>> +		total_mt_num = 32;
>> +
>> +	/* Each threads get part of each page, if nr_items < totla_mt_num */
>> +	if (nr_items < total_mt_num)
>> +		max_items_per_thread = nr_items;
>> +	else
>> +		max_items_per_thread = (nr_items / total_mt_num) +
>> +				((nr_items % total_mt_num) ? 1 : 0);
>> +
>> +
>> +	for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +		work_items[cpu] = kzalloc(sizeof(struct copy_page_info) +
>> +					sizeof(struct copy_item) * max_items_per_thread,
>> +					GFP_NOWAIT);
>> +		if (!work_items[cpu]) {
>> +			err = -ENOMEM;
>> +			goto free_work_items;
>> +		}
>> +	}
>> +
>> +	i = 0;
>> +	/* TODO: need a better cpu selection method */
>> +	for_each_cpu(cpu, per_node_cpumask) {
>> +		if (i >= total_mt_num)
>> +			break;
>> +		cpu_id_list[i] = cpu;
>> +		++i;
>> +	}
>> +
>> +	if (nr_items < total_mt_num) {
>> +		for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +			INIT_WORK((struct work_struct *)work_items[cpu],
>> +					  copy_page_work_queue_thread);
>> +			work_items[cpu]->num_items = max_items_per_thread;
>> +		}
>> +
>> +		item_idx = 0;
>> +		dst = list_first_entry(dst_folios, struct folio, lru);
>> +		dst2 = list_next_entry(dst, lru);
>> +		list_for_each_entry_safe(src, src2, src_folios, lru) {
>> +			unsigned long chunk_size = PAGE_SIZE * folio_nr_pages(src) / total_mt_num;
>> +			/* XXX: not working in HIGHMEM */
>> +			char *vfrom = page_address(&src->page);
>> +			char *vto = page_address(&dst->page);
>> +
>> +			VM_WARN_ON(PAGE_SIZE * folio_nr_pages(src) % total_mt_num);
>> +			VM_WARN_ON(folio_nr_pages(dst) != folio_nr_pages(src));
>> +
>> +			for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +				work_items[cpu]->item_list[item_idx].to =
>> +					vto + chunk_size * cpu;
>> +				work_items[cpu]->item_list[item_idx].from =
>> +					vfrom + chunk_size * cpu;
>> +				work_items[cpu]->item_list[item_idx].chunk_size =
>> +					chunk_size;
>> +			}
>> +
>> +			item_idx++;
>> +			dst = dst2;
>> +			dst2 = list_next_entry(dst, lru);
>> +		}
>> +
>> +		for (cpu = 0; cpu < total_mt_num; ++cpu)
>> +			queue_work_on(cpu_id_list[cpu],
>> +						  system_unbound_wq,
>> +						  (struct work_struct *)work_items[cpu]);
>> +	} else {
>> +		int num_xfer_per_thread = nr_items / total_mt_num;
>> +		int per_cpu_item_idx;
>> +
>> +
>> +		for (cpu = 0; cpu < total_mt_num; ++cpu) {
>> +			INIT_WORK((struct work_struct *)work_items[cpu],
>> +					  copy_page_work_queue_thread);
>> +
>> +			work_items[cpu]->num_items = num_xfer_per_thread +
>> +					(cpu < (nr_items % total_mt_num));
>> +		}
>> +
>> +		cpu = 0;
>> +		per_cpu_item_idx = 0;
>> +		item_idx = 0;
>> +		dst = list_first_entry(dst_folios, struct folio, lru);
>> +		dst2 = list_next_entry(dst, lru);
>> +		list_for_each_entry_safe(src, src2, src_folios, lru) {
>> +			/* XXX: not working in HIGHMEM */
>> +			work_items[cpu]->item_list[per_cpu_item_idx].to =
>> +				page_address(&dst->page);
>> +			work_items[cpu]->item_list[per_cpu_item_idx].from =
>> +				page_address(&src->page);
>> +			work_items[cpu]->item_list[per_cpu_item_idx].chunk_size =
>> +				PAGE_SIZE * folio_nr_pages(src);
>> +
>> +			VM_WARN_ON(folio_nr_pages(dst) !=
>> +				   folio_nr_pages(src));
>> +
>> +			per_cpu_item_idx++;
>> +			item_idx++;
>> +			dst = dst2;
>> +			dst2 = list_next_entry(dst, lru);
>> +
>> +			if (per_cpu_item_idx == work_items[cpu]->num_items) {
>> +				queue_work_on(cpu_id_list[cpu],
>> +					system_unbound_wq,
>> +					(struct work_struct *)work_items[cpu]);
>> +				per_cpu_item_idx = 0;
>> +				cpu++;
>> +			}
>> +		}
>> +		if (item_idx != nr_items)
>> +			pr_warn("%s: only %d out of %d pages are transferred\n",
>> +				__func__, item_idx - 1, nr_items);
>> +	}
>> +
>> +	/* Wait until it finishes  */
>> +	for (i = 0; i < total_mt_num; ++i)
>> +		flush_work((struct work_struct *)work_items[i]);
>> +
>> +free_work_items:
>> +	for (cpu = 0; cpu < total_mt_num; ++cpu)
>> +		kfree(work_items[cpu]);
>> +
>> +	return err;
>> +}
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 95c4cc4a7823..18440180d747 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1799,7 +1799,7 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
>>  		int *nr_retry_pages)
>>  {
>>  	struct folio *folio, *folio2, *dst, *dst2;
>> -	int rc, nr_pages = 0, nr_mig_folios = 0;
>> +	int rc, nr_pages = 0, total_nr_pages = 0, total_nr_folios = 0;
>>  	int old_page_state = 0;
>>  	struct anon_vma *anon_vma = NULL;
>>  	bool is_lru;
>> @@ -1807,11 +1807,6 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
>>  	LIST_HEAD(err_src);
>>  	LIST_HEAD(err_dst);
>>
>> -	if (mode != MIGRATE_ASYNC) {
>> -		*retry += 1;
>> -		return;
>> -	}
>> -
>>  	/*
>>  	 * Iterate over the list of locked src/dst folios to copy the metadata
>>  	 */
>> @@ -1859,19 +1854,23 @@ static void migrate_folios_batch_move(struct list_head *src_folios,
>>  			migrate_folio_undo_src(folio, old_page_state & PAGE_WAS_MAPPED,
>>  					anon_vma, true, ret_folios);
>>  			migrate_folio_undo_dst(dst, true, put_new_folio, private);
>> -		} else /* MIGRATEPAGE_SUCCESS */
>> -			nr_mig_folios++;
>> +		} else { /* MIGRATEPAGE_SUCCESS */
>> +			total_nr_pages += nr_pages;
>> +			total_nr_folios++;
>> +		}
>>
>>  		dst = dst2;
>>  		dst2 = list_next_entry(dst, lru);
>>  	}
>>
>>  	/* Exit if folio list for batch migration is empty */
>> -	if (!nr_mig_folios)
>> +	if (!total_nr_pages)
>>  		goto out;
>>
>>  	/* Batch copy the folios */
>> -	{
>> +	if (total_nr_pages > 32) {
>> +		copy_page_lists_mt(dst_folios, src_folios, total_nr_folios);
>> +	} else {
>>  		dst = list_first_entry(dst_folios, struct folio, lru);
>>  		dst2 = list_next_entry(dst, lru);
>>  		list_for_each_entry_safe(folio, folio2, src_folios, lru) {
>> -- 
>> 2.45.2
>>


Best Regards,
Yan, Zi

next prev parent reply	other threads:[~2025-02-13 15:34 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-03 17:24 [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Zi Yan
2025-01-03 17:24 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 2/5] mm/migrate: factor out code in move_to_new_folio() and migrate_folio_move() Zi Yan
2025-01-03 17:24 ` [RFC PATCH 3/5] mm/migrate: add migrate_folios_batch_move to batch the folio move operations Zi Yan
2025-01-09 11:47   ` Shivank Garg
2025-01-09 14:08     ` Zi Yan
2025-01-03 17:24 ` [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine Zi Yan
2025-01-06  1:18   ` Hyeonggon Yoo
2025-01-06  2:01     ` Zi Yan
2025-02-13 12:44   ` Byungchul Park
2025-02-13 15:34     ` Zi Yan [this message]
2025-02-13 21:34       ` Byungchul Park
2025-01-03 17:24 ` [RFC PATCH 5/5] test: add sysctl for folio copy tests and adjust NR_MAX_BATCHED_MIGRATION Zi Yan
2025-01-03 22:21   ` Gregory Price
2025-01-03 22:56     ` Zi Yan
2025-01-03 19:17 ` [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Gregory Price
2025-01-03 19:32   ` Zi Yan
2025-01-03 22:09 ` Yang Shi
2025-01-06  2:33   ` Zi Yan
2025-01-09 11:47 ` Shivank Garg
2025-01-09 15:04   ` Zi Yan
2025-01-09 18:03     ` Shivank Garg
2025-01-09 19:32       ` Zi Yan
2025-01-10 17:05         ` Zi Yan
2025-01-10 19:51           ` Zi Yan
2025-01-16  4:57             ` Shivank Garg
2025-01-21  6:15               ` Shivank Garg
2025-02-13  8:17 ` Byungchul Park
2025-02-13 15:36   ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D0E47A0F-3651-463C-AEDD-56CB0A5543E1@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=AneeshKumar.KizhakeVeetil@arm.com \
    --cc=Raghavendra.KodsaraThimmappa@amd.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=david@redhat.com \
    --cc=gregory.price@memverge.com \
    --cc=jhubbard@nvidia.com \
    --cc=jon.grimm@amd.com \
    --cc=k.shutemov@gmail.com \
    --cc=kernel_team@skhynix.com \
    --cc=leesuyeon0506@gmail.com \
    --cc=leillc@google.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mel.gorman@gmail.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=santosh.shukla@amd.com \
    --cc=shivankg@amd.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox