On my arm64 server with 128 cores, 2 numa nodes. I used memhog as benchmark :     numactl -m -C 5 memhog -r100000 1G The test result as below:  With this patch:     #time migratepages 8490 0 1     real 0m1.161s     user 0m0.000s     sys 0m1.161s without this patch:     #time migratepages 8460 0 1     real 0m2.068s     user 0m0.001s     sys 0m2.068s So you can see the migration performance improvement about *+78%* This is the perf  record info. w/o +   51.07%     0.09%  migratepages  [kernel.kallsyms]  [k] migrate_folio_extra +   42.43%     0.04%  migratepages  [kernel.kallsyms]  [k] folio_copy +   42.34%    42.34%  migratepages  [kernel.kallsyms]  [k] __pi_copy_page +   33.99%     0.09%  migratepages  [kernel.kallsyms]  [k] rmap_walk_anon +   32.35%     0.04%  migratepages  [kernel.kallsyms]  [k] try_to_migrate *+   27.78%    27.78%  migratepages  [kernel.kallsyms]  [k] ptep_clear_flush * +    8.19%     6.64%  migratepages  [kernel.kallsyms]  [k] folio_migrate_flagsmigrati_tlb_flush w/ this patch +   18.57%     0.13%  migratepages     [kernel.kallsyms]   [k] migrate_pages +   18.23%     0.07%  migratepages     [kernel.kallsyms]   [k] migrate_pages_batch +   16.29%     0.13%  migratepages     [kernel.kallsyms]   [k] migrate_folio_move +   12.73%     0.10%  migratepages     [kernel.kallsyms]   [k] move_to_new_folio +   12.52%     0.06%  migratepages     [kernel.kallsyms]   [k] migrate_folio_extra Therefore, this patch helps improve performance in page migration So,  you can add Tested-by: Xin Hao 在 2023/2/6 下午2:33, Huang Ying 写道: > From: "Huang, Ying" > > Now, migrate_pages() migrate folios one by one, like the fake code as > follows, > > for each folio > unmap > flush TLB > copy > restore map > > If multiple folios are passed to migrate_pages(), there are > opportunities to batch the TLB flushing and copying. That is, we can > change the code to something as follows, > > for each folio > unmap > for each folio > flush TLB > for each folio > copy > for each folio > restore map > > The total number of TLB flushing IPI can be reduced considerably. And > we may use some hardware accelerator such as DSA to accelerate the > folio copying. > > So in this patch, we refactor the migrate_pages() implementation and > implement the TLB flushing batching. Base on this, hardware > accelerated folio copying can be implemented. > > If too many folios are passed to migrate_pages(), in the naive batched > implementation, we may unmap too many folios at the same time. The > possibility for a task to wait for the migrated folios to be mapped > again increases. So the latency may be hurt. To deal with this > issue, the max number of folios be unmapped in batch is restricted to > no more than HPAGE_PMD_NR in the unit of page. That is, the influence > is at the same level of THP migration. > > We use the following test to measure the performance impact of the > patchset, > > On a 2-socket Intel server, > > - Run pmbench memory accessing benchmark > > - Run `migratepages` to migrate pages of pmbench between node 0 and > node 1 back and forth. > > With the patch, the TLB flushing IPI reduces 99.1% during the test and > the number of pages migrated successfully per second increases 291.7%. > > This patchset is based on v6.2-rc4. > > Changes: > > v4: > > - Fixed another bug about non-LRU folio migration. Thanks Hyeonggon! > > v3: > > - Rebased on v6.2-rc4 > > - Fixed a bug about non-LRU folio migration. Thanks Mike! > > - Fixed some comments. Thanks Baolin! > > - Collected reviewed-by. > > v2: > > - Rebased on v6.2-rc3 > > - Fixed type force cast warning. Thanks Kees! > > - Added more comments and cleaned up the code. Thanks Andrew, Zi, Alistair, Dan! > > - Collected reviewed-by. > > from rfc to v1: > > - Rebased on v6.2-rc1 > > - Fix the deadlock issue caused by locking multiple pages synchronously > per Alistair's comments. Thanks! > > - Fix the autonumabench panic per Rao's comments and fix. Thanks! > > - Other minor fixes per comments. Thanks! > > Best Regards, > Huang, Ying