From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89603E77188 for ; Fri, 3 Jan 2025 22:09:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 102B96B0089; Fri, 3 Jan 2025 17:09:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0B3236B008A; Fri, 3 Jan 2025 17:09:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EBE0A6B008C; Fri, 3 Jan 2025 17:09:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id CF2B36B0089 for ; Fri, 3 Jan 2025 17:09:54 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 50EBA1C7735 for ; Fri, 3 Jan 2025 22:09:54 +0000 (UTC) X-FDA: 82967533908.02.7033A51 Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) by imf03.hostedemail.com (Postfix) with ESMTP id 700502000F for ; Fri, 3 Jan 2025 22:09:52 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BP6tqyJP; spf=pass (imf03.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735942192; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=gB0D+v0EiHuS4ifiKIGL+PXhezPT5PJQr7SVuf9cnIk=; b=uSaeHq7BCLqbx0/mNJF59HOTMPgGYIJGlPJ3WdCsCP4zNMBbuQXZR0Lke/85X+/QmKpl4a VfLAnhJ/wnkVtYVNG91ZLXgIit9wHtEAQbFY2Fhl8WT+jsb7hZ6K4lKjLoywK53SXnSyV6 lOdkS7fAdLsOgB6iZYnP7LMmoiHt7J0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735942192; a=rsa-sha256; cv=none; b=xQokKDb0OjP2gvrnO5lgIV1ysuldo7VoLD8YdqZMpbelc+B+8x8y6cRSm7lNu9pT6n9ttE 5Cdjv31AMkzeuDy7UBR1sus+grIU6ROc2lqm9LE416qMHYaHPkXJjZRqZQtROK2CdEQWio RH2cDjqUKjuPNhj9rAhQZ5ErmDiyDYo= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BP6tqyJP; spf=pass (imf03.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ed1-f51.google.com with SMTP id 4fb4d7f45d1cf-5d3e6274015so22246507a12.0 for ; Fri, 03 Jan 2025 14:09:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1735942191; x=1736546991; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=gB0D+v0EiHuS4ifiKIGL+PXhezPT5PJQr7SVuf9cnIk=; b=BP6tqyJPPYJycaCDVe/lB1TgHwGPcMoc7TYeaOwAAg0DOuYw9V+fymWM2Q4tQmIDv8 p6Kk3O6V74IYWnEwRE+Qn3DmUHkC0vlSLXqrDhkB1jhov3AhGBGp6OiB/O3ug9dd2L40 uAmFwQ3P4ghNBSFaQNgfmw+2lUcdCDQKiM39+BchgLEpzvXGu0neUWe/w+/DLQ3K750O C0PNHWxLT8eN96DDC6jVU20hpS6Gq2nnFnhMBzW/AHYTIOcLT+EbM2VPlNx3fvjZMx+H 9BkP/OuqDh85Nt9tSRk4E3qGfDSWnCMJ1p0IDIvjgBq/yMtmuADxp81JE+J0J2XmUBys MxnQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735942191; x=1736546991; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=gB0D+v0EiHuS4ifiKIGL+PXhezPT5PJQr7SVuf9cnIk=; b=EjDwa4CAo42PUgxh1cq3sX91y8aI7zFS5ftsz0db6VFWbv6ZGbX595moj/hQGEEljK fX0LNcGFiY3GnxLeof470q8SRmQgpnaaX1/8FrkcLjVesUbcAysQznC9vmW4dztd7yw1 QoW5yEFnwU63tVq8iWiOlfvVLI9iaTeWPomYOigmp9IyRtWoGrnqLpWS2h8bd95ksWBh iFKW6ubItzlpnP4dUzVpvv7n2UiW4UBNtQqvNF8X3ZkdIMMoiguMWhR2AYlziM+e8V3h ErYmbLz2TiWP8sZQPwS+TBU2IqUgDfG4Av2LshM1DODZfRH8+KQ/Cdoq4xTG5To0EABQ nAJA== X-Gm-Message-State: AOJu0YzTeBh7eKFAixZiqXIb1ljn6jZwCKvr/+gsw9KoJiRkkEh74n9c VByJ/7QNdUwYJDSCHXwkvr3odrAv+L8Kla/awOZG7rDvIDeI8d1d8Htzh3vRCPDgBYM0qzNM8iz epOZJ14Pb0YSKxmk1CZAyH6KZxhY= X-Gm-Gg: ASbGncv5yiC7PaZ3LdofLhT8hFOvHvAC5W2xw/4vRMmSalwYq45OEUvIWITykd3QZ20 CmdvyxYNXHP2TnHPBtFz9eVbJtb36BKNX45f6vOWh X-Google-Smtp-Source: AGHT+IGxxC/fh69qJJT46s14CAPBrLurPIjkTR64HDfhqECpJUiC7t81dwmRCmLNlEJY9EbxZdfe4lWgE/I0vu/0aIE= X-Received: by 2002:a05:6402:3486:b0:5d0:e73c:b7f6 with SMTP id 4fb4d7f45d1cf-5d81de2dd01mr40240403a12.31.1735942190738; Fri, 03 Jan 2025 14:09:50 -0800 (PST) MIME-Version: 1.0 References: <20250103172419.4148674-1-ziy@nvidia.com> In-Reply-To: <20250103172419.4148674-1-ziy@nvidia.com> From: Yang Shi Date: Fri, 3 Jan 2025 14:09:39 -0800 Message-ID: Subject: Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads To: Zi Yan Cc: linux-mm@kvack.org, David Rientjes , Shivank Garg , Aneesh Kumar , David Hildenbrand , John Hubbard , Kirill Shutemov , Matthew Wilcox , Mel Gorman , "Rao, Bharata Bhasker" , Rik van Riel , RaghavendraKT , Wei Xu , Suyeon Lee , Lei Chen , "Shukla, Santosh" , "Grimm, Jon" , sj@kernel.org, Liam Howlett , Gregory Price , "Huang, Ying" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 700502000F X-Stat-Signature: idfh79m9hd7onfo1t338g36k4ke64wfq X-Rspam-User: X-HE-Tag: 1735942192-297764 X-HE-Meta: U2FsdGVkX1/Y9bzTSOWEwGVJNMmA1yRCPpDDefsYi0LE9sHmlSRiJ317vuaS6e+ACg5cXlH3nvomfsxg+mvVRabVXjLcUZgLqjYsgDGE7IHpszRqAmlPl73f3TvbZUXllGTWgon7SN8Z56OYl0WEKPW7kOOVjbbbSLBpZHlzmgyuVTD1WI7cZbldl9WAZmRtUKnRucrP/tjcgKoy09HOnYidKxvro7zflsEeS6A0p0b/8zAC4zYwckmVZ+nvjV+EwBuyJkujSQfcjO1g23rTAtrVi6rKZC7d6UY+RamQZf+AAsl7z5DS092wb3sR2CiXgcCb8us3ON4MekyIfGenclv5Ese9FdckkJTeH2v6uDqujjugWJ0zKGzEGJvvAJVtho8cwXOak4b8DqzxYyACNCCMUEzXZ6rnQ82onAnCpQrSvwVCd4ltIo+MP9isfU2x8KZD+aWthm3Fj8paNSh0we5laHNhr/qYB8E21KEeJtBOTjL/GG71q6iXAeQYAtkJuRrtQIOXBxhH1czid0z3LlYXlCi0pzFnmGK9XCJ2cNv6o/015LnuXGt5W+VycRvnKpTz0zoKHKHN8xEQQ9GwJdZaaPbOppBmfsTH9IiRZ99fshQN28ZuhGC3PrbNoc+r89GqIcxMZl4oSG9EcrUZ814dC1/0rRZz6ezD++0kewK8xjYxzoBSuLpZ7MI/HFq/+MHsXuX1Qwg2zBLWrwA4Y1/DvmRbJ4Fhn7aFOzkuOsfb4rfdXLSxTDUqetXUKdZPbeKZfxQeydPNgiocxcKrdT/fUY+4P68MrUGoSn1JJQF49+29lrlh0MKzLytzlP3LnNp6GSHm9d4LMVOCXQjVjKfBTeOU24VL6VLK1jV3O9DdmvyAl+3E+w1KTIw13zfMBIz30RvEQL2fq7l7+LHAZ/SKnmgF/CGCgfBn6AQI/AyGsYVnTl80fY3ix1KJ1ydiQWsQlBUZHdMfjGeJcpX 4ofScxbE 1b88UDbfc0yh6Kp9JQgNb6+Rgj6tHoG048uMiqh2ujPwrFCa5Yxpa0yY5f5/SdTcpWaFkvMpRHPxRJzKeBlh1y7wcJON3ALkulY6UkicWkhZnQow4mMF5R9RQsDgeQADgtmToB+AnR0kE6NzuThlbEOcTNwbSwVQ0AhCnpsIrcskXE1kod57SsWxocFivq0ORdaMoVxIkb8rrMr+zq4MfKwOEhDENE28P05nWBV9dKLSfvuconjD07XnNdjJVvB/iGv+8A+bdrMpGeKWQajS36TnN81KmjSpTBLXPf5erxbr/OumEfUPmewSyP7seUtMR+/9sr/HUpF4rIvdfrO7J76Jg2j4wbZcgs7Y84xnirXNjWuZ39W8lM1H7Vc5l7uckLu0z8oLd6zVnTqoo5WMo34duF2cu/3m7S78TpV3y4xn4UEfgEPRY7GTyEuDomFoR2GNuOtmOwt9pGG4UNpxzne3nGS1rMl3ZnmokpvJl494kOSPgjjSWvuyqYZi3IC2hFYsuno7AA51kr2ZZ+GdZjQ3eqcY4PtRx9+4UcNL7kdpMFwifkgjc9njWdLlY07TdlMDWy4FfnXyCX2c= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jan 3, 2025 at 9:24=E2=80=AFAM Zi Yan wrote: > > Hi all, > > This patchset accelerates page migration by batching folio copy operation= s and > using multiple CPU threads and is based on Shivank's Enhancements to Page > Migration with Batch Offloading via DMA patchset[1] and my original accel= erate > page migration patchset[2]. It is on top of mm-everything-2025-01-03-05-5= 9. > The last patch is for testing purpose and should not be considered. > > The motivations are: > > 1. Batching folio copy increases copy throughput. Especially for base pag= e > migrations, folio copy throughput is low since there are kernel activitie= s like > moving folio metadata and updating page table entries sit between two fol= io > copies. And base page sizes are relatively small, 4KB on x86_64, ARM64 > and 64KB on ARM64. > > 2. Single CPU thread has limited copy throughput. Using multi threads is > a natural extension to speed up folio copy, when DMA engine is NOT > available in a system. > > > Design > =3D=3D=3D > > It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY > (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside > migrate_folio_move() and perform them in one shot afterwards. A > copy_page_lists_mt() function is added to use multi threads to copy > folios from src list to dst list. > > Changes compared to Shivank's patchset (mainly rewrote batching folio > copy code) > =3D=3D=3D > > 1. mig_info is removed, so no memory allocation is needed during > batching folio copies. src->private is used to store old page state and > anon_vma after folio metadata is copied from src to dst. > > 2. move_to_new_folio() and migrate_folio_move() are refactored to remove > redundant code in migrate_folios_batch_move(). > > 3. folio_mc_copy() is used for the single threaded copy code to keep the > original kernel behavior. > > > Performance > =3D=3D=3D > > I benchmarked move_pages() throughput on a two socket NUMA system with tw= o > NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migration a= nd 2MB > mTHP page migration are measured. > > The tables below show move_pages() throughput with different > configurations and different numbers of copied pages. The x-axis is the > configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32 > threads with this patchset applied. And the unit is GB/s. > > The 32-thread copy throughput can be up to 10x of single thread serial fo= lio > copy. Batching folio copy not only benefits huge page but also base > page. > > 64KB (GB/s): > > vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 > 32 5.43 4.90 5.65 7.31 7.60 8.61 6.43 > 256 6.95 6.89 9.28 14.67 22.41 23.39 23.93 > 512 7.88 7.26 10.15 17.53 27.82 27.88 33.93 > 768 7.65 7.42 10.46 18.59 28.65 29.67 30.76 > 1024 7.46 8.01 10.90 17.77 27.04 32.18 38.80 > > 2MB mTHP (GB/s): > > vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 > 1 5.94 2.90 6.90 8.56 11.16 8.76 6.41 > 2 7.67 5.57 7.11 12.48 17.37 15.68 14.10 > 4 8.01 6.04 10.25 20.14 22.52 27.79 25.28 > 8 8.42 7.00 11.41 24.73 33.96 32.62 39.55 > 16 9.41 6.91 12.23 27.51 43.95 49.15 51.38 > 32 10.23 7.15 13.03 29.52 49.49 69.98 71.51 > 64 9.40 7.37 13.88 30.38 52.00 76.89 79.41 > 128 8.59 7.23 14.20 28.39 49.98 78.27 90.18 > 256 8.43 7.16 14.59 28.14 48.78 76.88 92.28 > 512 8.31 7.78 14.40 26.20 43.31 63.91 75.21 > 768 8.30 7.86 14.83 27.41 46.25 69.85 81.31 > 1024 8.31 7.90 14.96 27.62 46.75 71.76 83.84 Is this done on an idle system or a busy system? For real production workloads, all the CPUs are likely busy. It would be great to have the performance data collected from a busys system too. > > > TODOs > =3D=3D=3D > 1. Multi-threaded folio copy routine needs to look at CPU scheduler and > only use idle CPUs to avoid interfering userspace workloads. Of course > more complicated policies can be used based on migration issuing thread > priority. The other potential problem is it is hard to attribute cpu time consumed by the migration work threads to cpu cgroups. In a multi-tenant environment this may result in unfair cpu time counting. However, it is a chronic problem to properly count cpu time for kernel threads. I'm not sure whether it has been solved or not. > > 2. Eliminate memory allocation during multi-threaded folio copy routine > if possible. > > 3. A runtime check to decide when use multi-threaded folio copy. > Something like cache hotness issue mentioned by Matthew[3]. > > 4. Use non-temporal CPU instructions to avoid cache pollution issues. AFAICT, arm64 already uses non-temporal instructions for copy page. > > 5. Explicitly make multi-threaded folio copy only available to > !HIGHMEM, since kmap_local_page() would be needed for each kernel > folio copy work threads and expensive. > > 6. A better interface than copy_page_lists_mt() to allow DMA data copy > to be used as well. > > Let me know your thoughts. Thanks. > > > [1] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@amd.= com/ > [2] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.c= om/ > [3] https://lore.kernel.org/linux-mm/Zm0SWZKcRrngCUUW@casper.infradead.or= g/ > > Byungchul Park (1): > mm: separate move/undo doing on folio list from migrate_pages_batch() > > Zi Yan (4): > mm/migrate: factor out code in move_to_new_folio() and > migrate_folio_move() > mm/migrate: add migrate_folios_batch_move to batch the folio move > operations > mm/migrate: introduce multi-threaded page copy routine > test: add sysctl for folio copy tests and adjust > NR_MAX_BATCHED_MIGRATION > > include/linux/migrate.h | 3 + > include/linux/migrate_mode.h | 2 + > include/linux/mm.h | 4 + > include/linux/sysctl.h | 1 + > kernel/sysctl.c | 29 ++- > mm/Makefile | 2 +- > mm/copy_pages.c | 190 +++++++++++++++ > mm/migrate.c | 443 +++++++++++++++++++++++++++-------- > 8 files changed, 577 insertions(+), 97 deletions(-) > create mode 100644 mm/copy_pages.c > > -- > 2.45.2 >