From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86B85E77199 for ; Thu, 9 Jan 2025 15:04:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E94D66B0088; Thu, 9 Jan 2025 10:04:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E44ED6B008C; Thu, 9 Jan 2025 10:04:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C98366B0093; Thu, 9 Jan 2025 10:04:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A2F0A6B0088 for ; Thu, 9 Jan 2025 10:04:26 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 53A748016B for ; Thu, 9 Jan 2025 15:04:26 +0000 (UTC) X-FDA: 82988234532.03.E59ADB3 Received: from NAM10-MW2-obe.outbound.protection.outlook.com (mail-mw2nam10on2066.outbound.protection.outlook.com [40.107.94.66]) by imf26.hostedemail.com (Postfix) with ESMTP id 52D0B140009 for ; Thu, 9 Jan 2025 15:04:23 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=Nvidia.com header.s=selector2 header.b=JPYvA5FO; spf=pass (imf26.hostedemail.com: domain of ziy@nvidia.com designates 40.107.94.66 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (policy=reject) header.from=nvidia.com; arc=pass ("microsoft.com:s=arcselector10001:i=1") ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736435063; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ndnqd9j4Lk49LmNeuglwcSzB/t8Tq52Q4q50UV3IdkE=; b=d7bPd1ZZZqkUPGgydyJ3toxDY3Ime5F5hb7B84jhsWpzE/xm4S2QsRxX9bSAPp7kUPQEq/ mPXfg6awgfLMBt3tu9FvlQHWDLlgyV3j7gbPoVzjgHlXXf+wVdJrr4ygR58jUQruz3+BjC alViHulPpU011LVWh/EjwAq7oJbbO3I= ARC-Authentication-Results: i=2; imf26.hostedemail.com; dkim=pass header.d=Nvidia.com header.s=selector2 header.b=JPYvA5FO; spf=pass (imf26.hostedemail.com: domain of ziy@nvidia.com designates 40.107.94.66 as permitted sender) smtp.mailfrom=ziy@nvidia.com; dmarc=pass (policy=reject) header.from=nvidia.com; arc=pass ("microsoft.com:s=arcselector10001:i=1") ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1736435063; a=rsa-sha256; cv=pass; b=K2O2MwP7D/vcibkLNAZSHOrueUyMRZnp/GvG9YFUJ8NcRvli1YDj0WKiXzqy0NTwQWhmkQ hGioTm7hDyARsv2N/SAqVjZaRb3H6+U0DEF5a2PfmfC8IhZ2DN5VfFoDSdxExH5kvg/LGu sSPjiEbcuAKQSlzOZqneYpp4ZaMTeWc= ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=nKgRd3j4BPb3tXK7Ti5BsR/Anoc49eY+hh0IGroR491ZXHkI7KXxLhCH9QZ4TlFB+3zEWVBwS0qGJE2p377Rw91wx3ym8/+3X84znLA1owTh2gevm4ch7iYqlkDgh++rF51mX84tcavToGpZFIrcqtfJ+7ej/v6j4hIvCPePd31fk8woSKwDU4K/jEK058cxFXoK4fst8JvARTqDYdrjHk8F7z99QQv3F/AU/fs4LiSFn47CahFDnHaVTwOERqfzFaFm1sU7ulrJPsZHq6X0bJLWQ3X8EdwHvAE6R1eG9I8c8wIi/JCKt/4JfgBM4AWaLMM8xwal9hgO/gIa/Aonog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Ndnqd9j4Lk49LmNeuglwcSzB/t8Tq52Q4q50UV3IdkE=; b=xBOgmI9/5NZKO2Rcr9mRcsMlVyBVdsd3/E9fYhn/riJOPHGM9g93khycnP3CiiqE8vgbPEghMOu9ZJPjH9W4dBrZdunpBnv+Nsbmoe9LuFLDC7wHdQL5NBl3fL0vOAqfhSbWPuDLMdm2yTQJao4vS9d3+Qcghs7wJJlO7C+iGWz+BB2V+pffpe1w1g0hLGJmUX5m55Hp9LhCcBxgMk2OLyVYSZ/Ua0EOPzkEtTUr29dbXn/r02YMmfx0Ovt3myM7W3Y9PBDhwN5IEPrMEfocDZ7JlkTXvlOiQSHyMLTkIDWAgyLB0XHIzG2Xeh/7MMQ+gR2dp/4JpQdQjWhykoT39w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Ndnqd9j4Lk49LmNeuglwcSzB/t8Tq52Q4q50UV3IdkE=; b=JPYvA5FOx84GIG6z406CyR91xgaZSSdUOMBy6MnEDSQeioTijWq77vdfAT1lruD7huULnGV9e9lHGRAJJa/mnorBlQKas53yLEMrvf9lqrz5qOoXsiX/cxAd9KDgCqLj2t0+9fApzGwjrnIa0YnrqC/CvCFFkPkx4Hgm8Lfne8Xnqc94uR/BblHf5uv6qE6Grzx1UP4CkQxD6T8lxVYfh0mQjVbtfbp4zSD8LHt1EAY3BXzXWmIF9lBnixTf22CK1V5gi5Q3q2zV+nOw2mhAqGOnka+ORtbB5wmgi7/l6wYO69sJvBR9QDFE2L0/Mr/dKL0FRohSZs4xuU6UEb6Uhw== Received: from DS7PR12MB9473.namprd12.prod.outlook.com (2603:10b6:8:252::5) by CH3PR12MB9099.namprd12.prod.outlook.com (2603:10b6:610:1a5::16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8335.12; Thu, 9 Jan 2025 15:04:18 +0000 Received: from DS7PR12MB9473.namprd12.prod.outlook.com ([fe80::5189:ecec:d84a:133a]) by DS7PR12MB9473.namprd12.prod.outlook.com ([fe80::5189:ecec:d84a:133a%3]) with mapi id 15.20.8335.011; Thu, 9 Jan 2025 15:04:18 +0000 From: Zi Yan To: Shivank Garg Cc: linux-mm@kvack.org, David Rientjes , Aneesh Kumar , David Hildenbrand , John Hubbard , Kirill Shutemov , Matthew Wilcox , Mel Gorman , "Rao, Bharata Bhasker" , Rik van Riel , RaghavendraKT , Wei Xu , Suyeon Lee , Lei Chen , "Shukla, Santosh" , "Grimm, Jon" , sj@kernel.org, shy828301@gmail.com, Liam Howlett , Gregory Price , "Huang, Ying" Subject: Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Date: Thu, 09 Jan 2025 10:04:15 -0500 X-Mailer: MailMate (2.0r6203) Message-ID: <567FDE63-E84E-4B1E-85F4-4E1EB0C2CD26@nvidia.com> In-Reply-To: <600a57ff-a462-4997-a621-f919c2c4fa84@amd.com> References: <20250103172419.4148674-1-ziy@nvidia.com> <600a57ff-a462-4997-a621-f919c2c4fa84@amd.com> Content-Type: text/plain Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: BLAPR05CA0002.namprd05.prod.outlook.com (2603:10b6:208:36e::12) To DS7PR12MB9473.namprd12.prod.outlook.com (2603:10b6:8:252::5) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DS7PR12MB9473:EE_|CH3PR12MB9099:EE_ X-MS-Office365-Filtering-Correlation-Id: 01955487-b19e-4dc5-b75c-08dd30bee3b7 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|376014|366016|1800799024|7416014; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?2fglEmlv5kg1ZZn5x9Le9H3hcPt0n+S+Rg4ItGRGnVZSgbHMjCJE4rqhN3DM?= =?us-ascii?Q?jWkPYReP8dWcYyuhyzx/Q+XEpyOADi1K06j0EKpookagmxVZuzcD0H3/vW1r?= =?us-ascii?Q?C2Z/3ABujzRmdEFNOd/urT6EnU7KRjoMtoq6Cn6m3X7CHE/YcV7wsmOxWp3w?= =?us-ascii?Q?JNL8NRvBqKlkDU27gDVI3KRrbQ2XbI60uwEvXXvE4XoWTje6qSnWQ+E6WzbK?= =?us-ascii?Q?sL/EdB3cny7BllAJ4erq/jhBJe3tZhOE1Zg1jYRDbhxfTw7EgNV+xIttRqv9?= =?us-ascii?Q?9qt1Oj0ri9KIMQ5SAulfGeKPbx80iZNPvwaPRXvFbIYAZMbStakJcFmyLEtx?= =?us-ascii?Q?rhKmIHJlg2sbUYG/2shcW/Y2uJwkFHhRw+MIqhXy/PdV6vwgwS7v4DjD93Bc?= =?us-ascii?Q?TDcSiUpttKwYMpus1GI2hvgz1qCnhklgzSId489p4nABkhhASmtAMITSNmFc?= =?us-ascii?Q?/H4Yf8NszRS7GoUdJGtFKeTP/c3OQrODY320t3eo7YYfdu6GKRmY+OnKC9gi?= =?us-ascii?Q?y/UBMrD0uCS+aq59T6zXtufLMcwmBHPSgH3hPyiYwSM70xumD6T8WfeV7rYN?= =?us-ascii?Q?J/wlVORllotbXUMmir3YdDGWF3SKyRfsWGpv0CMH46CgUBD2tOiG+ceRIWvG?= =?us-ascii?Q?9SSyhEHD4MLn4bwfkdMjf1s6nhrruvxWJDaZUQKYUjy87Zd/CJWWCFcEAGhd?= =?us-ascii?Q?zbFSq+GcHo9sJzoPT8Rjky6CHe6HyF40XCq+QudvA1ysYNtC+KDuQ5xAygkj?= =?us-ascii?Q?Ke4TXfwKnDq2u28xhHw4cuMb0HcgBtcj8yBQN038GdR+Q5TvO99CHTT29E6c?= =?us-ascii?Q?A7Fv2k5PlIjHHoyd5aGREYjKVEYCNBChGo9ThrKZNkQUW//bHWem+FkVUjwE?= =?us-ascii?Q?mkC5V+zs4fMnyfbG1KlqxgcjECOvciRTeBlnrNLtKBy40BuGET/Ry9s0WvcV?= =?us-ascii?Q?zUQySyjUaxwOYe3cxSMEG2bY3cdfE/pvfqUm+OIZmT6xf83n51wZYm/yfPHm?= =?us-ascii?Q?V4SEcIHgqgNKgL1h5LSJ1DM16PIjSK/BwXtut5UPpNNQBGPP6nP3BTNYo++D?= =?us-ascii?Q?oULqxDg9iWLJXwbFOfadOx0g5BV3EieZ6ygfu/zGreSRiFjhmDqIhccYeSkh?= =?us-ascii?Q?WNuas7mK/3KkHuvoWiYbn9GBf7wVktiuxu1Qm1R9PiDvhABuxCWKWOjuEM04?= =?us-ascii?Q?XOooX4GEV7Gba+M+wFdUYnK52/6gpg7RbMj+uRKyYcAM8X4PD3Dm9BCMe7ba?= =?us-ascii?Q?dgpGzTh3Nc1q/rYF1FOPY7O9xv0ZZFLY9bZBthtOW+hiidNL/EZxCgFF4ZGQ?= =?us-ascii?Q?ZmeUwlX3gKJO393DkwRxexwjkfPFobdKhwJGjT9kmF3F16SM8rl9r4/XbBsZ?= =?us-ascii?Q?7M88c0SUIuCpCBr2gZCHw3FmQb5K?= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DS7PR12MB9473.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(376014)(366016)(1800799024)(7416014);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?b7UFChjJcG1BViwPv+FLo8An03sxia3VFLQz7Tcl9qMygNpQ5ymJrGx1SObd?= =?us-ascii?Q?d5tJWWQP65db/LGX0iBnF+8VhgH0SW5FeuzNeI/VOoPqkc4ah4mt7kqbfIgj?= =?us-ascii?Q?nknC2DSVTiyxckwOsNxF3mG61Nrg897sRzPa/qbFT9jQV8t3too861JDZOxn?= =?us-ascii?Q?Ppl2jzb/+RlNqkVhYNno2ffrjsMsnzF9PJmmfC4UXNFg30YbY7JfnEBMs2GN?= =?us-ascii?Q?R9uF8jUbjCZxg8GfPWcH0wBadzS/dtZYAujgywnmggsI2mvTkyFugxA9ElME?= =?us-ascii?Q?qIZtN1G4LnIgzaQWu2bOoyPPVVzKhQKPL997012w9pkenURhw4D2y1LVg4bi?= =?us-ascii?Q?RKt/phaFJ9HPvkPdS8ZNExjyD2tjae3HYNS04Nh5AZwXMSk/gl26NGEl1oyd?= =?us-ascii?Q?eNXvn9zaMmNIi5Xe+WmcoEBJ4EHleXENiCBJm0QDYGbYR9YDXcR5SvRZE8UW?= =?us-ascii?Q?r86bvidq8bWDERKcLoFuEujzduWu8wMD9Acg1l2xRlXuTyzsMWdebt6seK7b?= =?us-ascii?Q?msmc/EOPnB0m1FPV3Vdp6qiRhlI/MXmyi1GXvO3FpSXBIX/RmDutuWjzmQaL?= =?us-ascii?Q?ZNYQGrZlJA/GEekaF3i75vDx7BE3lsfmlQzINv8LblhCFoHsqYimuQXxBMfB?= =?us-ascii?Q?0du/Ir8VwzMGk2v+01VLlNB9DTdms9pqAOgXd18Feu9JH5VjYfngECslOJE/?= =?us-ascii?Q?1taqrf9ui2hUgfVDNDj5EM6f1gfNt8oSxM0cBDu2H4aqR/0i/6DIOYQKPcoV?= =?us-ascii?Q?BmL2cngSE5SU6dYETJ6eJYVNF+O9yOP1lQYBFHcB/n3hU9UournntJaJ7p3O?= =?us-ascii?Q?3Bcz9X1GRTo4jLdek9XgBjPk4jtZrZPFh9IPjbJ+EODZsVEsYr8diFMtyyQJ?= =?us-ascii?Q?U9Ga5AUN6VXHeO0rS2CRxCR4QGqzOvqopjRlNtFUOWo+uua4qoIfcW+UucgM?= =?us-ascii?Q?VJNveBCpGY7VI97S0XYjdK/jWbv1gRhj9F/O3mGKEBshY2stvYDwWi7RiG9S?= =?us-ascii?Q?agWSNnhpqfKpCsA0A2v4hgBV8kqN1Puwy9OWBL8sjwO2UMSjRzsTID0Jjq+e?= =?us-ascii?Q?IrKCXNxpemjXmnFUmkMBHJ/KCpUtjyP3mnrzOqcsSYY5xfCgxr9sQxqK4cmg?= =?us-ascii?Q?KIaxjHtHmIAhNWskKdkq3s+YS8Kz4gpe76AFHHgM7C5A92Llnq9ehwowK5qO?= =?us-ascii?Q?DjVMjruD05Vunm1lIEDgilLmkcWavZSaBDkQ9/3FC0TLqwR2VWYH0ixj/1D4?= =?us-ascii?Q?B6IIovzb+Gfs0yi3mBgI77X8G+32tF2vJ2F4t+ROJVJarVvCTzBO7D7Ok1A8?= =?us-ascii?Q?qWPUOHTEChjzzM1w19a+jT5dzTpy7UI+eAmDGzeCSEQsfMpIeHG2ApqKTwBl?= =?us-ascii?Q?9xs3hrnDzFJndRHmppaaCmcWLBN9WmAcXc6KJQUXKP4ybNBzCEOgU1Sl5UXN?= =?us-ascii?Q?or2Xm6A5/l7P/VhKsViv6RPARqkiOPnuDjXzWmDBx8N5hpbpzNJKxHhS6/yE?= =?us-ascii?Q?hq9ZpnRoUsNiVigYagJZ/oJcboK/S7FOse+CIn+66jYsy99GHlU27VkIqB4S?= =?us-ascii?Q?ywfdEOH4r8HLvBiBAr5sAGCG/F+hbi5fpKipNDWa?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: 01955487-b19e-4dc5-b75c-08dd30bee3b7 X-MS-Exchange-CrossTenant-AuthSource: DS7PR12MB9473.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Jan 2025 15:04:18.0375 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: 6mL+Ndj3uLEBWN9KsSDmJuHaWxGmNOcVM7OpWrXml4MhaOpV1i0LS+TsANQV9v3h X-MS-Exchange-Transport-CrossTenantHeadersStamped: CH3PR12MB9099 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 52D0B140009 X-Rspam-User: X-Stat-Signature: 3xkxa3b4e998ez6dgbgbft3of9gd7q8s X-HE-Tag: 1736435063-57231 X-HE-Meta: U2FsdGVkX1/DMcBYPuPH/ssAYULUEZim8OQezqeJeZi56JSFcgg20xg7kNfnr+rV8C5fTyp0S6Z8E/b9faAsSJtMjkcqyW4yF0wwjP8fLrjS82litIfb7Hn2RL80ufKHX+UXKLdLlGh+eQNuxOHADVa1fBjUlitLzZAwX+lc6JYvGv4RF5nFE7ZyTjlmO0Vd6xdBKjBr+K7B7g3NNSouYRoEui8O1xpKMnf3FenF5uP7lYgPn/GKTcizGj73LPrnl2gX4z+MpQjVWj/cXIRXcx/WpojvP7nQ4aj8gdUV6HbR0kN3ciFeI8hBleuReaOX+XR+1CqrfyXhhQWI12glYtiGm6vHTx0FaR8BWokQdrlG6t3hU7cfR9h083xbQzqsc/leUU1ZspyYUXC3VQ18dXIKFih25E7s+t8AaswSZocFAi9ngRNxiOf1TqpVMbMPGrBaHk5uUaB6R3dh9rV7qaPHRNPEmToo2NVrmZ2Z+9vZyZ0oA+seePVg99hr2zmZMhREMpq2h8GJ6MYWXUHW/TTIYIBq9lytGbu7MYR459vA6bOM06fDZP8mRe5JITd1OyUnATdp5L2jBDF5YkPMFNluzOUdoxY2ZsnMS/khGpzfxbE0TQ/Wdwvk8h6UK4Bj0SiwjelLUwCj+r3sT+RODTcxgh7k83iDmV7vQYyPIZR+Prh/s9U6gLcYzbqnLaRy73OSZIQvXxfJTpQ4g1NoEndsrRNsxgHk0bomQ8S0lijE65XVcNsp27nbnWhEcRSLgp3Fyy42MG0N26s2eT25m1vBFqzVBoUZkQY5cMaJuSpY7FjO1YJmrXTsVYfoaA3t7ZKD2lmQvYDiXtEBziPoiUgLhm0WG3GjhZWny4vgfH273J1h5vQNslvnD6j07g1i5U0SBfB5yWImxpLhrGvivcmmW23wUOJBC+DStQPrfHdLw4BYrKSx4bJJZXrd4TbjxtBOT7Uy9vtB3B5Cv0Q lufNjCEZ 7pPXBvx1F1lLMKBCLogdb8k5k/hKRtNc/+hldymr9JS0F0cuC1usuYxt6mO8F5OPIvUH1dywLjtBEW5kuGkdQjP9R3dNEYCC3w1eW+6qtZrbrAomgcbu7AtQPcGWJBEXWt2TnYrMYpjXo/y8WNBDLc8W3WgAzEn0V8lBBV7DoE3gpC9cuYhBO9Ek12xMpqsilsdr6t1tjOUALPYGAmVlGaGwzyxMmLegVo7tuPtcdLZZNi9Gm0Gop7yfIe3a8DTsZNpTm7+r3lGhFMgXoJYGvt9B7WS98X7G1vbgFZ0scKivVj/srxWo+PHClZPV7+nqPXvtr76y8hvImGc5f+z6SEpZZNMGNtX4Hkt5I9hryeOfsSU5ViF0s6k7QEw0laZv2E+b25RbqYer9D4K5fq9c+ecmAPlyoa9REphF2kdz2EGzmGKvQi1LqOe6YYG5kA++oDevkw6t9bGIhs+EaJW8n7OqSM6lagARW7OH X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 9 Jan 2025, at 6:47, Shivank Garg wrote: > On 1/3/2025 10:54 PM, Zi Yan wrote: > > Hi Zi, > > It's interesting to see my batch page migration patchset evolution with= > multi-threading support. Thanks for sharing this. > >> Hi all, >> >> This patchset accelerates page migration by batching folio copy operat= ions and >> using multiple CPU threads and is based on Shivank's Enhancements to P= age >> Migration with Batch Offloading via DMA patchset[1] and my original ac= celerate >> page migration patchset[2]. It is on top of mm-everything-2025-01-03-0= 5-59. >> The last patch is for testing purpose and should not be considered. >> >> The motivations are: >> >> 1. Batching folio copy increases copy throughput. Especially for base = page >> migrations, folio copy throughput is low since there are kernel activi= ties like >> moving folio metadata and updating page table entries sit between two = folio >> copies. And base page sizes are relatively small, 4KB on x86_64, ARM64= >> and 64KB on ARM64. >> >> 2. Single CPU thread has limited copy throughput. Using multi threads = is >> a natural extension to speed up folio copy, when DMA engine is NOT >> available in a system. >> >> >> Design >> =3D=3D=3D >> >> It is based on Shivank's patchset and revise MIGRATE_SYNC_NO_COPY >> (renamed to MIGRATE_NO_COPY) to avoid folio copy operation inside >> migrate_folio_move() and perform them in one shot afterwards. A >> copy_page_lists_mt() function is added to use multi threads to copy >> folios from src list to dst list. >> >> Changes compared to Shivank's patchset (mainly rewrote batching folio >> copy code) >> =3D=3D=3D >> >> 1. mig_info is removed, so no memory allocation is needed during >> batching folio copies. src->private is used to store old page state an= d >> anon_vma after folio metadata is copied from src to dst. >> >> 2. move_to_new_folio() and migrate_folio_move() are refactored to remo= ve >> redundant code in migrate_folios_batch_move(). >> >> 3. folio_mc_copy() is used for the single threaded copy code to keep t= he >> original kernel behavior. >> >> > > >> >> TODOs >> =3D=3D=3D >> 1. Multi-threaded folio copy routine needs to look at CPU scheduler an= d >> only use idle CPUs to avoid interfering userspace workloads. Of course= >> more complicated policies can be used based on migration issuing threa= d >> priority. >> >> 2. Eliminate memory allocation during multi-threaded folio copy routin= e >> if possible. >> >> 3. A runtime check to decide when use multi-threaded folio copy. >> Something like cache hotness issue mentioned by Matthew[3]. >> >> 4. Use non-temporal CPU instructions to avoid cache pollution issues. > >> >> 5. Explicitly make multi-threaded folio copy only available to >> !HIGHMEM, since kmap_local_page() would be needed for each kernel >> folio copy work threads and expensive. >> >> 6. A better interface than copy_page_lists_mt() to allow DMA data copy= >> to be used as well. > > I think Static Calls can be better option for this. This is the first time I hear about it. Based on the info I find, I agree= it is a great mechanism to switch between two methods globally. > > This will give a flexible copy interface to support both CPU and variou= s DMA-based > folio copy. DMA-capable driver can override the default CPU copy path w= ithout any > additional runtime overheads. Yes, supporting DMA-based folio copy is also my intention too. I am happy= to with you on that. Things to note are: 1. DMA engine should have more copy throughput as a single CPU thread, ot= herwise the scatter-gather setup overheads will eliminate the benefit of using DM= A engine. 2. Unless the DMA engine is really beef and can handle all possible page = migration requests, CPU-based migration (single or multi threads) should be a fallb= ack. In terms of 2, I wonder how much overheads does Static Calls have when sw= itching between functions. Also, a lock might be needed since falling back to CPU= might be per migrate_pages(). Considering these two, Static Calls might not wor= k as you intended if switching between CPU and DMA is needed. > > >> Performance >> =3D=3D=3D >> >> I benchmarked move_pages() throughput on a two socket NUMA system with= two >> NVIDIA Grace CPUs. The base page size is 64KB. Both 64KB page migratio= n and 2MB >> mTHP page migration are measured. >> >> The tables below show move_pages() throughput with different >> configurations and different numbers of copied pages. The x-axis is th= e >> configurations, from vanilla Linux kernel to using 1, 2, 4, 8, 16, 32 >> threads with this patchset applied. And the unit is GB/s. >> >> The 32-thread copy throughput can be up to 10x of single thread serial= folio >> copy. Batching folio copy not only benefits huge page but also base >> page. >> >> 64KB (GB/s): >> >> vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 >> 32 5.43 4.90 5.65 7.31 7.60 8.61 6.43 >> 256 6.95 6.89 9.28 14.67 22.41 23.39 23.93 >> 512 7.88 7.26 10.15 17.53 27.82 27.88 33.93 >> 768 7.65 7.42 10.46 18.59 28.65 29.67 30.76 >> 1024 7.46 8.01 10.90 17.77 27.04 32.18 38.80 >> >> 2MB mTHP (GB/s): >> >> vanilla mt_1 mt_2 mt_4 mt_8 mt_16 mt_32 >> 1 5.94 2.90 6.90 8.56 11.16 8.76 6.41 >> 2 7.67 5.57 7.11 12.48 17.37 15.68 14.10 >> 4 8.01 6.04 10.25 20.14 22.52 27.79 25.28 >> 8 8.42 7.00 11.41 24.73 33.96 32.62 39.55 >> 16 9.41 6.91 12.23 27.51 43.95 49.15 51.38 >> 32 10.23 7.15 13.03 29.52 49.49 69.98 71.51 >> 64 9.40 7.37 13.88 30.38 52.00 76.89 79.41 >> 128 8.59 7.23 14.20 28.39 49.98 78.27 90.18 >> 256 8.43 7.16 14.59 28.14 48.78 76.88 92.28 >> 512 8.31 7.78 14.40 26.20 43.31 63.91 75.21 >> 768 8.30 7.86 14.83 27.41 46.25 69.85 81.31 >> 1024 8.31 7.90 14.96 27.62 46.75 71.76 83.84 > > I'm measuring the throughput(in GB/s) on our AMD EPYC Zen 5 system > (2-socket, 64-core per socket with SMT Enabled, 2 NUMA nodes) with base= > page-size as 4KB and using using mm-everything-2025-01-04-04-41 as base= > kernel. > > Method: > =3D=3D=3D=3D=3D=3D > main() { > ... > > // code snippet to measure throughput > clock_gettime(CLOCK_MONOTONIC, &t1); > retcode =3D move_pages(getpid(), num_pages, pages, nodesArray , sta= tusArray, MPOL_MF_MOVE); > clock_gettime(CLOCK_MONOTONIC, &t2); > > // tput =3D num_pages*PAGE_SIZE/(t2-t1) > > ... > } > > > Measurements: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > vanilla: base kernel without patchset > mt:0 =3D MT kernel with use_mt_copy=3D0 > mt:1..mt:32 =3D MT kernel with use_mt_copy=3D1 and thread cnt =3D 1,2,.= =2E.,32 > > Measured for both configuration push_0_pull_1=3D0 and push_0_pull_1=3D1= and > for 4KB migration and THP migration. > > -------------------- > #1 push_0_pull_1 =3D 0 (src node CPUs are used) > > #1.1 THP=3DNever, 4KB (GB/s): > nr_pages vanilla mt:0 mt:1 mt:2 mt:4 m= t:8 mt:16 mt:32 > 512 1.28 1.28 1.92 1.80 2.24 2= =2E35 2.22 2.17 > 4096 2.40 2.40 2.51 2.58 2.83 2= =2E72 2.99 3.25 > 8192 3.18 2.88 2.83 2.69 3.49 3= =2E46 3.57 3.80 > 16348 3.17 2.94 2.96 3.17 3.63 3= =2E68 4.06 4.15 > > #1.2 THP=3DAlways, 2MB (GB/s): > nr_pages vanilla mt:0 mt:1 mt:2 mt:4 m= t:8 mt:16 mt:32 > 512 4.31 5.02 3.39 3.40 3.33 3= =2E51 3.91 4.03 > 1024 7.13 4.49 3.58 3.56 3.91 3= =2E87 4.39 4.57 > 2048 5.26 6.47 3.91 4.00 3.71 3= =2E85 4.97 6.83 > 4096 9.93 7.77 4.58 3.79 3.93 3= =2E53 6.41 4.77 > 8192 6.47 6.33 4.37 4.67 4.52 4= =2E39 5.30 5.37 > 16348 7.66 8.00 5.20 5.22 5.24 5= =2E28 6.41 7.02 > 32768 8.56 8.62 6.34 6.20 6.20 6= =2E19 7.18 8.10 > 65536 9.41 9.40 7.14 7.15 7.15 7= =2E19 7.96 8.89 > 262144 10.17 10.19 7.26 7.90 7.98 8= =2E05 9.46 10.30 > 524288 10.40 9.95 7.25 7.93 8.02 8= =2E76 9.55 10.30 > > -------------------- > #2 push_0_pull_1 =3D 1 (dst node CPUs are used): > > #2.1 THP=3DNever 4KB (GB/s): > nr_pages vanilla mt:0 mt:1 mt:2 mt:4 m= t:8 mt:16 mt:32 > 512 1.28 1.36 2.01 2.74 2.33 2= =2E31 2.53 2.96 > 4096 2.40 2.84 2.94 3.04 3.40 3= =2E23 3.31 4.16 > 8192 3.18 3.27 3.34 3.94 3.77 3= =2E68 4.23 4.76 > 16348 3.17 3.42 3.66 3.21 3.82 4= =2E40 4.76 4.89 > > #2.2 THP=3DAlways 2MB (GB/s): > nr_pages vanilla mt:0 mt:1 mt:2 mt:4 m= t:8 mt:16 mt:32 > 512 4.31 5.91 4.03 3.73 4.26 4= =2E13 4.78 3.44 > 1024 7.13 6.83 4.60 5.13 5.03 5= =2E19 5.94 7.25 > 2048 5.26 7.09 5.20 5.69 5.83 5= =2E73 6.85 8.13 > 4096 9.93 9.31 4.90 4.82 4.82 5= =2E26 8.46 8.52 > 8192 6.47 7.63 5.66 5.85 5.75 6= =2E14 7.45 8.63 > 16348 7.66 10.00 6.35 6.54 6.66 6= =2E99 8.18 10.21 > 32768 8.56 9.78 7.06 7.41 7.76 9= =2E02 9.55 11.92 > 65536 9.41 10.00 8.19 9.20 9.32 8= =2E68 11.00 13.31 > 262144 10.17 11.17 9.01 9.96 9.99 1= 0.00 11.70 14.27 > 524288 10.40 11.38 9.07 9.98 10.01 1= 0.09 11.95 14.48 > > Note: > 1. For THP =3D Never: I'm doing for 16X pages to keep total size same f= or your > experiment with 64KB pagesize) > 2. For THP =3D Always: nr_pages =3D Number of 4KB pages moved. > nr_pages=3D512 =3D> 512 4KB pages =3D> 1 2MB page) > > > I'm seeing little (1.5X in some cases) to no benefits. The performance = scaling is > relatively flat across thread counts. > > Is it possible I'm missing something in my testing? > > Could the base page size difference (4KB vs 64KB) be playing a role in > the scaling behavior? How the performance varies with 4KB pages on your= system? > > I'd be happy to work with you on investigating this differences. > Let me know if you'd like any additional test data or if there are spec= ific > configurations I should try. The results surprises me, since I was able to achieve ~9GB/s when migrati= ng 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v3 @= 2.30GHz (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1]. These are 10-year-old Haswell CPUs. And your results above show that EPYC= 5 can only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It just = does not make sense. One thing you might want to try is to set init_on_alloc=3D0 in your boot parameters to use folio_zero_user() instead of GFP_ZERO to zero pages. Th= at might reduce the time spent on page zeros. I am also going to rerun the experiments locally on x86_64 boxes to see i= f your results can be replicated. Thank you for the review and running these experiments. I really apprecia= te it. [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sent.c= om/ Best Regards, Yan, Zi