From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 557A5E77199 for ; Thu, 9 Jan 2025 19:32:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7FD136B00B4; Thu, 9 Jan 2025 14:32:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7AC706B00B5; Thu, 9 Jan 2025 14:32:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 626296B00B7; Thu, 9 Jan 2025 14:32:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4221E6B00B4 for ; Thu, 9 Jan 2025 14:32:27 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id D21741C6E7A for ; Thu, 9 Jan 2025 19:32:26 +0000 (UTC) X-FDA: 82988909892.29.E7BEBFB Received: from NAM10-DM6-obe.outbound.protection.outlook.com (mail-dm6nam10on2068.outbound.protection.outlook.com [40.107.93.68]) by imf20.hostedemail.com (Postfix) with ESMTP id DF8AF1C000A for ; Thu, 9 Jan 2025 19:32:23 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=Nvidia.com header.s=selector2 header.b=INHmhrCR; dmarc=pass (policy=reject) header.from=nvidia.com; arc=pass ("microsoft.com:s=arcselector10001:i=1"); spf=pass (imf20.hostedemail.com: domain of ziy@nvidia.com designates 40.107.93.68 as permitted sender) smtp.mailfrom=ziy@nvidia.com ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1736451144; a=rsa-sha256; cv=pass; b=AB+vNZ7R6mvKVE19vuqv9zDQ9ysNEoqtHhAFLfw41gHJVpM/5AoMS7H0twqk+WKyk/f6zq eytJ15wNcj4CxqUA1m43VkpPWMJLi2ypDDrL6e4IHtBxXkB3nnuOOXSImMe43s8Izugz7k /mtm90fU6qeigWrnPbjR0SD6qVxsrYc= ARC-Authentication-Results: i=2; imf20.hostedemail.com; dkim=pass header.d=Nvidia.com header.s=selector2 header.b=INHmhrCR; dmarc=pass (policy=reject) header.from=nvidia.com; arc=pass ("microsoft.com:s=arcselector10001:i=1"); spf=pass (imf20.hostedemail.com: domain of ziy@nvidia.com designates 40.107.93.68 as permitted sender) smtp.mailfrom=ziy@nvidia.com ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736451144; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eOfOb1GSy6rJwp8Is/NqxM09gTxa7ZQHd1MxXzgRE68=; b=FhLNCLs2puAZohLwlmO1RIrTFwLDb0Kmr296R7bBb2T4WD0hQvjjLxHtbuvA3FqWAmg66f 0uU3/ALe2dmOvoC0Aju1I3IIvjcaJOSRZi0i+S2d/B+23LX4ZpdvbIi4m1cbFaHoisRT6n hwvl5u2I0hG6eAZk8lTNAwnPaNy/tsI= ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=QxHmKULEzHim83alKzC6ua/gJ17d7MP0+BqQGPazHLlxOF5vPktEKtevm6Sn8MwiXsStf4FRBqXt3cMv5m1VrTJUxENgYqUrb5Q2mVTK7SHEXVBCIoWADd1Rr3GDR2pJ/zWfJc15oJ7CdHvImD/0Zbx1C+vGrhupZXB6NhqYG5+IiTrUB8PeM2sMNHbgD3ywSl4sB15lwV8i5rnnCU2MMPg4tEbYSfz+xdocNZt6ZtzSo+qoOnU6OYMv6KUViQpCCtuzko2xTFTT2soLPId0NKtkhCnxhBOJySDJNDKyPH41c3hGJ4hD4vwwcmvIcuxUqWS9fL/96KLUVAi03W6jkg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=eOfOb1GSy6rJwp8Is/NqxM09gTxa7ZQHd1MxXzgRE68=; b=Yj+ASIiAMqPh9Ah09CQYfuSTN0oOZcB2ChtMiRL6NeV9tPhSriLd5VtiCtu5uFZu9DbXBafuD0udHPYxgobA0TJX5TFn8BIDzAhh4TUVxPsarPdRbRh823vXc+kU+lwDou7k2hfxWLYfaEwOjePILR9NVm6zwgX1uKLPDkYXwwrZO9jb8sV2QLEx/coieg0OkHXplN4m+gSfVklfw41PxzknL3HET98MIVCB7RpUUy3PmJNsm8ozRJSVLb9iLLu8KfzbHlLkKZi/Piv9oL/wOYvH6+L/EUFQxHPek4xMYMjH3oqHbNUCrLVCgFbW6b5ax/p+2kaxDjhaS/0IqAhfNg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=eOfOb1GSy6rJwp8Is/NqxM09gTxa7ZQHd1MxXzgRE68=; b=INHmhrCRxRYM79r0cR3Us1iZjgaS9lFgDUndKtaTSOiJgnJoRjM/a6EO4A++YjB/hgscFdBFyDzvb9LC2NufYcEVJ0qvwtcvi8cMyi7pKfDmJ7Q7it+vKfkk+IX0IaZrACtPPVgA9l0pXYYXSuyQsCxqsElrbCMgZ2tkg1b5RM7HM9eZQIMEYW05QfDiABPm03K1UXmMyAF5SKQ30nfbZxcYKW3bC5KRVV/YF+BDjseFx3tcuFvDRHPsXI+6ubVnRABeAeU8ctVNP79STOrZiibE94NvUl9MlGGvxgj2xh6m6faOhAaaa/lz0LhTLlsSwxndx6O8q67uj5BcquR9QA== Received: from DS7PR12MB9473.namprd12.prod.outlook.com (2603:10b6:8:252::5) by BL1PR12MB5777.namprd12.prod.outlook.com (2603:10b6:208:390::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8335.10; Thu, 9 Jan 2025 19:32:20 +0000 Received: from DS7PR12MB9473.namprd12.prod.outlook.com ([fe80::5189:ecec:d84a:133a]) by DS7PR12MB9473.namprd12.prod.outlook.com ([fe80::5189:ecec:d84a:133a%3]) with mapi id 15.20.8335.011; Thu, 9 Jan 2025 19:32:20 +0000 From: Zi Yan To: Shivank Garg Cc: linux-mm@kvack.org, David Rientjes , Aneesh Kumar , David Hildenbrand , John Hubbard , Kirill Shutemov , Matthew Wilcox , Mel Gorman , "Rao, Bharata Bhasker" , Rik van Riel , RaghavendraKT , Wei Xu , Suyeon Lee , Lei Chen , "Shukla, Santosh" , "Grimm, Jon" , sj@kernel.org, shy828301@gmail.com, Liam Howlett , Gregory Price , "Huang, Ying" Subject: Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi threads Date: Thu, 09 Jan 2025 14:32:17 -0500 X-Mailer: MailMate (2.0r6203) Message-ID: <8E1D6790-8A44-48C2-9FA5-66C7AB6CE531@nvidia.com> In-Reply-To: <003b0818-a35e-429c-9408-5e7344e981f2@amd.com> References: <20250103172419.4148674-1-ziy@nvidia.com> <600a57ff-a462-4997-a621-f919c2c4fa84@amd.com> <567FDE63-E84E-4B1E-85F4-4E1EB0C2CD26@nvidia.com> <003b0818-a35e-429c-9408-5e7344e981f2@amd.com> Content-Type: text/plain Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: BL1PR13CA0197.namprd13.prod.outlook.com (2603:10b6:208:2be::22) To DS7PR12MB9473.namprd12.prod.outlook.com (2603:10b6:8:252::5) MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DS7PR12MB9473:EE_|BL1PR12MB5777:EE_ X-MS-Office365-Filtering-Correlation-Id: defdf720-bd38-40de-d728-08dd30e455a8 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|376014|1800799024|7416014|366016|7053199007; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?5rfoU1mOQuBM+5k85lWsVEU2i5Vinj/QC/G8xsp38AVf/9EU+57kYgISTddf?= =?us-ascii?Q?OMiEXdu1EvJrq+v/AxQvC5Fvfvrxhp7yasUXA+xyRDSuu+Q3ebsRcmSOyY1C?= =?us-ascii?Q?C2wMQ16UR6TR06X5gFR6SnVQFGVWPGXZkxvSQi+Zm9MkcPpMsK6/kckPcOew?= =?us-ascii?Q?uFPcIRUPuhi0/9mJCdX4jmsf5bq4LR9RDSCiSCpM17y+RYj8ggERoafU00L3?= =?us-ascii?Q?6mYBo0uG5kLhXcOa3TCMAiNYy8oVYa6QpFRjY4wyRwoQI+oUeiQ8mfcycT3N?= =?us-ascii?Q?uqDKYVOXG8CXFYKflY5aPHV61N9fstiUXRKy9c4oIaPD4f6dTXcjNU1YMe3X?= =?us-ascii?Q?VMbxaTEIjX6sa4TfrZWIoIOGcuyi59XDNT85eO0Ms6V67RoiT+toP7I/t2kP?= =?us-ascii?Q?boFGSmB9zMLkJ66Sali0zjJF+XxLFo1U+qBuHcaRcVy5iEOm72xQwsAV9kxZ?= =?us-ascii?Q?YvTqM8cXfQ8JGBPApRsuK0JcAL69r9p6PZBSF7WObWuQUSXNkSb8+WHKKRyM?= =?us-ascii?Q?R8aqqCigQ8r832tPKum2UaIcXrj4kHmjUTUCSPcvSibeVWuKtFIg42KfCGU0?= =?us-ascii?Q?u0vGM4hX6qed2KDcsnl/MlGQ4ACWjD2wYdFHvMbxUYyKLYPDp8rE9zK//acn?= =?us-ascii?Q?MeV9c4HJuM4Qhh6pA1x4s1XlIWb8i02MGGHHrkshq+W65/2ANLWW4RF0/Piw?= =?us-ascii?Q?G5prrUGYp6Te5PI2ifURQjNRc89jDQsy0u7keWO/iBuDzW/vaAQ4igN7iNhl?= =?us-ascii?Q?ZqqEqKwapEkKg381Ir+V9ov3IjmwbjZKsa8PU4+Ir3Mr6gYyco/bUFuocf1g?= =?us-ascii?Q?4OzVq30TRlknP3Ea8DBi0/BOo+rfljTzU/FNVgGX7txBSj3ZFXDCL8niygRS?= =?us-ascii?Q?lgWL0Zea8he34Zwf23hNu9Sm2AIk5eCt/hz4owawlmXIaTX9PT+AIp+OezOk?= =?us-ascii?Q?93UTV9+RXCLpvyafGI8HMMS+hwOplLcoQb9eza6u3HCdkhb37MGSLT4LtWhv?= =?us-ascii?Q?zb0bJ1GFT2+0l+jezgkCpn6aQbTrk0QOWle2/xT+joAI8k/ZfaZ27jxwR5+o?= =?us-ascii?Q?TU0fLHk37NDKhD0kJCY6ym/LDtzOv1OuwCTXRDToaRV0RCouSR+TSK921vBF?= =?us-ascii?Q?hrhKBNSiCkZ7pmBFQMC4Cjjzxue+uaWPixvKrrLUPObRfdGwSSgHDN4yMvuT?= =?us-ascii?Q?bBwVlOoWG8KOvdwuKmR7ASodpf6+uAnBK97wc4hQPYAcZgjUUZwUUuAAPaW6?= =?us-ascii?Q?PJ7aV5mVXCvZFc5AOtca7Y6NLb4iSdi3tFtasVMyyWnWcLAcRZhLEwrUh0wO?= =?us-ascii?Q?sODP2rNUWc65UZS172bnTPVo8khnDZOo5k4kXpTyrbiLLT+jCEcz+p0A68/f?= =?us-ascii?Q?I53T7bUlN+KsKjoJjQu3VuJ7SpWO?= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DS7PR12MB9473.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(376014)(1800799024)(7416014)(366016)(7053199007);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?s1JZlehZzfSR90v3+GBvY7TEfju2s4F745ethRzmDj8vaX3swOY+QlOA7UJs?= =?us-ascii?Q?yxxyCpRdEHSSqEAF/o53ita9RhMDJ4DVlG+oZAH+FPxL8Oli9UfD02XMy/WJ?= =?us-ascii?Q?Y20wur9rYzRy/60rj20ta72AdB6j+tHQ4Hd6ApE5OwmezoVjWqHdKUHI7HOi?= =?us-ascii?Q?tIa7/N9x6p/KHfqSu6KvaKc/6t7ShV9caODx0c3FBw6+e4LL/6r4Uhw6FrpG?= =?us-ascii?Q?0R8wgeAas8CfFcqCkz9VjFLVkCwNYcWC/R8bTEVa7Gyq96n9oApzqIoG6yJp?= =?us-ascii?Q?sepz4lqdZPijPjkTh6pLQStVfkftgKbaKhRgU5hwfNnaHjh9KhxmpM5I/Fbh?= =?us-ascii?Q?+ODdD+qY+Rf1fAFIWiYjoHPdTnehwLvks1cA4yP+QgiNEo0kIG9OGtOes7MR?= =?us-ascii?Q?UXyNHemhznGEg1I2bPy0KgBOI/7OzszOVYmAY6f9GE8wwlsi01MtkeEuPw6B?= =?us-ascii?Q?yz+LzmkCd01VlPlDxavViKaWpohXb1wkdtYNBF8clUYm5btN+aS3lh1GpI7w?= =?us-ascii?Q?a5GOlC2fcSBjlrkB4pe8tKoHA+7qk1OVSRwisFHsG8iNts1MbqQWbr6Yg2P8?= =?us-ascii?Q?aOxdT/8A3UKlDKHlzj/B3NZvhNw7JoQrYCsq4jHpl4NQgzb5AJdaZWxot9hY?= =?us-ascii?Q?fpYRxDMBXuKVMiYpcWZp502YtSRh27uRE5BbVRrPR321kDwvmuJgWnMt0E9v?= =?us-ascii?Q?qJVdqKUh+GlTzXeL4gV8eJZTS/+4xCWRVmesgGn1WZDLSRDr6lwg5pjUK12/?= =?us-ascii?Q?J+vqXZckAgSbQS9YEi3jqkypI3m7zys2Q/0uKQnoi3OyiFbwhtq5SWuPaU8J?= =?us-ascii?Q?O3XjQTpVZ9qgjymqdDfJtDB6NCRfh64khaCjKL3PtXP360xqtRG3VYrvezDd?= =?us-ascii?Q?yOQUf/sgG2NdewnoU9qxTfII1aAk/nlR/lOT1jXgNrxPohk3vEwek8ZS+npv?= =?us-ascii?Q?lUxUQObwd+jypOJDQntboiFWmtiHvzkdRTbop/GAk8p/AVh1d3brMe9TfJmz?= =?us-ascii?Q?hU/yJ1kMYVqg1oGAqKxg1xjs/U5vEevDTv0yJq4QGJRg9va6KWFC0/XrpxoG?= =?us-ascii?Q?3y47iduWjDXTCYZccr+Kei9yHVMgTMFHsCfMtnslmTSduqiY0LFPvHhV009I?= =?us-ascii?Q?gY4xUjX7ebrv0BJa4fSeG6+6Oc1l6gdkq92FcBqmXUSCEmTI6UTqsg9rgA9Y?= =?us-ascii?Q?WVRSawqHxoFyKE69Lg6t7PhzM/pOofXXhf1WMd2JNFcStk4L6xP61bKK7/5e?= =?us-ascii?Q?qEbZtQTHK+opyP5jDZA9dgMBXn/poA8jRmqOHp1eH5zesLpJPZtTOxCR/oaX?= =?us-ascii?Q?u81YThFhY7+KquZjmaxJ9vBqxeQtmXg7mMa7r5tKQaJmcc2Ijwaze5iOHooc?= =?us-ascii?Q?efQqF2j7crQe/O0EN2XWLkINiskxb6OHN2fCuZ5ST1dmL8Z1vJv/GXnFnmrM?= =?us-ascii?Q?Qm3cb/GTk11Jm98CKbb0s5HOKiswvDO/jMgVjzWXR3VU8vqJFaMSnuSsoKwX?= =?us-ascii?Q?n+zhzPa4m6ehMW11frH6rXFMISdBXuEW/Eft9mIyNHBdB82A4KbJqdi+QqP3?= =?us-ascii?Q?gKn+v92n0jkeE+HXRKa58tkL5hkXU0+jjXZqpTWV?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: defdf720-bd38-40de-d728-08dd30e455a8 X-MS-Exchange-CrossTenant-AuthSource: DS7PR12MB9473.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Jan 2025 19:32:20.5578 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: 3sQpkaimNbevxEiYJiDTzzns86AdVeHU2ZZxx7K0yiE/xC3oF8/CJ5TlbsxHNyP4 X-MS-Exchange-Transport-CrossTenantHeadersStamped: BL1PR12MB5777 X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: DF8AF1C000A X-Stat-Signature: c9cqwsp1317fsu5z5knsqyftssunrp89 X-HE-Tag: 1736451143-827515 X-HE-Meta: U2FsdGVkX1+1MxFigGEgrzXNEywrsy8NobzoZQiE3BQJm5s5e7STRQINV81FAFzTVw+GDUXFH4k0t6bKMMlkHH2zMNVecESpMm+Nrn3VES4mfe//7akvjF4rrDZjGXDb0gioU2slM+zMt101aPXXAsLAj6H4L5i+kZTQxC2u3VIzd1g5hzu/MQFGDvtaIL9IYN6rizirx0PhVm/TTffevWQkAT8+Vob84euQ3YTmakkBJ9Ri8BoqH3uxoiT4r2IvBE8YLIkwDesfD5I4Qe2IqbpksE4ulxL9OEYlEanq96fSk4a/McvFbMZ2tSD4pHOVPT9PX05AomfUITMYbbGN8NaTCgv9/GVMr8ssBhLhc52K+kWVBTiUVjplVUefiPmJC2Jupnz4hvtkUZQQRgrx5Qf8ZWY0KqeNg8HZld0bTOW0HgqmkQpQ9YUoGchR1MxjxHtAiKp89U2Jpuu4brRh46cRH9QZTfs9xFXIIsmZ8wAVwhYTpLGV9Q2ZKM0dKFheEjeh3MR9BSJ0k5gDWDxjRwWsNoQsrOhelmx8gljxGwdgUuGcXuAiP5qoakiRj1sshePHjvyFK58v/mYs4Ud3/LQWXdKQNURyS5kYUc+mhs3GjF7dXJRhkGOkb21FGNuJXAb5+jyg5WAaGmt7U6Kcp342VekG2VA523rhwf/8E7SPM75PEXidc3IpYQVDgKvdavUX95ymlQUIBR361MYjOm1XdGSPiuOUeRH+bIyTIK9LK2FzqMG+6f/bnLjbsnyx4rRnDMDJvaq7zPOfEWVjlDMdTQBc6RkREE1ClqtArTDT4S63+6rl6foS55hdoeKelqwxYLDbVzXLl86NUhhG/3m7AvB15RAE4Zw7TaSHV6JnXpLtWKPg5BM8RQyA0QPXWHQR7sT238WIGWeCQWfhHydKEalNH42g8W4H6/Zf77BlEte+nkMxGu0CemLDAY7B+TQPMCjS/uJbgQKFtkX cOlB67DZ uRx/rFh7W64P0THK1JZOBrHs5EnePSHUAaRUVvtVrkrD/K+9Mq/kswBy6s3reOW09+GM+vk13uZEgtJFC7jIZ4vBhQX+m/ZT6gjLYAb95aZDK8ajrkV6Ev8Fv7efVNNrw11TGkvR9AgGtkjSmjNYgwnopZbAMk/+ne1XnyAarMnnZ0itsdKnjUwCZVrCEa7ZroG36xMKT/PbR79RqP94mdaWSWbmzscELMRhIxck2NVBIOenO2T3Bos8j7eXIbNDhpZ6ljq5SSZOsjfGjCAqh8LKG40EgW7I6XObVlSquYPmFfbvJbLTf3bR3JTH4ril9EB5AzivBBu76qofYnz08NdWts9Mk+fvfM6d2ygNJvFS/Tp8yuuoxkZ2f8Fv6SKpaZI+ra05Ov7IyxvoNMK11+8hTecLTEU5thcltNa+ilgNuRmgyS93Xwp6rV1vzjUfaDfV4tfi3MhF8fGYspTfGRpyHwJ9VHzhPFz31IRhO5vouozuhwIF+UI7x/nUE/aG3DPwPJ+MUKRi3gfE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 9 Jan 2025, at 13:03, Shivank Garg wrote: > On 1/9/2025 8:34 PM, Zi Yan wrote: >> On 9 Jan 2025, at 6:47, Shivank Garg wrote: >> >>> On 1/3/2025 10:54 PM, Zi Yan wrote: >>> > > >>>> >>>> 6. A better interface than copy_page_lists_mt() to allow DMA data co= py >>>> to be used as well. >>> >>> I think Static Calls can be better option for this. >> >> This is the first time I hear about it. Based on the info I find, I ag= ree >> it is a great mechanism to switch between two methods globally. >>> >>> This will give a flexible copy interface to support both CPU and vari= ous DMA-based >>> folio copy. DMA-capable driver can override the default CPU copy path= without any >>> additional runtime overheads. >> >> Yes, supporting DMA-based folio copy is also my intention too. I am ha= ppy to >> with you on that. Things to note are: >> 1. DMA engine should have more copy throughput as a single CPU thread,= otherwise >> the scatter-gather setup overheads will eliminate the benefit of using= DMA engine. > > I agree on this. > >> 2. Unless the DMA engine is really beef and can handle all possible pa= ge migration >> requests, CPU-based migration (single or multi threads) should be a fa= llback. >> >> In terms of 2, I wonder how much overheads does Static Calls have when= switching >> between functions. Also, a lock might be needed since falling back to = CPU might >> be per migrate_pages(). Considering these two, Static Calls might not = work >> as you intended if switching between CPU and DMA is needed. > > You can check Patch 4/5 and 5/5 for static call implementation for usin= g DMA Driver > https://lore.kernel.org/linux-mm/20240614221525.19170-5-shivankg@amd.co= m > > There are no run-time overheads of this Static call approach as update = happens only > during DMA driver registration/un-registration - dma_update_migrator() > The SRCU synchronization will ensure the safety during updates. I understand this part. > > It'll use static_call(_folios_copy)() for the copy path. A wrapper insi= de the DMA can > ensure it fallback to folios_copy(). > > Does this address your concern regarding the 2? DMA driver will need to fall back to folios_copy() (using CPU to copy fol= ios), when it thinks DMA engine is overloaded. In my mind, kernel should make t= he decision whether to use single CPU, multiple CPUs, or DMA engine based on= CPU usage and DMA usage. As I am writing it, I realize that might be an o= verhead we want to avoid, since it takes time to get CPU usage and DMA usage info= rmation and should not be on the critical path of page migration. A better approa= ch might be that CPU scheduler and DMA engine can call dma_update_migrator() to ch= ange _folios_copy in the static_call, based on CPU usage and DMA usage. Yes, I think Static Calls should be able to help us choose the right foli= o copy method, single CPU, multiple CPUs, or DMA engine, to perform folio copies= =2E BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_d= ma(), which would incur a huge overhead, based on my past experience using DMA = engine for page copy. I know it is needed to make sure DMA is still present, but= its cost needs to be minimized to make DMA folio copy usable. Otherwise, the 768MB/s DMA copy throughput from your cover letter cannot convince pe= ople to use it for page migration, since single CPU can achieve more than that= , as you showed in the table below. > > >>> main() { >>> ... >>> >>> // code snippet to measure throughput >>> clock_gettime(CLOCK_MONOTONIC, &t1); >>> retcode =3D move_pages(getpid(), num_pages, pages, nodesArray , s= tatusArray, MPOL_MF_MOVE); >>> clock_gettime(CLOCK_MONOTONIC, &t2); >>> >>> // tput =3D num_pages*PAGE_SIZE/(t2-t1) >>> >>> ... >>> } >>> >>> >>> Measurements: >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> vanilla: base kernel without patchset >>> mt:0 =3D MT kernel with use_mt_copy=3D0 >>> mt:1..mt:32 =3D MT kernel with use_mt_copy=3D1 and thread cnt =3D 1,2= ,...,32 >>> >>> Measured for both configuration push_0_pull_1=3D0 and push_0_pull_1=3D= 1 and >>> for 4KB migration and THP migration. >>> >>> -------------------- >>> #1 push_0_pull_1 =3D 0 (src node CPUs are used) >>> >>> #1.1 THP=3DNever, 4KB (GB/s): >>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 = mt:8 mt:16 mt:32 >>> 512 1.28 1.28 1.92 1.80 2.24 = 2.35 2.22 2.17 >>> 4096 2.40 2.40 2.51 2.58 2.83 = 2.72 2.99 3.25 >>> 8192 3.18 2.88 2.83 2.69 3.49 = 3.46 3.57 3.80 >>> 16348 3.17 2.94 2.96 3.17 3.63 = 3.68 4.06 4.15 >>> >>> #1.2 THP=3DAlways, 2MB (GB/s): >>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 = mt:8 mt:16 mt:32 >>> 512 4.31 5.02 3.39 3.40 3.33 = 3.51 3.91 4.03 >>> 1024 7.13 4.49 3.58 3.56 3.91 = 3.87 4.39 4.57 >>> 2048 5.26 6.47 3.91 4.00 3.71 = 3.85 4.97 6.83 >>> 4096 9.93 7.77 4.58 3.79 3.93 = 3.53 6.41 4.77 >>> 8192 6.47 6.33 4.37 4.67 4.52 = 4.39 5.30 5.37 >>> 16348 7.66 8.00 5.20 5.22 5.24 = 5.28 6.41 7.02 >>> 32768 8.56 8.62 6.34 6.20 6.20 = 6.19 7.18 8.10 >>> 65536 9.41 9.40 7.14 7.15 7.15 = 7.19 7.96 8.89 >>> 262144 10.17 10.19 7.26 7.90 7.98 = 8.05 9.46 10.30 >>> 524288 10.40 9.95 7.25 7.93 8.02 = 8.76 9.55 10.30 >>> >>> -------------------- >>> #2 push_0_pull_1 =3D 1 (dst node CPUs are used): >>> >>> #2.1 THP=3DNever 4KB (GB/s): >>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 = mt:8 mt:16 mt:32 >>> 512 1.28 1.36 2.01 2.74 2.33 = 2.31 2.53 2.96 >>> 4096 2.40 2.84 2.94 3.04 3.40 = 3.23 3.31 4.16 >>> 8192 3.18 3.27 3.34 3.94 3.77 = 3.68 4.23 4.76 >>> 16348 3.17 3.42 3.66 3.21 3.82 = 4.40 4.76 4.89 >>> >>> #2.2 THP=3DAlways 2MB (GB/s): >>> nr_pages vanilla mt:0 mt:1 mt:2 mt:4 = mt:8 mt:16 mt:32 >>> 512 4.31 5.91 4.03 3.73 4.26 = 4.13 4.78 3.44 >>> 1024 7.13 6.83 4.60 5.13 5.03 = 5.19 5.94 7.25 >>> 2048 5.26 7.09 5.20 5.69 5.83 = 5.73 6.85 8.13 >>> 4096 9.93 9.31 4.90 4.82 4.82 = 5.26 8.46 8.52 >>> 8192 6.47 7.63 5.66 5.85 5.75 = 6.14 7.45 8.63 >>> 16348 7.66 10.00 6.35 6.54 6.66 = 6.99 8.18 10.21 >>> 32768 8.56 9.78 7.06 7.41 7.76 = 9.02 9.55 11.92 >>> 65536 9.41 10.00 8.19 9.20 9.32 = 8.68 11.00 13.31 >>> 262144 10.17 11.17 9.01 9.96 9.99 = 10.00 11.70 14.27 >>> 524288 10.40 11.38 9.07 9.98 10.01 = 10.09 11.95 14.48 >>> >>> Note: >>> 1. For THP =3D Never: I'm doing for 16X pages to keep total size same= for your >>> experiment with 64KB pagesize) >>> 2. For THP =3D Always: nr_pages =3D Number of 4KB pages moved. >>> nr_pages=3D512 =3D> 512 4KB pages =3D> 1 2MB page) >>> >>> >>> I'm seeing little (1.5X in some cases) to no benefits. The performanc= e scaling is >>> relatively flat across thread counts. >>> >>> Is it possible I'm missing something in my testing? >>> >>> Could the base page size difference (4KB vs 64KB) be playing a role i= n >>> the scaling behavior? How the performance varies with 4KB pages on yo= ur system? >>> >>> I'd be happy to work with you on investigating this differences. >>> Let me know if you'd like any additional test data or if there are sp= ecific >>> configurations I should try. >> >> The results surprises me, since I was able to achieve ~9GB/s when migr= ating >> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v= 3 @ 2.30GHz >> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1]. >> These are 10-year-old Haswell CPUs. And your results above show that E= PYC 5 can >> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It ju= st does >> not make sense. >> >> One thing you might want to try is to set init_on_alloc=3D0 in your bo= ot >> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages.= That >> might reduce the time spent on page zeros. >> >> I am also going to rerun the experiments locally on x86_64 boxes to se= e if your >> results can be replicated. >> >> Thank you for the review and running these experiments. I really appre= ciate >> it.> >> >> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sen= t.com/ >> > > Using init_on_alloc=3D0 gave significant performance gain over the last= experiment > but I'm still missing the performance scaling you observed. It might be the difference between x86 and ARM64, but I am not 100% sure.= Based on your data below, 2 or 4 threads seem to the sweep spot for the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between= two sockets in your system? From Figure 10 in [1], I see the InfiniteBand= between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional= , ~25GB/s bidirectional. I wonder if your results below are cross-socket link bandwidth limited. =46rom my results, NVIDIA Grace CPU can achieve high copy throughput with more threads between two sockets, maybe part of the reason is that its cross-socket link theoretical bandwidth is 900GB/s bidirectional. > > THP Never > nr_pages vanilla mt:0 mt:1 mt:2 mt:4 m= t:8 mt:16 mt:32 > 512 1.40 1.43 2.79 3.48 3.63 3= =2E73 3.63 3.57 > 4096 2.54 3.32 3.18 4.65 4.83 5= =2E11 5.39 5.78 > 8192 3.35 4.40 4.39 4.71 3.63 5= =2E04 5.33 6.00 > 16348 3.76 4.50 4.44 5.33 5.41 5= =2E41 6.47 6.41 > > THP Always > nr_pages vanilla mt:0 mt:1 mt:2 mt:4 m= t:8 mt:16 mt:32 > 512 5.21 5.47 5.77 6.92 3.71 2= =2E75 7.54 7.44 > 1024 6.10 7.65 8.12 8.41 8.87 8= =2E55 9.13 11.36 > 2048 6.39 6.66 9.58 8.92 10.75 1= 2.99 13.33 12.23 > 4096 7.33 10.85 8.22 13.57 11.43 1= 0.93 12.53 16.86 > 8192 7.26 7.46 8.88 11.82 10.55 1= 0.94 13.27 14.11 > 16348 9.07 8.53 11.82 14.89 12.97 1= 3.22 16.14 18.10 > 32768 10.45 10.55 11.79 19.19 16.85 1= 7.56 20.58 26.57 > 65536 11.00 11.12 13.25 18.27 16.18 1= 6.11 19.61 27.73 > 262144 12.37 12.40 15.65 20.00 19.25 1= 9.38 22.60 31.95 > 524288 12.44 12.33 15.66 19.78 19.06 1= 8.96 23.31 32.29 [1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hp= l-infiniband-and-wrf-performance-study Best Regards, Yan, Zi