From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 557A5E77199
	for <linux-mm@archiver.kernel.org>; Thu,  9 Jan 2025 19:32:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 7FD136B00B4; Thu,  9 Jan 2025 14:32:27 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7AC706B00B5; Thu,  9 Jan 2025 14:32:27 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 626296B00B7; Thu,  9 Jan 2025 14:32:27 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 4221E6B00B4
	for <linux-mm@kvack.org>; Thu,  9 Jan 2025 14:32:27 -0500 (EST)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id D21741C6E7A
	for <linux-mm@kvack.org>; Thu,  9 Jan 2025 19:32:26 +0000 (UTC)
X-FDA: 82988909892.29.E7BEBFB
Received: from NAM10-DM6-obe.outbound.protection.outlook.com (mail-dm6nam10on2068.outbound.protection.outlook.com [40.107.93.68])
	by imf20.hostedemail.com (Postfix) with ESMTP id DF8AF1C000A
	for <linux-mm@kvack.org>; Thu,  9 Jan 2025 19:32:23 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=Nvidia.com header.s=selector2 header.b=INHmhrCR;
	dmarc=pass (policy=reject) header.from=nvidia.com;
	arc=pass ("microsoft.com:s=arcselector10001:i=1");
	spf=pass (imf20.hostedemail.com: domain of ziy@nvidia.com designates 40.107.93.68 as permitted sender) smtp.mailfrom=ziy@nvidia.com
ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1736451144; a=rsa-sha256;
	cv=pass;
	b=AB+vNZ7R6mvKVE19vuqv9zDQ9ysNEoqtHhAFLfw41gHJVpM/5AoMS7H0twqk+WKyk/f6zq
	eytJ15wNcj4CxqUA1m43VkpPWMJLi2ypDDrL6e4IHtBxXkB3nnuOOXSImMe43s8Izugz7k
	/mtm90fU6qeigWrnPbjR0SD6qVxsrYc=
ARC-Authentication-Results: i=2;
	imf20.hostedemail.com;
	dkim=pass header.d=Nvidia.com header.s=selector2 header.b=INHmhrCR;
	dmarc=pass (policy=reject) header.from=nvidia.com;
	arc=pass ("microsoft.com:s=arcselector10001:i=1");
	spf=pass (imf20.hostedemail.com: domain of ziy@nvidia.com designates 40.107.93.68 as permitted sender) smtp.mailfrom=ziy@nvidia.com
ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736451144;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=eOfOb1GSy6rJwp8Is/NqxM09gTxa7ZQHd1MxXzgRE68=;
	b=FhLNCLs2puAZohLwlmO1RIrTFwLDb0Kmr296R7bBb2T4WD0hQvjjLxHtbuvA3FqWAmg66f
	0uU3/ALe2dmOvoC0Aju1I3IIvjcaJOSRZi0i+S2d/B+23LX4ZpdvbIi4m1cbFaHoisRT6n
	hwvl5u2I0hG6eAZk8lTNAwnPaNy/tsI=
ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none;
 b=QxHmKULEzHim83alKzC6ua/gJ17d7MP0+BqQGPazHLlxOF5vPktEKtevm6Sn8MwiXsStf4FRBqXt3cMv5m1VrTJUxENgYqUrb5Q2mVTK7SHEXVBCIoWADd1Rr3GDR2pJ/zWfJc15oJ7CdHvImD/0Zbx1C+vGrhupZXB6NhqYG5+IiTrUB8PeM2sMNHbgD3ywSl4sB15lwV8i5rnnCU2MMPg4tEbYSfz+xdocNZt6ZtzSo+qoOnU6OYMv6KUViQpCCtuzko2xTFTT2soLPId0NKtkhCnxhBOJySDJNDKyPH41c3hGJ4hD4vwwcmvIcuxUqWS9fL/96KLUVAi03W6jkg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector10001;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=eOfOb1GSy6rJwp8Is/NqxM09gTxa7ZQHd1MxXzgRE68=;
 b=Yj+ASIiAMqPh9Ah09CQYfuSTN0oOZcB2ChtMiRL6NeV9tPhSriLd5VtiCtu5uFZu9DbXBafuD0udHPYxgobA0TJX5TFn8BIDzAhh4TUVxPsarPdRbRh823vXc+kU+lwDou7k2hfxWLYfaEwOjePILR9NVm6zwgX1uKLPDkYXwwrZO9jb8sV2QLEx/coieg0OkHXplN4m+gSfVklfw41PxzknL3HET98MIVCB7RpUUy3PmJNsm8ozRJSVLb9iLLu8KfzbHlLkKZi/Piv9oL/wOYvH6+L/EUFQxHPek4xMYMjH3oqHbNUCrLVCgFbW6b5ax/p+2kaxDjhaS/0IqAhfNg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com;
 dkim=pass header.d=nvidia.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com;
 s=selector2;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=eOfOb1GSy6rJwp8Is/NqxM09gTxa7ZQHd1MxXzgRE68=;
 b=INHmhrCRxRYM79r0cR3Us1iZjgaS9lFgDUndKtaTSOiJgnJoRjM/a6EO4A++YjB/hgscFdBFyDzvb9LC2NufYcEVJ0qvwtcvi8cMyi7pKfDmJ7Q7it+vKfkk+IX0IaZrACtPPVgA9l0pXYYXSuyQsCxqsElrbCMgZ2tkg1b5RM7HM9eZQIMEYW05QfDiABPm03K1UXmMyAF5SKQ30nfbZxcYKW3bC5KRVV/YF+BDjseFx3tcuFvDRHPsXI+6ubVnRABeAeU8ctVNP79STOrZiibE94NvUl9MlGGvxgj2xh6m6faOhAaaa/lz0LhTLlsSwxndx6O8q67uj5BcquR9QA==
Received: from DS7PR12MB9473.namprd12.prod.outlook.com (2603:10b6:8:252::5) by
 BL1PR12MB5777.namprd12.prod.outlook.com (2603:10b6:208:390::21) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8335.10; Thu, 9 Jan
 2025 19:32:20 +0000
Received: from DS7PR12MB9473.namprd12.prod.outlook.com
 ([fe80::5189:ecec:d84a:133a]) by DS7PR12MB9473.namprd12.prod.outlook.com
 ([fe80::5189:ecec:d84a:133a%3]) with mapi id 15.20.8335.011; Thu, 9 Jan 2025
 19:32:20 +0000
From: Zi Yan <ziy@nvidia.com>
To: Shivank Garg <shivankg@amd.com>
Cc: linux-mm@kvack.org, David Rientjes <rientjes@google.com>,
 Aneesh Kumar <AneeshKumar.KizhakeVeetil@arm.com>,
 David Hildenbrand <david@redhat.com>, John Hubbard <jhubbard@nvidia.com>,
 Kirill Shutemov <k.shutemov@gmail.com>, Matthew Wilcox <willy@infradead.org>,
 Mel Gorman <mel.gorman@gmail.com>, "Rao, Bharata Bhasker" <bharata@amd.com>,
 Rik van Riel <riel@surriel.com>,
 RaghavendraKT <Raghavendra.KodsaraThimmappa@amd.com>,
 Wei Xu <weixugc@google.com>, Suyeon Lee <leesuyeon0506@gmail.com>,
 Lei Chen <leillc@google.com>, "Shukla, Santosh" <santosh.shukla@amd.com>,
 "Grimm, Jon" <jon.grimm@amd.com>, sj@kernel.org, shy828301@gmail.com,
 Liam Howlett <liam.howlett@oracle.com>,
 Gregory Price <gregory.price@memverge.com>,
 "Huang, Ying" <ying.huang@linux.alibaba.com>
Subject: Re: [RFC PATCH 0/5] Accelerate page migration with batching and multi
 threads
Date: Thu, 09 Jan 2025 14:32:17 -0500
X-Mailer: MailMate (2.0r6203)
Message-ID: <8E1D6790-8A44-48C2-9FA5-66C7AB6CE531@nvidia.com>
In-Reply-To: <003b0818-a35e-429c-9408-5e7344e981f2@amd.com>
References: <20250103172419.4148674-1-ziy@nvidia.com>
 <600a57ff-a462-4997-a621-f919c2c4fa84@amd.com>
 <567FDE63-E84E-4B1E-85F4-4E1EB0C2CD26@nvidia.com>
 <003b0818-a35e-429c-9408-5e7344e981f2@amd.com>
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable
X-ClientProxiedBy: BL1PR13CA0197.namprd13.prod.outlook.com
 (2603:10b6:208:2be::22) To DS7PR12MB9473.namprd12.prod.outlook.com
 (2603:10b6:8:252::5)
MIME-Version: 1.0
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: DS7PR12MB9473:EE_|BL1PR12MB5777:EE_
X-MS-Office365-Filtering-Correlation-Id: defdf720-bd38-40de-d728-08dd30e455a8
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam:
	BCL:0;ARA:13230040|376014|1800799024|7416014|366016|7053199007;
X-Microsoft-Antispam-Message-Info:
	=?us-ascii?Q?5rfoU1mOQuBM+5k85lWsVEU2i5Vinj/QC/G8xsp38AVf/9EU+57kYgISTddf?=
 =?us-ascii?Q?OMiEXdu1EvJrq+v/AxQvC5Fvfvrxhp7yasUXA+xyRDSuu+Q3ebsRcmSOyY1C?=
 =?us-ascii?Q?C2wMQ16UR6TR06X5gFR6SnVQFGVWPGXZkxvSQi+Zm9MkcPpMsK6/kckPcOew?=
 =?us-ascii?Q?uFPcIRUPuhi0/9mJCdX4jmsf5bq4LR9RDSCiSCpM17y+RYj8ggERoafU00L3?=
 =?us-ascii?Q?6mYBo0uG5kLhXcOa3TCMAiNYy8oVYa6QpFRjY4wyRwoQI+oUeiQ8mfcycT3N?=
 =?us-ascii?Q?uqDKYVOXG8CXFYKflY5aPHV61N9fstiUXRKy9c4oIaPD4f6dTXcjNU1YMe3X?=
 =?us-ascii?Q?VMbxaTEIjX6sa4TfrZWIoIOGcuyi59XDNT85eO0Ms6V67RoiT+toP7I/t2kP?=
 =?us-ascii?Q?boFGSmB9zMLkJ66Sali0zjJF+XxLFo1U+qBuHcaRcVy5iEOm72xQwsAV9kxZ?=
 =?us-ascii?Q?YvTqM8cXfQ8JGBPApRsuK0JcAL69r9p6PZBSF7WObWuQUSXNkSb8+WHKKRyM?=
 =?us-ascii?Q?R8aqqCigQ8r832tPKum2UaIcXrj4kHmjUTUCSPcvSibeVWuKtFIg42KfCGU0?=
 =?us-ascii?Q?u0vGM4hX6qed2KDcsnl/MlGQ4ACWjD2wYdFHvMbxUYyKLYPDp8rE9zK//acn?=
 =?us-ascii?Q?MeV9c4HJuM4Qhh6pA1x4s1XlIWb8i02MGGHHrkshq+W65/2ANLWW4RF0/Piw?=
 =?us-ascii?Q?G5prrUGYp6Te5PI2ifURQjNRc89jDQsy0u7keWO/iBuDzW/vaAQ4igN7iNhl?=
 =?us-ascii?Q?ZqqEqKwapEkKg381Ir+V9ov3IjmwbjZKsa8PU4+Ir3Mr6gYyco/bUFuocf1g?=
 =?us-ascii?Q?4OzVq30TRlknP3Ea8DBi0/BOo+rfljTzU/FNVgGX7txBSj3ZFXDCL8niygRS?=
 =?us-ascii?Q?lgWL0Zea8he34Zwf23hNu9Sm2AIk5eCt/hz4owawlmXIaTX9PT+AIp+OezOk?=
 =?us-ascii?Q?93UTV9+RXCLpvyafGI8HMMS+hwOplLcoQb9eza6u3HCdkhb37MGSLT4LtWhv?=
 =?us-ascii?Q?zb0bJ1GFT2+0l+jezgkCpn6aQbTrk0QOWle2/xT+joAI8k/ZfaZ27jxwR5+o?=
 =?us-ascii?Q?TU0fLHk37NDKhD0kJCY6ym/LDtzOv1OuwCTXRDToaRV0RCouSR+TSK921vBF?=
 =?us-ascii?Q?hrhKBNSiCkZ7pmBFQMC4Cjjzxue+uaWPixvKrrLUPObRfdGwSSgHDN4yMvuT?=
 =?us-ascii?Q?bBwVlOoWG8KOvdwuKmR7ASodpf6+uAnBK97wc4hQPYAcZgjUUZwUUuAAPaW6?=
 =?us-ascii?Q?PJ7aV5mVXCvZFc5AOtca7Y6NLb4iSdi3tFtasVMyyWnWcLAcRZhLEwrUh0wO?=
 =?us-ascii?Q?sODP2rNUWc65UZS172bnTPVo8khnDZOo5k4kXpTyrbiLLT+jCEcz+p0A68/f?=
 =?us-ascii?Q?I53T7bUlN+KsKjoJjQu3VuJ7SpWO?=
X-Forefront-Antispam-Report:
	CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:DS7PR12MB9473.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(376014)(1800799024)(7416014)(366016)(7053199007);DIR:OUT;SFP:1101;
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0:
	=?us-ascii?Q?s1JZlehZzfSR90v3+GBvY7TEfju2s4F745ethRzmDj8vaX3swOY+QlOA7UJs?=
 =?us-ascii?Q?yxxyCpRdEHSSqEAF/o53ita9RhMDJ4DVlG+oZAH+FPxL8Oli9UfD02XMy/WJ?=
 =?us-ascii?Q?Y20wur9rYzRy/60rj20ta72AdB6j+tHQ4Hd6ApE5OwmezoVjWqHdKUHI7HOi?=
 =?us-ascii?Q?tIa7/N9x6p/KHfqSu6KvaKc/6t7ShV9caODx0c3FBw6+e4LL/6r4Uhw6FrpG?=
 =?us-ascii?Q?0R8wgeAas8CfFcqCkz9VjFLVkCwNYcWC/R8bTEVa7Gyq96n9oApzqIoG6yJp?=
 =?us-ascii?Q?sepz4lqdZPijPjkTh6pLQStVfkftgKbaKhRgU5hwfNnaHjh9KhxmpM5I/Fbh?=
 =?us-ascii?Q?+ODdD+qY+Rf1fAFIWiYjoHPdTnehwLvks1cA4yP+QgiNEo0kIG9OGtOes7MR?=
 =?us-ascii?Q?UXyNHemhznGEg1I2bPy0KgBOI/7OzszOVYmAY6f9GE8wwlsi01MtkeEuPw6B?=
 =?us-ascii?Q?yz+LzmkCd01VlPlDxavViKaWpohXb1wkdtYNBF8clUYm5btN+aS3lh1GpI7w?=
 =?us-ascii?Q?a5GOlC2fcSBjlrkB4pe8tKoHA+7qk1OVSRwisFHsG8iNts1MbqQWbr6Yg2P8?=
 =?us-ascii?Q?aOxdT/8A3UKlDKHlzj/B3NZvhNw7JoQrYCsq4jHpl4NQgzb5AJdaZWxot9hY?=
 =?us-ascii?Q?fpYRxDMBXuKVMiYpcWZp502YtSRh27uRE5BbVRrPR321kDwvmuJgWnMt0E9v?=
 =?us-ascii?Q?qJVdqKUh+GlTzXeL4gV8eJZTS/+4xCWRVmesgGn1WZDLSRDr6lwg5pjUK12/?=
 =?us-ascii?Q?J+vqXZckAgSbQS9YEi3jqkypI3m7zys2Q/0uKQnoi3OyiFbwhtq5SWuPaU8J?=
 =?us-ascii?Q?O3XjQTpVZ9qgjymqdDfJtDB6NCRfh64khaCjKL3PtXP360xqtRG3VYrvezDd?=
 =?us-ascii?Q?yOQUf/sgG2NdewnoU9qxTfII1aAk/nlR/lOT1jXgNrxPohk3vEwek8ZS+npv?=
 =?us-ascii?Q?lUxUQObwd+jypOJDQntboiFWmtiHvzkdRTbop/GAk8p/AVh1d3brMe9TfJmz?=
 =?us-ascii?Q?hU/yJ1kMYVqg1oGAqKxg1xjs/U5vEevDTv0yJq4QGJRg9va6KWFC0/XrpxoG?=
 =?us-ascii?Q?3y47iduWjDXTCYZccr+Kei9yHVMgTMFHsCfMtnslmTSduqiY0LFPvHhV009I?=
 =?us-ascii?Q?gY4xUjX7ebrv0BJa4fSeG6+6Oc1l6gdkq92FcBqmXUSCEmTI6UTqsg9rgA9Y?=
 =?us-ascii?Q?WVRSawqHxoFyKE69Lg6t7PhzM/pOofXXhf1WMd2JNFcStk4L6xP61bKK7/5e?=
 =?us-ascii?Q?qEbZtQTHK+opyP5jDZA9dgMBXn/poA8jRmqOHp1eH5zesLpJPZtTOxCR/oaX?=
 =?us-ascii?Q?u81YThFhY7+KquZjmaxJ9vBqxeQtmXg7mMa7r5tKQaJmcc2Ijwaze5iOHooc?=
 =?us-ascii?Q?efQqF2j7crQe/O0EN2XWLkINiskxb6OHN2fCuZ5ST1dmL8Z1vJv/GXnFnmrM?=
 =?us-ascii?Q?Qm3cb/GTk11Jm98CKbb0s5HOKiswvDO/jMgVjzWXR3VU8vqJFaMSnuSsoKwX?=
 =?us-ascii?Q?n+zhzPa4m6ehMW11frH6rXFMISdBXuEW/Eft9mIyNHBdB82A4KbJqdi+QqP3?=
 =?us-ascii?Q?gKn+v92n0jkeE+HXRKa58tkL5hkXU0+jjXZqpTWV?=
X-OriginatorOrg: Nvidia.com
X-MS-Exchange-CrossTenant-Network-Message-Id: defdf720-bd38-40de-d728-08dd30e455a8
X-MS-Exchange-CrossTenant-AuthSource: DS7PR12MB9473.namprd12.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 09 Jan 2025 19:32:20.5578
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 3sQpkaimNbevxEiYJiDTzzns86AdVeHU2ZZxx7K0yiE/xC3oF8/CJ5TlbsxHNyP4
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BL1PR12MB5777
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: DF8AF1C000A
X-Stat-Signature: c9cqwsp1317fsu5z5knsqyftssunrp89
X-HE-Tag: 1736451143-827515
X-HE-Meta: U2FsdGVkX1+1MxFigGEgrzXNEywrsy8NobzoZQiE3BQJm5s5e7STRQINV81FAFzTVw+GDUXFH4k0t6bKMMlkHH2zMNVecESpMm+Nrn3VES4mfe//7akvjF4rrDZjGXDb0gioU2slM+zMt101aPXXAsLAj6H4L5i+kZTQxC2u3VIzd1g5hzu/MQFGDvtaIL9IYN6rizirx0PhVm/TTffevWQkAT8+Vob84euQ3YTmakkBJ9Ri8BoqH3uxoiT4r2IvBE8YLIkwDesfD5I4Qe2IqbpksE4ulxL9OEYlEanq96fSk4a/McvFbMZ2tSD4pHOVPT9PX05AomfUITMYbbGN8NaTCgv9/GVMr8ssBhLhc52K+kWVBTiUVjplVUefiPmJC2Jupnz4hvtkUZQQRgrx5Qf8ZWY0KqeNg8HZld0bTOW0HgqmkQpQ9YUoGchR1MxjxHtAiKp89U2Jpuu4brRh46cRH9QZTfs9xFXIIsmZ8wAVwhYTpLGV9Q2ZKM0dKFheEjeh3MR9BSJ0k5gDWDxjRwWsNoQsrOhelmx8gljxGwdgUuGcXuAiP5qoakiRj1sshePHjvyFK58v/mYs4Ud3/LQWXdKQNURyS5kYUc+mhs3GjF7dXJRhkGOkb21FGNuJXAb5+jyg5WAaGmt7U6Kcp342VekG2VA523rhwf/8E7SPM75PEXidc3IpYQVDgKvdavUX95ymlQUIBR361MYjOm1XdGSPiuOUeRH+bIyTIK9LK2FzqMG+6f/bnLjbsnyx4rRnDMDJvaq7zPOfEWVjlDMdTQBc6RkREE1ClqtArTDT4S63+6rl6foS55hdoeKelqwxYLDbVzXLl86NUhhG/3m7AvB15RAE4Zw7TaSHV6JnXpLtWKPg5BM8RQyA0QPXWHQR7sT238WIGWeCQWfhHydKEalNH42g8W4H6/Zf77BlEte+nkMxGu0CemLDAY7B+TQPMCjS/uJbgQKFtkX
 cOlB67DZ
 uRx/rFh7W64P0THK1JZOBrHs5EnePSHUAaRUVvtVrkrD/K+9Mq/kswBy6s3reOW09+GM+vk13uZEgtJFC7jIZ4vBhQX+m/ZT6gjLYAb95aZDK8ajrkV6Ev8Fv7efVNNrw11TGkvR9AgGtkjSmjNYgwnopZbAMk/+ne1XnyAarMnnZ0itsdKnjUwCZVrCEa7ZroG36xMKT/PbR79RqP94mdaWSWbmzscELMRhIxck2NVBIOenO2T3Bos8j7eXIbNDhpZ6ljq5SSZOsjfGjCAqh8LKG40EgW7I6XObVlSquYPmFfbvJbLTf3bR3JTH4ril9EB5AzivBBu76qofYnz08NdWts9Mk+fvfM6d2ygNJvFS/Tp8yuuoxkZ2f8Fv6SKpaZI+ra05Ov7IyxvoNMK11+8hTecLTEU5thcltNa+ilgNuRmgyS93Xwp6rV1vzjUfaDfV4tfi3MhF8fGYspTfGRpyHwJ9VHzhPFz31IRhO5vouozuhwIF+UI7x/nUE/aG3DPwPJ+MUKRi3gfE=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 9 Jan 2025, at 13:03, Shivank Garg wrote:

> On 1/9/2025 8:34 PM, Zi Yan wrote:
>> On 9 Jan 2025, at 6:47, Shivank Garg wrote:
>>
>>> On 1/3/2025 10:54 PM, Zi Yan wrote:
>>>
>
>
>>>>
>>>> 6. A better interface than copy_page_lists_mt() to allow DMA data co=
py
>>>> to be used as well.
>>>
>>> I think Static Calls can be better option for this.
>>
>> This is the first time I hear about it. Based on the info I find, I ag=
ree
>> it is a great mechanism to switch between two methods globally.
>>>
>>> This will give a flexible copy interface to support both CPU and vari=
ous DMA-based
>>> folio copy. DMA-capable driver can override the default CPU copy path=
 without any
>>> additional runtime overheads.
>>
>> Yes, supporting DMA-based folio copy is also my intention too. I am ha=
ppy to
>> with you on that. Things to note are:
>> 1. DMA engine should have more copy throughput as a single CPU thread,=
 otherwise
>> the scatter-gather setup overheads will eliminate the benefit of using=
 DMA engine.
>
> I agree on this.
>
>> 2. Unless the DMA engine is really beef and can handle all possible pa=
ge migration
>> requests, CPU-based migration (single or multi threads) should be a fa=
llback.
>>
>> In terms of 2, I wonder how much overheads does Static Calls have when=
 switching
>> between functions. Also, a lock might be needed since falling back to =
CPU might
>> be per migrate_pages(). Considering these two, Static Calls might not =
work
>> as you intended if switching between CPU and DMA is needed.
>
> You can check Patch 4/5 and 5/5 for static call implementation for usin=
g DMA Driver
> https://lore.kernel.org/linux-mm/20240614221525.19170-5-shivankg@amd.co=
m
>
> There are no run-time overheads of this Static call approach as update =
happens only
> during DMA driver registration/un-registration - dma_update_migrator()
> The SRCU synchronization will ensure the safety during updates.

I understand this part.

>
> It'll use static_call(_folios_copy)() for the copy path. A wrapper insi=
de the DMA can
> ensure it fallback to folios_copy().
>
> Does this address your concern regarding the 2?

DMA driver will need to fall back to folios_copy() (using CPU to copy fol=
ios),
when it thinks DMA engine is overloaded. In my mind, kernel should make t=
he
decision whether to use single CPU, multiple CPUs, or DMA engine based on=

CPU usage and DMA usage. As I am writing it, I realize that might be an o=
verhead
we want to avoid, since it takes time to get CPU usage and DMA usage info=
rmation
and should not be on the critical path of page migration. A better approa=
ch might
be that CPU scheduler and DMA engine can call dma_update_migrator() to ch=
ange
_folios_copy in the static_call, based on CPU usage and DMA usage.

Yes, I think Static Calls should be able to help us choose the right foli=
o copy
method, single CPU, multiple CPUs, or DMA engine, to perform folio copies=
=2E

BTW, I notice that you called dmaengine_get_dma_device() in folios_copy_d=
ma(),
which would incur a huge overhead, based on my past experience using DMA =
engine
for page copy. I know it is needed to make sure DMA is still present, but=

its cost needs to be minimized to make DMA folio copy usable. Otherwise,
the 768MB/s DMA copy throughput from your cover letter cannot convince pe=
ople
to use it for page migration, since single CPU can achieve more than that=
,
as you showed in the table below.

>
>
>>> main() {
>>> ...
>>>
>>>     // code snippet to measure throughput
>>>     clock_gettime(CLOCK_MONOTONIC, &t1);
>>>     retcode =3D move_pages(getpid(), num_pages, pages, nodesArray , s=
tatusArray, MPOL_MF_MOVE);
>>>     clock_gettime(CLOCK_MONOTONIC, &t2);
>>>
>>>     // tput =3D num_pages*PAGE_SIZE/(t2-t1)
>>>
>>> ...
>>> }
>>>
>>>
>>> Measurements:
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>> vanilla: base kernel without patchset
>>> mt:0 =3D MT kernel with use_mt_copy=3D0
>>> mt:1..mt:32 =3D MT kernel with use_mt_copy=3D1 and thread cnt =3D 1,2=
,...,32
>>>
>>> Measured for both configuration push_0_pull_1=3D0 and push_0_pull_1=3D=
1 and
>>> for 4KB migration and THP migration.
>>>
>>> --------------------
>>> #1 push_0_pull_1 =3D 0 (src node CPUs are used)
>>>
>>> #1.1 THP=3DNever, 4KB (GB/s):
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4     =
 mt:8      mt:16     mt:32
>>> 512                 1.28      1.28      1.92      1.80      2.24     =
 2.35      2.22      2.17
>>> 4096                2.40      2.40      2.51      2.58      2.83     =
 2.72      2.99      3.25
>>> 8192                3.18      2.88      2.83      2.69      3.49     =
 3.46      3.57      3.80
>>> 16348               3.17      2.94      2.96      3.17      3.63     =
 3.68      4.06      4.15
>>>
>>> #1.2 THP=3DAlways, 2MB (GB/s):
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4     =
 mt:8      mt:16     mt:32
>>> 512                 4.31      5.02      3.39      3.40      3.33     =
 3.51      3.91      4.03
>>> 1024                7.13      4.49      3.58      3.56      3.91     =
 3.87      4.39      4.57
>>> 2048                5.26      6.47      3.91      4.00      3.71     =
 3.85      4.97      6.83
>>> 4096                9.93      7.77      4.58      3.79      3.93     =
 3.53      6.41      4.77
>>> 8192                6.47      6.33      4.37      4.67      4.52     =
 4.39      5.30      5.37
>>> 16348               7.66      8.00      5.20      5.22      5.24     =
 5.28      6.41      7.02
>>> 32768               8.56      8.62      6.34      6.20      6.20     =
 6.19      7.18      8.10
>>> 65536               9.41      9.40      7.14      7.15      7.15     =
 7.19      7.96      8.89
>>> 262144              10.17     10.19     7.26      7.90      7.98     =
 8.05      9.46      10.30
>>> 524288              10.40     9.95      7.25      7.93      8.02     =
 8.76      9.55      10.30
>>>
>>> --------------------
>>> #2 push_0_pull_1 =3D 1 (dst node CPUs are used):
>>>
>>> #2.1 THP=3DNever 4KB (GB/s):
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4     =
 mt:8      mt:16     mt:32
>>> 512                 1.28      1.36      2.01      2.74      2.33     =
 2.31      2.53      2.96
>>> 4096                2.40      2.84      2.94      3.04      3.40     =
 3.23      3.31      4.16
>>> 8192                3.18      3.27      3.34      3.94      3.77     =
 3.68      4.23      4.76
>>> 16348               3.17      3.42      3.66      3.21      3.82     =
 4.40      4.76      4.89
>>>
>>> #2.2 THP=3DAlways 2MB (GB/s):
>>> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4     =
 mt:8      mt:16     mt:32
>>> 512                 4.31      5.91      4.03      3.73      4.26     =
 4.13      4.78      3.44
>>> 1024                7.13      6.83      4.60      5.13      5.03     =
 5.19      5.94      7.25
>>> 2048                5.26      7.09      5.20      5.69      5.83     =
 5.73      6.85      8.13
>>> 4096                9.93      9.31      4.90      4.82      4.82     =
 5.26      8.46      8.52
>>> 8192                6.47      7.63      5.66      5.85      5.75     =
 6.14      7.45      8.63
>>> 16348               7.66      10.00     6.35      6.54      6.66     =
 6.99      8.18      10.21
>>> 32768               8.56      9.78      7.06      7.41      7.76     =
 9.02      9.55      11.92
>>> 65536               9.41      10.00     8.19      9.20      9.32     =
 8.68      11.00     13.31
>>> 262144              10.17     11.17     9.01      9.96      9.99     =
 10.00     11.70     14.27
>>> 524288              10.40     11.38     9.07      9.98      10.01    =
 10.09     11.95     14.48
>>>
>>> Note:
>>> 1. For THP =3D Never: I'm doing for 16X pages to keep total size same=
 for your
>>>    experiment with 64KB pagesize)
>>> 2. For THP =3D Always: nr_pages =3D Number of 4KB pages moved.
>>>    nr_pages=3D512 =3D> 512 4KB pages =3D> 1 2MB page)
>>>
>>>
>>> I'm seeing little (1.5X in some cases) to no benefits. The performanc=
e scaling is
>>> relatively flat across thread counts.
>>>
>>> Is it possible I'm missing something in my testing?
>>>
>>> Could the base page size difference (4KB vs 64KB) be playing a role i=
n
>>> the scaling behavior? How the performance varies with 4KB pages on yo=
ur system?
>>>
>>> I'd be happy to work with you on investigating this differences.
>>> Let me know if you'd like any additional test data or if there are sp=
ecific
>>> configurations I should try.
>>
>> The results surprises me, since I was able to achieve ~9GB/s when migr=
ating
>> 16 2MB THPs with 16 threads on a two socket system with Xeon E5-2650 v=
3 @ 2.30GHz
>> (a 19.2GB/s bandwidth QPI link between two sockets) back in 2019[1].
>> These are 10-year-old Haswell CPUs. And your results above show that E=
PYC 5 can
>> only achieve ~4GB/s when migrating 512 2MB THPs with 16 threads. It ju=
st does
>> not make sense.
>>
>> One thing you might want to try is to set init_on_alloc=3D0 in your bo=
ot
>> parameters to use folio_zero_user() instead of GFP_ZERO to zero pages.=
 That
>> might reduce the time spent on page zeros.
>>
>> I am also going to rerun the experiments locally on x86_64 boxes to se=
e if your
>> results can be replicated.
>>
>> Thank you for the review and running these experiments. I really appre=
ciate
>> it.>
>>
>> [1] https://lore.kernel.org/linux-mm/20190404020046.32741-1-zi.yan@sen=
t.com/
>>
>
> Using init_on_alloc=3D0 gave significant performance gain over the last=
 experiment
> but I'm still missing the performance scaling you observed.

It might be the difference between x86 and ARM64, but I am not 100% sure.=

Based on your data below, 2 or 4 threads seem to the sweep spot for
the multi-threaded method on AMD CPUs. BTW, what is the bandwidth between=

two sockets in your system? From Figure 10 in [1], I see the InfiniteBand=

between two AMD EPYC 7601 @ 2.2GHz was measured at ~12GB/s unidirectional=
,
~25GB/s bidirectional. I wonder if your results below are cross-socket
link bandwidth limited.

=46rom my results, NVIDIA Grace CPU can achieve high copy throughput
with more threads between two sockets, maybe part of the reason is that
its cross-socket link theoretical bandwidth is 900GB/s bidirectional.

>
> THP Never
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      m=
t:8      mt:16     mt:32
> 512                 1.40      1.43      2.79      3.48      3.63      3=
=2E73      3.63      3.57
> 4096                2.54      3.32      3.18      4.65      4.83      5=
=2E11      5.39      5.78
> 8192                3.35      4.40      4.39      4.71      3.63      5=
=2E04      5.33      6.00
> 16348               3.76      4.50      4.44      5.33      5.41      5=
=2E41      6.47      6.41
>
> THP Always
> nr_pages            vanilla   mt:0      mt:1      mt:2      mt:4      m=
t:8      mt:16     mt:32
> 512                 5.21      5.47      5.77      6.92      3.71      2=
=2E75      7.54      7.44
> 1024                6.10      7.65      8.12      8.41      8.87      8=
=2E55      9.13      11.36
> 2048                6.39      6.66      9.58      8.92      10.75     1=
2.99     13.33     12.23
> 4096                7.33      10.85     8.22      13.57     11.43     1=
0.93     12.53     16.86
> 8192                7.26      7.46      8.88      11.82     10.55     1=
0.94     13.27     14.11
> 16348               9.07      8.53      11.82     14.89     12.97     1=
3.22     16.14     18.10
> 32768               10.45     10.55     11.79     19.19     16.85     1=
7.56     20.58     26.57
> 65536               11.00     11.12     13.25     18.27     16.18     1=
6.11     19.61     27.73
> 262144              12.37     12.40     15.65     20.00     19.25     1=
9.38     22.60     31.95
> 524288              12.44     12.33     15.66     19.78     19.06     1=
8.96     23.31     32.29

[1] https://www.dell.com/support/kbdoc/en-us/000143393/amd-epyc-stream-hp=
l-infiniband-and-wrf-performance-study

Best Regards,
Yan, Zi