From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4401ECD8CBD for ; Thu, 13 Nov 2025 18:14:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9CC7C8E0006; Thu, 13 Nov 2025 13:14:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9A4558E0002; Thu, 13 Nov 2025 13:14:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F6618E0006; Thu, 13 Nov 2025 13:14:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 5A3378E0002 for ; Thu, 13 Nov 2025 13:14:44 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E895313C0B0 for ; Thu, 13 Nov 2025 18:14:43 +0000 (UTC) X-FDA: 84106384446.22.C0A63BB Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) by imf28.hostedemail.com (Postfix) with ESMTP id 47254C0012 for ; Thu, 13 Nov 2025 18:14:39 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=U2AUhMCM; arc=pass ("microsoft.com:s=arcselector10001:i=1"); spf=pass (imf28.hostedemail.com: domain of kanchana.p.sridhar@intel.com designates 192.198.163.9 as permitted sender) smtp.mailfrom=kanchana.p.sridhar@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1763057680; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sp5ElUTFs/j+Ugn8K9KfM4eamLyCXjRgC1uvy2C9XYY=; b=E13H607uefT+dHJNc2VvLgy2zEsS0cAHBnojbQruLXdTNoTGcCEzNTiQq5wWFQUXdzD5Kl C2GmxdUXG0o3PbdrNcC1xNP/bG3VWFm1UQ3FIj2wsHSSdpZkoC5SajoidsHvDAAtwdPvb+ TrFoQPqcxZAQeiwbyWK5ejxGlJiuxZE= ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1763057680; a=rsa-sha256; cv=pass; b=ExZb4EVUw4H8hMVlUsAedX2Z9XJlFxYIFcl2gfuWQDsgrMQ80voFapTP7eFOv229bdEOA8 4tqySGtD1vPRFuXiiDumBMh7QWLTZcZJ93XRgggtArmvfQxmeahfElbZzABAgTkMPsUrI0 5/ZfykBKEAVfI1IT5x+UJ4m9KuyingA= ARC-Authentication-Results: i=2; imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=U2AUhMCM; arc=pass ("microsoft.com:s=arcselector10001:i=1"); spf=pass (imf28.hostedemail.com: domain of kanchana.p.sridhar@intel.com designates 192.198.163.9 as permitted sender) smtp.mailfrom=kanchana.p.sridhar@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1763057679; x=1794593679; h=from:to:cc:subject:date:message-id:references: in-reply-to:content-transfer-encoding:mime-version; bh=PYWZEkQVdY8dUNmP8LFCOg6kwPHe2dAKPKNWwCTX3vI=; b=U2AUhMCM7VD/Jp5EmY8sgwyCPM+X4BAm0zSBKul2XfRwMeYPIRReAkXJ yzhNGOPggAjcOSZ9f9Q/C8jQZaYxG5Dsg8V3qqjQFrxBbTodGqpfWkAI3 LzEzL372ijNlQarcdpwAHhQOi1aJ0CDYmRQqWQPGuqEVk07vehXoW3ZZj 2JWKPzGE6oZvC+5jO0iRBLzoqfFsTO/CEvXHIkGg9nSc6Aga4/2xpl2SB p34j1GnTcKA5T5QmC7t1ulp9pV/gOCjIMOlw423ZvjGJtyK8mXgpjB18E lbF3q9q6Ai6WotN7GHy4b2fmbGfhP+U4R4tP5VRVMX6mp1TAoQI5vg0zG Q==; X-CSE-ConnectionGUID: mFL4wuW6QU6rjN6r65JJ0w== X-CSE-MsgGUID: 9rNRAxMFTU+4PlLRV8jYjw== X-IronPort-AV: E=McAfee;i="6800,10657,11612"; a="75828475" X-IronPort-AV: E=Sophos;i="6.19,302,1754982000"; d="scan'208";a="75828475" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Nov 2025 10:14:37 -0800 X-CSE-ConnectionGUID: dDLGlnuIRoSDM4Xz1FZqMw== X-CSE-MsgGUID: 4VhVdaY0T/Se++1SejHRMw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.19,302,1754982000"; d="scan'208";a="188825339" Received: from fmsmsx903.amr.corp.intel.com ([10.18.126.92]) by orviesa006.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Nov 2025 10:14:36 -0800 Received: from FMSMSX903.amr.corp.intel.com (10.18.126.92) by fmsmsx903.amr.corp.intel.com (10.18.126.92) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.27; Thu, 13 Nov 2025 10:14:36 -0800 Received: from fmsedg903.ED.cps.intel.com (10.1.192.145) by FMSMSX903.amr.corp.intel.com (10.18.126.92) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.27 via Frontend Transport; Thu, 13 Nov 2025 10:14:36 -0800 Received: from CY7PR03CU001.outbound.protection.outlook.com (40.93.198.50) by edgegateway.intel.com (192.55.55.83) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.27; Thu, 13 Nov 2025 10:14:35 -0800 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=aEOfEaCwUHfbexKryfiUrgL3Albn4aWoNCrrZddjrtwkKCUZBoFSKTmv1e0sjqnN+g1YvL0V8zR9iv/m8FXh3XgU+S9uAm4E1b/U3PmJen6jxp5id5O346Sufl7UfplXxhkjwNhCEdcG15rnhgUpBpiGMKjnibpWwZeAh8NtExWKWSrUo4k0ZU3gqeCBpskjx2ugPFW5N6iG7AZAMQbGjgNO2MglwkLNy3UU6gxTJHZke06ONQ12ZiMW7IFFnSfTDGGbp2PjjhO1iOiCROwvQJFC04lNYjRee3RjXoKhSCduSH7a4vwPIszlPICiy1r906PF294vydRG1/Wag8zi4Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=sp5ElUTFs/j+Ugn8K9KfM4eamLyCXjRgC1uvy2C9XYY=; b=K37LNM7q2xLNWy7av6ma8LAnlr9s9WrvNfQwuUj3LSJTIe58z2nch5NaxoOWLBTX3zDbEXmwHWjNlopuTXJj2cEw0nR2UxqosDZHIlAGfUmDDf3HHK+cuv27BcK+8LcJ0xN5iAlO9Mgpu0It6Bf+jqzJjSPA9Z0KmJwX7AniDnusmP9aWtHe0asvGir4KrblmXdi1pbaN/xJKkvboyShXoGzOxsktSn1dkSR8vb9lb4fOSwt2VMxMG+i5jBRM+CxNZc+cJmks5L3qmnbBzVFi1RA3sxZmS+E3qa+4ndSgOhd+2zayKjJA/rmptqDJgKkkcYvFgPLKrG139rW+KWOYQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Received: from SJ2PR11MB8472.namprd11.prod.outlook.com (2603:10b6:a03:574::15) by SA3PR11MB8023.namprd11.prod.outlook.com (2603:10b6:806:2ff::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9320.16; Thu, 13 Nov 2025 18:14:27 +0000 Received: from SJ2PR11MB8472.namprd11.prod.outlook.com ([fe80::1871:ff24:a49e:2bbb]) by SJ2PR11MB8472.namprd11.prod.outlook.com ([fe80::1871:ff24:a49e:2bbb%4]) with mapi id 15.20.9320.013; Thu, 13 Nov 2025 18:14:27 +0000 From: "Sridhar, Kanchana P" To: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "hannes@cmpxchg.org" , "yosry.ahmed@linux.dev" , "nphamcs@gmail.com" , "chengming.zhou@linux.dev" , "usamaarif642@gmail.com" , "ryan.roberts@arm.com" , "21cnbao@gmail.com" <21cnbao@gmail.com>, "ying.huang@linux.alibaba.com" , "akpm@linux-foundation.org" , "senozhatsky@chromium.org" , "sj@kernel.org" , "kasong@tencent.com" , "linux-crypto@vger.kernel.org" , "herbert@gondor.apana.org.au" , "davem@davemloft.net" , "clabbe@baylibre.com" , "ardb@kernel.org" , "ebiggers@google.com" , "surenb@google.com" , "Accardi, Kristen C" , "Gomes, Vinicius" CC: "Feghali, Wajdi K" , "Gopal, Vinodh" , "Sridhar, Kanchana P" Subject: RE: [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Thread-Topic: [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Thread-Index: AQHcTWsvZ9z6Qrobi0GScZxUAlcTGLTw9S9A Date: Thu, 13 Nov 2025 18:14:27 +0000 Message-ID: References: <20251104091235.8793-1-kanchana.p.sridhar@intel.com> In-Reply-To: <20251104091235.8793-1-kanchana.p.sridhar@intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-ms-traffictypediagnostic: SJ2PR11MB8472:EE_|SA3PR11MB8023:EE_ x-ms-office365-filtering-correlation-id: 11069736-a2d5-4708-8c1e-08de22e07b75 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0;ARA:13230040|366016|7416014|1800799024|376014|921020|7053199007|38070700021; x-microsoft-antispam-message-info: =?us-ascii?Q?IwpqXiHTbDsmAOIKrB1kuziVFySOLwHMuuAbfvw4RPcqg6FnNiEuVHX0EIKR?= =?us-ascii?Q?Y03H7RA0oB+dorkeL6nn0oQtYg2axT1qYN7iLNJRz1JLN4lL7lAPhLDwnf06?= =?us-ascii?Q?zNU/eJwfjw+ZNnerQzLkEEHRMFyWvrNkIY3pZp5WKyBlhjKqAxsb+BkohnRo?= =?us-ascii?Q?C1YnyPMecE5RMWLEpq4t++IU/dSnhsBH9mK1yffRVcxrGywgP6TKE9fqQ9kC?= =?us-ascii?Q?FBUIUFjzis3ul7HwkAkvfAQPrYDpQxgf7GHGHKXySqQIXzUINJMDfZM2tWtd?= =?us-ascii?Q?c+1lgY3rXLmoFagHkVnQVzLQvJxnBz0C0m79DY8sSeW8j+ZYS65m1i4Vf5A7?= =?us-ascii?Q?+3HUOlrXRvDGyvV+nowc77rOzdq8Sp5RjUnGmYaB+Y9Pbtlz+02E//YJCJ3T?= =?us-ascii?Q?RbykFHKIs98icut4zFz+9VFoU3o8F0gyHdf+/LiRsqI+I4TG8JPDESd1Q6TX?= =?us-ascii?Q?E/jxilCZrVuHufDv07HUC6BbspaUYo/BWSJV8gjHvYcGMiQxzozh6DAHWr5p?= =?us-ascii?Q?i3vf3m34fUP/x00ImYdMiUCMg5K2i69cmfKh+c5U1YFoSyrvOkWq2yv+HNj3?= =?us-ascii?Q?MeO+ei+1DVXx+Ate3Acm+G81+igQnG3sL5XXfj4G4FvX3I+Kq84cggvzXLS6?= =?us-ascii?Q?1kHiNKq/uTOQU8OX+/dHkOcLYg8dUxcyMvFl3lZ4rAWY590Mi1eANRunliG7?= =?us-ascii?Q?1fIjBCc+SPrjDee/7sbDASqNvHH6GJ/ANvqd5M3f1UHJqJ5WarOLoiNiiPln?= =?us-ascii?Q?kGPn5moKlEPOI8BRoJXcrcE713XdiknZ3O5cKKw+7sNL1TXmpSc8toLvsCMp?= =?us-ascii?Q?URRZUCiPpmYXbYrBa1dic5Q9DOJb4fQ8xJGfkifgHvfttlnAf7jIrzLCnwYf?= =?us-ascii?Q?XBfV1ilLf5I/BGGgs2C5GoLWv8ttEcCz2YG2wd3cvnuNItGqb6l4d0LAoQ6+?= =?us-ascii?Q?sFNU+XLs8PR5GbTmPBQpp7hqovMO3K2WSb6QfHC2yDwbTcpT4TGjGVBzNPlW?= =?us-ascii?Q?RVCiD+L4AAMs8rrglbXvwBrRl9FnAb31t4xqx+asGxbXJ0+XCO+4j+j+6cOS?= =?us-ascii?Q?vlAfBI7DlQFTNqwl58/yRkS0QzoeTPCVbIrkifwRF0+RwHSOJQxrk9qfJkcE?= =?us-ascii?Q?age7dSqJQXFb+vISGkilRDnspUbtjpXciOE08Be0eLP48QJQhoRYYOQlL3+F?= =?us-ascii?Q?C2qiPRuZx5RuWmJ8L1yLnTPiKGT4j9sQtr/j66oHFMg68cMvQvuaz2EHo+GV?= =?us-ascii?Q?XYyVEIgr2djzHPiHomop96aXMXWw9quigm34pF0KPwiv7qHbK2IT4HTeEr0c?= =?us-ascii?Q?JHXf3+oTNODFxyhFKwsqKDDxcZDxX5TvHA7M1577ITSCXcNiDkejLwUX5LOR?= =?us-ascii?Q?uUy7j/wVzAaR3tXH34WSII8+z/4iGoxpmTYUJWO0iPGR6kOwhePQbhnYBdxW?= =?us-ascii?Q?OfReXvIlzyOIviwNrP1rzRLFMbOuIbljmYu3CenV5WxdnBxTHQmHGNOxsZms?= =?us-ascii?Q?Al9RlOn3G3BG8nFu4AFt6KLFI3Uyim79FsOpADURNt+HHpEPmZQCHPrDDw?= =?us-ascii?Q?=3D=3D?= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SJ2PR11MB8472.namprd11.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(366016)(7416014)(1800799024)(376014)(921020)(7053199007)(38070700021);DIR:OUT;SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?o424J7y3wId87X1KA+c7me8nlDRJjarcdewthLZumRd4upQpJuGXW0ao8nrZ?= =?us-ascii?Q?8oziX/NMIc0lM/3tQhd9zb5VFeVParHrcKEoXgkR70T/Dzyoi6RdIMm0d9Lq?= =?us-ascii?Q?6Z0r931D5uQ1t7pJCcaG1u7gDptEszLbt4eKsMu6CmKm1GaXELet2xlaxk+1?= =?us-ascii?Q?pTsTqUwZyQriMvvnU/RJFQT0so0LAcH1RlN+q0bysQkmCwsLtRErpgb6jcUH?= =?us-ascii?Q?Wb2lXy0lNllxI6gUWQITdn6PcnGQwCBfbpnXn17DPzJNH/53DtdpsDde1wQy?= =?us-ascii?Q?d9wUBc5OvlSKrWUhh31nPsLBmETtPg9NrBoWM3Udnp5TAVMCplk7CnPYvHDb?= =?us-ascii?Q?psHxODcC8KyeOERSqIO9rM8LMEEIZAEBIuuw8d2RU94WQlz5G8ltjaL9ht+9?= =?us-ascii?Q?+AgIP8OqXMPca9L5qVgPL6OnOHcB4W1bzTVF/kniHMIAsciNjLF6eMb8wFKG?= =?us-ascii?Q?7a6sJz2Wgq8nQmdGtTh16YTjWfE51xi4eOs7omsXsm+//bU1M34EedK/nZZd?= =?us-ascii?Q?ToGNaEb8EDh3WT5KISMw5Tb8MakdQqGbp1voCsOk+2aEzzCOeJUQovLk2LoS?= =?us-ascii?Q?SpOUo8U7yF8tEgz5nmMa5sGj5axFjqE1fZR9LrygAp1X4g+jag+V/zp+IKZR?= =?us-ascii?Q?NMDlxido2J8tTCTyhGWSbGj7c1jEVnTRFMLN82AtqPIBl+pKef5aTiVM/l2y?= =?us-ascii?Q?dV4VjSYG+2MBCLA6aXYrHEg4dD2XfkGZlPGlEPkN0GqBeSacJSjyrVS56uSZ?= =?us-ascii?Q?jMqDhpY6rHy7GDTVXCXYAgrAgoRd8Ko9nJBSdeHLfcTVMapyKk0UuO4eISR8?= =?us-ascii?Q?agj2JD09nvexOvWNqSLNDzy5wUyHpJtIbuK6tFiZ1nlvJ9DSPvDigCTlL/IB?= =?us-ascii?Q?oj3vEZdNXI5NR9IpAlfQa+5iGaBhemRhejYTBoIcbP4HqAvCXaI7EhQGcRnl?= =?us-ascii?Q?M4lPuc75YyrnkA+3OBXFWu/mGet3VVoETzJj8ZCzKxciiWwBXCLYVJAb6Cxw?= =?us-ascii?Q?GJ4fdw9KyGHstG4ggKQjIN1YQ1v4CyfQn66KZ3lR5ER581z7Lm2xYh9k4e5f?= =?us-ascii?Q?v55CJBjRCUOL4xZyAOduhbc43nua9qjLhbAEbKdeGJXcT1sEkbC4EebDDo7t?= =?us-ascii?Q?0UnxcOimNHFdhkR4yjg83liGq9QObyu/zQLt3J/8CIkym7ZtuoNMvLxoiMyP?= =?us-ascii?Q?Gxv+07Hl2G6A7oJvyuZG7ZsppXNVWNVglwgrpiC8w99LYeTfB7odFHBUt2aG?= =?us-ascii?Q?ZbMwrudFQomVriZ+hqICfxKWKqqmTbtc/sP3TsnApf9rENPDytW2dhkLFDLi?= =?us-ascii?Q?F4G2/otgbS5stJ6hh+N4fzVnk7O2W9UyBau2236/UlPezcsmvwcyRzPorAId?= =?us-ascii?Q?2goV0tPHKRAzeLBjAOtB/htbZZWr+ewAGQbDBpV+3WJFuHSte8W3SgFp5sXp?= =?us-ascii?Q?cs+APGx+JFgT9+sTCsGXq/46J1wNEt/rxZAF3IzghwZogG+dk6omzoA1xIvw?= =?us-ascii?Q?SSanK6B6gjppyJ7axnDHXU3AgdnD42tP5zYutaSEQwpvvPcTtOBrIc1e0tW3?= =?us-ascii?Q?7HGbCRoGJF80oQtWIbCaahuDiIfUf058o1uiYGkFS+rHTBGngIZtH4KMa8KE?= =?us-ascii?Q?Nw=3D=3D?= Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: SJ2PR11MB8472.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 11069736-a2d5-4708-8c1e-08de22e07b75 X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Nov 2025 18:14:27.1972 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: x1C0thtPR0xmW6euU6wBYGin5h1ZHI8VXM1poeyLQ3rSqPLD+Y9Gg5WjDhvfB3m8Za+GY/M69YMuZgctbxOJtSGK/cOghukMilolzhi4beA= X-MS-Exchange-Transport-CrossTenantHeadersStamped: SA3PR11MB8023 X-OriginatorOrg: intel.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 47254C0012 X-Stat-Signature: pq3mgrsj5318s3ed3w6sbkyg56zbax6j X-Rspam-User: X-HE-Tag: 1763057679-41010 X-HE-Meta: U2FsdGVkX18oacZAE/8DnJl9G3sr2Ci5ZK8L3NLJTX67WSldOScmVTxZ2R+lUoCwB3sVkyVW50JuwqFOV2Xt26W8B4PU4PQkJVNOwFYXLiGJbi/TpvnJehIo5Qqg1KAji+760GGpjGVRaNrcRPNX83mknKIfFmcIfUg8neVceQA2ZsmztGV4b4Hl13tAzDgUmZOr4RPB32Gpx3eYAoo0VliCb26Vh2jXOKrcRxuIEICxkUOGQX8/oMZdZ/wMIIrBHyQT/4UP6M3343mv1RgMg7RM6jsbpX0h7do8pG1wtZc2e9WCZHGcp8z1CcTAZgBaoitGgDcw5tU2+qyHU8SS+pXAc4d8WcjaasC4WtY6Q/hDT4+2IknasetpG063MgZdjWksrNSBHJZHLQ7OHb04lpFEWWOODa5SnOGV7zUioC+VM/sfSTjv3dS1rxa02VGKYbOieuBOXY+lMLTHd5M0JqAej1mHCpy0+EBh0DU/Ztw2h2KaVIHKZX21+Q7uz6W2jcjZkxkqTRAgjTCC9mil3wY5pABE4bRCMxZFLs33zRe1JiWkqmKj+LCMxXYynaOgiGfOy5D8Am4QHKUmnOgZcEmfWD+QCWQbDpZtAhWiAUTsdAykAN1HtHT7obMI+l10gVG3wMqJll9xeJCtLKonHsmp2s2p7iRf1IRViVh8FT5KZDVxdPQR2HX9VzVSfu7xchSqwR/VQlZ10yUuZn3sAHKy1Zd2gl9pj4f5SEWGjvaoOLRYgpvl+k74jau6VK/p7XGX6mlHpig+ePYJSF5KyiZAQnrgcldsqpbPyVxtVv+sw8R7+kzr1qMYl6AyWErBC6L8MTzQ7PsRtFJrsA6qT1Qyge5RpVE8SggWOewWmQAySRAq0za8HvVlAljxguEXz7yccF6QWsf3fK72r5zNJW1erWFQQOtZcjilUzGG/NGlmovrlpD92lynCjMLdRNnWPGFw/R3Cb+dleBCWJU +FKsO6+h hkrFeAbkVCnymAe6q/aSqlBC1jtBQnC1LEhhRgQkZMSYEUb5IxNXxEZPjd1fi9S5V2GfV/CUVOmNop5Aoay2MzXRiTeVL8eKTgyXpMkqEgORSje5HLyd5ZlC0FpW+NaBbxwYS106yGRj68oa7Jx0rpneauvvJkkBUJc7bAZGrgQIIbSbg/d2onufcR72SdIwfXrtv1xrPDVlEafIPJIRITKReZk4lHBxuGqXha/J/bGHPd3kqNPzqMeqrC6R4PqNMncRJBoE3ojNDBxXH2gm6Z1ICu6QLBCNI/HK3gVzc853i433bx+wUNTre6Ed3cfzNB7l1HMqmJer21uZiem2bFoT+eEA+BPekpDy1rZLyR3DB/MjhuJ5Quh78sQaIO8K5RvodDS0U7/RRaqUoaYy10atvanmVlJ+OnBXl3JbT7ETTWgQgbw+3J41qvZdAT2ZNt13o/XGWdpM5fAVzEQVOsIqk81OC5epd0UBgR3LIREBPuayKX5hk79Z+SaSKhXsnf09zgjVjm82ILSIKiUQun/ZMB1wnJ9JzGctYlK6f36p/fi7NimCSmQoboI1dNGzeqseQUYRXiV2L/4RIzmsNsEDrYWY9Cug4/rApfyF5LIM34OHuecaHZ36yZuuQPbYSbaNm4F6EJCNdUHXws4MohZcmoTNwPy15FR0WeCMeE5BCXxBYk9MUeK7qf2wdPoPbNdwhwFxM5LAUwAI6ZMiNKKM799yJmooDRSd5BcSw8aM8p2YwvEaJC60ygQ2etTH9BEaha6/t265CNti02ywGWz8+odSXnYHV/AKP98ba3/vZ+x8gnfPHNdZIs+S0PM0Bt0i0mGHl5K7FWVua1DI8shrb9Cg1HyDNzRFWL+lnObyY52Lvrn1thHMDEeX3Z2IBalz4xnWFjLLJwu/Nuug6a5ARsmW6f8VU+KvhC83eXw82nWOiRii/KPB70b8CuMfIpt0/pqJMLefrMQtaJ0FklJBQYxSF rF0qrSSd ElFO2ndSBxHpQCYGT+IKgOwDr3BIhkXAGDXN5ZoO23/HaJZUqG+QRimtZE73dKrh32lml8GsceUT0DpGuzjXn82PNKsAhO/YxwVGJbqYSjRs//PIaZxeunCwEqCRkwuYavcGYmT0V/qj8UW5GD3dt40WP5Pw7kaK X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > -----Original Message----- > From: Sridhar, Kanchana P > Sent: Tuesday, November 4, 2025 1:12 AM > To: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > ; Gomes, Vinicius > Cc: Feghali, Wajdi K ; Gopal, Vinodh > ; Sridhar, Kanchana P > > Subject: [PATCH v13 00/22] zswap compression batching with optimized > iaa_crypto driver >=20 > v13: zswap compression batching with optimized iaa_crypto driver > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > =3D=3D > This updated patch-series further generalizes the batching implementation= of > zswap_compress() for non-batching and batching compressors. It makes sure > the > bulk allocation of zswap entries preserves the current behavior of additi= on of > an entry to the LRU list for the nid of the page. >=20 > Based on Herbert's suggestions, the batching interfaces from zswap to cry= pto, > from crypto to iaa_crypto, and the batching implementation within iaa_cry= pto > now > use the folio directly as the source (sg_page_iter for retrieving pages),= and > destination SG lists. A unit_size has been added to struct acomp_req, wit= h > kernel users such as zswap using the new acomp_request_set_unit_size() AP= I > to > set the unit size to use while breaking down the request's src/dst > scatterlists. zswap sets the unit-size to PAGE_SIZE. Hi Nhat, Yosry, Herbert, I just wanted to follow up on whether there are other code review comments or suggestions you have on this latest patch set. Thanks very much for your= time in reviewing and improving the patch-series. Nhat, I will make the change to the struct zswap_entry bit-fields to be macro-defined constants, either as an update to this series, or submit a se= parate patch with this change if that's Ok with you. Thanks, Kanchana >=20 > Following Andrew's suggestion, the next two paragraphs emphasize > generality and > alignment with current kernel efforts. >=20 > Architectural considerations for the zswap batching framework: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > We have designed the zswap batching framework to be hardware-agnostic. It > has no > dependencies on Intel-specific features and can be leveraged by any > hardware > accelerator or software-based compressor. In other words, the framework i= s > open > and inclusive by design. >=20 > Other ongoing work that can use batching: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > This patch-series demonstrates the performance benefits of compress > batching when used in zswap_store() of large folios. shrink_folio_list() > "reclaim batching" of any-order folios is the next major work that uses > this zswap compress batching framework: our testing of kernel_compilation > with writeback and the zswap shrinker indicates 10X fewer pages get > written back when we reclaim 32 folios as a batch, as compared to one > folio at a time: this is with deflate-iaa and with zstd. We expect to > submit a patch-series with this data and the resulting performance > improvements shortly. Reclaim batching relieves memory pressure faster > than reclaiming one folio at a time, hence alleviates the need to scan > slab memory for writeback. >=20 > Many thanks to Nhat for suggesting ideas on using batching with the > ongoing kcompressd work, as well as beneficially using decompression > batching & block IO batching to improve zswap writeback efficiency. >=20 > Experiments with kernel compilation benchmark (allmod config) that > combine zswap compress batching, reclaim batching, swapin_readahead() > decompression batching of prefetched pages, and writeback batching show > that 0 pages are written back to disk with deflate-iaa and zstd. For > comparison, the baselines for these compressors see 200K-800K pages > written to disk. >=20 > To summarize, these are future clients of the batching framework: >=20 > - shrink_folio_list() reclaim batching of multiple folios: > Implemented, will submit patch-series. > - zswap writeback with decompress batching: > Implemented, will submit patch-series. > - zram: > Implemented, will submit patch-series. > - kcompressd: > Not yet implemented. > - file systems: > Not yet implemented. > - swapin_readahead() decompression batching of prefetched pages: > Implemented, will submit patch-series. >=20 >=20 > iaa_crypto Driver Rearchitecting and Optimizations: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D >=20 > The most significant highlight of v13 is a new, lightweight and highly > optimized iaa_crypto driver, resulting directly in the latency and > throughput improvements noted later in this cover letter. >=20 > 1) Better stability, more functionally versatile to support zswap > with better performance on different Intel platforms. >=20 > a) Patches 0002, 0005 and 0011 together resolve a race condition in > mainline v6.15, reported from internal validation, when IAA > wqs/devices are disabled while workloads are using IAA. >=20 > b) Patch 0002 introduces a new architecture for mapping cores to > IAAs based on packages instead of NUMA nodes, and generalizing > how WQs are used: as package level shared resources for all > same-package cores (default for compress WQs), or dedicated to > mapped cores (default for decompress WQs). Further, users are > able to configure multiple WQs and specify how many of those are > for compress jobs only vs. decompress jobs only. sysfs iaa_crypto > driver parameters can be used to change the default settings for > performance tuning. >=20 > c) idxd descriptor allocation moved from blocking to non-blocking > with retry limits and mitigations if limits are exceeded. >=20 > d) Code cleanup for readability and clearer code flow. >=20 > e) Fixes IAA re-registration errors upon disabling/enabling IAA wqs > and devices that exists in the mainline v6.15. >=20 > f) Addition of a layer that encapsulates iaa_crypto's core functional= ity to > rely only on idxd, dma and scatterlists to provide clean interface= s to > crypto_acomp. >=20 > g) New Dynamic compression mode for Granite Rapids to get better > compression ratio by echo-ing 'deflate-iaa-dynamic' as the zswap > compressor. >=20 > h) New crypto_acomp API crypto_acomp_batch_size() that will return > the driver's max batch size if the driver has registered a batch_s= ize > that's greater than 1; or 1 if there is no driver specific definit= ion of > batch_size. >=20 > Accordingly, iaa_crypto sets the acomp_alg batch_size to its inter= nal > IAA_CRYPTO_MAX_BATCH_SIZE for fixed and dynamic modes. >=20 > 2) Performance optimizations (please refer to the latency data per > optimization in the commit logs): >=20 > a) Distributing [de]compress jobs in round-robin manner to available > IAAs on package. >=20 > b) Replacing the compute-intensive iaa_wq_get()/iaa_wq_put() with a > percpu_ref in struct iaa_wq, thereby eliminating acquiring a > spinlock in the fast path, while using a combination of the > iaa_crypto_enabled atomic with spinlocks in the slow path to > ensure the compress/decompress code sees a consistent state of the > wq tables. >=20 > c) Directly call movdir64b for non-irq use cases, i.e., the most > common usage. Avoid the overhead of irq-specific computes in > idxd_submit_desc() to gain latency. >=20 > d) Batching of compressions/decompressions using async submit-poll > mechanism to derive the benefits of hardware parallelism. >=20 > e) Batching compressors need to manage their own "requests" > abstraction, and remove this driver-specific aspect from being > managed by kernel users such as zswap. iaa_crypto maintains > per-CPU "struct iaa_req **reqs" to submit multiple jobs to the > hardware accelerator to run in parallel. >=20 > f) Modifies the iaa_crypto batching API and their implementation to e= xpect > a > src SG list that contains the batch's pages and a dst SG list that= has > multiple scatterlists for the batch's output buffers. >=20 > g) Submit the two largest data buffers first for decompression > batching, so that the longest running jobs get a head start, > reducing latency for the batch. >=20 > 3) Compress/decompress batching are implemented using SG lists as the > batching > interface. >=20 >=20 > Main Changes in Zswap Compression Batching: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > Note to zswap maintainers: > -------------------------- > Patches 19 and 20 can be reviewed and improved/merged independently > of this series, since they are zswap centric. These 2 patches help > batching but the crypto_acomp_batch_size() from the iaa_crypto commits > in this series is not a requirement, unlike patches 21-22. >=20 > 1) v13 preserves the pool acomp_ctx resources creation/deletion > simplification of v11, namely, lasting from pool creation-deletion, > persisting through CPU hot[un]plug operations. Further, zswap no > longer needs to create multiple "struct acomp_req" in the per-CPU > acomp_ctx. zswap only needs to manage multiple "u8 **buffers". >=20 > 2) We store the compressor's batch-size (@pool->compr_batch_size) direct= ly > in > struct zswap_pool for quick retrieval in the zswap_store() fast path. >=20 > 3) Optimizations to not cause regressions in software compressors with > the introduction of the new unified zswap_compress() framework that > implements compression batching for all compressors. These optimizati= ons > help recover the performance for non-batching compressors: >=20 > a) kmem_cache_alloc_bulk(), kmem_cache_free_bulk() to allocate/free > batch zswap_entry-s. These kmem_cache API allow allocator > optimizations with internal locks for multiple allocations. >=20 > b) The page's nid is stored in a new nid field added to zswap_entry, = so the > zswap_lru_add()/zswap_lru_del() will add/delete the entry from the= LRU > list of the page's nid. This preserves the current behavior wrt th= e > shrinker. >=20 > c) Writes to the zswap_entry right after it is allocated without > modifying the publishing order. This avoids different code blocks > in zswap_store_pages() having to bring the zswap_entries to the > cache for writing, potentially evicting other working set > structures, impacting performance. >=20 > d) ZSWAP_MAX_BATCH_SIZE is used as the batch-size for software > compressors, since this gives the best performance with zstd. >=20 > e) Minimize branches in zswap_compress(). >=20 > 4) During pool creation, these key additions are allocated as part of th= e > per-CPU acomp_ctx so as to recover performance with the new, > generalized SG > lists based zswap_compress() batching interface: >=20 > a) An sg_table "acomp_ctx->sg_outputs" is allocated to contain the > compressor's batch-size number of SG lists that will contain the > destination buffers/lengths after batch compression. >=20 > b) The per-CPU destination buffers are mapped to the per-CPU SG lists= : this > needs to be done only once, and optimizes performance. >=20 > 5) A unified zswap_compress() API is added to compress multiple pages. > Thanks > to Nhat, Yosry and Johannes for their helpful suggestions to accompli= sh > this. >=20 > 6) Finally, zswap_compress() has been re-written to incorporate Herbert'= s > suggestions to use source folios and output SG lists for batching. Th= e new > zswap_compress() code has been made as generic to software and batchi= ng > compressors as possible, so that it is easy to read and maintain. The > recent changes related to PAGE_SIZE dst buffers, zsmalloc and > incompressible > pages have been incorporated into the batched zswap_compress() as wel= l. > To > resolve regressions with zstd, I took the liberty of not explicitly c= hecking > for dlen =3D=3D 0 and dlen > PAGE_SIZE (as in the mainline); instead, > expecting that a negative err value will be returned by the software > compressor in such cases. >=20 >=20 > Compression Batching: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > This patch-series introduces batch compression of pages in large folios t= o > improve zswap swapout latency. It preserves the existing zswap protocols > for non-batching software compressors by calling crypto_acomp sequentiall= y > per page in the batch. Additionally, in support of hardware accelerators > that can process a batch as an integral unit, the patch-series allows > zswap to call crypto_acomp without API changes, for compressors > that intrinsically support batching. The zswap_compress() code has very > minimal > special casing for software/batching compressors. >=20 > The patch series provides a proof point by using the Intel Analytics > Accelerator (IAA) for implementing the compress/decompress batching API > using hardware parallelism in the iaa_crypto driver and another proof poi= nt > with a sequential software compressor, zstd. >=20 > SUMMARY: > =3D=3D=3D=3D=3D=3D=3D=3D >=20 > The first proof point is to test with IAA using a sequential call (full= y > synchronous, compress one page at a time) vs. a batching call (fully > asynchronous, submit a batch to IAA for parallel compression, then poll= for > completion statuses). >=20 > The performance testing data with 30 usemem processes/64K folios > shows 62% throughput gains and 28% elapsed/sys time reductions with > deflate-iaa; and 5% sys time reduction with zstd for a small > throughput increase. For PMD folios, a 67% throughput gain and 23% > elapsed/sys time reduction is seen. >=20 > Kernel compilation test with 64K folios using 32 threads and the > zswap shrinker_enabled set to "N", demonstrates similar > improvements: zswap_store() large folios using IAA compress batching > improves the workload performance by 3.5% and reduces sys time by > 6% as compared to IAA sequential. For zstd, compress batching > improves workload performance by 3.4% and reduces sys time by > 1.8% as compared to sequentially calling zswap_compress() per page > in a folio. >=20 > The main takeaway from usemem, a workload that is mostly compression > dominated (very few swapins) is that the higher the number of batches= , > such as with larger folios, the more the benefit of batching cost > amortization, as shown by the PMD usemem data. This aligns well with = the > future direction for batching. >=20 > The second proof point is to make sure that software algorithms such as > zstd do not regress. The data indicates that for sequential software > algorithms a performance gain is achieved. >=20 > With the performance optimizations implemented in patches 21-22 of v1= 3: >=20 > * zstd usemem metrics with 64K folios are within range of variation > with a slight sys time improvement. zstd usemem30 workload > performance > with PMD folios improves by 6% and sys time reduces by 8%, for > comparable > throughput as the baseline. >=20 > * With kernel compilation, I used zstd without the zswap shrinker to= enable > more direct comparisons with the changes in this series. Subsequen= t > patch > series I expect to submit in collaboration with Nhat, will enable = the > zswap shrinker to quantify the benefits of decompression batching = during > writeback. With this series' compression batching within large fol= ios, we > get a 6%-1.8% reduction in sys time, a 3.5%-3.4% improvement in > workload > performance with 64K folios for deflate-iaa/zstd respectively. >=20 > These optimizations pertain to ensuring common code paths and removin= g > redundant branches/computes. Additionally, using the batching code fo= r > non-batching compressors to sequentially compress/store batches of up > to ZSWAP_MAX_BATCH_SIZE pages seems to help, most likely due to > cache locality of working set structures such as the array of > zswap_entry-s for the batch. >=20 > Our internal validation of zstd with the batching interface vs. IAA w= ith > the batching interface on Emerald Rapids has shown that IAA > compress/decompress batching gives 21.3% more memory savings as > compared > to zstd, for 5% performance loss as compared to the baseline without = any > memory pressure. IAA batching demonstrates more than 2X the memory > savings obtained by zstd at this 95% performance KPI. > The compression ratio with IAA is 2.23, and with zstd 2.96. Even with > this compression ratio deficit for IAA, batching is extremely > beneficial. As we improve the compression ratio of the IAA accelerato= r, > we expect to see even better memory savings with IAA as compared to > software compressors. >=20 >=20 > Batching Roadmap: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > 1) Compression batching within large folios (this series). >=20 > 2) zswap writeback decompression batching: >=20 > This is being co-developed with Nhat Pham, and shows promising > results. We plan to submit an RFC shortly. >=20 > 3) Reclaim batching of hybrid folios: >=20 > We can expect to see even more significant performance and throughpu= t > improvements if we use the parallelism offered by IAA to do reclaim > batching of 4K/large folios (really any-order folios), and using the > zswap_store() high throughput compression pipeline to batch-compress > pages comprising these folios, not just batching within large > folios. This is the reclaim batching patch 13 in v1, which we expect > to submit in a separate patch-series. As mentioned earlier, reclaim > batching reduces the # of writeback pages by 10X for zstd and > deflate-iaa. >=20 > 4) swapin_readahead() decompression batching: >=20 > We have developed a zswap load batching interface to be used > for parallel decompression batching, using swapin_readahead(). >=20 > These capabilities are architected so as to be useful to zswap and > zram. We have integrated these components with zram and expect to > submit an > RFC soon. >=20 >=20 > v13 Performance Summary: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D >=20 > This is a performance testing summary of results with usemem30 > (30 usemem processes running in a cgroup limited at 150G, each trying t= o > allocate 10G). >=20 > usemem30 with 64K folios: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D >=20 > zswap shrinker_enabled =3D N. >=20 > --------------------------------------------------------------------= --- > mm-unstable-10-24-2025 v13 > --------------------------------------------------------------------= --- > zswap compressor deflate-iaa deflate-iaa IAA Batching > vs. > IAA Sequenti= al > --------------------------------------------------------------------= --- > Total throughput (KB/s) 6,118,675 9,901,216 62% > Average throughput (KB/s) 203,955 330,040 62% > elapsed time (sec) 98.94 70.90 -28% > sys time (sec) 2,379.29 1,686.18 -29% > --------------------------------------------------------------------= --- >=20 > --------------------------------------------------------------------= --- > mm-unstable-10-24-2025 v13 > --------------------------------------------------------------------= --- > zswap compressor zstd zstd v13 zstd > improvement > --------------------------------------------------------------------= --- > Total throughput (KB/s) 5,983,561 6,003,851 0.3% > Average throughput (KB/s) 199,452 200,128 0.3% > elapsed time (sec) 100.93 96.62 -4.3% > sys time (sec) 2,532.49 2,395.83 -5% > --------------------------------------------------------------------= --- >=20 > usemem30 with 2M folios: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D >=20 > --------------------------------------------------------------------= --- > mm-unstable-10-24-2025 v13 > --------------------------------------------------------------------= --- > zswap compressor deflate-iaa deflate-iaa IAA Batching > vs. > IAA Sequenti= al > --------------------------------------------------------------------= --- > Total throughput (KB/s) 6,309,635 10,558,225 67% > Average throughput (KB/s) 210,321 351,940 67% > elapsed time (sec) 88.70 67.84 -24% > sys time (sec) 2,059.83 1,581.07 -23% > --------------------------------------------------------------------= --- >=20 > --------------------------------------------------------------------= --- > mm-unstable-10-24-2025 v13 > --------------------------------------------------------------------= --- > zswap compressor zstd zstd v13 zstd > improvement > --------------------------------------------------------------------= --- > Total throughput (KB/s) 6,562,687 6,567,946 0.1% > Average throughput (KB/s) 218,756 218,931 0.1% > elapsed time (sec) 94.69 88.79 -6% > sys time (sec) 2,253.97 2,083.43 -8% > --------------------------------------------------------------------= --- >=20 >=20 > This is a performance testing summary of results with > kernel_compilation test (allmod config, 32 cores, cgroup limited to 2G)= . >=20 > zswap shrinker_enabled =3D N. >=20 > kernel_compilation with 64K folios: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > --------------------------------------------------------------------= ------ > mm-unstable-10-24-2025 v13 > --------------------------------------------------------------------= ------ > zswap compressor deflate-iaa deflate-iaa IAA Batching > vs. > IAA Sequential > --------------------------------------------------------------------= ------ > real_sec 836.64 806.94 -3.5% > sys_sec 3,897.57 3,661.83 -6% > --------------------------------------------------------------------= ------ >=20 > --------------------------------------------------------------------= ------ > mm-unstable-10-24-2025 v13 > --------------------------------------------------------------------= ------ > zswap compressor zstd zstd Improvement > --------------------------------------------------------------------= ------ > real_sec 880.62 850.41 -3.4% > sys_sec 5,171.90 5,076.51 -1.8% > --------------------------------------------------------------------= ------ >=20 >=20 > kernel_compilation with PMD folios: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > --------------------------------------------------------------------= ------ > mm-unstable-10-24-2025 v13 > --------------------------------------------------------------------= ------ > zswap compressor deflate-iaa deflate-iaa IAA Batching > vs. > IAA Sequential > --------------------------------------------------------------------= ------ > real_sec 818.48 779.67 -4.7% > sys_sec 4,226.52 4,245.18 0.4% > --------------------------------------------------------------------= ------ >=20 > --------------------------------------------------------------------= ------ > mm-unstable-10-24-2025 v13 > --------------------------------------------------------------------= ------ > zswap compressor zstd zstd Improvement > --------------------------------------------------------------------= ------ > real_sec 888.45 849.54 -4.4% > sys_sec 5,866.72 5,847.17 -0.3% > --------------------------------------------------------------------= ------ >=20 >=20 >=20 > The patch-series is organized as follows: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > 1) crypto acomp & iaa_crypto driver enablers for batching: Relevant > patches are tagged with "crypto:" in the subject: >=20 > Patch 1) Reorganizes the iaa_crypto driver code into logically relate= d > sections and avoids forward declarations, in order to facili= tate > subsequent iaa_crypto patches. This patch makes no > functional changes. >=20 > Patch 2) Makes an infrastructure change in the iaa_crypto driver > to map IAA devices/work-queues to cores based on packages > instead of NUMA nodes. This doesn't impact performance on > the Sapphire Rapids system used for performance > testing. However, this change fixes functional problems we > found on Granite Rapids during internal validation, where th= e > number of NUMA nodes is greater than the number of packages, > which was resulting in over-utilization of some IAA devices > and non-usage of other IAA devices as per the current NUMA > based mapping infrastructure. >=20 > This patch also develops a new architecture that > generalizes how IAA device WQs are used. It enables > designating IAA device WQs as either compress-only or > decompress-only or generic. Once IAA device WQ types are > thus defined, it also allows the configuration of whether > device WQs will be shared by all cores on the package, or > used only by "mapped cores" obtained by a simple allocation > of available IAAs to cores on the package. >=20 > As a result of the overhaul of wq_table definition, > allocation and rebalancing, this patch eliminates > duplication of device WQs in per-CPU wq_tables, thereby > saving 140MiB on a 384 cores dual socket Granite Rapids serv= er > with 8 IAAs. >=20 > Regardless of how the user has configured the WQs' usage, > the next WQ to use is obtained through a direct look-up in > per-CPU "cpu_comp_wqs" and "cpu_decomp_wqs" structures so > as to minimize latency in the critical path driver compress > and decompress routines. >=20 > Patch 3) Code cleanup, consistency of function parameters. >=20 > Patch 4) Makes a change to iaa_crypto driver's descriptor allocation, > from blocking to non-blocking with retries/timeouts and > mitigations in case of timeouts during compress/decompress > ops. This prevents tasks getting blocked indefinitely, which > was observed when testing 30 cores running workloads, with > only 1 IAA enabled on Sapphire Rapids (out of 4). These > timeouts are typically only encountered, and associated > mitigations exercised, only in configurations with 1 IAA > device shared by 30+ cores. >=20 > Patch 5) Optimize iaa_wq refcounts using a percpu_ref instead of > spinlocks and "int refcount". >=20 > Patch 6) Code simplification and restructuring for understandability > in core iaa_compress() and iaa_decompress() routines. >=20 > Patch 7) Refactor hardware descriptor setup to their own procedures > to reduce code clutter. >=20 > Patch 8) Simplify and optimize job submission for the most commonly u= sed > non-irq async mode by directly calling movdir64b. >=20 > Patch 9) Deprecate exporting symbols for adding IAA compression > modes. >=20 > Patch 10) All dma_map_sg() calls will pass in 1 for the nents instead= of > sg_nents(), for these main reasons: performance; no existin= g > iaa_crypto use cases that allow multiple SG lists to be map= ped for > dma at once; facilitates new SG lists batching interface th= rough > crypto. >=20 > Patch 11) Move iaa_crypto core functionality to a layer that relies o= nly on > the idxd driver, dma, and scatterlists. Implement clean int= erfaces > to crypto_acomp. >=20 > Patch 12) Define a unit_size in struct acomp_req to enable batching, = and > provides acomp_request_set_unit_size() for use by kernel > modules. zswap_cpu_comp_prepare() calls this API to set the > unit_size for zswap as PAGE_SIZE. >=20 > Patch 13) Implement asynchronous descriptor submit and polling > mechanisms, > enablers for batching. Develop IAA batching of compressions= and > decompressions for deriving hardware parallelism. >=20 > Patch 14) Enables the "async" mode, sets it as the default. >=20 > Patch 15) Disables verify_compress by default. >=20 > Patch 16) Decompress batching optimization: Find the two largest > buffers in the batch and submit them first. >=20 > Patch 17) Add a new Dynamic compression mode that can be used on > Granite Rapids. >=20 > Patch 18) Add a batch_size data member to struct acomp_alg and > a crypto_acomp_batch_size() API that returns the compressor= 's > batch-size, if it has defined one; 1 otherwise. >=20 > 2) zswap modifications to enable compress batching in zswap_store() > of large folios (including pmd-mappable folios): >=20 > Patch 19) Simplifies the zswap_pool's per-CPU acomp_ctx resource > management and lifetime to be from pool creation to pool > deletion. >=20 > Patch 20) Uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check > for > valid acomp/req, thereby making it consistent with the reso= urce > de-allocation code. >=20 > Patch 21) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently se= t > as 8U) to denote the maximum number of acomp_ctx batching > resources to allocate, thus limiting the amount of extra > memory used for batching. Further, the "struct > crypto_acomp_ctx" is modified to contain multiple buffers. > New "u8 compr_batch_size" member is added to "struct zswap_= pool" > to track the number of dst buffers associated with the comp= ressor > (more than 1 if the compressor supports batching). >=20 > Modifies zswap_store() to store the folio in batches of > pool->compr_batch_size (batching compressors) or > ZSWAP_MAX_BATCH_SIZE (sequential compressors) by calling a = new > zswap_store_pages() that takes a range of indices in the fo= lio to > be stored. >=20 > zswap_store_pages() bulk-allocates zswap entries for the ba= tch, > calls zswap_compress() for each page in this range, and sto= res > the entries in xarray/LRU. >=20 > Patch 22) Introduces a new unified batching implementation of > zswap_compress() for compressors that do and do not support > batching. This eliminates code duplication and facilitates > code maintainability with the introduction of compress > batching. Further, there are many optimizations to this com= mon > code that result in workload throughput and performance > improvements with software compressors and hardware acceler= ators > such as IAA. >=20 > zstd performance is better or on par with mm-unstable. We > see impressive throughput/performance improvements with > IAA and workload performance/sys time improvement with zstd > batching vs. no-batching. >=20 >=20 > With v13 of this patch series, the IAA compress batching feature will be > enabled seamlessly on Intel platforms that have IAA by selecting > 'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async' > sync_mode driver attribute (the default). >=20 >=20 > System setup for testing: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D > Testing of this patch-series was done with mm-unstable as of 10-24-2025, > commit 813c0fa931ce, without and with this patch-series. Data was > gathered on an Intel Sapphire Rapids (SPR) server, dual-socket 56 cores > per socket, 4 IAA devices per socket, each IAA has total 128 WQ entries, > 503 GiB RAM and 525G SSD disk partition swap. Core frequency was fixed > at 2500MHz. >=20 > Other kernel configuration parameters: >=20 > zswap compressor : zstd, deflate-iaa > zswap allocator : zsmalloc > vm.page-cluster : 0 >=20 > IAA "compression verification" is disabled and IAA is run in the async > mode (the defaults with this series). >=20 > I ran experiments with these workloads: >=20 > 1) usemem 30 processes with zswap shrinker_enabled=3DN. Two sets of > experiments, one with 64K folios, another with PMD folios. >=20 > 2) Kernel compilation allmodconfig with 2G max memory, 32 threads, with > zswap shrinker_enabled=3DN to test batching performance impact in > isolation. Two sets of experiments, one with 64K folios, another with = PMD > folios. >=20 > IAA configuration is done by a CLI: script is included at the end of the > cover letter. >=20 >=20 > Performance testing (usemem30): > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed at 150G. There is no swap limit set for the cgroup. 30 usemem > processes were run, each allocating and writing 10G of memory, and > sleeping for 10 sec before exiting: >=20 > usemem --init-time -w -O -b 1 -s 10 -n 30 10g > echo 0 > /sys/module/zswap/parameters/shrinker_enabled >=20 > IAA WQ Configuration (script is iincluded at the end of the cover > letter): >=20 > ./enable_iaa.sh -d 4 -q 1 >=20 > This enables all 4 IAAs on the socket, and configures 1 WQ per IAA > device, each containing 128 entries. The driver distributes compress > jobs from each core to wqX.0 of all same-package IAAs in a > round-robin manner. Decompress jobs are send to the wqX.0 of the > mapped IAA device. >=20 > Since usemem has significantly more swapouts than swapins, this > configuration is the most optimal. >=20 > 64K folios: usemem30: deflate-iaa: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > ------------------------------------------------------------------------= ------- > mm-unstable-10-24-2025 v13 > ------------------------------------------------------------------------= ------- > zswap compressor deflate-iaa deflate-iaa IAA Batching > vs. > IAA Sequenti= al > ------------------------------------------------------------------------= ------- > Total throughput (KB/s) 6,118,675 9,901,216 62% > Avg throughput (KB/s) 203,955 330,040 62% > elapsed time (sec) 98.94 70.90 -28% > sys time (sec) 2,379.29 1,686.18 -29% >=20 > ------------------------------------------------------------------------= ------- > memcg_high 1,263,467 1,404,068 > memcg_swap_fail 1,728 1,377 > 64kB_swpout_fallback 1,728 1,377 > zswpout 58,174,008 64,508,622 > zswpin 43 138 > pswpout 0 0 > pswpin 0 0 > ZSWPOUT-64kB 3,634,162 4,030,643 > SWPOUT-64kB 0 0 > pgmajfault 2,398 2,488 > zswap_reject_compress_fail 0 0 > zswap_reject_reclaim_fail 0 0 > IAA incompressible pages 0 0 > ------------------------------------------------------------------------= ------- >=20 >=20 > 2M folios: usemem30: deflate-iaa: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > ------------------------------------------------------------------------= ------- > mm-unstable-10-24-2025 v13 > ------------------------------------------------------------------------= ------- > zswap compressor deflate-iaa deflate-iaa IAA Batchin= g > vs. > IAA Sequent= ial > ------------------------------------------------------------------------= ------- > Total throughput (KB/s) 6,309,635 10,558,225 67% > Avg throughput (KB/s) 210,321 351,940 67% > elapsed time (sec) 88.70 67.84 -24% > sys time (sec) 2,059.83 1,581.07 -23% >=20 > ------------------------------------------------------------------------= ------- > memcg_high 116,246 125,218 > memcg_swap_fail 41 177 > thp_swpout_fallback 41 177 > zswpout 59,880,021 64,509,854 > zswpin 69 425 > pswpout 0 0 > pswpin 0 0 > ZSWPOUT-2048kB 116,912 125,822 > thp_swpout 0 0 > pgmajfault 2,408 4,026 > zswap_reject_compress_fail 0 0 > zswap_reject_reclaim_fail 0 0 > IAA incompressible pages 0 0 > ------------------------------------------------------------------------= ------- >=20 >=20 > 64K folios: usemem30: zstd: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D >=20 > ------------------------------------------------------------------------= ------- > mm-unstable-10-24-2025 v13 > ------------------------------------------------------------------------= ------- > zswap compressor zstd zstd v13 zstd > improvem= ent > ------------------------------------------------------------------------= ------- > Total throughput (KB/s) 5,983,561 6,003,851 0.3% > Avg throughput (KB/s) 199,452 200,128 0.3% > elapsed time (sec) 100.93 96.62 -4.3% > sys time (sec) 2,532.49 2,395.83 -5% >=20 > ------------------------------------------------------------------------= ------- > memcg_high 1,122,198 1,113,384 > memcg_swap_fail 192 55 > 64kB_swpout_fallback 192 55 > zswpout 48,766,907 48,799,863 > zswpin 89 68 > pswpout 0 0 > pswpin 0 0 > ZSWPOUT-64kB 3,047,702 3,049,908 > SWPOUT-64kB 0 0 > pgmajfault 2,428 2,390 > zswap_reject_compress_fail 0 0 > zswap_reject_reclaim_fail 0 0 > ------------------------------------------------------------------------= ------- >=20 >=20 > 2M folios: usemem30: zstd: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D >=20 > ------------------------------------------------------------------------= ------- > mm-unstable-10-24-2025 v13 > ------------------------------------------------------------------------= ------- > zswap compressor zstd zstd v13 zstd > improvem= ent > ------------------------------------------------------------------------= ------- > Total throughput (KB/s) 6,562,687 6,567,946 0.1% > Avg throughput (KB/s) 218,756 218,931 0.1% > elapsed time (sec) 94.69 88.79 -6% > sys time (sec) 2,253.97 2,083.43 -8% >=20 > ------------------------------------------------------------------------= -------- > memcg_high 92,709 92,686 > memcg_swap_fail 33 226 > thp_swpout_fallback 33 226 > zswpout 47,851,601 47,847,171 > zswpin 65 441 > pswpout 0 0 > pswpin 0 0 > ZSWPOUT-2048kB 93,427 93,238 > thp_swpout 0 0 > pgmajfault 2,382 2,767 > zswap_reject_compress_fail 0 0 > zswap_reject_reclaim_fail 0 0 > ------------------------------------------------------------------------= ------- >=20 >=20 > Performance testing (Kernel compilation, allmodconfig): > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D >=20 > The experiments with kernel compilation test use 32 threads and build > the "allmodconfig" that takes ~14 minutes, and has considerable > swapout/swapin activity. The cgroup's memory.max is set to 2G. zswap > writeback is not enabled so as to isolate the performance impact of only = large > folio batch compression. >=20 > echo 0 > /sys/module/zswap/parameters/shrinker_enabled >=20 > IAA WQ Configuration (script is at the end of the cover letter): >=20 > ./enable_iaa.sh -d 4 -q 2 >=20 > This enables all 4 IAAs on the socket, and configures 2 WQs per IAA, > each containing 64 entries. The driver sends decompresses to wqX.0 of > the mapped IAA device, and distributes compresses to wqX.1 of all > same-package IAAs in a round-robin manner. >=20 > 64K folios: Kernel compilation/allmodconfig: deflate-iaa: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D >=20 > ------------------------------------------------------------------------= ------- > mm-unstable-10-24-2025 v13 > ------------------------------------------------------------------------= ------- > zswap compressor deflate-iaa deflate-iaa IAA Batching > vs. > IAA Sequenti= al > ------------------------------------------------------------------------= ------- > real_sec 836.64 806.94 -3.5% > user_sec 15,702.26 15,695.13 > sys_sec 3,897.57 3,661.83 -6% > ------------------------------------------------------------------------= ------- > Max_Res_Set_Size_KB 1,872,500 1,873,144 > ------------------------------------------------------------------------= ------- > memcg_high 0 0 > memcg_swap_fail 0 0 > 64kB_swpout_fallback 0 0 > zswpout 94,890,390 93,332,527 > zswpin 28,305,656 28,111,525 > pswpout 0 0 > pswpin 0 0 > ZSWPOUT-64kB 3,088,473 3,018,341 > SWPOUT-64kB 0 0 > pgmajfault 29,958,141 29,776,102 > zswap_reject_compress_fail 0 0 > zswap_reject_reclaim_fail 0 0 > IAA incompressible pages 684 442 > ------------------------------------------------------------------------= ------- >=20 >=20 > 2M folios: Kernel compilation/allmodconfig: deflate-iaa: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >=20 > ------------------------------------------------------------------------= ------- > mm-unstable-10-24-2025 v13 > ------------------------------------------------------------------------= ------- > zswap compressor deflate-iaa deflate-iaa IAA Batching > vs. > IAA Sequenti= al > ------------------------------------------------------------------------= ------- > real_sec 818.48 779.67 -4.7% > user_sec 15,798.78 15,807.93 > sys_sec 4,226.52 4,245.18 0.4% > ------------------------------------------------------------------------= ------- > Max_Res_Set_Size_KB 1,871,096 1,871,100 > ------------------------------------------------------------------------= ------- > memcg_high 0 0 > memcg_swap_fail 0 0 > thp_swpout_fallback 0 0 > zswpout 105,675,621 109,930,550 > zswpin 36,537,688 38,205,575 > pswpout 0 0 > pswpin 0 0 > ZSWPOUT-2048kB 15,600 15,800 > thp_swpout 0 0 > pgmajfault 37,843,091 39,540,387 > zswap_reject_compress_fail 0 0 > zswap_reject_reclaim_fail 0 0 > IAA incompressible pages 188 349 > ------------------------------------------------------------------------= ------- >=20 >=20 > With the iaa_crypto driver changes for non-blocking descriptor allocation= s, > no timeouts-with-mitigations were seen in compress/decompress jobs, for a= ll > of the above experiments. >=20 >=20 > 64K folios: Kernel compilation/allmodconfig: zstd: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D >=20 > ------------------------------------------------------------------------= ------- > mm-unstable-10-24-2025 v13 > ------------------------------------------------------------------------= ------- > zswap compressor zstd zstd Improvement > ------------------------------------------------------------------------= ------- > real_sec 880.62 850.41 -3.4% > user_sec 15,717.23 15,683.17 > sys_sec 5,171.90 5,076.51 -1.8% > ------------------------------------------------------------------------= ------- > Max_Res_Set_Size_KB 1,871,276 1,874,744 > ------------------------------------------------------------------------= ------- > memcg_high 0 0 > memcg_swap_fail 0 0 > 64kB_swpout_fallback 0 0 > zswpout 76,599,637 76,472,392 > zswpin 21,833,178 22,538,969 > pswpout 0 0 > pswpin 0 0 > ZSWPOUT-64kB 2,462,404 2,446,549 > SWPOUT-64kB 0 0 > pgmajfault 23,027,211 23,830,391 > zswap_reject_compress_fail 0 0 > zswap_reject_reclaim_fail 0 0 > ------------------------------------------------------------------------= ------- >=20 >=20 > 2M folios: Kernel compilation/allmodconfig: zstd: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > ------------------------------------------------------------------------= ------- > mm-unstable-10-24-2025 v13 > ------------------------------------------------------------------------= ------- > zswap compressor zstd zstd Improvement > ------------------------------------------------------------------------= ------- > real_sec 888.45 849.54 -4.4% > user_sec 15,841.87 15,828.10 > sys_sec 5,866.72 5,847.17 -0.3% > ------------------------------------------------------------------------= ------- > Max_Res_Set_Size_KB 1,871,096 1,872,892 > ------------------------------------------------------------------------= ------- > memcg_high 0 0 > memcg_swap_fail 0 0 > thp_swpout_fallback 0 0 > zswpout 89,891,328 90,847,761 > zswpin 29,249,656 29,999,617 > pswpout 0 0 > pswpin 0 0 > ZSWPOUT-2048kB 12,198 12,481 > thp_swpout 0 0 > pgmajfault 30,077,425 30,915,945 > zswap_reject_compress_fail 0 0 > zswap_reject_reclaim_fail 0 0 > ------------------------------------------------------------------------= ------- >=20 >=20 >=20 > Changes since v12: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 10-24-2025, commit 813c0fa931ce. > 2) Added "int nid" to zswap_entry to store the page's nid, to preserve z= swap > LRU list/shrinker behavior with bulk allocation, as suggested by Nhat= and > Yosry. No change in memory footprint of struct zswap_entry. > 3) Added a WARN_ON() if kmem_cache_alloc_bulk() returns 0 or a number > that's > different than nr_entries, as suggested by Yosry. > 4) Confirmed that kmem_cache_bulk_free() works for both bulk and non-bul= k > allocated entries, to follow-up on Yosry's comment. > 5) Moved the call to cpuhp_state_remove_instance() to > zswap_pool_destroy(), as > suggested by Yosry. > 6) Variable names changed to "nid" and "wb_enabled", per Yosry's > suggestion. > 7) Concise comments in zswap.c, and summarized commit logs, as suggested > by > Yosry. > 8) Minimized branches in zswap_compress(). > 9) Deleted allocating extra memory in acomp_req->__ctx[] to statically s= tore > addresses to SG lists' lengths, as suggested by Herbert. > 10) Deleted the iaa_comp API and export symbols, as suggested by Herbert. > 11) Deleted @batch_size in struct crypto_acomp. Instead, the value is > returned > from struct acomp_alg directly, as suggested by Herbert. > 12) Addressed checkpatch.pl warnings and coding style suggestions in the > iaa_crypto patches, provided by Vinicius Gomes in internal code > reviews. Thanks Vinicius! >=20 >=20 > Changes since v11: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 9-18-2025, commit 1f98191f08b4. > 2) Incorporated Herbert's suggestions on submitting the folio as the sour= ce > and > SG lists for the destination to create the compress batching interface= from > zswap to crypto. > 3) As per Herbert's suggestion, added a new unit_size member to struct > acomp_req, along with a acomp_request_set_unit_size() API for kernel > modules > to set the unit size to use while breaking down the request's src/dst > scatterlists. > 4) Implemented iaa_crypto batching using the new SG lists based architect= ure > and > crypto interfaces. > 5) To make the SG lists based approach functional and performant for IAA,= I > have > changed all the calls to dma_map_sg() to use nents of 1. This should n= ot be > a > concern, since it eliminates redundant computes to scan an SG list wit= h only > one scatterlist for existing kernel users, i.e. zswap with the > zswap_compress() modifications in this series. This will continue to h= old > true with the zram IAA batching support I am developing. There are no > kernel > use cases for the iaa_crypto driver that will break this assumption. > 6) Addressed Herbert's comment about batch_size being a statically define= d > data > member in struct acomp_alg and struct crypto_acomp. > 7) Addressed Nhat's comment about VM_WARN_ON_ONCE(nr_pages > > ZSWAP_MAX_BATCH_SIZE) in zswap_store_pages(). > 8) Nhat's comment about deleting struct swap_batch_decomp_data is > automatically > addressed by the SG lists based rewrite of the crypto batching interfa= ce. > 9) Addressed Barry's comment about renaming pool->batch_size to > pool->store_batch_size. > 10) Incorporated Barry's suggestion to merge patches that introduce data > members > to structures and/or API and their usage. > 11) Added performance data to patch 0023's commit log, as suggested by > Barry. >=20 > Changes since v10: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 7-30-2025, commit 01da54f10fdd. > 2) Added change logging in patch 0024 on there being no Intel-specific > dependencies in the batching framework, as suggested by > Andrew Morton. Thanks Andrew! > 3) Added change logging in patch 0024 on other ongoing work that can use > batching, as per Andrew's suggestion. Thanks Andrew! > 4) Added the IAA configuration script in the cover letter, as suggested > by Nhat Pham. Thanks Nhat! > 5) As suggested by Nhat, dropped patch 0020 from v10, that moves CPU > hotplug procedures to pool functions. > 6) Gathered kernel_compilation 'allmod' config performance data with > writeback and zswap shrinker_enabled=3DY. > 7) Changed the pool->batch_size for software compressors to be > ZSWAP_MAX_BATCH_SIZE since this gives better performance with the > zswap > shrinker enabled. > 8) Was unable to replicate in v11 the issue seen in v10 with higher > memcg_swap_fail than in the baseline, with usemem30/zstd. >=20 > Changes since v9: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 6-24-2025, commit 23b9c0472ea3. > 2) iaa_crypto rearchitecting, mainline race condition fix, performance > optimizations, code cleanup. > 3) Addressed Herbert's comments in v9 patch 10, that an array based > crypto_acomp interface is not acceptable. > 4) Optimized the implementation of the batching zswap_compress() and > zswap_store_pages() added in v9, to recover performance when > integrated with the changes in commit 56e5a103a721 ("zsmalloc: prefer > the the original page's node for compressed data"). >=20 > Changes since v8: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 4-21-2025, commit 2c01d9f3c611. > 2) Backported commits for reverting request chaining, since these are > in cryptodev-2.6 but not yet in mm-unstable: without these backports, > deflate-iaa is non-functional in mm-unstable: > commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining") > commit 5976fe19e240 ("Revert "crypto: testmgr - Add multibuffer acomp > testing"") > Backported this hotfix as well: > commit 002ba346e3d7 ("crypto: scomp - Fix off-by-one bug when > calculating last page"). > 3) crypto_acomp_[de]compress() restored to non-request chained > implementations since request chaining has been removed from acomp in > commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining"). > 4) New IAA WQ architecture to denote WQ type and whether or not a WQ > should be shared among all package cores, or only to the "mapped" > ones from an even cores-to-IAA distribution scheme. > 5) Compress/decompress batching are implemented in iaa_crypto using new > crypto_acomp_batch_compress()/crypto_acomp_batch_decompress() API. > 6) Defines a "void *data" in struct acomp_req, based on Herbert advising > against using req->base.data in the driver. This is needed for async > submit-poll to work. > 7) In zswap.c, moved the CPU hotplug callbacks to reside in "pool > functions", per Yosry's suggestion to move procedures in a distinct > patch before refactoring patches. > 8) A new "u8 nr_reqs" member is added to "struct zswap_pool" to track > the number of requests/buffers associated with the per-cpu acomp_ctx, > as per Yosry's suggestion. > 9) Simplifications to the acomp_ctx resources allocation, deletion, > locking, and for these to exist from pool creation to pool deletion, > based on v8 code review discussions with Yosry. > 10) Use IS_ERR_OR_NULL() consistently in zswap_cpu_comp_prepare() and > acomp_ctx_dealloc(), as per Yosry's v8 comment. > 11) zswap_store_folio() is deleted, and instead, the loop over > zswap_store_pages() is moved inline in zswap_store(), per Yosry's > suggestion. > 12) Better structure in zswap_compress(), unified procedure that > compresses/stores a batch of pages for both, non-batching and > batching compressors. Renamed from zswap_batch_compress() to > zswap_compress(): Thanks Yosry for these suggestions. >=20 >=20 > Changes since v7: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 3-3-2025, commit 5f089a9aa987. > 2) Changed the acomp_ctx->nr_reqs to be u8 since ZSWAP_MAX_BATCH_SIZE > is > defined as 8U, for saving memory in this per-cpu structure. > 3) Fixed a typo in code comments in acomp_ctx_get_cpu_lock(): > acomp_ctx->initialized to acomp_ctx->__online. > 4) Incorporated suggestions from Yosry, Chengming, Nhat and Johannes, > thanks to all! > a) zswap_batch_compress() replaces zswap_compress(). Thanks Yosry > for this suggestion! > b) Process the folio in sub-batches of ZSWAP_MAX_BATCH_SIZE, regardles= s > of whether or not the compressor supports batching. This gets rid o= f > the kmalloc(entries), and allows us to allocate an array of > ZSWAP_MAX_BATCH_SIZE entries on the stack. This is implemented in > zswap_store_pages(). > c) Use of a common structure and code paths for compressing a folio in > batches, either as a request chain (in parallel in IAA hardware) or > sequentially. No code duplication since zswap_compress() has been > replaced with zswap_batch_compress(), simplifying maintainability. > 5) A key difference between compressors that support batching and > those that do not, is that for the latter, the acomp_ctx mutex is > locked/unlocked per ZSWAP_MAX_BATCH_SIZE batch, so that > decompressions > to handle page-faults can make progress. This fixes the zstd kernel > compilation regression seen in v7. For compressors that support > batching, for e.g. IAA, the mutex is locked/released once for storing > the folio. > 6) Used likely/unlikely compiler directives and prefetchw to restore > performance with the common code paths. >=20 > Changes since v6: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 2-27-2025, commit d58172d128ac. >=20 > 2) Deleted crypto_acomp_batch_compress() and > crypto_acomp_batch_decompress() interfaces, as per Herbert's > suggestion. Batching is instead enabled by chaining the requests. For > non-batching compressors, there is no request chaining involved. Both, > batching and non-batching compressions are accomplished by zswap by > calling: >=20 > crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), > &acomp_ctx->wait); >=20 > 3) iaa_crypto implementation of batch compressions/decompressions using > request chaining, as per Herbert's suggestions. > 4) Simplification of the acomp_ctx resource allocation/deletion with > respect to CPU hot[un]plug, to address Yosry's suggestions to explore = the > mutex options in zswap_cpu_comp_prepare(). Yosry, please let me know i= f > the per-cpu memory cost of this proposed change is acceptable (IAA: > 64.8KB, Software compressors: 8.2KB). On the positive side, I believe > restarting reclaim on a CPU after it has been through an offline-onlin= e > transition, will be much faster by not deleting the acomp_ctx resource= s > when the CPU gets offlined. > 5) Use of lockdep assertions rather than comments for internal locking > rules, as per Yosry's suggestion. > 6) No specific references to IAA in zswap.c, as suggested by Yosry. > 7) Explored various solutions other than the v6 zswap_store_folio() > implementation, to fix the zstd regression seen in v5, to attempt to > unify common code paths, and to allocate smaller arrays for the zswap > entries on the stack. All these options were found to cause usemem30 > latency regression with zstd. The v6 version of zswap_store_folio() is > the only implementation that does not cause zstd regression, confirmed > by 10 consecutive runs, each giving quite consistent latency > numbers. Hence, the v6 implementation is carried forward to v7, with > changes for branching for batching vs. sequential compression API > calls. >=20 >=20 > Changes since v5: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 2-1-2025, commit 7de6fd8ab650. >=20 > Several improvements, regression fixes and bug fixes, based on Yosry's > v5 comments (Thanks Yosry!): >=20 > 2) Fix for zstd performance regression in v5. > 3) Performance debug and fix for marginal improvements with IAA batching > vs. sequential. > 4) Performance testing data compares IAA with and without batching, inste= ad > of IAA batching against zstd. > 5) Commit logs/zswap comments not mentioning crypto_acomp > implementation > details. > 6) Delete the pr_info_once() when batching resources are allocated in > zswap_cpu_comp_prepare(). > 7) Use kcalloc_node() for the multiple acomp_ctx buffers/reqs in > zswap_cpu_comp_prepare(). > 8) Simplify and consolidate error handling cleanup code in > zswap_cpu_comp_prepare(). > 9) Introduce zswap_compress_folio() in a separate patch. > 10) Bug fix in zswap_store_folio() when xa_store() failure can cause all > compressed objects and entries to be freed, and UAF when zswap_store(= ) > tries to free the entries that were already added to the xarray prior > to the failure. > 11) Deleting compressed_bytes/bytes. zswap_store_folio() also comprehends > the recent fixes in commit bf5eaaaf7941 ("mm/zswap: fix inconsistency > when zswap_store_page() fails") by Hyeonggon Yoo. >=20 > iaa_crypto improvements/fixes/changes: >=20 > 12) Enables asynchronous mode and makes it the default. With commit > 4ebd9a5ca478 ("crypto: iaa - Fix IAA disabling that occurs when > sync_mode is set to 'async'"), async mode was previously just sync. W= e > now have true async support. > 13) Change idxd descriptor allocations from blocking to non-blocking with > timeouts, and mitigations for compress/decompress ops that fail to > obtain a descriptor. This is a fix for tasks blocked errors seen in > configurations where 30+ cores are running workloads under high memor= y > pressure, and sending comps/decomps to 1 IAA device. > 14) Fixes a bug with unprotected access of "deflate_generic_tfm" in > deflate_generic_decompress(), which can cause data corruption and > zswap_decompress() kernel crash. > 15) zswap uses crypto_acomp_batch_compress() with async polling instead o= f > request chaining for slightly better latency. However, the request > chaining framework itself is unchanged, preserved from v5. >=20 >=20 > Changes since v4: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6. > 2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert! > 3) Implemented IAA compress batching using request chaining. > 4) zswap_store() batching simplifications suggested by Chengming, Yosry a= nd > Nhat, thanks to all! > - New zswap_compress_folio() that is called by zswap_store(). > - Move the loop over folio's pages out of zswap_store() and into a > zswap_store_folio() that stores all pages. > - Allocate all zswap entries for the folio upfront. > - Added zswap_batch_compress(). > - Branch to call zswap_compress() or zswap_batch_compress() inside > zswap_compress_folio(). > - All iterations over pages kept in same function level. > - No helpers other than the newly added zswap_store_folio() and > zswap_compress_folio(). >=20 >=20 > Changes since v3: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6. > 2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores, > based on packages instead of NUMA nodes. > 3) Added acomp_has_async_batching() API to crypto acomp, that allows > zswap/zram to query if a crypto_acomp has registered batch_compress an= d > batch_decompress interfaces. > 4) Clear the poll bits on the acomp_reqs passed to > iaa_comp_a[de]compress_batch() so that a module like zswap can be > confident about the acomp_reqs[0] not having the poll bit set before > calling the fully synchronous API crypto_acomp_[de]compress(). > Herbert, I would appreciate it if you can review changes 2-4; in patch= es > 1-8 in v4. I did not want to introduce too many iaa_crypto changes in > v4, given that patch 7 is already making a major change. I plan to wor= k > on incorporating the request chaining using the ahash interface in v5 > (I need to understand the basic crypto ahash better). Thanks Herbert! > 5) Incorporated Johannes' suggestion to not have a sysctl to enable > compress batching. > 6) Incorporated Yosry's suggestion to allocate batching resources in the > cpu hotplug onlining code, since there is no longer a sysctl to contro= l > batching. Thanks Yosry! > 7) Incorporated Johannes' suggestions related to making the overall > sequence of events between zswap_store() and zswap_batch_store() > similar > as much as possible for readability and control flow, better naming of > procedures, avoiding forward declarations, not inlining error path > procedures, deleting zswap internal details from zswap.h, etc. Thanks > Johannes, really appreciate the direction! > I have tried to explain the minimal future-proofing in terms of the > zswap_batch_store() signature and the definition of "struct > zswap_batch_store_sub_batch" in the comments for this struct. I hope t= he > new code explains the control flow a bit better. >=20 >=20 > Changes since v2: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8. > 2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL > returned by kmalloc_node() for acomp_ctx->buffers and for > acomp_ctx->reqs. > 3) Fixed a bug in zswap_pool_can_batch() for returning true if > pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, > and if > the per-cpu acomp_batch_ctx tests true for batching resources having > been allocated on this cpu. Also, changed from per_cpu_ptr() to > raw_cpu_ptr(). > 4) Incorporated the zswap_store_propagate_errors() compilation warning fi= x > suggested by Dan Carpenter. Thanks Dan! > 5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments > in > zswap.h, with SWAP_CRYPTO_BATCH_SIZE. >=20 > Changes since v1: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702. > 2) Incorporated Herbert's suggestions to use an acomp_req flag to indicat= e > async/poll mode, and to encapsulate the polling functionality in the > iaa_crypto driver. Thanks Herbert! > 3) Incorporated Herbert's and Yosry's suggestions to implement the batchi= ng > API in iaa_crypto and to make its use seamless from zswap's > perspective. Thanks Herbert and Yosry! > 4) Incorporated Yosry's suggestion to make it more convenient for the use= r > to enable compress batching, while minimizing the memory footprint > cost. Thanks Yosry! > 5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list() > reclaim batching patch from this series, since it requires a broader > discussion. >=20 >=20 > IAA configuration script "enable_iaa.sh": > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > Acknowledgements: Binuraj Ravindran and Rakib Al-Fahad. >=20 > Usage: > ------ >=20 > ./enable_iaa.sh -d -q >=20 >=20 > #-------------------------------------------------------------= ------ > #!/usr/bin/env bash > #SPDX-License-Identifier: BSD-3-Clause > #Copyright (c) 2025, Intel Corporation > #Description: Configure IAA devices >=20 > VERIFY_COMPRESS_PATH=3D"/sys/bus/dsa/drivers/crypto/verify_compress" >=20 > iax_dev_id=3D"0cfe" > num_iaa=3D$(lspci -d:${iax_dev_id} | wc -l) > sockets=3D$(lscpu | grep Socket | awk '{print $2}') > echo "Found ${num_iaa} instances in ${sockets} sockets(s)" >=20 > # The same number of devices will be configured in each socket, if ther= e > # are more than one socket. > # Normalize with respect to the number of sockets. > device_num_per_socket=3D$(( num_iaa/sockets )) > num_iaa_per_socket=3D$(( num_iaa / sockets )) >=20 > iaa_wqs=3D2 > verbose=3D0 > iaa_engines=3D8 > mode=3D"dedicated" > wq_type=3D"kernel" > iaa_crypto_mode=3D"async" > verify_compress=3D0 >=20 >=20 > # Function to handle errors > handle_error() { > echo "Error: $1" > exit 1 > } >=20 > # Process arguments >=20 > while getopts "d:hm:q:vD" opt; do > case $opt in > d) > device_num_per_socket=3D$OPTARG > ;; > m) > iaa_crypto_mode=3D$OPTARG > ;; > q) > iaa_wqs=3D$OPTARG > ;; > D) > verbose=3D1 > ;; > v) > verify_compress=3D1 > ;; > h) > echo "Usage: $0 [-d ][-q ][-v]" > echo " -d - number of devices" > echo " -q - number of WQs per device" > echo " -v - verbose mode" > echo " -h - help" > exit > ;; > \?) > echo "Invalid option: -$OPTARG" >&2 > exit > ;; > esac > done >=20 > LOG=3D"configure_iaa.log" >=20 > # Update wq_size based on number of wqs > wq_size=3D$(( 128 / iaa_wqs )) >=20 > # Take care of the enumeration, if DSA is enabled. > dsa=3D`lspci | grep -c 0b25` > # set first,step counters to correctly enumerate iax devices based on > # whether running on guest or host with or without dsa > first=3D0 > step=3D1 > [[ $dsa -gt 0 && -d /sys/bus/dsa/devices/dsa0 ]] && first=3D1 && step=3D= 2 > echo "first index: ${first}, step: ${step}" >=20 >=20 > # > # Switch to software compressors and disable IAAs to have a clean start > # > COMPRESSOR=3D/sys/module/zswap/parameters/compressor > last_comp=3D`cat ${COMPRESSOR}` > echo lzo > ${COMPRESSOR} >=20 > echo "Disable IAA devices before configuring" >=20 > for ((i =3D ${first}; i < ${step} * ${num_iaa}; i +=3D ${step})); do > for ((j =3D 0; j < ${iaa_wqs}; j +=3D 1)); do > cmd=3D"accel-config disable-wq iax${i}/wq${i}.${j} >& /dev/null" > [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd > done > cmd=3D"accel-config disable-device iax${i} >& /dev/null" > [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd > done >=20 > rmmod iaa_crypto > modprobe iaa_crypto >=20 > # apply crypto parameters > echo $verify_compress > ${VERIFY_COMPRESS_PATH} || handle_error "did > not change verify_compress" > # Note: This is a temporary solution for during the kernel transition. > if [ -f /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa ];then > echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa || > handle_error "did not set g_comp_wqs_per_iaa" > elif [ -f /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa ];then > echo 1 > /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa || handle_error "= did > not set g_wqs_per_iaa" > fi > if [ -f /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq ];then > echo 1 > /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq || > handle_error "did not set g_consec_descs_per_gwq" > fi > echo ${iaa_crypto_mode} > /sys/bus/dsa/drivers/crypto/sync_mode || > handle_error "could not set sync_mode" >=20 >=20 >=20 > echo "Configuring ${device_num_per_socket} device(s) out of > $num_iaa_per_socket per socket" > if [ "${device_num_per_socket}" -le "${num_iaa_per_socket}" ]; then > echo "Configuring all devices" > start=3D${first} > end=3D$(( ${step} * ${device_num_per_socket} )) > else > echo "ERROR: Not enough devices" > exit > fi >=20 >=20 > # > # enable all iax devices and wqs > # > for (( socket =3D 0; socket < ${sockets}; socket +=3D 1 )); do > for ((i =3D ${start}; i < ${end}; i +=3D ${step})); do >=20 > echo "Configuring iaa$i on socket ${socket}" >=20 > for ((j =3D 0; j < ${iaa_engines}; j +=3D 1)); do > cmd=3D"accel-config config-engine iax${i}/engine${i}.${j} --grou= p-id=3D0" > [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd > done >=20 > # Config WQs > for ((j =3D 0; j < ${iaa_wqs}; j +=3D 1)); do > # Config WQ: group 0, priority=3D10, mode=3Dshared, type =3D ke= rnel > name=3Dkernel, driver_name=3Dcrypto > cmd=3D"accel-config config-wq iax${i}/wq${i}.${j} -g 0 -s ${wq_s= ize} -p 10 - > m ${mode} -y ${wq_type} -n iaa_crypto${i}${j} -d crypto" > [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd > done >=20 > # Enable Device and WQs > cmd=3D"accel-config enable-device iax${i}" > [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd >=20 > for ((j =3D 0; j < ${iaa_wqs}; j +=3D 1)); do > cmd=3D"accel-config enable-wq iax${i}/wq${i}.${j}" > [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd > done >=20 > done > start=3D$(( start + ${step} * ${num_iaa_per_socket} )) > end=3D$(( start + (${step} * ${device_num_per_socket}) )) > done >=20 > # Restore the last compressor > echo "$last_comp" > ${COMPRESSOR} >=20 > # Check if the configuration is correct > echo "Configured IAA devices:" > accel-config list | grep iax >=20 > #-------------------------------------------------------------= ------ >=20 >=20 > I would greatly appreciate code review comments for the iaa_crypto driver > and mm patches included in this series! >=20 > Thanks, > Kanchana >=20 >=20 >=20 > Kanchana P Sridhar (22): > crypto: iaa - Reorganize the iaa_crypto driver code. > crypto: iaa - New architecture for IAA device WQ comp/decomp usage & > core mapping. > crypto: iaa - Simplify, consistency of function parameters, minor > stats bug fix. > crypto: iaa - Descriptor allocation timeouts with mitigations. > crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting. > crypto: iaa - Simplify the code flow in iaa_compress() and > iaa_decompress(). > crypto: iaa - Refactor hardware descriptor setup into separate > procedures. > crypto: iaa - Simplified, efficient job submissions for non-irq mode. > crypto: iaa - Deprecate exporting add/remove IAA compression modes. > crypto: iaa - Expect a single scatterlist for a [de]compress request's > src/dst. > crypto: iaa - Rearchitect iaa_crypto to have clean interfaces with > crypto_acomp > crypto: acomp - Define a unit_size in struct acomp_req to enable > batching. > crypto: iaa - IAA Batching for parallel compressions/decompressions. > crypto: iaa - Enable async mode and make it the default. > crypto: iaa - Disable iaa_verify_compress by default. > crypto: iaa - Submit the two largest source buffers first in > decompress batching. > crypto: iaa - Add deflate-iaa-dynamic compression mode. > crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's > batch-size. > mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to > deletion. > mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx > resources. > mm: zswap: zswap_store() will process a large folio in batches. > mm: zswap: Batched zswap_compress() with compress batching of large > folios. >=20 > .../driver-api/crypto/iaa/iaa-crypto.rst | 168 +- > crypto/acompress.c | 14 + > crypto/testmgr.c | 10 + > crypto/testmgr.h | 74 + > drivers/crypto/intel/iaa/Makefile | 4 +- > drivers/crypto/intel/iaa/iaa_crypto.h | 87 +- > .../intel/iaa/iaa_crypto_comp_dynamic.c | 22 + > drivers/crypto/intel/iaa/iaa_crypto_main.c | 2836 ++++++++++++----- > drivers/crypto/intel/iaa/iaa_crypto_stats.c | 8 + > drivers/crypto/intel/iaa/iaa_crypto_stats.h | 2 + > include/crypto/acompress.h | 48 + > include/crypto/internal/acompress.h | 3 + > mm/zswap.c | 700 ++-- > 13 files changed, 2905 insertions(+), 1071 deletions(-) > create mode 100644 drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c >=20 > -- > 2.27.0