From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 4401ECD8CBD
	for <linux-mm@archiver.kernel.org>; Thu, 13 Nov 2025 18:14:45 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9CC7C8E0006; Thu, 13 Nov 2025 13:14:44 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9A4558E0002; Thu, 13 Nov 2025 13:14:44 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7F6618E0006; Thu, 13 Nov 2025 13:14:44 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 5A3378E0002
	for <linux-mm@kvack.org>; Thu, 13 Nov 2025 13:14:44 -0500 (EST)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id E895313C0B0
	for <linux-mm@kvack.org>; Thu, 13 Nov 2025 18:14:43 +0000 (UTC)
X-FDA: 84106384446.22.C0A63BB
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	by imf28.hostedemail.com (Postfix) with ESMTP id 47254C0012
	for <linux-mm@kvack.org>; Thu, 13 Nov 2025 18:14:39 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=U2AUhMCM;
	arc=pass ("microsoft.com:s=arcselector10001:i=1");
	spf=pass (imf28.hostedemail.com: domain of kanchana.p.sridhar@intel.com designates 192.198.163.9 as permitted sender) smtp.mailfrom=kanchana.p.sridhar@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1763057680;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=sp5ElUTFs/j+Ugn8K9KfM4eamLyCXjRgC1uvy2C9XYY=;
	b=E13H607uefT+dHJNc2VvLgy2zEsS0cAHBnojbQruLXdTNoTGcCEzNTiQq5wWFQUXdzD5Kl
	C2GmxdUXG0o3PbdrNcC1xNP/bG3VWFm1UQ3FIj2wsHSSdpZkoC5SajoidsHvDAAtwdPvb+
	TrFoQPqcxZAQeiwbyWK5ejxGlJiuxZE=
ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1763057680; a=rsa-sha256;
	cv=pass;
	b=ExZb4EVUw4H8hMVlUsAedX2Z9XJlFxYIFcl2gfuWQDsgrMQ80voFapTP7eFOv229bdEOA8
	4tqySGtD1vPRFuXiiDumBMh7QWLTZcZJ93XRgggtArmvfQxmeahfElbZzABAgTkMPsUrI0
	5/ZfykBKEAVfI1IT5x+UJ4m9KuyingA=
ARC-Authentication-Results: i=2;
	imf28.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=U2AUhMCM;
	arc=pass ("microsoft.com:s=arcselector10001:i=1");
	spf=pass (imf28.hostedemail.com: domain of kanchana.p.sridhar@intel.com designates 192.198.163.9 as permitted sender) smtp.mailfrom=kanchana.p.sridhar@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1763057679; x=1794593679;
  h=from:to:cc:subject:date:message-id:references:
   in-reply-to:content-transfer-encoding:mime-version;
  bh=PYWZEkQVdY8dUNmP8LFCOg6kwPHe2dAKPKNWwCTX3vI=;
  b=U2AUhMCM7VD/Jp5EmY8sgwyCPM+X4BAm0zSBKul2XfRwMeYPIRReAkXJ
   yzhNGOPggAjcOSZ9f9Q/C8jQZaYxG5Dsg8V3qqjQFrxBbTodGqpfWkAI3
   LzEzL372ijNlQarcdpwAHhQOi1aJ0CDYmRQqWQPGuqEVk07vehXoW3ZZj
   2JWKPzGE6oZvC+5jO0iRBLzoqfFsTO/CEvXHIkGg9nSc6Aga4/2xpl2SB
   p34j1GnTcKA5T5QmC7t1ulp9pV/gOCjIMOlw423ZvjGJtyK8mXgpjB18E
   lbF3q9q6Ai6WotN7GHy4b2fmbGfhP+U4R4tP5VRVMX6mp1TAoQI5vg0zG
   Q==;
X-CSE-ConnectionGUID: mFL4wuW6QU6rjN6r65JJ0w==
X-CSE-MsgGUID: 9rNRAxMFTU+4PlLRV8jYjw==
X-IronPort-AV: E=McAfee;i="6800,10657,11612"; a="75828475"
X-IronPort-AV: E=Sophos;i="6.19,302,1754982000"; 
   d="scan'208";a="75828475"
Received: from orviesa006.jf.intel.com ([10.64.159.146])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Nov 2025 10:14:37 -0800
X-CSE-ConnectionGUID: dDLGlnuIRoSDM4Xz1FZqMw==
X-CSE-MsgGUID: 4VhVdaY0T/Se++1SejHRMw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.19,302,1754982000"; 
   d="scan'208";a="188825339"
Received: from fmsmsx903.amr.corp.intel.com ([10.18.126.92])
  by orviesa006.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Nov 2025 10:14:36 -0800
Received: from FMSMSX903.amr.corp.intel.com (10.18.126.92) by
 fmsmsx903.amr.corp.intel.com (10.18.126.92) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.2562.27; Thu, 13 Nov 2025 10:14:36 -0800
Received: from fmsedg903.ED.cps.intel.com (10.1.192.145) by
 FMSMSX903.amr.corp.intel.com (10.18.126.92) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.2562.27 via Frontend Transport; Thu, 13 Nov 2025 10:14:36 -0800
Received: from CY7PR03CU001.outbound.protection.outlook.com (40.93.198.50) by
 edgegateway.intel.com (192.55.55.83) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.2562.27; Thu, 13 Nov 2025 10:14:35 -0800
ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none;
 b=aEOfEaCwUHfbexKryfiUrgL3Albn4aWoNCrrZddjrtwkKCUZBoFSKTmv1e0sjqnN+g1YvL0V8zR9iv/m8FXh3XgU+S9uAm4E1b/U3PmJen6jxp5id5O346Sufl7UfplXxhkjwNhCEdcG15rnhgUpBpiGMKjnibpWwZeAh8NtExWKWSrUo4k0ZU3gqeCBpskjx2ugPFW5N6iG7AZAMQbGjgNO2MglwkLNy3UU6gxTJHZke06ONQ12ZiMW7IFFnSfTDGGbp2PjjhO1iOiCROwvQJFC04lNYjRee3RjXoKhSCduSH7a4vwPIszlPICiy1r906PF294vydRG1/Wag8zi4Q==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector10001;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=sp5ElUTFs/j+Ugn8K9KfM4eamLyCXjRgC1uvy2C9XYY=;
 b=K37LNM7q2xLNWy7av6ma8LAnlr9s9WrvNfQwuUj3LSJTIe58z2nch5NaxoOWLBTX3zDbEXmwHWjNlopuTXJj2cEw0nR2UxqosDZHIlAGfUmDDf3HHK+cuv27BcK+8LcJ0xN5iAlO9Mgpu0It6Bf+jqzJjSPA9Z0KmJwX7AniDnusmP9aWtHe0asvGir4KrblmXdi1pbaN/xJKkvboyShXoGzOxsktSn1dkSR8vb9lb4fOSwt2VMxMG+i5jBRM+CxNZc+cJmks5L3qmnbBzVFi1RA3sxZmS+E3qa+4ndSgOhd+2zayKjJA/rmptqDJgKkkcYvFgPLKrG139rW+KWOYQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com;
 dkim=pass header.d=intel.com; arc=none
Received: from SJ2PR11MB8472.namprd11.prod.outlook.com (2603:10b6:a03:574::15)
 by SA3PR11MB8023.namprd11.prod.outlook.com (2603:10b6:806:2ff::11) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9320.16; Thu, 13 Nov
 2025 18:14:27 +0000
Received: from SJ2PR11MB8472.namprd11.prod.outlook.com
 ([fe80::1871:ff24:a49e:2bbb]) by SJ2PR11MB8472.namprd11.prod.outlook.com
 ([fe80::1871:ff24:a49e:2bbb%4]) with mapi id 15.20.9320.013; Thu, 13 Nov 2025
 18:14:27 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com>
To: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>, "hannes@cmpxchg.org"
	<hannes@cmpxchg.org>, "yosry.ahmed@linux.dev" <yosry.ahmed@linux.dev>,
	"nphamcs@gmail.com" <nphamcs@gmail.com>, "chengming.zhou@linux.dev"
	<chengming.zhou@linux.dev>, "usamaarif642@gmail.com"
	<usamaarif642@gmail.com>, "ryan.roberts@arm.com" <ryan.roberts@arm.com>,
	"21cnbao@gmail.com" <21cnbao@gmail.com>, "ying.huang@linux.alibaba.com"
	<ying.huang@linux.alibaba.com>, "akpm@linux-foundation.org"
	<akpm@linux-foundation.org>, "senozhatsky@chromium.org"
	<senozhatsky@chromium.org>, "sj@kernel.org" <sj@kernel.org>,
	"kasong@tencent.com" <kasong@tencent.com>, "linux-crypto@vger.kernel.org"
	<linux-crypto@vger.kernel.org>, "herbert@gondor.apana.org.au"
	<herbert@gondor.apana.org.au>, "davem@davemloft.net" <davem@davemloft.net>,
	"clabbe@baylibre.com" <clabbe@baylibre.com>, "ardb@kernel.org"
	<ardb@kernel.org>, "ebiggers@google.com" <ebiggers@google.com>,
	"surenb@google.com" <surenb@google.com>, "Accardi, Kristen C"
	<kristen.c.accardi@intel.com>, "Gomes, Vinicius" <vinicius.gomes@intel.com>
CC: "Feghali, Wajdi K" <wajdi.k.feghali@intel.com>, "Gopal, Vinodh"
	<vinodh.gopal@intel.com>, "Sridhar, Kanchana P"
	<kanchana.p.sridhar@intel.com>
Subject: RE: [PATCH v13 00/22] zswap compression batching with optimized
 iaa_crypto driver
Thread-Topic: [PATCH v13 00/22] zswap compression batching with optimized
 iaa_crypto driver
Thread-Index: AQHcTWsvZ9z6Qrobi0GScZxUAlcTGLTw9S9A
Date: Thu, 13 Nov 2025 18:14:27 +0000
Message-ID: <SJ2PR11MB84724355C7C64FB98C098F34C9CDA@SJ2PR11MB8472.namprd11.prod.outlook.com>
References: <20251104091235.8793-1-kanchana.p.sridhar@intel.com>
In-Reply-To: <20251104091235.8793-1-kanchana.p.sridhar@intel.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: SJ2PR11MB8472:EE_|SA3PR11MB8023:EE_
x-ms-office365-filtering-correlation-id: 11069736-a2d5-4708-8c1e-08de22e07b75
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;ARA:13230040|366016|7416014|1800799024|376014|921020|7053199007|38070700021;
x-microsoft-antispam-message-info: =?us-ascii?Q?IwpqXiHTbDsmAOIKrB1kuziVFySOLwHMuuAbfvw4RPcqg6FnNiEuVHX0EIKR?=
 =?us-ascii?Q?Y03H7RA0oB+dorkeL6nn0oQtYg2axT1qYN7iLNJRz1JLN4lL7lAPhLDwnf06?=
 =?us-ascii?Q?zNU/eJwfjw+ZNnerQzLkEEHRMFyWvrNkIY3pZp5WKyBlhjKqAxsb+BkohnRo?=
 =?us-ascii?Q?C1YnyPMecE5RMWLEpq4t++IU/dSnhsBH9mK1yffRVcxrGywgP6TKE9fqQ9kC?=
 =?us-ascii?Q?FBUIUFjzis3ul7HwkAkvfAQPrYDpQxgf7GHGHKXySqQIXzUINJMDfZM2tWtd?=
 =?us-ascii?Q?c+1lgY3rXLmoFagHkVnQVzLQvJxnBz0C0m79DY8sSeW8j+ZYS65m1i4Vf5A7?=
 =?us-ascii?Q?+3HUOlrXRvDGyvV+nowc77rOzdq8Sp5RjUnGmYaB+Y9Pbtlz+02E//YJCJ3T?=
 =?us-ascii?Q?RbykFHKIs98icut4zFz+9VFoU3o8F0gyHdf+/LiRsqI+I4TG8JPDESd1Q6TX?=
 =?us-ascii?Q?E/jxilCZrVuHufDv07HUC6BbspaUYo/BWSJV8gjHvYcGMiQxzozh6DAHWr5p?=
 =?us-ascii?Q?i3vf3m34fUP/x00ImYdMiUCMg5K2i69cmfKh+c5U1YFoSyrvOkWq2yv+HNj3?=
 =?us-ascii?Q?MeO+ei+1DVXx+Ate3Acm+G81+igQnG3sL5XXfj4G4FvX3I+Kq84cggvzXLS6?=
 =?us-ascii?Q?1kHiNKq/uTOQU8OX+/dHkOcLYg8dUxcyMvFl3lZ4rAWY590Mi1eANRunliG7?=
 =?us-ascii?Q?1fIjBCc+SPrjDee/7sbDASqNvHH6GJ/ANvqd5M3f1UHJqJ5WarOLoiNiiPln?=
 =?us-ascii?Q?kGPn5moKlEPOI8BRoJXcrcE713XdiknZ3O5cKKw+7sNL1TXmpSc8toLvsCMp?=
 =?us-ascii?Q?URRZUCiPpmYXbYrBa1dic5Q9DOJb4fQ8xJGfkifgHvfttlnAf7jIrzLCnwYf?=
 =?us-ascii?Q?XBfV1ilLf5I/BGGgs2C5GoLWv8ttEcCz2YG2wd3cvnuNItGqb6l4d0LAoQ6+?=
 =?us-ascii?Q?sFNU+XLs8PR5GbTmPBQpp7hqovMO3K2WSb6QfHC2yDwbTcpT4TGjGVBzNPlW?=
 =?us-ascii?Q?RVCiD+L4AAMs8rrglbXvwBrRl9FnAb31t4xqx+asGxbXJ0+XCO+4j+j+6cOS?=
 =?us-ascii?Q?vlAfBI7DlQFTNqwl58/yRkS0QzoeTPCVbIrkifwRF0+RwHSOJQxrk9qfJkcE?=
 =?us-ascii?Q?age7dSqJQXFb+vISGkilRDnspUbtjpXciOE08Be0eLP48QJQhoRYYOQlL3+F?=
 =?us-ascii?Q?C2qiPRuZx5RuWmJ8L1yLnTPiKGT4j9sQtr/j66oHFMg68cMvQvuaz2EHo+GV?=
 =?us-ascii?Q?XYyVEIgr2djzHPiHomop96aXMXWw9quigm34pF0KPwiv7qHbK2IT4HTeEr0c?=
 =?us-ascii?Q?JHXf3+oTNODFxyhFKwsqKDDxcZDxX5TvHA7M1577ITSCXcNiDkejLwUX5LOR?=
 =?us-ascii?Q?uUy7j/wVzAaR3tXH34WSII8+z/4iGoxpmTYUJWO0iPGR6kOwhePQbhnYBdxW?=
 =?us-ascii?Q?OfReXvIlzyOIviwNrP1rzRLFMbOuIbljmYu3CenV5WxdnBxTHQmHGNOxsZms?=
 =?us-ascii?Q?Al9RlOn3G3BG8nFu4AFt6KLFI3Uyim79FsOpADURNt+HHpEPmZQCHPrDDw?=
 =?us-ascii?Q?=3D=3D?=
x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SJ2PR11MB8472.namprd11.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(366016)(7416014)(1800799024)(376014)(921020)(7053199007)(38070700021);DIR:OUT;SFP:1101;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?o424J7y3wId87X1KA+c7me8nlDRJjarcdewthLZumRd4upQpJuGXW0ao8nrZ?=
 =?us-ascii?Q?8oziX/NMIc0lM/3tQhd9zb5VFeVParHrcKEoXgkR70T/Dzyoi6RdIMm0d9Lq?=
 =?us-ascii?Q?6Z0r931D5uQ1t7pJCcaG1u7gDptEszLbt4eKsMu6CmKm1GaXELet2xlaxk+1?=
 =?us-ascii?Q?pTsTqUwZyQriMvvnU/RJFQT0so0LAcH1RlN+q0bysQkmCwsLtRErpgb6jcUH?=
 =?us-ascii?Q?Wb2lXy0lNllxI6gUWQITdn6PcnGQwCBfbpnXn17DPzJNH/53DtdpsDde1wQy?=
 =?us-ascii?Q?d9wUBc5OvlSKrWUhh31nPsLBmETtPg9NrBoWM3Udnp5TAVMCplk7CnPYvHDb?=
 =?us-ascii?Q?psHxODcC8KyeOERSqIO9rM8LMEEIZAEBIuuw8d2RU94WQlz5G8ltjaL9ht+9?=
 =?us-ascii?Q?+AgIP8OqXMPca9L5qVgPL6OnOHcB4W1bzTVF/kniHMIAsciNjLF6eMb8wFKG?=
 =?us-ascii?Q?7a6sJz2Wgq8nQmdGtTh16YTjWfE51xi4eOs7omsXsm+//bU1M34EedK/nZZd?=
 =?us-ascii?Q?ToGNaEb8EDh3WT5KISMw5Tb8MakdQqGbp1voCsOk+2aEzzCOeJUQovLk2LoS?=
 =?us-ascii?Q?SpOUo8U7yF8tEgz5nmMa5sGj5axFjqE1fZR9LrygAp1X4g+jag+V/zp+IKZR?=
 =?us-ascii?Q?NMDlxido2J8tTCTyhGWSbGj7c1jEVnTRFMLN82AtqPIBl+pKef5aTiVM/l2y?=
 =?us-ascii?Q?dV4VjSYG+2MBCLA6aXYrHEg4dD2XfkGZlPGlEPkN0GqBeSacJSjyrVS56uSZ?=
 =?us-ascii?Q?jMqDhpY6rHy7GDTVXCXYAgrAgoRd8Ko9nJBSdeHLfcTVMapyKk0UuO4eISR8?=
 =?us-ascii?Q?agj2JD09nvexOvWNqSLNDzy5wUyHpJtIbuK6tFiZ1nlvJ9DSPvDigCTlL/IB?=
 =?us-ascii?Q?oj3vEZdNXI5NR9IpAlfQa+5iGaBhemRhejYTBoIcbP4HqAvCXaI7EhQGcRnl?=
 =?us-ascii?Q?M4lPuc75YyrnkA+3OBXFWu/mGet3VVoETzJj8ZCzKxciiWwBXCLYVJAb6Cxw?=
 =?us-ascii?Q?GJ4fdw9KyGHstG4ggKQjIN1YQ1v4CyfQn66KZ3lR5ER581z7Lm2xYh9k4e5f?=
 =?us-ascii?Q?v55CJBjRCUOL4xZyAOduhbc43nua9qjLhbAEbKdeGJXcT1sEkbC4EebDDo7t?=
 =?us-ascii?Q?0UnxcOimNHFdhkR4yjg83liGq9QObyu/zQLt3J/8CIkym7ZtuoNMvLxoiMyP?=
 =?us-ascii?Q?Gxv+07Hl2G6A7oJvyuZG7ZsppXNVWNVglwgrpiC8w99LYeTfB7odFHBUt2aG?=
 =?us-ascii?Q?ZbMwrudFQomVriZ+hqICfxKWKqqmTbtc/sP3TsnApf9rENPDytW2dhkLFDLi?=
 =?us-ascii?Q?F4G2/otgbS5stJ6hh+N4fzVnk7O2W9UyBau2236/UlPezcsmvwcyRzPorAId?=
 =?us-ascii?Q?2goV0tPHKRAzeLBjAOtB/htbZZWr+ewAGQbDBpV+3WJFuHSte8W3SgFp5sXp?=
 =?us-ascii?Q?cs+APGx+JFgT9+sTCsGXq/46J1wNEt/rxZAF3IzghwZogG+dk6omzoA1xIvw?=
 =?us-ascii?Q?SSanK6B6gjppyJ7axnDHXU3AgdnD42tP5zYutaSEQwpvvPcTtOBrIc1e0tW3?=
 =?us-ascii?Q?7HGbCRoGJF80oQtWIbCaahuDiIfUf058o1uiYGkFS+rHTBGngIZtH4KMa8KE?=
 =?us-ascii?Q?Nw=3D=3D?=
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: SJ2PR11MB8472.namprd11.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 11069736-a2d5-4708-8c1e-08de22e07b75
X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Nov 2025 18:14:27.1972
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: x1C0thtPR0xmW6euU6wBYGin5h1ZHI8VXM1poeyLQ3rSqPLD+Y9Gg5WjDhvfB3m8Za+GY/M69YMuZgctbxOJtSGK/cOghukMilolzhi4beA=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SA3PR11MB8023
X-OriginatorOrg: intel.com
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 47254C0012
X-Stat-Signature: pq3mgrsj5318s3ed3w6sbkyg56zbax6j
X-Rspam-User: 
X-HE-Tag: 1763057679-41010
X-HE-Meta: U2FsdGVkX18oacZAE/8DnJl9G3sr2Ci5ZK8L3NLJTX67WSldOScmVTxZ2R+lUoCwB3sVkyVW50JuwqFOV2Xt26W8B4PU4PQkJVNOwFYXLiGJbi/TpvnJehIo5Qqg1KAji+760GGpjGVRaNrcRPNX83mknKIfFmcIfUg8neVceQA2ZsmztGV4b4Hl13tAzDgUmZOr4RPB32Gpx3eYAoo0VliCb26Vh2jXOKrcRxuIEICxkUOGQX8/oMZdZ/wMIIrBHyQT/4UP6M3343mv1RgMg7RM6jsbpX0h7do8pG1wtZc2e9WCZHGcp8z1CcTAZgBaoitGgDcw5tU2+qyHU8SS+pXAc4d8WcjaasC4WtY6Q/hDT4+2IknasetpG063MgZdjWksrNSBHJZHLQ7OHb04lpFEWWOODa5SnOGV7zUioC+VM/sfSTjv3dS1rxa02VGKYbOieuBOXY+lMLTHd5M0JqAej1mHCpy0+EBh0DU/Ztw2h2KaVIHKZX21+Q7uz6W2jcjZkxkqTRAgjTCC9mil3wY5pABE4bRCMxZFLs33zRe1JiWkqmKj+LCMxXYynaOgiGfOy5D8Am4QHKUmnOgZcEmfWD+QCWQbDpZtAhWiAUTsdAykAN1HtHT7obMI+l10gVG3wMqJll9xeJCtLKonHsmp2s2p7iRf1IRViVh8FT5KZDVxdPQR2HX9VzVSfu7xchSqwR/VQlZ10yUuZn3sAHKy1Zd2gl9pj4f5SEWGjvaoOLRYgpvl+k74jau6VK/p7XGX6mlHpig+ePYJSF5KyiZAQnrgcldsqpbPyVxtVv+sw8R7+kzr1qMYl6AyWErBC6L8MTzQ7PsRtFJrsA6qT1Qyge5RpVE8SggWOewWmQAySRAq0za8HvVlAljxguEXz7yccF6QWsf3fK72r5zNJW1erWFQQOtZcjilUzGG/NGlmovrlpD92lynCjMLdRNnWPGFw/R3Cb+dleBCWJU
 +FKsO6+h
 hkrFeAbkVCnymAe6q/aSqlBC1jtBQnC1LEhhRgQkZMSYEUb5IxNXxEZPjd1fi9S5V2GfV/CUVOmNop5Aoay2MzXRiTeVL8eKTgyXpMkqEgORSje5HLyd5ZlC0FpW+NaBbxwYS106yGRj68oa7Jx0rpneauvvJkkBUJc7bAZGrgQIIbSbg/d2onufcR72SdIwfXrtv1xrPDVlEafIPJIRITKReZk4lHBxuGqXha/J/bGHPd3kqNPzqMeqrC6R4PqNMncRJBoE3ojNDBxXH2gm6Z1ICu6QLBCNI/HK3gVzc853i433bx+wUNTre6Ed3cfzNB7l1HMqmJer21uZiem2bFoT+eEA+BPekpDy1rZLyR3DB/MjhuJ5Quh78sQaIO8K5RvodDS0U7/RRaqUoaYy10atvanmVlJ+OnBXl3JbT7ETTWgQgbw+3J41qvZdAT2ZNt13o/XGWdpM5fAVzEQVOsIqk81OC5epd0UBgR3LIREBPuayKX5hk79Z+SaSKhXsnf09zgjVjm82ILSIKiUQun/ZMB1wnJ9JzGctYlK6f36p/fi7NimCSmQoboI1dNGzeqseQUYRXiV2L/4RIzmsNsEDrYWY9Cug4/rApfyF5LIM34OHuecaHZ36yZuuQPbYSbaNm4F6EJCNdUHXws4MohZcmoTNwPy15FR0WeCMeE5BCXxBYk9MUeK7qf2wdPoPbNdwhwFxM5LAUwAI6ZMiNKKM799yJmooDRSd5BcSw8aM8p2YwvEaJC60ygQ2etTH9BEaha6/t265CNti02ywGWz8+odSXnYHV/AKP98ba3/vZ+x8gnfPHNdZIs+S0PM0Bt0i0mGHl5K7FWVua1DI8shrb9Cg1HyDNzRFWL+lnObyY52Lvrn1thHMDEeX3Z2IBalz4xnWFjLLJwu/Nuug6a5ARsmW6f8VU+KvhC83eXw82nWOiRii/KPB70b8CuMfIpt0/pqJMLefrMQtaJ0FklJBQYxSF
 rF0qrSSd
 ElFO2ndSBxHpQCYGT+IKgOwDr3BIhkXAGDXN5ZoO23/HaJZUqG+QRimtZE73dKrh32lml8GsceUT0DpGuzjXn82PNKsAhO/YxwVGJbqYSjRs//PIaZxeunCwEqCRkwuYavcGYmT0V/qj8UW5GD3dt40WP5Pw7kaK
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Tuesday, November 4, 2025 1:12 AM
> To: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>
> Cc: Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: [PATCH v13 00/22] zswap compression batching with optimized
> iaa_crypto driver
>=20
> v13: zswap compression batching with optimized iaa_crypto driver
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> =3D=3D
> This updated patch-series further generalizes the batching implementation=
 of
> zswap_compress() for non-batching and batching compressors. It makes sure
> the
> bulk allocation of zswap entries preserves the current behavior of additi=
on of
> an entry to the LRU list for the nid of the page.
>=20
> Based on Herbert's suggestions, the batching interfaces from zswap to cry=
pto,
> from crypto to iaa_crypto, and the batching implementation within iaa_cry=
pto
> now
> use the folio directly as the source (sg_page_iter for retrieving pages),=
 and
> destination SG lists. A unit_size has been added to struct acomp_req, wit=
h
> kernel users such as zswap using the new acomp_request_set_unit_size() AP=
I
> to
> set the unit size to use while breaking down the request's src/dst
> scatterlists. zswap sets the unit-size to PAGE_SIZE.

Hi Nhat, Yosry, Herbert,

I just wanted to follow up on whether there are other code review comments
or suggestions you have on this latest patch set. Thanks very much for your=
 time
in reviewing and improving the patch-series.

Nhat, I will make the change to the struct zswap_entry bit-fields to be
macro-defined constants, either as an update to this series, or submit a se=
parate
patch with this change if that's Ok with you.

Thanks,
Kanchana

>=20
> Following Andrew's suggestion, the next two paragraphs emphasize
> generality and
> alignment with current kernel efforts.
>=20
> Architectural considerations for the zswap batching framework:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> We have designed the zswap batching framework to be hardware-agnostic. It
> has no
> dependencies on Intel-specific features and can be leveraged by any
> hardware
> accelerator or software-based compressor. In other words, the framework i=
s
> open
> and inclusive by design.
>=20
> Other ongoing work that can use batching:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> This patch-series demonstrates the performance benefits of compress
> batching when used in zswap_store() of large folios. shrink_folio_list()
> "reclaim batching" of any-order folios is the next major work that uses
> this zswap compress batching framework: our testing of kernel_compilation
> with writeback and the zswap shrinker indicates 10X fewer pages get
> written back when we reclaim 32 folios as a batch, as compared to one
> folio at a time: this is with deflate-iaa and with zstd. We expect to
> submit a patch-series with this data and the resulting performance
> improvements shortly. Reclaim batching relieves memory pressure faster
> than reclaiming one folio at a time, hence alleviates the need to scan
> slab memory for writeback.
>=20
> Many thanks to Nhat for suggesting ideas on using batching with the
> ongoing kcompressd work, as well as beneficially using decompression
> batching & block IO batching to improve zswap writeback efficiency.
>=20
> Experiments with kernel compilation benchmark (allmod config) that
> combine zswap compress batching, reclaim batching, swapin_readahead()
> decompression batching of prefetched pages, and writeback batching show
> that 0 pages are written back to disk with deflate-iaa and zstd. For
> comparison, the baselines for these compressors see 200K-800K pages
> written to disk.
>=20
> To summarize, these are future clients of the batching framework:
>=20
>    - shrink_folio_list() reclaim batching of multiple folios:
>        Implemented, will submit patch-series.
>    - zswap writeback with decompress batching:
>        Implemented, will submit patch-series.
>    - zram:
>        Implemented, will submit patch-series.
>    - kcompressd:
>        Not yet implemented.
>    - file systems:
>        Not yet implemented.
>    - swapin_readahead() decompression batching of prefetched pages:
>        Implemented, will submit patch-series.
>=20
>=20
> iaa_crypto Driver Rearchitecting and Optimizations:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
>=20
> The most significant highlight of v13 is a new, lightweight and highly
> optimized iaa_crypto driver, resulting directly in the latency and
> throughput improvements noted later in this cover letter.
>=20
>  1) Better stability, more functionally versatile to support zswap
>     with better performance on different Intel platforms.
>=20
>     a) Patches 0002, 0005 and 0011 together resolve a race condition in
>        mainline v6.15, reported from internal validation, when IAA
>        wqs/devices are disabled while workloads are using IAA.
>=20
>     b) Patch 0002 introduces a new architecture for mapping cores to
>        IAAs based on packages instead of NUMA nodes, and generalizing
>        how WQs are used: as package level shared resources for all
>        same-package cores (default for compress WQs), or dedicated to
>        mapped cores (default for decompress WQs). Further, users are
>        able to configure multiple WQs and specify how many of those are
>        for compress jobs only vs. decompress jobs only. sysfs iaa_crypto
>        driver parameters can be used to change the default settings for
>        performance tuning.
>=20
>     c) idxd descriptor allocation moved from blocking to non-blocking
>        with retry limits and mitigations if limits are exceeded.
>=20
>     d) Code cleanup for readability and clearer code flow.
>=20
>     e) Fixes IAA re-registration errors upon disabling/enabling IAA wqs
>        and devices that exists in the mainline v6.15.
>=20
>     f) Addition of a layer that encapsulates iaa_crypto's core functional=
ity to
>        rely only on idxd, dma and scatterlists to provide clean interface=
s to
>        crypto_acomp.
>=20
>     g) New Dynamic compression mode for Granite Rapids to get better
>        compression ratio by echo-ing 'deflate-iaa-dynamic' as the zswap
>        compressor.
>=20
>     h) New crypto_acomp API crypto_acomp_batch_size() that will return
>        the driver's max batch size if the driver has registered a batch_s=
ize
>        that's greater than 1; or 1 if there is no driver specific definit=
ion of
>        batch_size.
>=20
>        Accordingly, iaa_crypto sets the acomp_alg batch_size to its inter=
nal
>        IAA_CRYPTO_MAX_BATCH_SIZE for fixed and dynamic modes.
>=20
>  2) Performance optimizations (please refer to the latency data per
>     optimization in the commit logs):
>=20
>     a) Distributing [de]compress jobs in round-robin manner to available
>        IAAs on package.
>=20
>     b) Replacing the compute-intensive iaa_wq_get()/iaa_wq_put() with a
>        percpu_ref in struct iaa_wq, thereby eliminating acquiring a
>        spinlock in the fast path, while using a combination of the
>        iaa_crypto_enabled atomic with spinlocks in the slow path to
>        ensure the compress/decompress code sees a consistent state of the
>        wq tables.
>=20
>     c) Directly call movdir64b for non-irq use cases, i.e., the most
>        common usage. Avoid the overhead of irq-specific computes in
>        idxd_submit_desc() to gain latency.
>=20
>     d) Batching of compressions/decompressions using async submit-poll
>        mechanism to derive the benefits of hardware parallelism.
>=20
>     e) Batching compressors need to manage their own "requests"
>        abstraction, and remove this driver-specific aspect from being
>        managed by kernel users such as zswap. iaa_crypto maintains
>        per-CPU "struct iaa_req **reqs" to submit multiple jobs to the
>        hardware accelerator to run in parallel.
>=20
>     f) Modifies the iaa_crypto batching API and their implementation to e=
xpect
> a
>        src SG list that contains the batch's pages and a dst SG list that=
 has
>        multiple scatterlists for the batch's output buffers.
>=20
>     g) Submit the two largest data buffers first for decompression
>        batching, so that the longest running jobs get a head start,
>        reducing latency for the batch.
>=20
>  3)  Compress/decompress batching are implemented using SG lists as the
> batching
>      interface.
>=20
>=20
> Main Changes in Zswap Compression Batching:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
>  Note to zswap maintainers:
>  --------------------------
>  Patches 19 and 20 can be reviewed and improved/merged independently
>  of this series, since they are zswap centric. These 2 patches help
>  batching but the crypto_acomp_batch_size() from the iaa_crypto commits
>  in this series is not a requirement, unlike patches 21-22.
>=20
>  1) v13 preserves the pool acomp_ctx resources creation/deletion
>     simplification of v11, namely, lasting from pool creation-deletion,
>     persisting through CPU hot[un]plug operations. Further, zswap no
>     longer needs to create multiple "struct acomp_req" in the per-CPU
>     acomp_ctx. zswap only needs to manage multiple "u8 **buffers".
>=20
>  2) We store the compressor's batch-size (@pool->compr_batch_size) direct=
ly
> in
>     struct zswap_pool for quick retrieval in the zswap_store() fast path.
>=20
>  3) Optimizations to not cause regressions in software compressors with
>     the introduction of the new unified zswap_compress() framework that
>     implements compression batching for all compressors. These optimizati=
ons
>     help recover the performance for non-batching compressors:
>=20
>     a) kmem_cache_alloc_bulk(), kmem_cache_free_bulk() to allocate/free
>        batch zswap_entry-s. These kmem_cache API allow allocator
>        optimizations with internal locks for multiple allocations.
>=20
>     b) The page's nid is stored in a new nid field added to zswap_entry, =
so the
>        zswap_lru_add()/zswap_lru_del() will add/delete the entry from the=
 LRU
>        list of the page's nid. This preserves the current behavior wrt th=
e
>        shrinker.
>=20
>     c) Writes to the zswap_entry right after it is allocated without
>        modifying the publishing order. This avoids different code blocks
>        in zswap_store_pages() having to bring the zswap_entries to the
>        cache for writing, potentially evicting other working set
>        structures, impacting performance.
>=20
>     d) ZSWAP_MAX_BATCH_SIZE is used as the batch-size for software
>        compressors, since this gives the best performance with zstd.
>=20
>     e) Minimize branches in zswap_compress().
>=20
>  4) During pool creation, these key additions are allocated as part of th=
e
>     per-CPU acomp_ctx so as to recover performance with the new,
> generalized SG
>     lists based zswap_compress() batching interface:
>=20
>     a) An sg_table "acomp_ctx->sg_outputs" is allocated to contain the
>        compressor's batch-size number of SG lists that will contain the
>        destination buffers/lengths after batch compression.
>=20
>     b) The per-CPU destination buffers are mapped to the per-CPU SG lists=
: this
>        needs to be done only once, and optimizes performance.
>=20
>  5) A unified zswap_compress() API is added to compress multiple pages.
> Thanks
>     to Nhat, Yosry and Johannes for their helpful suggestions to accompli=
sh
>     this.
>=20
>  6) Finally, zswap_compress() has been re-written to incorporate Herbert'=
s
>     suggestions to use source folios and output SG lists for batching. Th=
e new
>     zswap_compress() code has been made as generic to software and batchi=
ng
>     compressors as possible, so that it is easy to read and maintain. The
>     recent changes related to PAGE_SIZE dst buffers, zsmalloc and
> incompressible
>     pages have been incorporated into the batched zswap_compress() as wel=
l.
> To
>     resolve regressions with zstd, I took the liberty of not explicitly c=
hecking
>     for dlen =3D=3D 0 and dlen > PAGE_SIZE (as in the mainline); instead,
>     expecting that a negative err value will be returned by the software
>     compressor in such cases.
>=20
>=20
> Compression Batching:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
> This patch-series introduces batch compression of pages in large folios t=
o
> improve zswap swapout latency. It preserves the existing zswap protocols
> for non-batching software compressors by calling crypto_acomp sequentiall=
y
> per page in the batch. Additionally, in support of hardware accelerators
> that can process a batch as an integral unit, the patch-series allows
> zswap to call crypto_acomp without API changes, for compressors
> that intrinsically support batching. The zswap_compress() code has very
> minimal
> special casing for software/batching compressors.
>=20
> The patch series provides a proof point by using the Intel Analytics
> Accelerator (IAA) for implementing the compress/decompress batching API
> using hardware parallelism in the iaa_crypto driver and another proof poi=
nt
> with a sequential software compressor, zstd.
>=20
> SUMMARY:
> =3D=3D=3D=3D=3D=3D=3D=3D
>=20
>   The first proof point is to test with IAA using a sequential call (full=
y
>   synchronous, compress one page at a time) vs. a batching call (fully
>   asynchronous, submit a batch to IAA for parallel compression, then poll=
 for
>   completion statuses).
>=20
>     The performance testing data with 30 usemem processes/64K folios
>     shows 62% throughput gains and 28% elapsed/sys time reductions with
>     deflate-iaa; and 5% sys time reduction with zstd for a small
>     throughput increase. For PMD folios, a 67% throughput gain and 23%
>     elapsed/sys time reduction is seen.
>=20
>     Kernel compilation test with 64K folios using 32 threads and the
>     zswap shrinker_enabled set to "N", demonstrates similar
>     improvements: zswap_store() large folios using IAA compress batching
>     improves the workload performance by 3.5% and reduces sys time by
>     6% as compared to IAA sequential. For zstd, compress batching
>     improves workload performance by 3.4% and reduces sys time by
>     1.8% as compared to sequentially calling zswap_compress() per page
>     in a folio.
>=20
>     The main takeaway from usemem, a workload that is mostly compression
>     dominated (very few swapins) is that the higher the number of batches=
,
>     such as with larger folios, the more the benefit of batching cost
>     amortization, as shown by the PMD usemem data. This aligns well with =
the
>     future direction for batching.
>=20
>   The second proof point is to make sure that software algorithms such as
>   zstd do not regress. The data indicates that for sequential software
>   algorithms a performance gain is achieved.
>=20
>     With the performance optimizations implemented in patches 21-22 of v1=
3:
>=20
>     *  zstd usemem metrics with 64K folios are within range of variation
>        with a slight sys time improvement. zstd usemem30 workload
> performance
>        with PMD folios improves by 6% and sys time reduces by 8%, for
> comparable
>        throughput as the baseline.
>=20
>     *  With kernel compilation, I used zstd without the zswap shrinker to=
 enable
>        more direct comparisons with the changes in this series. Subsequen=
t
> patch
>        series I expect to submit in collaboration with Nhat, will enable =
the
>        zswap shrinker to quantify the benefits of decompression batching =
during
>        writeback. With this series' compression batching within large fol=
ios, we
>        get a 6%-1.8% reduction in sys time, a 3.5%-3.4% improvement in
> workload
>        performance with 64K folios for deflate-iaa/zstd respectively.
>=20
>     These optimizations pertain to ensuring common code paths and removin=
g
>     redundant branches/computes. Additionally, using the batching code fo=
r
>     non-batching compressors to sequentially compress/store batches of up
>     to ZSWAP_MAX_BATCH_SIZE pages seems to help, most likely due to
>     cache locality of working set structures such as the array of
>     zswap_entry-s for the batch.
>=20
>     Our internal validation of zstd with the batching interface vs. IAA w=
ith
>     the batching interface on Emerald Rapids has shown that IAA
>     compress/decompress batching gives 21.3% more memory savings as
> compared
>     to zstd, for 5% performance loss as compared to the baseline without =
any
>     memory pressure. IAA batching demonstrates more than 2X the memory
>     savings obtained by zstd at this 95% performance KPI.
>     The compression ratio with IAA is 2.23, and with zstd 2.96. Even with
>     this compression ratio deficit for IAA, batching is extremely
>     beneficial. As we improve the compression ratio of the IAA accelerato=
r,
>     we expect to see even better memory savings with IAA as compared to
>     software compressors.
>=20
>=20
>   Batching Roadmap:
>   =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
>   1) Compression batching within large folios (this series).
>=20
>   2) zswap writeback decompression batching:
>=20
>      This is being co-developed with Nhat Pham, and shows promising
>      results. We plan to submit an RFC shortly.
>=20
>   3) Reclaim batching of hybrid folios:
>=20
>      We can expect to see even more significant performance and throughpu=
t
>      improvements if we use the parallelism offered by IAA to do reclaim
>      batching of 4K/large folios (really any-order folios), and using the
>      zswap_store() high throughput compression pipeline to batch-compress
>      pages comprising these folios, not just batching within large
>      folios. This is the reclaim batching patch 13 in v1, which we expect
>      to submit in a separate patch-series. As mentioned earlier, reclaim
>      batching reduces the # of writeback pages by 10X for zstd and
>      deflate-iaa.
>=20
>   4) swapin_readahead() decompression batching:
>=20
>      We have developed a zswap load batching interface to be used
>      for parallel decompression batching, using swapin_readahead().
>=20
>   These capabilities are architected so as to be useful to zswap and
>   zram. We have integrated these components with zram and expect to
> submit an
>   RFC soon.
>=20
>=20
>   v13 Performance Summary:
>   =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
>=20
>   This is a performance testing summary of results with usemem30
>   (30 usemem processes running in a cgroup limited at 150G, each trying t=
o
>    allocate 10G).
>=20
>   usemem30 with 64K folios:
>   =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
>=20
>      zswap shrinker_enabled =3D N.
>=20
>      --------------------------------------------------------------------=
---
>                      mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------=
---
>      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
>                                                                  vs.
>                                                              IAA Sequenti=
al
>      --------------------------------------------------------------------=
---
>      Total throughput (KB/s)     6,118,675       9,901,216       62%
>      Average throughput (KB/s)     203,955         330,040       62%
>      elapsed time (sec)              98.94           70.90      -28%
>      sys time (sec)               2,379.29        1,686.18      -29%
>      --------------------------------------------------------------------=
---
>=20
>      --------------------------------------------------------------------=
---
>                      mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------=
---
>      zswap compressor                 zstd            zstd   v13 zstd
>                                                              improvement
>      --------------------------------------------------------------------=
---
>      Total throughput (KB/s)     5,983,561       6,003,851      0.3%
>      Average throughput (KB/s)     199,452         200,128      0.3%
>      elapsed time (sec)             100.93           96.62     -4.3%
>      sys time (sec)               2,532.49        2,395.83       -5%
>      --------------------------------------------------------------------=
---
>=20
>   usemem30 with 2M folios:
>   =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
>=20
>      --------------------------------------------------------------------=
---
>                      mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------=
---
>      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
>                                                                  vs.
>                                                              IAA Sequenti=
al
>      --------------------------------------------------------------------=
---
>      Total throughput (KB/s)     6,309,635      10,558,225       67%
>      Average throughput (KB/s)     210,321         351,940       67%
>      elapsed time (sec)              88.70           67.84      -24%
>      sys time (sec)               2,059.83        1,581.07      -23%
>      --------------------------------------------------------------------=
---
>=20
>      --------------------------------------------------------------------=
---
>                      mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------=
---
>      zswap compressor                 zstd            zstd   v13 zstd
>                                                              improvement
>      --------------------------------------------------------------------=
---
>      Total throughput (KB/s)     6,562,687       6,567,946      0.1%
>      Average throughput (KB/s)     218,756         218,931      0.1%
>      elapsed time (sec)              94.69           88.79       -6%
>      sys time (sec)               2,253.97        2,083.43       -8%
>      --------------------------------------------------------------------=
---
>=20
>=20
>   This is a performance testing summary of results with
>   kernel_compilation test (allmod config, 32 cores, cgroup limited to 2G)=
.
>=20
>   zswap shrinker_enabled =3D N.
>=20
>   kernel_compilation with 64K folios:
>   =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
>      --------------------------------------------------------------------=
------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------=
------
>      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
>                                                              vs.
>                                                         IAA Sequential
>      --------------------------------------------------------------------=
------
>      real_sec                 836.64          806.94      -3.5%
>      sys_sec                3,897.57        3,661.83        -6%
>      --------------------------------------------------------------------=
------
>=20
>      --------------------------------------------------------------------=
------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------=
------
>      zswap compressor           zstd            zstd    Improvement
>      --------------------------------------------------------------------=
------
>      real_sec                 880.62          850.41      -3.4%
>      sys_sec                5,171.90        5,076.51      -1.8%
>      --------------------------------------------------------------------=
------
>=20
>=20
>   kernel_compilation with PMD folios:
>   =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
>      --------------------------------------------------------------------=
------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------=
------
>      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
>                                                              vs.
>                                                         IAA Sequential
>      --------------------------------------------------------------------=
------
>      real_sec                 818.48          779.67      -4.7%
>      sys_sec                4,226.52        4,245.18       0.4%
>      --------------------------------------------------------------------=
------
>=20
>      --------------------------------------------------------------------=
------
>               mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------=
------
>      zswap compressor          zstd             zstd    Improvement
>      --------------------------------------------------------------------=
------
>      real_sec                888.45           849.54      -4.4%
>      sys_sec               5,866.72         5,847.17      -0.3%
>      --------------------------------------------------------------------=
------
>=20
>=20
>=20
> The patch-series is organized as follows:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
>  1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
>     patches are tagged with "crypto:" in the subject:
>=20
>     Patch 1) Reorganizes the iaa_crypto driver code into logically relate=
d
>              sections and avoids forward declarations, in order to facili=
tate
>              subsequent iaa_crypto patches. This patch makes no
>              functional changes.
>=20
>     Patch 2) Makes an infrastructure change in the iaa_crypto driver
>              to map IAA devices/work-queues to cores based on packages
>              instead of NUMA nodes. This doesn't impact performance on
>              the Sapphire Rapids system used for performance
>              testing. However, this change fixes functional problems we
>              found on Granite Rapids during internal validation, where th=
e
>              number of NUMA nodes is greater than the number of packages,
>              which was resulting in over-utilization of some IAA devices
>              and non-usage of other IAA devices as per the current NUMA
>              based mapping infrastructure.
>=20
>              This patch also develops a new architecture that
>              generalizes how IAA device WQs are used. It enables
>              designating IAA device WQs as either compress-only or
>              decompress-only or generic. Once IAA device WQ types are
>              thus defined, it also allows the configuration of whether
>              device WQs will be shared by all cores on the package, or
>              used only by "mapped cores" obtained by a simple allocation
>              of available IAAs to cores on the package.
>=20
>              As a result of the overhaul of wq_table definition,
>              allocation and rebalancing, this patch eliminates
>              duplication of device WQs in per-CPU wq_tables, thereby
>              saving 140MiB on a 384 cores dual socket Granite Rapids serv=
er
>              with 8 IAAs.
>=20
>              Regardless of how the user has configured the WQs' usage,
>              the next WQ to use is obtained through a direct look-up in
>              per-CPU "cpu_comp_wqs" and "cpu_decomp_wqs" structures so
>              as to minimize latency in the critical path driver compress
>              and decompress routines.
>=20
>     Patch 3) Code cleanup, consistency of function parameters.
>=20
>     Patch 4) Makes a change to iaa_crypto driver's descriptor allocation,
>              from blocking to non-blocking with retries/timeouts and
>              mitigations in case of timeouts during compress/decompress
>              ops. This prevents tasks getting blocked indefinitely, which
>              was observed when testing 30 cores running workloads, with
>              only 1 IAA enabled on Sapphire Rapids (out of 4). These
>              timeouts are typically only encountered, and associated
>              mitigations exercised, only in configurations with 1 IAA
>              device shared by 30+ cores.
>=20
>     Patch 5) Optimize iaa_wq refcounts using a percpu_ref instead of
>              spinlocks and "int refcount".
>=20
>     Patch 6) Code simplification and restructuring for understandability
>              in core iaa_compress() and iaa_decompress() routines.
>=20
>     Patch 7) Refactor hardware descriptor setup to their own procedures
>              to reduce code clutter.
>=20
>     Patch 8) Simplify and optimize job submission for the most commonly u=
sed
>              non-irq async mode by directly calling movdir64b.
>=20
>     Patch 9) Deprecate exporting symbols for adding IAA compression
>              modes.
>=20
>     Patch 10) All dma_map_sg() calls will pass in 1 for the nents instead=
 of
>               sg_nents(), for these main reasons: performance; no existin=
g
>               iaa_crypto use cases that allow multiple SG lists to be map=
ped for
>               dma at once; facilitates new SG lists batching interface th=
rough
>               crypto.
>=20
>     Patch 11) Move iaa_crypto core functionality to a layer that relies o=
nly on
>               the idxd driver, dma, and scatterlists. Implement clean int=
erfaces
>               to crypto_acomp.
>=20
>     Patch 12) Define a unit_size in struct acomp_req to enable batching, =
and
>               provides acomp_request_set_unit_size() for use by kernel
>               modules. zswap_cpu_comp_prepare() calls this API to set the
>               unit_size for zswap as PAGE_SIZE.
>=20
>     Patch 13) Implement asynchronous descriptor submit and polling
> mechanisms,
>               enablers for batching. Develop IAA batching of compressions=
 and
>               decompressions for deriving hardware parallelism.
>=20
>     Patch 14) Enables the "async" mode, sets it as the default.
>=20
>     Patch 15) Disables verify_compress by default.
>=20
>     Patch 16) Decompress batching optimization: Find the two largest
>               buffers in the batch and submit them first.
>=20
>     Patch 17) Add a new Dynamic compression mode that can be used on
>               Granite Rapids.
>=20
>     Patch 18) Add a batch_size data member to struct acomp_alg and
>               a crypto_acomp_batch_size() API that returns the compressor=
's
>               batch-size, if it has defined one; 1 otherwise.
>=20
>  2) zswap modifications to enable compress batching in zswap_store()
>     of large folios (including pmd-mappable folios):
>=20
>     Patch 19) Simplifies the zswap_pool's per-CPU acomp_ctx resource
>               management and lifetime to be from pool creation to pool
>               deletion.
>=20
>     Patch 20) Uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check
> for
>               valid acomp/req, thereby making it consistent with the reso=
urce
>               de-allocation code.
>=20
>     Patch 21) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently se=
t
>               as 8U) to denote the maximum number of acomp_ctx batching
>               resources to allocate, thus limiting the amount of extra
>               memory used for batching. Further, the "struct
>               crypto_acomp_ctx" is modified to contain multiple buffers.
>               New "u8 compr_batch_size" member is added to "struct zswap_=
pool"
>               to track the number of dst buffers associated with the comp=
ressor
>               (more than 1 if the compressor supports batching).
>=20
>               Modifies zswap_store() to store the folio in batches of
>               pool->compr_batch_size (batching compressors) or
>               ZSWAP_MAX_BATCH_SIZE (sequential compressors) by calling a =
new
>               zswap_store_pages() that takes a range of indices in the fo=
lio to
>               be stored.
>=20
>               zswap_store_pages() bulk-allocates zswap entries for the ba=
tch,
>               calls zswap_compress() for each page in this range, and sto=
res
>               the entries in xarray/LRU.
>=20
>     Patch 22) Introduces a new unified batching implementation of
>               zswap_compress() for compressors that do and do not support
>               batching. This eliminates code duplication and facilitates
>               code maintainability with the introduction of compress
>               batching. Further, there are many optimizations to this com=
mon
>               code that result in workload throughput and performance
>               improvements with software compressors and hardware acceler=
ators
>               such as IAA.
>=20
>               zstd performance is better or on par with mm-unstable. We
>               see impressive throughput/performance improvements with
>               IAA and workload performance/sys time improvement with zstd
>               batching vs. no-batching.
>=20
>=20
> With v13 of this patch series, the IAA compress batching feature will be
> enabled seamlessly on Intel platforms that have IAA by selecting
> 'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
> sync_mode driver attribute (the default).
>=20
>=20
> System setup for testing:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
> Testing of this patch-series was done with mm-unstable as of 10-24-2025,
> commit 813c0fa931ce, without and with this patch-series. Data was
> gathered on an Intel Sapphire Rapids (SPR) server, dual-socket 56 cores
> per socket, 4 IAA devices per socket, each IAA has total 128 WQ entries,
> 503 GiB RAM and 525G SSD disk partition swap. Core frequency was fixed
> at 2500MHz.
>=20
> Other kernel configuration parameters:
>=20
>     zswap compressor  : zstd, deflate-iaa
>     zswap allocator   : zsmalloc
>     vm.page-cluster   : 0
>=20
> IAA "compression verification" is disabled and IAA is run in the async
> mode (the defaults with this series).
>=20
> I ran experiments with these workloads:
>=20
> 1) usemem 30 processes with zswap shrinker_enabled=3DN. Two sets of
>    experiments, one with 64K folios, another with PMD folios.
>=20
> 2) Kernel compilation allmodconfig with 2G max memory, 32 threads, with
>    zswap shrinker_enabled=3DN to test batching performance impact in
>    isolation. Two sets of experiments, one with 64K folios, another with =
PMD
>    folios.
>=20
> IAA configuration is done by a CLI: script is included at the end of the
> cover letter.
>=20
>=20
> Performance testing (usemem30):
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 150G. There is no swap limit set for the cgroup. 30 usemem
> processes were run, each allocating and writing 10G of memory, and
> sleeping for 10 sec before exiting:
>=20
>  usemem --init-time -w -O -b 1 -s 10 -n 30 10g
>  echo 0 > /sys/module/zswap/parameters/shrinker_enabled
>=20
>  IAA WQ Configuration (script is iincluded at the end of the cover
>  letter):
>=20
>    ./enable_iaa.sh -d 4 -q 1
>=20
>  This enables all 4 IAAs on the socket, and configures 1 WQ per IAA
>  device, each containing 128 entries. The driver distributes compress
>  jobs from each core to wqX.0 of all same-package IAAs in a
>  round-robin manner. Decompress jobs are send to the wqX.0 of the
>  mapped IAA device.
>=20
>  Since usemem has significantly more swapouts than swapins, this
>  configuration is the most optimal.
>=20
>  64K folios: usemem30: deflate-iaa:
>  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
>  ------------------------------------------------------------------------=
-------
>                     mm-unstable-10-24-2025             v13
>  ------------------------------------------------------------------------=
-------
>  zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
>                                                                  vs.
>                                                              IAA Sequenti=
al
>  ------------------------------------------------------------------------=
-------
>  Total throughput (KB/s)        6,118,675       9,901,216         62%
>  Avg throughput (KB/s)            203,955         330,040         62%
>  elapsed time (sec)                 98.94           70.90        -28%
>  sys time (sec)                  2,379.29        1,686.18        -29%
>=20
>  ------------------------------------------------------------------------=
-------
>  memcg_high                     1,263,467       1,404,068
>  memcg_swap_fail                    1,728           1,377
>  64kB_swpout_fallback               1,728           1,377
>  zswpout                       58,174,008      64,508,622
>  zswpin                                43             138
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-64kB                   3,634,162       4,030,643
>  SWPOUT-64kB                            0               0
>  pgmajfault                         2,398           2,488
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  IAA incompressible pages               0               0
>  ------------------------------------------------------------------------=
-------
>=20
>=20
>  2M folios: usemem30: deflate-iaa:
>  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
>  ------------------------------------------------------------------------=
-------
>                     mm-unstable-10-24-2025             v13
>  ------------------------------------------------------------------------=
-------
>  zswap compressor             deflate-iaa     deflate-iaa     IAA Batchin=
g
>                                                                   vs.
>                                                               IAA Sequent=
ial
>  ------------------------------------------------------------------------=
-------
>  Total throughput (KB/s)        6,309,635      10,558,225        67%
>  Avg throughput (KB/s)            210,321         351,940        67%
>  elapsed time (sec)                 88.70           67.84       -24%
>  sys time (sec)                  2,059.83        1,581.07       -23%
>=20
>  ------------------------------------------------------------------------=
-------
>  memcg_high                       116,246         125,218
>  memcg_swap_fail                       41             177
>  thp_swpout_fallback                   41             177
>  zswpout                       59,880,021      64,509,854
>  zswpin                                69             425
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-2048kB                   116,912         125,822
>  thp_swpout                             0               0
>  pgmajfault                         2,408           4,026
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  IAA incompressible pages               0               0
>  ------------------------------------------------------------------------=
-------
>=20
>=20
>  64K folios: usemem30: zstd:
>  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
>=20
>  ------------------------------------------------------------------------=
-------
>                     mm-unstable-10-24-2025             v13
>  ------------------------------------------------------------------------=
-------
>  zswap compressor                    zstd            zstd        v13 zstd
>                                                                  improvem=
ent
>  ------------------------------------------------------------------------=
-------
>  Total throughput (KB/s)        5,983,561       6,003,851         0.3%
>  Avg throughput (KB/s)            199,452         200,128         0.3%
>  elapsed time (sec)                100.93           96.62        -4.3%
>  sys time (sec)                  2,532.49        2,395.83          -5%
>=20
>  ------------------------------------------------------------------------=
-------
>  memcg_high                     1,122,198       1,113,384
>  memcg_swap_fail                      192              55
>  64kB_swpout_fallback                 192              55
>  zswpout                       48,766,907      48,799,863
>  zswpin                                89              68
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-64kB                   3,047,702       3,049,908
>  SWPOUT-64kB                            0               0
>  pgmajfault                         2,428           2,390
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  ------------------------------------------------------------------------=
-------
>=20
>=20
>  2M folios: usemem30: zstd:
>  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
>=20
>  ------------------------------------------------------------------------=
-------
>                     mm-unstable-10-24-2025             v13
>  ------------------------------------------------------------------------=
-------
>  zswap compressor                    zstd            zstd        v13 zstd
>                                                                  improvem=
ent
>  ------------------------------------------------------------------------=
-------
>  Total throughput (KB/s)        6,562,687       6,567,946         0.1%
>  Avg throughput (KB/s)            218,756         218,931         0.1%
>  elapsed time (sec)                 94.69           88.79          -6%
>  sys time (sec)                  2,253.97        2,083.43          -8%
>=20
>  ------------------------------------------------------------------------=
--------
>  memcg_high                        92,709          92,686
>  memcg_swap_fail                       33             226
>  thp_swpout_fallback                   33             226
>  zswpout                       47,851,601      47,847,171
>  zswpin                                65             441
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-2048kB                    93,427          93,238
>  thp_swpout                             0               0
>  pgmajfault                         2,382           2,767
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  ------------------------------------------------------------------------=
-------
>=20
>=20
> Performance testing (Kernel compilation, allmodconfig):
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D
>=20
> The experiments with kernel compilation test use 32 threads and build
> the "allmodconfig" that takes ~14 minutes, and has considerable
> swapout/swapin activity. The cgroup's memory.max is set to 2G. zswap
> writeback is not enabled so as to isolate the performance impact of only =
large
> folio batch compression.
>=20
>  echo 0 > /sys/module/zswap/parameters/shrinker_enabled
>=20
>  IAA WQ Configuration (script is at the end of the cover letter):
>=20
>    ./enable_iaa.sh -d 4 -q 2
>=20
>  This enables all 4 IAAs on the socket, and configures 2 WQs per IAA,
>  each containing 64 entries. The driver sends decompresses to wqX.0 of
>  the mapped IAA device, and distributes compresses to wqX.1 of all
>  same-package IAAs in a round-robin manner.
>=20
>  64K folios: Kernel compilation/allmodconfig: deflate-iaa:
>  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D
>=20
>  ------------------------------------------------------------------------=
-------
>                     mm-unstable-10-24-2025             v13
>  ------------------------------------------------------------------------=
-------
>  zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
>                                                                  vs.
>                                                              IAA Sequenti=
al
>  ------------------------------------------------------------------------=
-------
>  real_sec                          836.64          806.94       -3.5%
>  user_sec                       15,702.26       15,695.13
>  sys_sec                         3,897.57        3,661.83         -6%
>  ------------------------------------------------------------------------=
-------
>  Max_Res_Set_Size_KB            1,872,500       1,873,144
>  ------------------------------------------------------------------------=
-------
>  memcg_high                             0               0
>  memcg_swap_fail                        0               0
>  64kB_swpout_fallback                   0               0
>  zswpout                       94,890,390      93,332,527
>  zswpin                        28,305,656      28,111,525
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-64kB                   3,088,473       3,018,341
>  SWPOUT-64kB                            0               0
>  pgmajfault                    29,958,141      29,776,102
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  IAA incompressible pages             684             442
>  ------------------------------------------------------------------------=
-------
>=20
>=20
>  2M folios: Kernel compilation/allmodconfig: deflate-iaa:
>  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D
>=20
>  ------------------------------------------------------------------------=
-------
>                     mm-unstable-10-24-2025             v13
>  ------------------------------------------------------------------------=
-------
>  zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
>                                                                  vs.
>                                                              IAA Sequenti=
al
>  ------------------------------------------------------------------------=
-------
>  real_sec                          818.48          779.67         -4.7%
>  user_sec                       15,798.78       15,807.93
>  sys_sec                         4,226.52        4,245.18          0.4%
>  ------------------------------------------------------------------------=
-------
>  Max_Res_Set_Size_KB            1,871,096       1,871,100
>  ------------------------------------------------------------------------=
-------
>  memcg_high                             0               0
>  memcg_swap_fail                        0               0
>  thp_swpout_fallback                    0               0
>  zswpout                      105,675,621     109,930,550
>  zswpin                        36,537,688      38,205,575
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-2048kB                    15,600          15,800
>  thp_swpout                             0               0
>  pgmajfault                    37,843,091      39,540,387
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  IAA incompressible pages             188             349
>  ------------------------------------------------------------------------=
-------
>=20
>=20
> With the iaa_crypto driver changes for non-blocking descriptor allocation=
s,
> no timeouts-with-mitigations were seen in compress/decompress jobs, for a=
ll
> of the above experiments.
>=20
>=20
>  64K folios: Kernel compilation/allmodconfig: zstd:
>  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
>=20
>  ------------------------------------------------------------------------=
-------
>                     mm-unstable-10-24-2025             v13
>  ------------------------------------------------------------------------=
-------
>  zswap compressor                    zstd            zstd    Improvement
>  ------------------------------------------------------------------------=
-------
>  real_sec                          880.62          850.41        -3.4%
>  user_sec                       15,717.23       15,683.17
>  sys_sec                         5,171.90        5,076.51        -1.8%
>  ------------------------------------------------------------------------=
-------
>  Max_Res_Set_Size_KB            1,871,276       1,874,744
>  ------------------------------------------------------------------------=
-------
>  memcg_high                             0               0
>  memcg_swap_fail                        0               0
>  64kB_swpout_fallback                   0               0
>  zswpout                       76,599,637      76,472,392
>  zswpin                        21,833,178      22,538,969
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-64kB                   2,462,404       2,446,549
>  SWPOUT-64kB                            0               0
>  pgmajfault                    23,027,211      23,830,391
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  ------------------------------------------------------------------------=
-------
>=20
>=20
>  2M folios: Kernel compilation/allmodconfig: zstd:
>  =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
>  ------------------------------------------------------------------------=
-------
>                     mm-unstable-10-24-2025             v13
>  ------------------------------------------------------------------------=
-------
>  zswap compressor                    zstd            zstd    Improvement
>  ------------------------------------------------------------------------=
-------
>  real_sec                          888.45          849.54       -4.4%
>  user_sec                       15,841.87       15,828.10
>  sys_sec                         5,866.72        5,847.17       -0.3%
>  ------------------------------------------------------------------------=
-------
>  Max_Res_Set_Size_KB            1,871,096       1,872,892
>  ------------------------------------------------------------------------=
-------
>  memcg_high                             0               0
>  memcg_swap_fail                        0               0
>  thp_swpout_fallback                    0               0
>  zswpout                       89,891,328      90,847,761
>  zswpin                        29,249,656      29,999,617
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-2048kB                    12,198          12,481
>  thp_swpout                             0               0
>  pgmajfault                    30,077,425      30,915,945
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  ------------------------------------------------------------------------=
-------
>=20
>=20
>=20
> Changes since v12:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1)  Rebased to mm-unstable as of 10-24-2025, commit 813c0fa931ce.
> 2)  Added "int nid" to zswap_entry to store the page's nid, to preserve z=
swap
>     LRU list/shrinker behavior with bulk allocation, as suggested by Nhat=
 and
>     Yosry. No change in memory footprint of struct zswap_entry.
> 3)  Added a WARN_ON() if kmem_cache_alloc_bulk() returns 0 or a number
> that's
>     different than nr_entries, as suggested by Yosry.
> 4)  Confirmed that kmem_cache_bulk_free() works for both bulk and non-bul=
k
>     allocated entries, to follow-up on Yosry's comment.
> 5)  Moved the call to cpuhp_state_remove_instance() to
> zswap_pool_destroy(), as
>     suggested by Yosry.
> 6)  Variable names changed to "nid" and "wb_enabled", per Yosry's
> suggestion.
> 7)  Concise comments in zswap.c, and summarized commit logs, as suggested
> by
>     Yosry.
> 8)  Minimized branches in zswap_compress().
> 9)  Deleted allocating extra memory in acomp_req->__ctx[] to statically s=
tore
>     addresses to SG lists' lengths, as suggested by Herbert.
> 10) Deleted the iaa_comp API and export symbols, as suggested by Herbert.
> 11) Deleted @batch_size in struct crypto_acomp. Instead, the value is
> returned
>     from struct acomp_alg directly, as suggested by Herbert.
> 12) Addressed checkpatch.pl warnings and coding style suggestions in the
>     iaa_crypto patches, provided by Vinicius Gomes in internal code
>     reviews. Thanks Vinicius!
>=20
>=20
> Changes since v11:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1) Rebased to mm-unstable as of 9-18-2025, commit 1f98191f08b4.
> 2) Incorporated Herbert's suggestions on submitting the folio as the sour=
ce
> and
>    SG lists for the destination to create the compress batching interface=
 from
>    zswap to crypto.
> 3) As per Herbert's suggestion, added a new unit_size member to struct
>    acomp_req, along with a acomp_request_set_unit_size() API for kernel
> modules
>    to set the unit size to use while breaking down the request's src/dst
>    scatterlists.
> 4) Implemented iaa_crypto batching using the new SG lists based architect=
ure
> and
>    crypto interfaces.
> 5) To make the SG lists based approach functional and performant for IAA,=
 I
> have
>    changed all the calls to dma_map_sg() to use nents of 1. This should n=
ot be
> a
>    concern, since it eliminates redundant computes to scan an SG list wit=
h only
>    one scatterlist for existing kernel users, i.e. zswap with the
>    zswap_compress() modifications in this series. This will continue to h=
old
>    true with the zram IAA batching support I am developing. There are no
> kernel
>    use cases for the iaa_crypto driver that will break this assumption.
> 6) Addressed Herbert's comment about batch_size being a statically define=
d
> data
>    member in struct acomp_alg and struct crypto_acomp.
> 7) Addressed Nhat's comment about VM_WARN_ON_ONCE(nr_pages >
>    ZSWAP_MAX_BATCH_SIZE) in zswap_store_pages().
> 8) Nhat's comment about deleting struct swap_batch_decomp_data is
> automatically
>    addressed by the SG lists based rewrite of the crypto batching interfa=
ce.
> 9) Addressed Barry's comment about renaming pool->batch_size to
>    pool->store_batch_size.
> 10) Incorporated Barry's suggestion to merge patches that introduce data
> members
>     to structures and/or API and their usage.
> 11) Added performance data to patch 0023's commit log, as suggested by
> Barry.
>=20
> Changes since v10:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1) Rebased to mm-unstable as of 7-30-2025, commit 01da54f10fdd.
> 2) Added change logging in patch 0024 on there being no Intel-specific
>    dependencies in the batching framework, as suggested by
>    Andrew Morton. Thanks Andrew!
> 3) Added change logging in patch 0024 on other ongoing work that can use
>    batching, as per Andrew's suggestion. Thanks Andrew!
> 4) Added the IAA configuration script in the cover letter, as suggested
>    by Nhat Pham. Thanks Nhat!
> 5) As suggested by Nhat, dropped patch 0020 from v10, that moves CPU
>    hotplug procedures to pool functions.
> 6) Gathered kernel_compilation 'allmod' config performance data with
>    writeback and zswap shrinker_enabled=3DY.
> 7) Changed the pool->batch_size for software compressors to be
>    ZSWAP_MAX_BATCH_SIZE since this gives better performance with the
> zswap
>    shrinker enabled.
> 8) Was unable to replicate in v11 the issue seen in v10 with higher
>    memcg_swap_fail than in the baseline, with usemem30/zstd.
>=20
> Changes since v9:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1) Rebased to mm-unstable as of 6-24-2025, commit 23b9c0472ea3.
> 2) iaa_crypto rearchitecting, mainline race condition fix, performance
>    optimizations, code cleanup.
> 3) Addressed Herbert's comments in v9 patch 10, that an array based
>    crypto_acomp interface is not acceptable.
> 4) Optimized the implementation of the batching zswap_compress() and
>    zswap_store_pages() added in v9, to recover performance when
>    integrated with the changes in commit 56e5a103a721 ("zsmalloc: prefer
>    the the original page's node for compressed data").
>=20
> Changes since v8:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1) Rebased to mm-unstable as of 4-21-2025, commit 2c01d9f3c611.
> 2) Backported commits for reverting request chaining, since these are
>    in cryptodev-2.6 but not yet in mm-unstable: without these backports,
>    deflate-iaa is non-functional in mm-unstable:
>    commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining")
>    commit 5976fe19e240 ("Revert "crypto: testmgr - Add multibuffer acomp
>                          testing"")
>    Backported this hotfix as well:
>    commit 002ba346e3d7 ("crypto: scomp - Fix off-by-one bug when
>    calculating last page").
> 3) crypto_acomp_[de]compress() restored to non-request chained
>    implementations since request chaining has been removed from acomp in
>    commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining").
> 4) New IAA WQ architecture to denote WQ type and whether or not a WQ
>    should be shared among all package cores, or only to the "mapped"
>    ones from an even cores-to-IAA distribution scheme.
> 5) Compress/decompress batching are implemented in iaa_crypto using new
>    crypto_acomp_batch_compress()/crypto_acomp_batch_decompress() API.
> 6) Defines a "void *data" in struct acomp_req, based on Herbert advising
>    against using req->base.data in the driver. This is needed for async
>    submit-poll to work.
> 7) In zswap.c, moved the CPU hotplug callbacks to reside in "pool
>    functions", per Yosry's suggestion to move procedures in a distinct
>    patch before refactoring patches.
> 8) A new "u8 nr_reqs" member is added to "struct zswap_pool" to track
>    the number of requests/buffers associated with the per-cpu acomp_ctx,
>    as per Yosry's suggestion.
> 9) Simplifications to the acomp_ctx resources allocation, deletion,
>    locking, and for these to exist from pool creation to pool deletion,
>    based on v8 code review discussions with Yosry.
> 10) Use IS_ERR_OR_NULL() consistently in zswap_cpu_comp_prepare() and
>     acomp_ctx_dealloc(), as per Yosry's v8 comment.
> 11) zswap_store_folio() is deleted, and instead, the loop over
>     zswap_store_pages() is moved inline in zswap_store(), per Yosry's
>     suggestion.
> 12) Better structure in zswap_compress(), unified procedure that
>     compresses/stores a batch of pages for both, non-batching and
>     batching compressors. Renamed from zswap_batch_compress() to
>     zswap_compress(): Thanks Yosry for these suggestions.
>=20
>=20
> Changes since v7:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1) Rebased to mm-unstable as of 3-3-2025, commit 5f089a9aa987.
> 2) Changed the acomp_ctx->nr_reqs to be u8 since ZSWAP_MAX_BATCH_SIZE
> is
>    defined as 8U, for saving memory in this per-cpu structure.
> 3) Fixed a typo in code comments in acomp_ctx_get_cpu_lock():
>    acomp_ctx->initialized to acomp_ctx->__online.
> 4) Incorporated suggestions from Yosry, Chengming, Nhat and Johannes,
>    thanks to all!
>    a) zswap_batch_compress() replaces zswap_compress(). Thanks Yosry
>       for this suggestion!
>    b) Process the folio in sub-batches of ZSWAP_MAX_BATCH_SIZE, regardles=
s
>       of whether or not the compressor supports batching. This gets rid o=
f
>       the kmalloc(entries), and allows us to allocate an array of
>       ZSWAP_MAX_BATCH_SIZE entries on the stack. This is implemented in
>       zswap_store_pages().
>    c) Use of a common structure and code paths for compressing a folio in
>       batches, either as a request chain (in parallel in IAA hardware) or
>       sequentially. No code duplication since zswap_compress() has been
>       replaced with zswap_batch_compress(), simplifying maintainability.
> 5) A key difference between compressors that support batching and
>    those that do not, is that for the latter, the acomp_ctx mutex is
>    locked/unlocked per ZSWAP_MAX_BATCH_SIZE batch, so that
> decompressions
>    to handle page-faults can make progress. This fixes the zstd kernel
>    compilation regression seen in v7. For compressors that support
>    batching, for e.g. IAA, the mutex is locked/released once for storing
>    the folio.
> 6) Used likely/unlikely compiler directives and prefetchw to restore
>    performance with the common code paths.
>=20
> Changes since v6:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1) Rebased to mm-unstable as of 2-27-2025, commit d58172d128ac.
>=20
> 2) Deleted crypto_acomp_batch_compress() and
>    crypto_acomp_batch_decompress() interfaces, as per Herbert's
>    suggestion. Batching is instead enabled by chaining the requests. For
>    non-batching compressors, there is no request chaining involved. Both,
>    batching and non-batching compressions are accomplished by zswap by
>    calling:
>=20
>    crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]),
> &acomp_ctx->wait);
>=20
> 3) iaa_crypto implementation of batch compressions/decompressions using
>    request chaining, as per Herbert's suggestions.
> 4) Simplification of the acomp_ctx resource allocation/deletion with
>    respect to CPU hot[un]plug, to address Yosry's suggestions to explore =
the
>    mutex options in zswap_cpu_comp_prepare(). Yosry, please let me know i=
f
>    the per-cpu memory cost of this proposed change is acceptable (IAA:
>    64.8KB, Software compressors: 8.2KB). On the positive side, I believe
>    restarting reclaim on a CPU after it has been through an offline-onlin=
e
>    transition, will be much faster by not deleting the acomp_ctx resource=
s
>    when the CPU gets offlined.
> 5) Use of lockdep assertions rather than comments for internal locking
>    rules, as per Yosry's suggestion.
> 6) No specific references to IAA in zswap.c, as suggested by Yosry.
> 7) Explored various solutions other than the v6 zswap_store_folio()
>    implementation, to fix the zstd regression seen in v5, to attempt to
>    unify common code paths, and to allocate smaller arrays for the zswap
>    entries on the stack. All these options were found to cause usemem30
>    latency regression with zstd. The v6 version of zswap_store_folio() is
>    the only implementation that does not cause zstd regression, confirmed
>    by 10 consecutive runs, each giving quite consistent latency
>    numbers. Hence, the v6 implementation is carried forward to v7, with
>    changes for branching for batching vs. sequential compression API
>    calls.
>=20
>=20
> Changes since v5:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1) Rebased to mm-unstable as of 2-1-2025, commit 7de6fd8ab650.
>=20
> Several improvements, regression fixes and bug fixes, based on Yosry's
> v5 comments (Thanks Yosry!):
>=20
> 2) Fix for zstd performance regression in v5.
> 3) Performance debug and fix for marginal improvements with IAA batching
>    vs. sequential.
> 4) Performance testing data compares IAA with and without batching, inste=
ad
>    of IAA batching against zstd.
> 5) Commit logs/zswap comments not mentioning crypto_acomp
> implementation
>    details.
> 6) Delete the pr_info_once() when batching resources are allocated in
>    zswap_cpu_comp_prepare().
> 7) Use kcalloc_node() for the multiple acomp_ctx buffers/reqs in
>    zswap_cpu_comp_prepare().
> 8) Simplify and consolidate error handling cleanup code in
>    zswap_cpu_comp_prepare().
> 9) Introduce zswap_compress_folio() in a separate patch.
> 10) Bug fix in zswap_store_folio() when xa_store() failure can cause all
>     compressed objects and entries to be freed, and UAF when zswap_store(=
)
>     tries to free the entries that were already added to the xarray prior
>     to the failure.
> 11) Deleting compressed_bytes/bytes. zswap_store_folio() also comprehends
>     the recent fixes in commit bf5eaaaf7941 ("mm/zswap: fix inconsistency
>     when zswap_store_page() fails") by Hyeonggon Yoo.
>=20
> iaa_crypto improvements/fixes/changes:
>=20
> 12) Enables asynchronous mode and makes it the default. With commit
>     4ebd9a5ca478 ("crypto: iaa - Fix IAA disabling that occurs when
>     sync_mode is set to 'async'"), async mode was previously just sync. W=
e
>     now have true async support.
> 13) Change idxd descriptor allocations from blocking to non-blocking with
>     timeouts, and mitigations for compress/decompress ops that fail to
>     obtain a descriptor. This is a fix for tasks blocked errors seen in
>     configurations where 30+ cores are running workloads under high memor=
y
>     pressure, and sending comps/decomps to 1 IAA device.
> 14) Fixes a bug with unprotected access of "deflate_generic_tfm" in
>     deflate_generic_decompress(), which can cause data corruption and
>     zswap_decompress() kernel crash.
> 15) zswap uses crypto_acomp_batch_compress() with async polling instead o=
f
>     request chaining for slightly better latency. However, the request
>     chaining framework itself is unchanged, preserved from v5.
>=20
>=20
> Changes since v4:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
> 2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
> 3) Implemented IAA compress batching using request chaining.
> 4) zswap_store() batching simplifications suggested by Chengming, Yosry a=
nd
>    Nhat, thanks to all!
>    - New zswap_compress_folio() that is called by zswap_store().
>    - Move the loop over folio's pages out of zswap_store() and into a
>      zswap_store_folio() that stores all pages.
>    - Allocate all zswap entries for the folio upfront.
>    - Added zswap_batch_compress().
>    - Branch to call zswap_compress() or zswap_batch_compress() inside
>      zswap_compress_folio().
>    - All iterations over pages kept in same function level.
>    - No helpers other than the newly added zswap_store_folio() and
>      zswap_compress_folio().
>=20
>=20
> Changes since v3:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
> 2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
>    based on packages instead of NUMA nodes.
> 3) Added acomp_has_async_batching() API to crypto acomp, that allows
>    zswap/zram to query if a crypto_acomp has registered batch_compress an=
d
>    batch_decompress interfaces.
> 4) Clear the poll bits on the acomp_reqs passed to
>    iaa_comp_a[de]compress_batch() so that a module like zswap can be
>    confident about the acomp_reqs[0] not having the poll bit set before
>    calling the fully synchronous API crypto_acomp_[de]compress().
>    Herbert, I would appreciate it if you can review changes 2-4; in patch=
es
>    1-8 in v4. I did not want to introduce too many iaa_crypto changes in
>    v4, given that patch 7 is already making a major change. I plan to wor=
k
>    on incorporating the request chaining using the ahash interface in v5
>    (I need to understand the basic crypto ahash better). Thanks Herbert!
> 5) Incorporated Johannes' suggestion to not have a sysctl to enable
>    compress batching.
> 6) Incorporated Yosry's suggestion to allocate batching resources in the
>    cpu hotplug onlining code, since there is no longer a sysctl to contro=
l
>    batching. Thanks Yosry!
> 7) Incorporated Johannes' suggestions related to making the overall
>    sequence of events between zswap_store() and zswap_batch_store()
> similar
>    as much as possible for readability and control flow, better naming of
>    procedures, avoiding forward declarations, not inlining error path
>    procedures, deleting zswap internal details from zswap.h, etc. Thanks
>    Johannes, really appreciate the direction!
>    I have tried to explain the minimal future-proofing in terms of the
>    zswap_batch_store() signature and the definition of "struct
>    zswap_batch_store_sub_batch" in the comments for this struct. I hope t=
he
>    new code explains the control flow a bit better.
>=20
>=20
> Changes since v2:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
> 2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
>    returned by kmalloc_node() for acomp_ctx->buffers and for
>    acomp_ctx->reqs.
> 3) Fixed a bug in zswap_pool_can_batch() for returning true if
>    pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED,
> and if
>    the per-cpu acomp_batch_ctx tests true for batching resources having
>    been allocated on this cpu. Also, changed from per_cpu_ptr() to
>    raw_cpu_ptr().
> 4) Incorporated the zswap_store_propagate_errors() compilation warning fi=
x
>    suggested by Dan Carpenter. Thanks Dan!
> 5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments
> in
>    zswap.h, with SWAP_CRYPTO_BATCH_SIZE.
>=20
> Changes since v1:
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> 1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
> 2) Incorporated Herbert's suggestions to use an acomp_req flag to indicat=
e
>    async/poll mode, and to encapsulate the polling functionality in the
>    iaa_crypto driver. Thanks Herbert!
> 3) Incorporated Herbert's and Yosry's suggestions to implement the batchi=
ng
>    API in iaa_crypto and to make its use seamless from zswap's
>    perspective. Thanks Herbert and Yosry!
> 4) Incorporated Yosry's suggestion to make it more convenient for the use=
r
>    to enable compress batching, while minimizing the memory footprint
>    cost. Thanks Yosry!
> 5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
>    reclaim batching patch from this series, since it requires a broader
>    discussion.
>=20
>=20
> IAA configuration script "enable_iaa.sh":
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
>  Acknowledgements: Binuraj Ravindran and Rakib Al-Fahad.
>=20
>  Usage:
>  ------
>=20
>    ./enable_iaa.sh -d <num_IAAs> -q <num_WQs_per_IAA>
>=20
>=20
>  #---------------------------------<cut here>----------------------------=
------
>  #!/usr/bin/env bash
>  #SPDX-License-Identifier: BSD-3-Clause
>  #Copyright (c) 2025, Intel Corporation
>  #Description: Configure IAA devices
>=20
>  VERIFY_COMPRESS_PATH=3D"/sys/bus/dsa/drivers/crypto/verify_compress"
>=20
>  iax_dev_id=3D"0cfe"
>  num_iaa=3D$(lspci -d:${iax_dev_id} | wc -l)
>  sockets=3D$(lscpu | grep Socket | awk '{print $2}')
>  echo "Found ${num_iaa} instances in ${sockets} sockets(s)"
>=20
>  #  The same number of devices will be configured in each socket, if ther=
e
>  #  are  more than one socket.
>  #  Normalize with respect to the number of sockets.
>  device_num_per_socket=3D$(( num_iaa/sockets ))
>  num_iaa_per_socket=3D$(( num_iaa / sockets ))
>=20
>  iaa_wqs=3D2
>  verbose=3D0
>  iaa_engines=3D8
>  mode=3D"dedicated"
>  wq_type=3D"kernel"
>  iaa_crypto_mode=3D"async"
>  verify_compress=3D0
>=20
>=20
>  # Function to handle errors
>  handle_error() {
>      echo "Error: $1"
>      exit 1
>  }
>=20
>  # Process arguments
>=20
>  while getopts "d:hm:q:vD" opt; do
>    case $opt in
>      d)
>        device_num_per_socket=3D$OPTARG
>        ;;
>      m)
>        iaa_crypto_mode=3D$OPTARG
>        ;;
>      q)
>        iaa_wqs=3D$OPTARG
>        ;;
>      D)
>        verbose=3D1
>        ;;
>      v)
>        verify_compress=3D1
>        ;;
>      h)
>        echo "Usage: $0 [-d <device_count>][-q <wq_per_device>][-v]"
>        echo "       -d - number of devices"
>        echo "       -q - number of WQs per device"
>        echo "       -v - verbose mode"
>        echo "       -h - help"
>        exit
>        ;;
>      \?)
>        echo "Invalid option: -$OPTARG" >&2
>        exit
>        ;;
>    esac
>  done
>=20
>  LOG=3D"configure_iaa.log"
>=20
>  # Update wq_size based on number of wqs
>  wq_size=3D$(( 128 / iaa_wqs ))
>=20
>  # Take care of the enumeration, if DSA is enabled.
>  dsa=3D`lspci | grep -c 0b25`
>  # set first,step counters to correctly enumerate iax devices based on
>  # whether running on guest or host with or without dsa
>  first=3D0
>  step=3D1
>  [[ $dsa -gt 0 && -d /sys/bus/dsa/devices/dsa0 ]] && first=3D1 && step=3D=
2
>  echo "first index: ${first}, step: ${step}"
>=20
>=20
>  #
>  # Switch to software compressors and disable IAAs to have a clean start
>  #
>  COMPRESSOR=3D/sys/module/zswap/parameters/compressor
>  last_comp=3D`cat ${COMPRESSOR}`
>  echo lzo > ${COMPRESSOR}
>=20
>  echo "Disable IAA devices before configuring"
>=20
>  for ((i =3D ${first}; i < ${step} * ${num_iaa}; i +=3D ${step})); do
>      for ((j =3D 0; j < ${iaa_wqs}; j +=3D 1)); do
>          cmd=3D"accel-config disable-wq iax${i}/wq${i}.${j} >& /dev/null"
>         [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd
>       done
>      cmd=3D"accel-config disable-device iax${i} >& /dev/null"
>      [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd
>  done
>=20
>  rmmod iaa_crypto
>  modprobe iaa_crypto
>=20
>  # apply crypto parameters
>  echo $verify_compress > ${VERIFY_COMPRESS_PATH} || handle_error "did
> not change verify_compress"
>  # Note: This is a temporary solution for during the kernel transition.
>  if [ -f /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa ];then
>      echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa ||
> handle_error "did not set g_comp_wqs_per_iaa"
>  elif [ -f /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa ];then
>      echo 1 > /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa || handle_error "=
did
> not set g_wqs_per_iaa"
>  fi
>  if [ -f /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq ];then
>      echo 1 > /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq ||
> handle_error "did not set g_consec_descs_per_gwq"
>  fi
>  echo ${iaa_crypto_mode} > /sys/bus/dsa/drivers/crypto/sync_mode ||
> handle_error "could not set sync_mode"
>=20
>=20
>=20
>  echo "Configuring ${device_num_per_socket} device(s) out of
> $num_iaa_per_socket per socket"
>  if [ "${device_num_per_socket}" -le "${num_iaa_per_socket}" ]; then
>      echo "Configuring all devices"
>      start=3D${first}
>      end=3D$(( ${step} * ${device_num_per_socket} ))
>  else
>     echo "ERROR: Not enough devices"
>     exit
>  fi
>=20
>=20
>  #
>  # enable all iax devices and wqs
>  #
>  for (( socket =3D 0; socket < ${sockets}; socket +=3D 1 )); do
>  for ((i =3D ${start}; i < ${end}; i +=3D ${step})); do
>=20
>      echo "Configuring iaa$i on socket ${socket}"
>=20
>      for ((j =3D 0; j < ${iaa_engines}; j +=3D 1)); do
>          cmd=3D"accel-config config-engine iax${i}/engine${i}.${j} --grou=
p-id=3D0"
>          [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd
>      done
>=20
>      # Config  WQs
>      for ((j =3D 0; j < ${iaa_wqs}; j +=3D 1)); do
>          # Config WQ: group 0,  priority=3D10, mode=3Dshared, type =3D ke=
rnel
> name=3Dkernel, driver_name=3Dcrypto
>          cmd=3D"accel-config config-wq iax${i}/wq${i}.${j} -g 0 -s ${wq_s=
ize} -p 10 -
> m ${mode} -y ${wq_type} -n iaa_crypto${i}${j} -d crypto"
>          [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd
>       done
>=20
>      # Enable Device and WQs
>      cmd=3D"accel-config enable-device iax${i}"
>      [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd
>=20
>      for ((j =3D 0; j < ${iaa_wqs}; j +=3D 1)); do
>          cmd=3D"accel-config enable-wq iax${i}/wq${i}.${j}"
>          [[ $verbose =3D=3D 1 ]] && echo $cmd; eval $cmd
>       done
>=20
>  done
>      start=3D$(( start + ${step} * ${num_iaa_per_socket} ))
>      end=3D$(( start + (${step} * ${device_num_per_socket}) ))
>  done
>=20
>  # Restore the last compressor
>  echo "$last_comp" > ${COMPRESSOR}
>=20
>  # Check if the configuration is correct
>  echo "Configured IAA devices:"
>  accel-config list | grep iax
>=20
>  #---------------------------------<cut here>----------------------------=
------
>=20
>=20
> I would greatly appreciate code review comments for the iaa_crypto driver
> and mm patches included in this series!
>=20
> Thanks,
> Kanchana
>=20
>=20
>=20
> Kanchana P Sridhar (22):
>   crypto: iaa - Reorganize the iaa_crypto driver code.
>   crypto: iaa - New architecture for IAA device WQ comp/decomp usage &
>     core mapping.
>   crypto: iaa - Simplify, consistency of function parameters, minor
>     stats bug fix.
>   crypto: iaa - Descriptor allocation timeouts with mitigations.
>   crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting.
>   crypto: iaa - Simplify the code flow in iaa_compress() and
>     iaa_decompress().
>   crypto: iaa - Refactor hardware descriptor setup into separate
>     procedures.
>   crypto: iaa - Simplified, efficient job submissions for non-irq mode.
>   crypto: iaa - Deprecate exporting add/remove IAA compression modes.
>   crypto: iaa - Expect a single scatterlist for a [de]compress request's
>     src/dst.
>   crypto: iaa - Rearchitect iaa_crypto to have clean interfaces with
>     crypto_acomp
>   crypto: acomp - Define a unit_size in struct acomp_req to enable
>     batching.
>   crypto: iaa - IAA Batching for parallel compressions/decompressions.
>   crypto: iaa - Enable async mode and make it the default.
>   crypto: iaa - Disable iaa_verify_compress by default.
>   crypto: iaa - Submit the two largest source buffers first in
>     decompress batching.
>   crypto: iaa - Add deflate-iaa-dynamic compression mode.
>   crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's
>     batch-size.
>   mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to
>     deletion.
>   mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx
>     resources.
>   mm: zswap: zswap_store() will process a large folio in batches.
>   mm: zswap: Batched zswap_compress() with compress batching of large
>     folios.
>=20
>  .../driver-api/crypto/iaa/iaa-crypto.rst      |  168 +-
>  crypto/acompress.c                            |   14 +
>  crypto/testmgr.c                              |   10 +
>  crypto/testmgr.h                              |   74 +
>  drivers/crypto/intel/iaa/Makefile             |    4 +-
>  drivers/crypto/intel/iaa/iaa_crypto.h         |   87 +-
>  .../intel/iaa/iaa_crypto_comp_dynamic.c       |   22 +
>  drivers/crypto/intel/iaa/iaa_crypto_main.c    | 2836 ++++++++++++-----
>  drivers/crypto/intel/iaa/iaa_crypto_stats.c   |    8 +
>  drivers/crypto/intel/iaa/iaa_crypto_stats.h   |    2 +
>  include/crypto/acompress.h                    |   48 +
>  include/crypto/internal/acompress.h           |    3 +
>  mm/zswap.c                                    |  700 ++--
>  13 files changed, 2905 insertions(+), 1071 deletions(-)
>  create mode 100644 drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c
>=20
> --
> 2.27.0