[PATCH v4 00/10] zswap IAA compress batching

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 00/10] zswap IAA compress batching
@ 2024-11-23  7:01 Kanchana P Sridhar
  2024-11-23  7:01 ` [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching Kanchana P Sridhar
                   ` (9 more replies)
  0 siblings, 10 replies; 39+ messages in thread
From: Kanchana P Sridhar @ 2024-11-23  7:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

IAA Compression Batching:
=========================

This patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel batch compression of pages in large folios to improve
zswap swapout latency, resulting in sys time reduction by 41% (usemem 30
processes) and by 24% (kernel compilation); as well as a 39% increase in
usemem30 throughput with IAA batching as compared to zstd.

The patch-series is organized as follows:

 1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
    patches are tagged with "crypto:" in the subject:

    Patch 1) Adds acomp_alg/crypto_acomp batch_compress() and
             batch_decompress() interfaces, that swap modules can invoke
             using the new batching API crypto_acomp_batch_compress() and
             crypto_acomp_batch_decompress(). Additionally, crypto acomp
             provides a new acomp_has_async_batching() interface to query
             for these API before allocating batching resources for a given
             compressor in zswap/zram.
    Patch 2) New CRYPTO_ACOMP_REQ_POLL acomp_req flag to act as a gate for
             async poll mode in iaa_crypto.
    Patch 3) iaa-crypto driver implementations for async polling,
             crypto_acomp_batch_compress() and crypto_acomp_batch_decompress().
             The "iaa_acomp_fixed_deflate" algorithm registers these
             implementations for its batch_compress and batch_decompress
             interfaces respectively.
    Patch 4) Modifies the default iaa_crypto driver mode to async.
    Patch 5) Disables verify_compress by default, to facilitate users to
             run IAA easily for comparison with software compressors.
    Patch 6) Reorganizes the iaa_crypto driver code into logically related
             sections and avoids forward declarations, in order to facilitate
             Patch 7. This patch makes no functional changes.
    Patch 7) Makes a major infrastructure change in the iaa_crypto driver,
             to map IAA devices/work-queues to cores based on packages
             instead of NUMA nodes. This doesn't impact performance on
             the Sapphire Rapids system used for performance
             testing. However, this change fixes problems found on Granite
             Rapids in internal validation, where the number of NUMA nodes
             is greater than the number of packages, which was resulting in
             over-utilization of some IAA devices and non-usage of other
             IAA devices as per the current NUMA based mapping
             infrastructure. This patch also eliminates duplication of device
             wqs in per-cpu wq_tables, thereby saving 140MiB on a 384 cores
             Granite Rapids server with 8 IAAs. Submitting this change now
             so that it can go through code reviews before it can be merged.
    Patch 8) Builds upon the new infrastructure for mapping IAAs to cores
             based on packages, and enables configuring a "global_wq" per
             IAA, which can be used as a global resource for compress jobs
             for the package. If the user configures 2WQs per IAA device,
             the driver will distribute compress jobs from all cores on the
             package to the "global_wqs" of all the IAA devices on that
             package, in a round-robin manner. This can be used to improve
             compression throughput for workloads that see a lot of swapout
             activity.

 2) zswap modifications to enable compress batching in zswap_batch_store()
    of large folios (including pmd-mappable folios):

    Patch 9) Changes the "struct crypto_acomp_ctx" to contain a configurable
             number of acomp_reqs and buffers. Subsequently, the cpu
             hotplug onlining code will query acomp_has_async_batching() to
             allocate up to SWAP_CRYPTO_BATCH_SIZE (i.e. 8)
             acomp_reqs/buffers if the acomp supports batching, and 1
             acomp_req/buffer if not.
    Patch 10) zswap_batch_store() IAA compress batching implementation
              using the new crypto_acomp_batch_compress() iaa_crypto driver
              API. swap_writepage() will call zswap_batch_store() for large
              folios if zswap_can_batch().

With v4 of this patch series, the IAA compress batching feature will be
enabled seamlessly on Intel platforms that have IAA by selecting
'deflate-iaa' as the zswap compressor.

System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 11-18-2024,
commit 5a7056135bb6, without and with this patch-series.
Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores
per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
partition swap. Core frequency was fixed at 2500MHz.

Other kernel configuration parameters:

    zswap compressor  : zstd, deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 0, 2

IAA "compression verification" is disabled and IAA is run in the async
poll mode (the defaults with this series). 2WQs are configured per IAA
device. Compress jobs from all cores on a socket are distributed among all
4 IAA devices on the same socket.

I ran experiments with these workloads:

1) usemem 30 processes with these large folios enabled to "always":
   - 16k/32k/64k
   - 2048k

2) Kernel compilation allmodconfig with 2G max memory, 32 threads, run in
   tmpfs with these large folios enabled to "always":
   - 16k/32k/64k

Performance testing (usemem30):
===============================
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:

usemem --init-time -w -O -s 10 -n 30 10g

 16k/32/64k folios: usemem30:
 ============================

 -------------------------------------------------------------------------------
                            mm-unstable-11-18-2024   v4 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor               zstd   deflate-iaa   deflate-iaa   IAA Batching          
 vm.page-cluster                   2             2             2    vs.     vs.
                                                                    Seq    zstd
 -------------------------------------------------------------------------------
 Total throughput (KB/s)   6,284,634     7,149,906     8,392,830    17%     39%
 Avg throughput (KB/s)       209,487       238,330       279,761    17%     39%
 elapsed time (sec)           107.64         84.38         79.88    -5%    -29%
 sys time (sec)             2,566.69      1,844.32      1,592.02   -14%    -41%

 -------------------------------------------------------------------------------
 memcg_high                  477,219       616,897       683,170      
 memcg_swap_fail               1,040         2,734         2,330      
 zswpout                  48,931,670    55,520,017    57,467,487      
 zswpin                          384           491           415      
 pswpout                           0             0             0      
 pswpin                            0             0             0      
 thp_swpout                        0             0             0      
 thp_swpout_fallback               0             0             0      
 16kB-swpout_fallback              0             0             0                                                
 32kB_swpout_fallback              0             0             0      
 64kB_swpout_fallback          1,040         2,734         2,330      
 pgmajfault                    3,258         3,314         3,251      
 swap_ra                          95           128           112      
 swap_ra_hit                      46            49            61      
 ZSWPOUT-16kB                      2             4             3      
 ZSWPOUT-32kB                      0             2             0      
 ZSWPOUT-64kB              3,057,203     3,467,400     3,589,487      
 SWPOUT-16kB                       0             0             0      
 SWPOUT-32kB                       0             0             0      
 SWPOUT-64kB                       0             0             0      
 -------------------------------------------------------------------------------

 2M folios: usemem30:
 ====================

 -------------------------------------------------------------------------------
                            mm-unstable-11-18-2024   v4 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor               zstd   deflate-iaa   deflate-iaa   IAA Batching          
 vm.page-cluster                   2             2             2     vs.     vs.
                                                                     Seq    zstd
 -------------------------------------------------------------------------------
 Total throughput (KB/s)   6,466,700     7,245,936     9,107,731     26%     39%     
 Avg throughput (KB/s)       215,556       241,531       303,591     26%     39%     
 elapsed time (sec)           106.80         84.44         74.37    -12%    -30%     
 sys time (sec)             2,420.88      1,753.41      1,450.21    -17%    -41%     

 -------------------------------------------------------------------------------
 memcg_high                   60,926        79,259        90,314        
 memcg_swap_fail                  44           139           182        
 zswpout                  48,892,828    57,701,156    59,051,023        
 zswpin                          391           419           411        
 pswpout                           0             0             0        
 pswpin                            0             0             0        
 thp_swpout                        0             0             0        
 thp_swpout_fallback              44           139           182        
 pgmajfault                    4,907        11,542        30,492        
 swap_ra                       5,070        24,613        80,933        
 swap_ra_hit                   5,024        24,555        80,856        
 ZSWPOUT-2048kB               95,442       112,515       114,996        
 SWPOUT-2048kB                     0             0             0        
 -------------------------------------------------------------------------------

Performance testing (Kernel compilation, allmodconfig):
=======================================================

The experiments with kernel compilation test, 32 threads, in tmpfs use the
"allmodconfig" that takes ~12 minutes, and has considerable swapout
activity. The cgroup's memory.max is set to 2G.

 16k/32k/64k folios: Kernel compilation/allmodconfig:
 ====================================================

 -------------------------------------------------------------------------------
                           mm-unstable-11-18-2024    v4 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor             zstd    deflate-iaa    deflate-iaa   IAA Batching          
 vm.page-cluster                 0              0              0     vs.     vs.
                                                                    Seq    zstd
 -------------------------------------------------------------------------------
 real_sec                   783.15         792.78         789.65
 user_sec                15,763.86      15,779.60      15,775.48
 sys_sec                  5,198.29       4,215.74       3,930.92    -7%    -24%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB     1,872,932      1,873,444      1,872,896
 -------------------------------------------------------------------------------
 memcg_high                      0              0              0
 memcg_swap_fail                 0              0              0
 zswpout                88,824,270    109,828,718    109,402,157
 zswpin                 25,371,781     32,647,096     32,174,520
 pswpout                       121            360            297
 pswpin                        122            337            288
 thp_swpout                      0              0              0
 thp_swpout_fallback             0              0              0
 16kB_swpout_fallback            0              0              0                         
 32kB_swpout_fallback            0              0              0
 64kB_swpout_fallback          924         19,203          5,206
 pgmajfault             27,124,258     35,120,147     34,545,319
 swap_ra                         0              0              0
 swap_ra_hit                 2,561          3,131          2,380
 ZSWPOUT-16kB            1,246,641      1,499,293      1,469,160
 ZSWPOUT-32kB              675,242        865,310        827,968
 ZSWPOUT-64kB            2,886,860      3,596,899      3,638,188
 SWPOUT-16kB                     0              0              0
 SWPOUT-32kB                     1              0              0
 SWPOUT-64kB                     7             19             18
 -------------------------------------------------------------------------------

Summary:
========
The performance testing data with usemem 30 processes and kernel
compilation test show 39% throughput gains and 41% sys time reduction
(usemem30) and 24% sys time reduction (kernel compilation) with
zswap_batch_store() large folios using IAA compress batching as compared to
zstd.

The iaa_crypto wq stats will show almost the same number of compress calls
for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively.
We see a latency reduction of 2.5% by distributing compress jobs among all
IAA devices on the socket (based on v1 data).

We can expect to see even more significant performance and throughput
improvements if we use the parallelism offered by IAA to do reclaim
batching of 4K/large folios (really any-order folios), and using the
zswap_batch_store() high throughput compression to batch-compress pages
comprising these folios, not just batching within large folios. This is the
reclaim batching patch 13 in v1, which will be submitted in a separate
patch-series.

Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 3.5% reduction in workload run time as
compared to software compressors.

Changes since v3:
=================
1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
   based on packages instead of NUMA nodes.
3) Added acomp_has_async_batching() API to crypto acomp, that allows
   zswap/zram to query if a crypto_acomp has registered batch_compress and
   batch_decompress interfaces.
4) Clear the poll bits on the acomp_reqs passed to
   iaa_comp_a[de]compress_batch() so that a module like zswap can be
   confident about the acomp_reqs[0] not having the poll bit set before
   calling the fully synchronous API crypto_acomp_[de]compress().
   Herbert, I would appreciate it if you can review changes 2-4; in patches
   1-8 in v4. I did not want to introduce too many iaa_crypto changes in
   v4, given that patch 7 is already making a major change. I plan to work
   on incorporating the request chaining using the ahash interface in v5
   (I need to understand the basic crypto ahash better). Thanks Herbert!
5) Incorporated Johannes' suggestion to not have a sysctl to enable
   compress batching.
6) Incorporated Yosry's suggestion to allocate batching resources in the
   cpu hotplug onlining code, since there is no longer a sysctl to control
   batching. Thanks Yosry!
7) Incorporated Johannes' suggestions related to making the overall
   sequence of events between zswap_store() and zswap_batch_store() similar
   as much as possible for readability and control flow, better naming of
   procedures, avoiding forward declarations, not inlining error path
   procedures, deleting zswap internal details from zswap.h, etc. Thanks
   Johannes, really appreciate the direction!
   I have tried to explain the minimal future-proofing in terms of the
   zswap_batch_store() signature and the definition of "struct
   zswap_batch_store_sub_batch" in the comments for this struct. I hope the
   new code explains the control flow a bit better.

Changes since v2:
=================
1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
   returned by kmalloc_node() for acomp_ctx->buffers and for
   acomp_ctx->reqs.
3) Fixed a bug in zswap_pool_can_batch() for returning true if
   pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
   the per-cpu acomp_batch_ctx tests true for batching resources having
   been allocated on this cpu. Also, changed from per_cpu_ptr() to
   raw_cpu_ptr().
4) Incorporated the zswap_store_propagate_errors() compilation warning fix
   suggested by Dan Carpenter. Thanks Dan!
5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
   zswap.h, with SWAP_CRYPTO_BATCH_SIZE.

Changes since v1:
=================
1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
   async/poll mode, and to encapsulate the polling functionality in the
   iaa_crypto driver. Thanks Herbert!
3) Incorporated Herbert's and Yosry's suggestions to implement the batching
   API in iaa_crypto and to make its use seamless from zswap's
   perspective. Thanks Herbert and Yosry!
4) Incorporated Yosry's suggestion to make it more convenient for the user
   to enable compress batching, while minimizing the memory footprint
   cost. Thanks Yosry!
5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
   reclaim batching patch from this series, since it requires a broader
   discussion.

I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana

Kanchana P Sridhar (10):
  crypto: acomp - Define two new interfaces for compress/decompress
    batching.
  crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable
    async mode.
  crypto: iaa - Implement batch_compress(), batch_decompress() API in
    iaa_crypto.
  crypto: iaa - Make async mode the default.
  crypto: iaa - Disable iaa_verify_compress by default.
  crypto: iaa - Re-organize the iaa_crypto driver code.
  crypto: iaa - Map IAA devices/wqs to cores based on packages instead
    of NUMA.
  crypto: iaa - Distribute compress jobs from all cores to all IAAs on a
    package.
  mm: zswap: Allocate pool batching resources if the crypto_alg supports
    batching.
  mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of
    large folios.

 crypto/acompress.c                         |    2 +
 drivers/crypto/intel/iaa/iaa_crypto.h      |   18 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 1637 ++++++++++++++------
 include/crypto/acompress.h                 |   96 ++
 include/crypto/internal/acompress.h        |   16 +
 include/linux/zswap.h                      |   19 +
 mm/page_io.c                               |   16 +-
 mm/zswap.c                                 |  759 ++++++++-
 8 files changed, 2090 insertions(+), 473 deletions(-)

base-commit: 5a7056135bb69da2ce0a42eb8c07968c1331777b
-- 
2.27.0

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching.
  2024-11-23  7:01 [PATCH v4 00/10] zswap IAA compress batching Kanchana P Sridhar
@ 2024-11-23  7:01 ` Kanchana P Sridhar
  2024-11-25  9:35   ` Herbert Xu
  2024-11-23  7:01 ` [PATCH v4 02/10] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 39+ messages in thread
From: Kanchana P Sridhar @ 2024-11-23  7:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This commit adds batch_compress() and batch_decompress() interfaces to:

  struct acomp_alg
  struct crypto_acomp

This allows the iaa_crypto Intel IAA driver to register implementations for
the batch_compress() and batch_decompress() API, that can subsequently be
invoked from the kernel zswap/zram swap modules to compress/decompress
up to CRYPTO_BATCH_SIZE (i.e. 8) pages in parallel in the IAA hardware
accelerator to improve swapout/swapin performance.

A new helper function acomp_has_async_batching() can be invoked to query
if a crypto_acomp has registered these batch_compress and batch_decompress
interfaces.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c                  |  2 +
 include/crypto/acompress.h          | 91 +++++++++++++++++++++++++++++
 include/crypto/internal/acompress.h | 16 +++++
 3 files changed, 109 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index 6fdf0ff9f3c0..a506db499a37 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -71,6 +71,8 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm)
 
 	acomp->compress = alg->compress;
 	acomp->decompress = alg->decompress;
+	acomp->batch_compress = alg->batch_compress;
+	acomp->batch_decompress = alg->batch_decompress;
 	acomp->dst_free = alg->dst_free;
 	acomp->reqsize = alg->reqsize;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 54937b615239..4252bab3d0e1 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -37,12 +37,20 @@ struct acomp_req {
 	void *__ctx[] CRYPTO_MINALIGN_ATTR;
 };
 
+/*
+ * The max compress/decompress batch size, for crypto algorithms
+ * that support batch_compress and batch_decompress API.
+ */
+#define CRYPTO_BATCH_SIZE 8UL
+
 /**
  * struct crypto_acomp - user-instantiated objects which encapsulate
  * algorithms and core processing logic
  *
  * @compress:		Function performs a compress operation
  * @decompress:		Function performs a de-compress operation
+ * @batch_compress:	Function performs a batch compress operation
+ * @batch_decompress:	Function performs a batch decompress operation
  * @dst_free:		Frees destination buffer if allocated inside the
  *			algorithm
  * @reqsize:		Context size for (de)compression requests
@@ -51,6 +59,20 @@ struct acomp_req {
 struct crypto_acomp {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	void (*batch_compress)(struct acomp_req *reqs[],
+			       struct crypto_wait *wait,
+			       struct page *pages[],
+			       u8 *dsts[],
+			       unsigned int dlens[],
+			       int errors[],
+			       int nr_pages);
+	void (*batch_decompress)(struct acomp_req *reqs[],
+				 struct crypto_wait *wait,
+				 u8 *srcs[],
+				 struct page *pages[],
+				 unsigned int slens[],
+				 int errors[],
+				 int nr_pages);
 	void (*dst_free)(struct scatterlist *dst);
 	unsigned int reqsize;
 	struct crypto_tfm base;
@@ -142,6 +164,13 @@ static inline bool acomp_is_async(struct crypto_acomp *tfm)
 	       CRYPTO_ALG_ASYNC;
 }
 
+static inline bool acomp_has_async_batching(struct crypto_acomp *tfm)
+{
+	return (acomp_is_async(tfm) &&
+		(crypto_comp_alg_common(tfm)->base.cra_flags & CRYPTO_ALG_TYPE_ACOMPRESS) &&
+		tfm->batch_compress && tfm->batch_decompress);
+}
+
 static inline struct crypto_acomp *crypto_acomp_reqtfm(struct acomp_req *req)
 {
 	return __crypto_acomp_tfm(req->base.tfm);
@@ -265,4 +294,66 @@ static inline int crypto_acomp_decompress(struct acomp_req *req)
 	return crypto_acomp_reqtfm(req)->decompress(req);
 }
 
+/**
+ * crypto_acomp_batch_compress() -- Invoke asynchronous compress of
+ *                                  a batch of requests
+ *
+ * Function invokes the asynchronous batch compress operation
+ *
+ * @reqs: @nr_pages asynchronous compress requests.
+ * @wait: crypto_wait for synchronous acomp batch compress. If NULL, the
+ *        driver must provide a way to process completions asynchronously.
+ * @pages: Pages to be compressed.
+ * @dsts: Pre-allocated destination buffers to store results of compression.
+ * @dlens: Will contain the compressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to CRYPTO_BATCH_SIZE,
+ *            to be compressed.
+ */
+static inline void crypto_acomp_batch_compress(struct acomp_req *reqs[],
+					       struct crypto_wait *wait,
+					       struct page *pages[],
+					       u8 *dsts[],
+					       unsigned int dlens[],
+					       int errors[],
+					       int nr_pages)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_compress(reqs, wait, pages, dsts,
+				   dlens, errors, nr_pages);
+}
+
+/**
+ * crypto_acomp_batch_decompress() -- Invoke asynchronous decompress of
+ *                                    a batch of requests
+ *
+ * Function invokes the asynchronous batch decompress operation
+ *
+ * @reqs: @nr_pages asynchronous decompress requests.
+ * @wait: crypto_wait for synchronous acomp batch decompress. If NULL, the
+ *        driver must provide a way to process completions asynchronously.
+ * @srcs: The src buffers to be decompressed.
+ * @pages: The pages to store the decompressed buffers.
+ * @slens: Compressed lengths of @srcs.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to CRYPTO_BATCH_SIZE,
+ *            to be decompressed.
+ */
+static inline void crypto_acomp_batch_decompress(struct acomp_req *reqs[],
+						 struct crypto_wait *wait,
+						 u8 *srcs[],
+						 struct page *pages[],
+						 unsigned int slens[],
+						 int errors[],
+						 int nr_pages)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_decompress(reqs, wait, srcs, pages,
+				     slens, errors, nr_pages);
+}
+
 #endif
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 8831edaafc05..acfe2d9d5a83 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -17,6 +17,8 @@
  *
  * @compress:	Function performs a compress operation
  * @decompress:	Function performs a de-compress operation
+ * @batch_compress:	Function performs a batch compress operation
+ * @batch_decompress:	Function performs a batch decompress operation
  * @dst_free:	Frees destination buffer if allocated inside the algorithm
  * @init:	Initialize the cryptographic transformation object.
  *		This function is used to initialize the cryptographic
@@ -37,6 +39,20 @@
 struct acomp_alg {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	void (*batch_compress)(struct acomp_req *reqs[],
+			       struct crypto_wait *wait,
+			       struct page *pages[],
+			       u8 *dsts[],
+			       unsigned int dlens[],
+			       int errors[],
+			       int nr_pages);
+	void (*batch_decompress)(struct acomp_req *reqs[],
+				 struct crypto_wait *wait,
+				 u8 *srcs[],
+				 struct page *pages[],
+				 unsigned int slens[],
+				 int errors[],
+				 int nr_pages);
 	void (*dst_free)(struct scatterlist *dst);
 	int (*init)(struct crypto_acomp *tfm);
 	void (*exit)(struct crypto_acomp *tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching.
  2024-11-23  7:01 ` [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching Kanchana P Sridhar
@ 2024-11-25  9:35   ` Herbert Xu
  2024-11-25 20:03     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 39+ messages in thread
From: Herbert Xu @ 2024-11-25  9:35 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Fri, Nov 22, 2024 at 11:01:18PM -0800, Kanchana P Sridhar wrote:
> This commit adds batch_compress() and batch_decompress() interfaces to:
> 
>   struct acomp_alg
>   struct crypto_acomp
> 
> This allows the iaa_crypto Intel IAA driver to register implementations for
> the batch_compress() and batch_decompress() API, that can subsequently be
> invoked from the kernel zswap/zram swap modules to compress/decompress
> up to CRYPTO_BATCH_SIZE (i.e. 8) pages in parallel in the IAA hardware
> accelerator to improve swapout/swapin performance.
> 
> A new helper function acomp_has_async_batching() can be invoked to query
> if a crypto_acomp has registered these batch_compress and batch_decompress
> interfaces.
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  crypto/acompress.c                  |  2 +
>  include/crypto/acompress.h          | 91 +++++++++++++++++++++++++++++
>  include/crypto/internal/acompress.h | 16 +++++
>  3 files changed, 109 insertions(+)

This should be rebased on top of my request chaining patch:

https://lore.kernel.org/linux-crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herbert@gondor.apana.org.au/

Request chaining provides a perfect fit for batching.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching.
  2024-11-25  9:35   ` Herbert Xu
@ 2024-11-25 20:03     ` Sridhar, Kanchana P
  2024-11-26  2:13       ` Sridhar, Kanchana P
  0 siblings, 1 reply; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-25 20:03 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Monday, November 25, 2024 1:35 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 01/10] crypto: acomp - Define two new interfaces for
> compress/decompress batching.
> 
> On Fri, Nov 22, 2024 at 11:01:18PM -0800, Kanchana P Sridhar wrote:
> > This commit adds batch_compress() and batch_decompress() interfaces to:
> >
> >   struct acomp_alg
> >   struct crypto_acomp
> >
> > This allows the iaa_crypto Intel IAA driver to register implementations for
> > the batch_compress() and batch_decompress() API, that can subsequently
> be
> > invoked from the kernel zswap/zram swap modules to
> compress/decompress
> > up to CRYPTO_BATCH_SIZE (i.e. 8) pages in parallel in the IAA hardware
> > accelerator to improve swapout/swapin performance.
> >
> > A new helper function acomp_has_async_batching() can be invoked to
> query
> > if a crypto_acomp has registered these batch_compress and
> batch_decompress
> > interfaces.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  crypto/acompress.c                  |  2 +
> >  include/crypto/acompress.h          | 91 +++++++++++++++++++++++++++++
> >  include/crypto/internal/acompress.h | 16 +++++
> >  3 files changed, 109 insertions(+)
> 
> This should be rebased on top of my request chaining patch:
> 
> https://lore.kernel.org/linux-
> crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herb
> ert@gondor.apana.org.au/
> 
> Request chaining provides a perfect fit for batching.

Thanks Herbert. I am working on integrating the request chaining with
the iaa_crypto driver, expecting to have this ready for v5.

Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching.
  2024-11-25 20:03     ` Sridhar, Kanchana P
@ 2024-11-26  2:13       ` Sridhar, Kanchana P
  2024-11-26  2:14         ` Herbert Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-26  2:13 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

Hi Herbert,

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Monday, November 25, 2024 12:03 PM
> To: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 01/10] crypto: acomp - Define two new interfaces for
> compress/decompress batching.
> 
> 
> > -----Original Message-----
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Sent: Monday, November 25, 2024 1:35 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> > akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>;
> > Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 01/10] crypto: acomp - Define two new interfaces
> for
> > compress/decompress batching.
> >
> > On Fri, Nov 22, 2024 at 11:01:18PM -0800, Kanchana P Sridhar wrote:
> > > This commit adds batch_compress() and batch_decompress() interfaces
> to:
> > >
> > >   struct acomp_alg
> > >   struct crypto_acomp
> > >
> > > This allows the iaa_crypto Intel IAA driver to register implementations for
> > > the batch_compress() and batch_decompress() API, that can subsequently
> > be
> > > invoked from the kernel zswap/zram swap modules to
> > compress/decompress
> > > up to CRYPTO_BATCH_SIZE (i.e. 8) pages in parallel in the IAA hardware
> > > accelerator to improve swapout/swapin performance.
> > >
> > > A new helper function acomp_has_async_batching() can be invoked to
> > query
> > > if a crypto_acomp has registered these batch_compress and
> > batch_decompress
> > > interfaces.
> > >
> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > ---
> > >  crypto/acompress.c                  |  2 +
> > >  include/crypto/acompress.h          | 91
> +++++++++++++++++++++++++++++
> > >  include/crypto/internal/acompress.h | 16 +++++
> > >  3 files changed, 109 insertions(+)
> >
> > This should be rebased on top of my request chaining patch:
> >
> > https://lore.kernel.org/linux-
> >
> crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herb
> > ert@gondor.apana.org.au/
> >
> > Request chaining provides a perfect fit for batching.

I wanted to make sure I understand your suggestion: Are you suggesting we
implement request chaining for "struct acomp_req" similar to how this is being
done for "struct ahash_request" in your patch?

I guess I was a bit confused by your comment about rebasing, which would
imply a direct use of the request chaining API you've provided for "crypto hash".
I would appreciate it if you could clarify.

Thanks,
Kanchana

> 
> Thanks Herbert. I am working on integrating the request chaining with
> the iaa_crypto driver, expecting to have this ready for v5.
> 
> Thanks,
> Kanchana
> 
> >
> > Cheers,
> > --
> > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching.
  2024-11-26  2:13       ` Sridhar, Kanchana P
@ 2024-11-26  2:14         ` Herbert Xu
  2024-11-26  2:37           ` Sridhar, Kanchana P
  0 siblings, 1 reply; 39+ messages in thread
From: Herbert Xu @ 2024-11-26  2:14 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Nov 26, 2024 at 02:13:00AM +0000, Sridhar, Kanchana P wrote:
>
> I wanted to make sure I understand your suggestion: Are you suggesting we
> implement request chaining for "struct acomp_req" similar to how this is being
> done for "struct ahash_request" in your patch?
> 
> I guess I was a bit confused by your comment about rebasing, which would
> imply a direct use of the request chaining API you've provided for "crypto hash".
> I would appreciate it if you could clarify.

Yes I was referring to the generic part of request chaining,
and not rebasing acomp on top of ahash.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching.
  2024-11-26  2:14         ` Herbert Xu
@ 2024-11-26  2:37           ` Sridhar, Kanchana P
  2024-11-27  1:22             ` Sridhar, Kanchana P
  0 siblings, 1 reply; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-26  2:37 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Monday, November 25, 2024 6:14 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 01/10] crypto: acomp - Define two new interfaces for
> compress/decompress batching.
> 
> On Tue, Nov 26, 2024 at 02:13:00AM +0000, Sridhar, Kanchana P wrote:
> >
> > I wanted to make sure I understand your suggestion: Are you suggesting we
> > implement request chaining for "struct acomp_req" similar to how this is
> being
> > done for "struct ahash_request" in your patch?
> >
> > I guess I was a bit confused by your comment about rebasing, which would
> > imply a direct use of the request chaining API you've provided for "crypto
> hash".
> > I would appreciate it if you could clarify.
> 
> Yes I was referring to the generic part of request chaining,
> and not rebasing acomp on top of ahash.

Ok, thanks for the clarification! Would it be simpler if you could submit a
crypto_acomp request chaining patch that I can then use in iaa_crypto?
I would greatly appreciate this.

Thanks,
Kanchana


> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching.
  2024-11-26  2:37           ` Sridhar, Kanchana P
@ 2024-11-27  1:22             ` Sridhar, Kanchana P
  2024-11-27  5:04               ` Herbert Xu
  0 siblings, 1 reply; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-27  1:22 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Monday, November 25, 2024 6:37 PM
> To: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 01/10] crypto: acomp - Define two new interfaces for
> compress/decompress batching.
> 
> 
> > -----Original Message-----
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Sent: Monday, November 25, 2024 6:14 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> > linux-crypto@vger.kernel.org; davem@davemloft.net;
> clabbe@baylibre.com;
> > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 01/10] crypto: acomp - Define two new interfaces
> for
> > compress/decompress batching.
> >
> > On Tue, Nov 26, 2024 at 02:13:00AM +0000, Sridhar, Kanchana P wrote:
> > >
> > > I wanted to make sure I understand your suggestion: Are you suggesting
> we
> > > implement request chaining for "struct acomp_req" similar to how this is
> > being
> > > done for "struct ahash_request" in your patch?
> > >
> > > I guess I was a bit confused by your comment about rebasing, which
> would
> > > imply a direct use of the request chaining API you've provided for "crypto
> > hash".
> > > I would appreciate it if you could clarify.
> >
> > Yes I was referring to the generic part of request chaining,
> > and not rebasing acomp on top of ahash.
> 
> Ok, thanks for the clarification! Would it be simpler if you could submit a
> crypto_acomp request chaining patch that I can then use in iaa_crypto?
> I would greatly appreciate this.

Hi Herbert,

I was able to take a more in-depth look at the request chaining you have
implemented in crypto ahash, and I think I have a good understanding of
what needs to be done in crypto acomp for request chaining. I will go ahead
and try to implement this and reach out if I have any questions. 

I would be interested to know the performance impact if we kept the
crypto_wait based chaining of the requests (which makes the request chaining
synchronous IIUC), wrt the asynchronous polling that's currently used for
batching in the iaa_crypto driver. If you have any ideas on introducing
polling to the chaining concept, please do share, I would greatly appreciate
it.

Besides this, some questions that came up as far as applying request chaining
to crypto_acomp_batch_[de]compress were:

1) It appears a calling module like zswap would only be able to get 1 error
     status for all the requests that are chained, as against individual error
     statuses for each of the [de]compress jobs. Is this understanding correct?
2) The request chaining makes use of the req->base.complete and req->base.data,
     which are also used for internal data by the iaa_crypto driver. I can add another
     "void *data1" member to struct crypto_async_request to work around this,
     such that iaa_crypto uses "data1" instead of "data".

Please let me know if you have any suggestions. Also, if you have already begun
working on acomp request chaining, just let me know. I will wait for your patch
in this case (rather than implementing it myself).

Thanks,
Kanchana


> 
> Thanks,
> Kanchana
> 
> 
> >
> > Cheers,
> > --
> > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching.
  2024-11-27  1:22             ` Sridhar, Kanchana P
@ 2024-11-27  5:04               ` Herbert Xu
  0 siblings, 0 replies; 39+ messages in thread
From: Herbert Xu @ 2024-11-27  5:04 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Wed, Nov 27, 2024 at 01:22:40AM +0000, Sridhar, Kanchana P wrote:
>
> 1) It appears a calling module like zswap would only be able to get 1 error
>      status for all the requests that are chained, as against individual error
>      statuses for each of the [de]compress jobs. Is this understanding correct?

No, each request gets its own error in req->base.err.

> 2) The request chaining makes use of the req->base.complete and req->base.data,
>      which are also used for internal data by the iaa_crypto driver. I can add another
>      "void *data1" member to struct crypto_async_request to work around this,
>      such that iaa_crypto uses "data1" instead of "data".

These fields are meant for the user.  It's best not to use them
to store driver data, but if you really wanted to, then the API
ahash code provides an example of doing it (please only do this
as a last resort as it's rather fragile).

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 02/10] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode.
  2024-11-23  7:01 [PATCH v4 00/10] zswap IAA compress batching Kanchana P Sridhar
  2024-11-23  7:01 ` [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching Kanchana P Sridhar
@ 2024-11-23  7:01 ` Kanchana P Sridhar
  2024-11-23  7:01 ` [PATCH v4 03/10] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Kanchana P Sridhar @ 2024-11-23  7:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

If the iaa_crypto driver has async_mode set to true, and use_irq set to
false, it can still be forced to use synchronous mode by turning off the
CRYPTO_ACOMP_REQ_POLL flag in req->flags.

All three of the following need to be true for a request to be processed in
fully async poll mode:

 1) async_mode should be "true"
 2) use_irq should be "false"
 3) req->flags & CRYPTO_ACOMP_REQ_POLL should be "true"

Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 11 ++++++++++-
 include/crypto/acompress.h                 |  5 +++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 237f87000070..2edaecd42cc6 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1510,6 +1510,10 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		return -EINVAL;
 	}
 
+	/* If the caller has requested no polling, disable async. */
+	if (!(req->flags & CRYPTO_ACOMP_REQ_POLL))
+		disable_async = true;
+
 	cpu = get_cpu();
 	wq = wq_table_next_wq(cpu);
 	put_cpu();
@@ -1702,6 +1706,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 {
 	struct crypto_tfm *tfm = req->base.tfm;
 	dma_addr_t src_addr, dst_addr;
+	bool disable_async = false;
 	int nr_sgs, cpu, ret = 0;
 	struct iaa_wq *iaa_wq;
 	struct device *dev;
@@ -1717,6 +1722,10 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		return -EINVAL;
 	}
 
+	/* If the caller has requested no polling, disable async. */
+	if (!(req->flags & CRYPTO_ACOMP_REQ_POLL))
+		disable_async = true;
+
 	if (!req->dst)
 		return iaa_comp_adecompress_alloc_dest(req);
 
@@ -1765,7 +1774,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		req->dst, req->dlen, sg_dma_len(req->dst));
 
 	ret = iaa_decompress(tfm, req, wq, src_addr, req->slen,
-			     dst_addr, &req->dlen, false);
+			     dst_addr, &req->dlen, disable_async);
 	if (ret == -EINPROGRESS)
 		return ret;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 4252bab3d0e1..c1ed47405557 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -14,6 +14,11 @@
 #include <linux/crypto.h>
 
 #define CRYPTO_ACOMP_ALLOC_OUTPUT	0x00000001
+/*
+ * If set, the driver must have a way to submit the req, then
+ * poll its completion status for success/error.
+ */
+#define CRYPTO_ACOMP_REQ_POLL		0x00000002
 #define CRYPTO_ACOMP_DST_MAX		131072
 
 /**
-- 
2.27.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 03/10] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto.
  2024-11-23  7:01 [PATCH v4 00/10] zswap IAA compress batching Kanchana P Sridhar
  2024-11-23  7:01 ` [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching Kanchana P Sridhar
  2024-11-23  7:01 ` [PATCH v4 02/10] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
@ 2024-11-23  7:01 ` Kanchana P Sridhar
  2024-11-26  7:05   ` kernel test robot
  2024-11-23  7:01 ` [PATCH v4 04/10] crypto: iaa - Make async mode the default Kanchana P Sridhar
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 39+ messages in thread
From: Kanchana P Sridhar @ 2024-11-23  7:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch provides iaa_crypto driver implementations for the newly added
crypto_acomp batch_compress() and batch_decompress() interfaces.

This allows swap modules such as zswap/zram to invoke batch parallel
compression/decompression of pages on systems with Intel IAA, by invoking
these API, respectively:

 crypto_acomp_batch_compress(...);
 crypto_acomp_batch_decompress(...);

This enables zswap_batch_store() compress batching code to be developed in
a manner similar to the current single-page synchronous calls to:

 crypto_acomp_compress(...);
 crypto_acomp_decompress(...);

thereby, facilitating encapsulated and modular hand-off between the kernel
zswap/zram code and the crypto_acomp layer.

Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 337 +++++++++++++++++++++
 1 file changed, 337 insertions(+)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 2edaecd42cc6..cbf147a3c3cb 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1797,6 +1797,341 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 	ctx->use_irq = use_irq;
 }
 
+static int iaa_comp_poll(struct acomp_req *req)
+{
+	struct idxd_desc *idxd_desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	struct idxd_wq *wq;
+	bool compress_op;
+	int ret;
+
+	idxd_desc = req->base.data;
+	if (!idxd_desc)
+		return -EAGAIN;
+
+	compress_op = (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS);
+	wq = idxd_desc->wq;
+	iaa_wq = idxd_wq_get_private(wq);
+	idxd = iaa_wq->iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	ret = check_completion(dev, idxd_desc->iax_completion, true, true);
+	if (ret == -EAGAIN)
+		return ret;
+	if (ret)
+		goto out;
+
+	req->dlen = idxd_desc->iax_completion->output_size;
+
+	/* Update stats */
+	if (compress_op) {
+		update_total_comp_bytes_out(req->dlen);
+		update_wq_comp_bytes(wq, req->dlen);
+	} else {
+		update_total_decomp_bytes_in(req->slen);
+		update_wq_decomp_bytes(wq, req->slen);
+	}
+
+	if (iaa_verify_compress && (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS)) {
+		struct crypto_tfm *tfm = req->base.tfm;
+		dma_addr_t src_addr, dst_addr;
+		u32 compression_crc;
+
+		compression_crc = idxd_desc->iax_completion->crc;
+
+		dma_sync_sg_for_device(dev, req->dst, 1, DMA_FROM_DEVICE);
+		dma_sync_sg_for_device(dev, req->src, 1, DMA_TO_DEVICE);
+
+		src_addr = sg_dma_address(req->src);
+		dst_addr = sg_dma_address(req->dst);
+
+		ret = iaa_compress_verify(tfm, req, wq, src_addr, req->slen,
+					  dst_addr, &req->dlen, compression_crc);
+	}
+out:
+	/* caller doesn't call crypto_wait_req, so no acomp_request_complete() */
+
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	idxd_free_desc(idxd_desc->wq, idxd_desc);
+
+	dev_dbg(dev, "%s: returning ret=%d\n", __func__, ret);
+
+	return ret;
+}
+
+static void iaa_set_req_poll(
+	struct acomp_req *reqs[],
+	int nr_reqs,
+	bool set_flag)
+{
+	int i;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		set_flag ? (reqs[i]->flags |= CRYPTO_ACOMP_REQ_POLL) :
+			   (reqs[i]->flags &= ~CRYPTO_ACOMP_REQ_POLL);
+	}
+}
+
+/**
+ * This API provides IAA compress batching functionality for use by swap
+ * modules.
+ *
+ * @reqs: @nr_pages asynchronous compress requests.
+ * @wait: crypto_wait for synchronous acomp batch compress. If NULL, the
+ *        completions will be processed asynchronously.
+ * @pages: Pages to be compressed by IAA in parallel.
+ * @dsts: Pre-allocated destination buffers to store results of IAA
+ *        compression. Each element of @dsts must be of size "PAGE_SIZE * 2".
+ * @dlens: Will contain the compressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to CRYPTO_BATCH_SIZE,
+ *            to be compressed.
+ */
+static void iaa_comp_acompress_batch(
+	struct acomp_req *reqs[],
+	struct crypto_wait *wait,
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_pages)
+{
+	struct scatterlist inputs[CRYPTO_BATCH_SIZE];
+	struct scatterlist outputs[CRYPTO_BATCH_SIZE];
+	bool compressions_done = false;
+	bool poll = (async_mode && !use_irq);
+	int i;
+
+	BUG_ON(nr_pages > CRYPTO_BATCH_SIZE);
+	BUG_ON(!poll && !wait);
+
+	if (poll)
+		iaa_set_req_poll(reqs, nr_pages, true);
+	else
+		iaa_set_req_poll(reqs, nr_pages, false);
+
+	/*
+	 * Prepare and submit acomp_reqs to IAA. IAA will process these
+	 * compress jobs in parallel if async-poll mode is enabled.
+	 * If IAA is used in sync mode, the jobs will be processed sequentially
+	 * using "wait".
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		sg_init_table(&inputs[i], 1);
+		sg_set_page(&inputs[i], pages[i], PAGE_SIZE, 0);
+
+		/*
+		 * Each dst buffer should be of size (PAGE_SIZE * 2).
+		 * Reflect same in sg_list.
+		 */
+		sg_init_one(&outputs[i], dsts[i], PAGE_SIZE * 2);
+		acomp_request_set_params(reqs[i], &inputs[i],
+					 &outputs[i], PAGE_SIZE, dlens[i]);
+
+		/*
+		 * If poll is in effect, submit the request now, and poll for
+		 * a completion status later, after all descriptors have been
+		 * submitted. If polling is not enabled, submit the request
+		 * and wait for it to complete, i.e., synchronously, before
+		 * moving on to the next request.
+		 */
+		if (poll) {
+			errors[i] = iaa_comp_acompress(reqs[i]);
+
+			if (errors[i] != -EINPROGRESS)
+				errors[i] = -EINVAL;
+			else
+				errors[i] = -EAGAIN;
+		} else {
+			acomp_request_set_callback(reqs[i],
+						   CRYPTO_TFM_REQ_MAY_BACKLOG,
+						   crypto_req_done, wait);
+			errors[i] = crypto_wait_req(iaa_comp_acompress(reqs[i]),
+						    wait);
+			if (!errors[i])
+				dlens[i] = reqs[i]->dlen;
+		}
+	}
+
+	/*
+	 * If not doing async compressions, the batch has been processed at
+	 * this point and we can return.
+	 */
+	if (!poll)
+		goto reset_reqs_wait;
+
+	/*
+	 * Poll for and process IAA compress job completions
+	 * in out-of-order manner.
+	 */
+	while (!compressions_done) {
+		compressions_done = true;
+
+		for (i = 0; i < nr_pages; ++i) {
+			/*
+			 * Skip, if the compression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					compressions_done = false;
+			} else {
+				dlens[i] = reqs[i]->dlen;
+			}
+		}
+	}
+
+reset_reqs_wait:
+	/*
+	 * For the same 'reqs[]' and 'wait' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_deacompress():
+	 * Clear the CRYPTO_ACOMP_REQ_POLL bit on the acomp_reqs.
+	 * Reset the crypto_wait "wait" callback to reqs[0].
+	 */
+	iaa_set_req_poll(reqs, nr_pages, false);
+	acomp_request_set_callback(reqs[0],
+				   CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, wait);
+}
+
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ *
+ * @reqs: @nr_pages asynchronous decompress requests.
+ * @wait: crypto_wait for synchronous acomp batch decompress. If NULL, the
+ *        driver must provide a way to process completions asynchronously.
+ * @srcs: The src buffers to be decompressed by IAA in parallel.
+ * @pages: The pages to store the decompressed buffers.
+ * @slens: Compressed lengths of @srcs.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to CRYPTO_BATCH_SIZE,
+ *            to be decompressed.
+ */
+static void iaa_comp_adecompress_batch(
+	struct acomp_req *reqs[],
+	struct crypto_wait *wait,
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	int errors[],
+	int nr_pages)
+{
+	struct scatterlist inputs[CRYPTO_BATCH_SIZE];
+	struct scatterlist outputs[CRYPTO_BATCH_SIZE];
+	unsigned int dlens[CRYPTO_BATCH_SIZE];
+	bool decompressions_done = false;
+	bool poll = (async_mode && !use_irq);
+	int i;
+
+	BUG_ON(nr_pages > CRYPTO_BATCH_SIZE);
+	BUG_ON(!poll && !wait);
+
+	if (poll)
+		iaa_set_req_poll(reqs, nr_pages, true);
+	else
+		iaa_set_req_poll(reqs, nr_pages, false);
+
+	/*
+	 * Prepare and submit acomp_reqs to IAA. IAA will process these
+	 * decompress jobs in parallel if async-poll mode is enabled.
+	 * If IAA is used in sync mode, the jobs will be processed sequentially
+	 * using "wait".
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		dlens[i] = PAGE_SIZE;
+		sg_init_one(&inputs[i], srcs[i], slens[i]);
+		sg_init_table(&outputs[i], 1);
+		sg_set_page(&outputs[i], pages[i], PAGE_SIZE, 0);
+		acomp_request_set_params(reqs[i], &inputs[i],
+					&outputs[i], slens[i], dlens[i]);
+		/*
+		 * If poll is in effect, submit the request now, and poll for
+		 * a completion status later, after all descriptors have been
+		 * submitted. If polling is not enabled, submit the request
+		 * and wait for it to complete, i.e., synchronously, before
+		 * moving on to the next request.
+		 */
+		if (poll) {
+			errors[i] = iaa_comp_adecompress(reqs[i]);
+
+			if (errors[i] != -EINPROGRESS)
+				errors[i] = -EINVAL;
+			else
+				errors[i] = -EAGAIN;
+		} else {
+			acomp_request_set_callback(reqs[i],
+						   CRYPTO_TFM_REQ_MAY_BACKLOG,
+						   crypto_req_done, wait);
+			errors[i] = crypto_wait_req(iaa_comp_adecompress(reqs[i]),
+						    wait);
+			if (!errors[i]) {
+				dlens[i] = reqs[i]->dlen;
+				BUG_ON(dlens[i] != PAGE_SIZE);
+			}
+		}
+	}
+
+	/*
+	 * If not doing async decompressions, the batch has been processed at
+	 * this point and we can return.
+	 */
+	if (!poll)
+		goto reset_reqs_wait;
+
+	/*
+	 * Poll for and process IAA decompress job completions
+	 * in out-of-order manner.
+	 */
+	while (!decompressions_done) {
+		decompressions_done = true;
+
+		for (i = 0; i < nr_pages; ++i) {
+			/*
+			 * Skip, if the decompression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					decompressions_done = false;
+			} else {
+				dlens[i] = reqs[i]->dlen;
+				BUG_ON(dlens[i] != PAGE_SIZE);
+			}
+		}
+	}
+
+reset_reqs_wait:
+	/*
+	 * For the same 'reqs[]' and 'wait' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_deacompress():
+	 * Clear the CRYPTO_ACOMP_REQ_POLL bit on the acomp_reqs.
+	 * Reset the crypto_wait "wait" callback to reqs[0].
+	 */
+	iaa_set_req_poll(reqs, nr_pages, false);
+	acomp_request_set_callback(reqs[0],
+				   CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, wait);
+}
+
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
@@ -1822,6 +2157,8 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	.compress		= iaa_comp_acompress,
 	.decompress		= iaa_comp_adecompress,
 	.dst_free               = dst_free,
+	.batch_compress		= iaa_comp_acompress_batch,
+	.batch_decompress	= iaa_comp_adecompress_batch,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa",
-- 
2.27.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 03/10] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto.
  2024-11-23  7:01 ` [PATCH v4 03/10] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
@ 2024-11-26  7:05   ` kernel test robot
  0 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2024-11-26  7:05 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, ying.huang,
	21cnbao, akpm, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi
  Cc: oe-kbuild-all, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Hi Kanchana,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 5a7056135bb69da2ce0a42eb8c07968c1331777b]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchana-P-Sridhar/crypto-acomp-Define-two-new-interfaces-for-compress-decompress-batching/20241125-110412
base:   5a7056135bb69da2ce0a42eb8c07968c1331777b
patch link:    https://lore.kernel.org/r/20241123070127.332773-4-kanchana.p.sridhar%40intel.com
patch subject: [PATCH v4 03/10] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto.
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20241126/202411261737.ozFff8Ym-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241126/202411261737.ozFff8Ym-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411261737.ozFff8Ym-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/crypto/intel/iaa/iaa_crypto_main.c:1882: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * This API provides IAA compress batching functionality for use by swap
   drivers/crypto/intel/iaa/iaa_crypto_main.c:2010: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * This API provides IAA decompress batching functionality for use by swap


vim +1882 drivers/crypto/intel/iaa/iaa_crypto_main.c

  1880	
  1881	/**
> 1882	 * This API provides IAA compress batching functionality for use by swap
  1883	 * modules.
  1884	 *
  1885	 * @reqs: @nr_pages asynchronous compress requests.
  1886	 * @wait: crypto_wait for synchronous acomp batch compress. If NULL, the
  1887	 *        completions will be processed asynchronously.
  1888	 * @pages: Pages to be compressed by IAA in parallel.
  1889	 * @dsts: Pre-allocated destination buffers to store results of IAA
  1890	 *        compression. Each element of @dsts must be of size "PAGE_SIZE * 2".
  1891	 * @dlens: Will contain the compressed lengths.
  1892	 * @errors: zero on successful compression of the corresponding
  1893	 *          req, or error code in case of error.
  1894	 * @nr_pages: The number of pages, up to CRYPTO_BATCH_SIZE,
  1895	 *            to be compressed.
  1896	 */
  1897	static void iaa_comp_acompress_batch(
  1898		struct acomp_req *reqs[],
  1899		struct crypto_wait *wait,
  1900		struct page *pages[],
  1901		u8 *dsts[],
  1902		unsigned int dlens[],
  1903		int errors[],
  1904		int nr_pages)
  1905	{
  1906		struct scatterlist inputs[CRYPTO_BATCH_SIZE];
  1907		struct scatterlist outputs[CRYPTO_BATCH_SIZE];
  1908		bool compressions_done = false;
  1909		bool poll = (async_mode && !use_irq);
  1910		int i;
  1911	
  1912		BUG_ON(nr_pages > CRYPTO_BATCH_SIZE);
  1913		BUG_ON(!poll && !wait);
  1914	
  1915		if (poll)
  1916			iaa_set_req_poll(reqs, nr_pages, true);
  1917		else
  1918			iaa_set_req_poll(reqs, nr_pages, false);
  1919	
  1920		/*
  1921		 * Prepare and submit acomp_reqs to IAA. IAA will process these
  1922		 * compress jobs in parallel if async-poll mode is enabled.
  1923		 * If IAA is used in sync mode, the jobs will be processed sequentially
  1924		 * using "wait".
  1925		 */
  1926		for (i = 0; i < nr_pages; ++i) {
  1927			sg_init_table(&inputs[i], 1);
  1928			sg_set_page(&inputs[i], pages[i], PAGE_SIZE, 0);
  1929	
  1930			/*
  1931			 * Each dst buffer should be of size (PAGE_SIZE * 2).
  1932			 * Reflect same in sg_list.
  1933			 */
  1934			sg_init_one(&outputs[i], dsts[i], PAGE_SIZE * 2);
  1935			acomp_request_set_params(reqs[i], &inputs[i],
  1936						 &outputs[i], PAGE_SIZE, dlens[i]);
  1937	
  1938			/*
  1939			 * If poll is in effect, submit the request now, and poll for
  1940			 * a completion status later, after all descriptors have been
  1941			 * submitted. If polling is not enabled, submit the request
  1942			 * and wait for it to complete, i.e., synchronously, before
  1943			 * moving on to the next request.
  1944			 */
  1945			if (poll) {
  1946				errors[i] = iaa_comp_acompress(reqs[i]);
  1947	
  1948				if (errors[i] != -EINPROGRESS)
  1949					errors[i] = -EINVAL;
  1950				else
  1951					errors[i] = -EAGAIN;
  1952			} else {
  1953				acomp_request_set_callback(reqs[i],
  1954							   CRYPTO_TFM_REQ_MAY_BACKLOG,
  1955							   crypto_req_done, wait);
  1956				errors[i] = crypto_wait_req(iaa_comp_acompress(reqs[i]),
  1957							    wait);
  1958				if (!errors[i])
  1959					dlens[i] = reqs[i]->dlen;
  1960			}
  1961		}
  1962	
  1963		/*
  1964		 * If not doing async compressions, the batch has been processed at
  1965		 * this point and we can return.
  1966		 */
  1967		if (!poll)
  1968			goto reset_reqs_wait;
  1969	
  1970		/*
  1971		 * Poll for and process IAA compress job completions
  1972		 * in out-of-order manner.
  1973		 */
  1974		while (!compressions_done) {
  1975			compressions_done = true;
  1976	
  1977			for (i = 0; i < nr_pages; ++i) {
  1978				/*
  1979				 * Skip, if the compression has already completed
  1980				 * successfully or with an error.
  1981				 */
  1982				if (errors[i] != -EAGAIN)
  1983					continue;
  1984	
  1985				errors[i] = iaa_comp_poll(reqs[i]);
  1986	
  1987				if (errors[i]) {
  1988					if (errors[i] == -EAGAIN)
  1989						compressions_done = false;
  1990				} else {
  1991					dlens[i] = reqs[i]->dlen;
  1992				}
  1993			}
  1994		}
  1995	
  1996	reset_reqs_wait:
  1997		/*
  1998		 * For the same 'reqs[]' and 'wait' to be usable by
  1999		 * iaa_comp_acompress()/iaa_comp_deacompress():
  2000		 * Clear the CRYPTO_ACOMP_REQ_POLL bit on the acomp_reqs.
  2001		 * Reset the crypto_wait "wait" callback to reqs[0].
  2002		 */
  2003		iaa_set_req_poll(reqs, nr_pages, false);
  2004		acomp_request_set_callback(reqs[0],
  2005					   CRYPTO_TFM_REQ_MAY_BACKLOG,
  2006					   crypto_req_done, wait);
  2007	}
  2008	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 04/10] crypto: iaa - Make async mode the default.
  2024-11-23  7:01 [PATCH v4 00/10] zswap IAA compress batching Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2024-11-23  7:01 ` [PATCH v4 03/10] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
@ 2024-11-23  7:01 ` Kanchana P Sridhar
  2024-11-23  7:01 ` [PATCH v4 05/10] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Kanchana P Sridhar @ 2024-11-23  7:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default in the most efficient/recommended "async"
mode for parallel compressions/decompressions, namely, asynchronous
submission of descriptors, followed by polling for job completions.
Earlier, the "sync" mode used to be the default.

This way, anyone that wants to use IAA can do so after building the kernel,
and without having to go through these steps to use async poll:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo async > /sys/bus/dsa/drivers/crypto/sync_mode
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index cbf147a3c3cb..bd2db0b6f145 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -153,7 +153,7 @@ static DRIVER_ATTR_RW(verify_compress);
  */
 
 /* Use async mode */
-static bool async_mode;
+static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 05/10] crypto: iaa - Disable iaa_verify_compress by default.
  2024-11-23  7:01 [PATCH v4 00/10] zswap IAA compress batching Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2024-11-23  7:01 ` [PATCH v4 04/10] crypto: iaa - Make async mode the default Kanchana P Sridhar
@ 2024-11-23  7:01 ` Kanchana P Sridhar
  2024-11-23  7:01 ` [PATCH v4 06/10] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Kanchana P Sridhar @ 2024-11-23  7:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default with "iaa_verify_compress" disabled, to
facilitate performance comparisons with software compressors (which also
do not run compress verification by default). Earlier, iaa_crypto compress
verification used to be enabled by default.

With this patch, if users want to enable compress verification, they can do
so with these steps:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo 1 > /sys/bus/dsa/drivers/crypto/verify_compress
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index bd2db0b6f145..a572803a53d0 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -94,7 +94,7 @@ static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
 /* Verify results of IAA compress or not */
-static bool iaa_verify_compress = true;
+static bool iaa_verify_compress = false;
 
 static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
 {
-- 
2.27.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 06/10] crypto: iaa - Re-organize the iaa_crypto driver code.
  2024-11-23  7:01 [PATCH v4 00/10] zswap IAA compress batching Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2024-11-23  7:01 ` [PATCH v4 05/10] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
@ 2024-11-23  7:01 ` Kanchana P Sridhar
  2024-11-23  7:01 ` [PATCH v4 07/10] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Kanchana P Sridhar @ 2024-11-23  7:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch merely reorganizes the code in iaa_crypto_main.c, so that
the functions are consolidated into logically related sub-sections of
code.

This is expected to make the code more maintainable and for it to be easier
to replace functional layers and/or add new features.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 540 +++++++++++----------
 1 file changed, 275 insertions(+), 265 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index a572803a53d0..c2362e4525bd 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -24,6 +24,9 @@
 
 #define IAA_ALG_PRIORITY               300
 
+/**************************************
+ * Driver internal global variables.
+ **************************************/
 /* number of iaa instances probed */
 static unsigned int nr_iaa;
 static unsigned int nr_cpus;
@@ -36,55 +39,46 @@ static unsigned int cpus_per_iaa;
 static struct crypto_comp *deflate_generic_tfm;
 
 /* Per-cpu lookup table for balanced wqs */
-static struct wq_table_entry __percpu *wq_table;
+static struct wq_table_entry __percpu *wq_table = NULL;
 
-static struct idxd_wq *wq_table_next_wq(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (++entry->cur_wq >= entry->n_wqs)
-		entry->cur_wq = 0;
-
-	if (!entry->wqs[entry->cur_wq])
-		return NULL;
-
-	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
-		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
-		 entry->wqs[entry->cur_wq]->id, cpu);
-
-	return entry->wqs[entry->cur_wq];
-}
-
-static void wq_table_add(int cpu, struct idxd_wq *wq)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
-		return;
-
-	entry->wqs[entry->n_wqs++] = wq;
-
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
-}
-
-static void wq_table_free_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+/* Verify results of IAA compress or not */
+static bool iaa_verify_compress = false;
 
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
-}
+/*
+ * The iaa crypto driver supports three 'sync' methods determining how
+ * compressions and decompressions are performed:
+ *
+ * - sync:      the compression or decompression completes before
+ *              returning.  This is the mode used by the async crypto
+ *              interface when the sync mode is set to 'sync' and by
+ *              the sync crypto interface regardless of setting.
+ *
+ * - async:     the compression or decompression is submitted and returns
+ *              immediately.  Completion interrupts are not used so
+ *              the caller is responsible for polling the descriptor
+ *              for completion.  This mode is applicable to only the
+ *              async crypto interface and is ignored for anything
+ *              else.
+ *
+ * - async_irq: the compression or decompression is submitted and
+ *              returns immediately.  Completion interrupts are
+ *              enabled so the caller can wait for the completion and
+ *              yield to other threads.  When the compression or
+ *              decompression completes, the completion is signaled
+ *              and the caller awakened.  This mode is applicable to
+ *              only the async crypto interface and is ignored for
+ *              anything else.
+ *
+ * These modes can be set using the iaa_crypto sync_mode driver
+ * attribute.
+ */
 
-static void wq_table_clear_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+/* Use async mode */
+static bool async_mode = true;
+/* Use interrupts */
+static bool use_irq;
 
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
-}
+static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
@@ -93,9 +87,9 @@ DEFINE_MUTEX(iaa_devices_lock);
 static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
-/* Verify results of IAA compress or not */
-static bool iaa_verify_compress = false;
-
+/**************************************************
+ * Driver attributes along with get/set functions.
+ **************************************************/
 static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
 {
 	return sprintf(buf, "%d\n", iaa_verify_compress);
@@ -123,40 +117,6 @@ static ssize_t verify_compress_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(verify_compress);
 
-/*
- * The iaa crypto driver supports three 'sync' methods determining how
- * compressions and decompressions are performed:
- *
- * - sync:      the compression or decompression completes before
- *              returning.  This is the mode used by the async crypto
- *              interface when the sync mode is set to 'sync' and by
- *              the sync crypto interface regardless of setting.
- *
- * - async:     the compression or decompression is submitted and returns
- *              immediately.  Completion interrupts are not used so
- *              the caller is responsible for polling the descriptor
- *              for completion.  This mode is applicable to only the
- *              async crypto interface and is ignored for anything
- *              else.
- *
- * - async_irq: the compression or decompression is submitted and
- *              returns immediately.  Completion interrupts are
- *              enabled so the caller can wait for the completion and
- *              yield to other threads.  When the compression or
- *              decompression completes, the completion is signaled
- *              and the caller awakened.  This mode is applicable to
- *              only the async crypto interface and is ignored for
- *              anything else.
- *
- * These modes can be set using the iaa_crypto sync_mode driver
- * attribute.
- */
-
-/* Use async mode */
-static bool async_mode = true;
-/* Use interrupts */
-static bool use_irq;
-
 /**
  * set_iaa_sync_mode - Set IAA sync mode
  * @name: The name of the sync mode
@@ -219,8 +179,9 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
-static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
-
+/****************************
+ * Driver compression modes.
+ ****************************/
 static int find_empty_iaa_compression_mode(void)
 {
 	int i = -EINVAL;
@@ -411,11 +372,6 @@ static void free_device_compression_mode(struct iaa_device *iaa_device,
 						IDXD_OP_FLAG_WR_SRC2_AECS_COMP | \
 						IDXD_OP_FLAG_AECS_RW_TGLS)
 
-static int check_completion(struct device *dev,
-			    struct iax_completion_record *comp,
-			    bool compress,
-			    bool only_once);
-
 static int init_device_compression_mode(struct iaa_device *iaa_device,
 					struct iaa_compression_mode *mode,
 					int idx, struct idxd_wq *wq)
@@ -502,6 +458,10 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
 	}
 }
 
+/***********************************************************
+ * Functions for use in crypto probe and remove interfaces:
+ * allocate/init/query/deallocate devices/wqs.
+ ***********************************************************/
 static struct iaa_device *iaa_device_alloc(void)
 {
 	struct iaa_device *iaa_device;
@@ -614,16 +574,6 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 	}
 }
 
-static void clear_wq_table(void)
-{
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
-
-	pr_debug("cleared wq table\n");
-}
-
 static void free_iaa_device(struct iaa_device *iaa_device)
 {
 	if (!iaa_device)
@@ -704,43 +654,6 @@ static int iaa_wq_put(struct idxd_wq *wq)
 	return ret;
 }
 
-static void free_wq_table(void)
-{
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_free_entry(cpu);
-
-	free_percpu(wq_table);
-
-	pr_debug("freed wq table\n");
-}
-
-static int alloc_wq_table(int max_wqs)
-{
-	struct wq_table_entry *entry;
-	int cpu;
-
-	wq_table = alloc_percpu(struct wq_table_entry);
-	if (!wq_table)
-		return -ENOMEM;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++) {
-		entry = per_cpu_ptr(wq_table, cpu);
-		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
-		if (!entry->wqs) {
-			free_wq_table();
-			return -ENOMEM;
-		}
-
-		entry->max_wqs = max_wqs;
-	}
-
-	pr_debug("initialized wq table\n");
-
-	return 0;
-}
-
 static int save_iaa_wq(struct idxd_wq *wq)
 {
 	struct iaa_device *iaa_device, *found = NULL;
@@ -829,6 +742,87 @@ static void remove_iaa_wq(struct idxd_wq *wq)
 		cpus_per_iaa = 1;
 }
 
+/***************************************************************
+ * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
+ ***************************************************************/
+static void wq_table_free_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	kfree(entry->wqs);
+	memset(entry, 0, sizeof(*entry));
+}
+
+static void wq_table_clear_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	entry->n_wqs = 0;
+	entry->cur_wq = 0;
+	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+}
+
+static void clear_wq_table(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		wq_table_clear_entry(cpu);
+
+	pr_debug("cleared wq table\n");
+}
+
+static void free_wq_table(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		wq_table_free_entry(cpu);
+
+	free_percpu(wq_table);
+
+	pr_debug("freed wq table\n");
+}
+
+static int alloc_wq_table(int max_wqs)
+{
+	struct wq_table_entry *entry;
+	int cpu;
+
+	wq_table = alloc_percpu(struct wq_table_entry);
+	if (!wq_table)
+		return -ENOMEM;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		entry = per_cpu_ptr(wq_table, cpu);
+		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
+		if (!entry->wqs) {
+			free_wq_table();
+			return -ENOMEM;
+		}
+
+		entry->max_wqs = max_wqs;
+	}
+
+	pr_debug("initialized wq table\n");
+
+	return 0;
+}
+
+static void wq_table_add(int cpu, struct idxd_wq *wq)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+		return;
+
+	entry->wqs[entry->n_wqs++] = wq;
+
+	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
+		 entry->wqs[entry->n_wqs - 1]->idxd->id,
+		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+}
+
 static int wq_table_add_wqs(int iaa, int cpu)
 {
 	struct iaa_device *iaa_device, *found_device = NULL;
@@ -939,6 +933,29 @@ static void rebalance_wq_table(void)
 	}
 }
 
+/***************************************************************
+ * Assign work-queues for driver ops using per-cpu wq_tables.
+ ***************************************************************/
+static struct idxd_wq *wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	if (++entry->cur_wq >= entry->n_wqs)
+		entry->cur_wq = 0;
+
+	if (!entry->wqs[entry->cur_wq])
+		return NULL;
+
+	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
+		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
+		 entry->wqs[entry->cur_wq]->id, cpu);
+
+	return entry->wqs[entry->cur_wq];
+}
+
+/*************************************************
+ * Core iaa_crypto compress/decompress functions.
+ *************************************************/
 static inline int check_completion(struct device *dev,
 				   struct iax_completion_record *comp,
 				   bool compress,
@@ -1010,13 +1027,130 @@ static int deflate_generic_decompress(struct acomp_req *req)
 
 static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
 				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr);
+				dma_addr_t *src_addr, dma_addr_t *dst_addr)
+{
+	int ret = 0;
+	int nr_sgs;
+
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		goto out;
+	}
+	*src_addr = sg_dma_address(req->src);
+	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
+		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
+		req->src, req->slen, sg_dma_len(req->src));
+
+	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+		goto out;
+	}
+	*dst_addr = sg_dma_address(req->dst);
+	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
+		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
+		req->dst, req->dlen, sg_dma_len(req->dst));
+out:
+	return ret;
+}
 
 static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 			       struct idxd_wq *wq,
 			       dma_addr_t src_addr, unsigned int slen,
 			       dma_addr_t dst_addr, unsigned int *dlen,
-			       u32 compression_crc);
+			       u32 compression_crc)
+{
+	struct iaa_device_compression_mode *active_compression_mode;
+	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	struct iaa_device *iaa_device;
+	struct idxd_desc *idxd_desc;
+	struct iax_hw_desc *desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int ret = 0;
+
+	iaa_wq = idxd_wq_get_private(wq);
+	iaa_device = iaa_wq->iaa_device;
+	idxd = iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
+
+	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	if (IS_ERR(idxd_desc)) {
+		dev_dbg(dev, "idxd descriptor allocation failed\n");
+		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
+			PTR_ERR(idxd_desc));
+		return PTR_ERR(idxd_desc);
+	}
+	desc = idxd_desc->iax_hw;
+
+	/* Verify (optional) - decompress and check crc, suppress dest write */
+
+	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
+	desc->opcode = IAX_OPCODE_DECOMPRESS;
+	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
+	desc->priv = 0;
+
+	desc->src1_addr = (u64)dst_addr;
+	desc->src1_size = *dlen;
+	desc->dst_addr = (u64)src_addr;
+	desc->max_dst_size = slen;
+	desc->completion_addr = idxd_desc->compl_dma;
+
+	dev_dbg(dev, "(verify) compression mode %s,"
+		" desc->src1_addr %llx, desc->src1_size %d,"
+		" desc->dst_addr %llx, desc->max_dst_size %d,"
+		" desc->src2_addr %llx, desc->src2_size %d\n",
+		active_compression_mode->name,
+		desc->src1_addr, desc->src1_size, desc->dst_addr,
+		desc->max_dst_size, desc->src2_addr, desc->src2_size);
+
+	ret = idxd_submit_desc(wq, idxd_desc);
+	if (ret) {
+		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
+		goto err;
+	}
+
+	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+	if (ret) {
+		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
+		goto err;
+	}
+
+	if (compression_crc != idxd_desc->iax_completion->crc) {
+		ret = -EINVAL;
+		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
+			" comp=0x%x, decomp=0x%x\n", compression_crc,
+			idxd_desc->iax_completion->crc);
+		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
+			       8, 1, idxd_desc->iax_completion, 64, 0);
+		goto err;
+	}
+
+	idxd_free_desc(wq, idxd_desc);
+out:
+	return ret;
+err:
+	idxd_free_desc(wq, idxd_desc);
+	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
+
+	goto out;
+}
 
 static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 			      enum idxd_complete_type comp_type,
@@ -1235,133 +1369,6 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 	goto out;
 }
 
-static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr)
-{
-	int ret = 0;
-	int nr_sgs;
-
-	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
-	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
-
-	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		goto out;
-	}
-	*src_addr = sg_dma_address(req->src);
-	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
-		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
-		req->src, req->slen, sg_dma_len(req->src));
-
-	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-		goto out;
-	}
-	*dst_addr = sg_dma_address(req->dst);
-	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
-		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
-		req->dst, req->dlen, sg_dma_len(req->dst));
-out:
-	return ret;
-}
-
-static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
-			       struct idxd_wq *wq,
-			       dma_addr_t src_addr, unsigned int slen,
-			       dma_addr_t dst_addr, unsigned int *dlen,
-			       u32 compression_crc)
-{
-	struct iaa_device_compression_mode *active_compression_mode;
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
-	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
-	struct iax_hw_desc *desc;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
-	int ret = 0;
-
-	iaa_wq = idxd_wq_get_private(wq);
-	iaa_device = iaa_wq->iaa_device;
-	idxd = iaa_device->idxd;
-	pdev = idxd->pdev;
-	dev = &pdev->dev;
-
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
-	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
-			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
-	}
-	desc = idxd_desc->iax_hw;
-
-	/* Verify (optional) - decompress and check crc, suppress dest write */
-
-	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
-	desc->opcode = IAX_OPCODE_DECOMPRESS;
-	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
-	desc->priv = 0;
-
-	desc->src1_addr = (u64)dst_addr;
-	desc->src1_size = *dlen;
-	desc->dst_addr = (u64)src_addr;
-	desc->max_dst_size = slen;
-	desc->completion_addr = idxd_desc->compl_dma;
-
-	dev_dbg(dev, "(verify) compression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n",
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
-	ret = idxd_submit_desc(wq, idxd_desc);
-	if (ret) {
-		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
-		goto err;
-	}
-
-	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
-	if (ret) {
-		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
-		goto err;
-	}
-
-	if (compression_crc != idxd_desc->iax_completion->crc) {
-		ret = -EINVAL;
-		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
-			" comp=0x%x, decomp=0x%x\n", compression_crc,
-			idxd_desc->iax_completion->crc);
-		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
-			       8, 1, idxd_desc->iax_completion, 64, 0);
-		goto err;
-	}
-
-	idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
-}
-
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			  struct idxd_wq *wq,
 			  dma_addr_t src_addr, unsigned int slen,
@@ -2132,6 +2139,9 @@ static void iaa_comp_adecompress_batch(
 				   crypto_req_done, wait);
 }
 
+/*********************************************
+ * Interfaces to crypto_alg and crypto_acomp.
+ *********************************************/
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 07/10] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA.
  2024-11-23  7:01 [PATCH v4 00/10] zswap IAA compress batching Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2024-11-23  7:01 ` [PATCH v4 06/10] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
@ 2024-11-23  7:01 ` Kanchana P Sridhar
  2024-11-23  7:01 ` [PATCH v4 08/10] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 39+ messages in thread
From: Kanchana P Sridhar @ 2024-11-23  7:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the algorithm for mapping available IAA devices and
wqs to cores, as they are being discovered, based on packages instead of
NUMA nodes. This leads to a more realistic mapping of IAA devices as
compression/decompression resources for a package, rather than for a NUMA
node. This also resolves problems that were observed during internal
validation on Intel platforms with many more NUMA nodes than packages: for
such cases, the earlier NUMA based allocation caused some IAAs to be
over-subscribed and some to not be utilized at all.

As a result of this change from NUMA to packages, some of the core
functions used by the iaa_crypto driver's "probe" and "remove" API
have been re-written. The new infrastructure maintains a static/global
mapping of "local wqs" per IAA device, in the "struct iaa_device" itself.
The earlier implementation would allocate memory per-cpu for this data,
which never changes once the IAA devices/wqs have been initialized.

Two main outcomes from this new iaa_crypto driver infrastructure are:

1) Resolves "task blocked for more than x seconds" errors observed during
   internal validation on Intel systems with the earlier NUMA node based
   mappings, which was root-caused to the non-optimal IAA-to-core mappings
   described earlier.

2) Results in a NUM_THREADS factor reduction in memory footprint cost of
   initializing IAA devices/wqs, due to eliminating the per-cpu copies of
   each IAA device's wqs. On a 384 cores Intel Granite Rapids server with
   8 IAA devices, this saves 140MiB.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  17 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 276 ++++++++++++---------
 2 files changed, 171 insertions(+), 122 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 56985e395263..ca317c5aaf27 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -46,6 +46,7 @@ struct iaa_wq {
 	struct idxd_wq		*wq;
 	int			ref;
 	bool			remove;
+	bool			mapped;
 
 	struct iaa_device	*iaa_device;
 
@@ -63,6 +64,13 @@ struct iaa_device_compression_mode {
 	dma_addr_t			aecs_comp_table_dma_addr;
 };
 
+struct wq_table_entry {
+	struct idxd_wq **wqs;
+	int	max_wqs;
+	int	n_wqs;
+	int	cur_wq;
+};
+
 /* Representation of IAA device with wqs, populated by probe */
 struct iaa_device {
 	struct list_head		list;
@@ -73,19 +81,14 @@ struct iaa_device {
 	int				n_wq;
 	struct list_head		wqs;
 
+	struct wq_table_entry		*iaa_local_wqs;
+
 	atomic64_t			comp_calls;
 	atomic64_t			comp_bytes;
 	atomic64_t			decomp_calls;
 	atomic64_t			decomp_bytes;
 };
 
-struct wq_table_entry {
-	struct idxd_wq **wqs;
-	int	max_wqs;
-	int	n_wqs;
-	int	cur_wq;
-};
-
 #define IAA_AECS_ALIGN			32
 
 /*
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index c2362e4525bd..28f2f5617bf0 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -30,8 +30,9 @@
 /* number of iaa instances probed */
 static unsigned int nr_iaa;
 static unsigned int nr_cpus;
-static unsigned int nr_nodes;
-static unsigned int nr_cpus_per_node;
+static unsigned int nr_packages;
+static unsigned int nr_cpus_per_package;
+static unsigned int nr_iaa_per_package;
 
 /* Number of physical cpus sharing each iaa instance */
 static unsigned int cpus_per_iaa;
@@ -462,17 +463,46 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
  * Functions for use in crypto probe and remove interfaces:
  * allocate/init/query/deallocate devices/wqs.
  ***********************************************************/
-static struct iaa_device *iaa_device_alloc(void)
+static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 {
+	struct wq_table_entry *local;
 	struct iaa_device *iaa_device;
 
 	iaa_device = kzalloc(sizeof(*iaa_device), GFP_KERNEL);
 	if (!iaa_device)
-		return NULL;
+		goto err;
+
+	iaa_device->idxd = idxd;
+
+	/* IAA device's local wqs. */
+	iaa_device->iaa_local_wqs = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->iaa_local_wqs)
+		goto err;
+
+	local = iaa_device->iaa_local_wqs;
+
+	local->wqs = kzalloc(iaa_device->idxd->max_wqs * sizeof(struct wq *), GFP_KERNEL);
+	if (!local->wqs)
+		goto err;
+
+	local->max_wqs = iaa_device->idxd->max_wqs;
+	local->n_wqs = 0;
 
 	INIT_LIST_HEAD(&iaa_device->wqs);
 
 	return iaa_device;
+
+err:
+	if (iaa_device) {
+		if (iaa_device->iaa_local_wqs) {
+			if (iaa_device->iaa_local_wqs->wqs)
+				kfree(iaa_device->iaa_local_wqs->wqs);
+			kfree(iaa_device->iaa_local_wqs);
+		}
+		kfree(iaa_device);
+	}
+
+	return NULL;
 }
 
 static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
@@ -491,12 +521,10 @@ static struct iaa_device *add_iaa_device(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
 
-	iaa_device = iaa_device_alloc();
+	iaa_device = iaa_device_alloc(idxd);
 	if (!iaa_device)
 		return NULL;
 
-	iaa_device->idxd = idxd;
-
 	list_add_tail(&iaa_device->list, &iaa_devices);
 
 	nr_iaa++;
@@ -537,6 +565,7 @@ static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 	iaa_wq->wq = wq;
 	iaa_wq->iaa_device = iaa_device;
 	idxd_wq_set_private(wq, iaa_wq);
+	iaa_wq->mapped = false;
 
 	list_add_tail(&iaa_wq->list, &iaa_device->wqs);
 
@@ -580,6 +609,13 @@ static void free_iaa_device(struct iaa_device *iaa_device)
 		return;
 
 	remove_device_compression_modes(iaa_device);
+
+	if (iaa_device->iaa_local_wqs) {
+		if (iaa_device->iaa_local_wqs->wqs)
+			kfree(iaa_device->iaa_local_wqs->wqs);
+		kfree(iaa_device->iaa_local_wqs);
+	}
+
 	kfree(iaa_device);
 }
 
@@ -716,9 +752,14 @@ static int save_iaa_wq(struct idxd_wq *wq)
 	if (WARN_ON(nr_iaa == 0))
 		return -EINVAL;
 
-	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+	cpus_per_iaa = (nr_packages * nr_cpus_per_package) / nr_iaa;
 	if (!cpus_per_iaa)
 		cpus_per_iaa = 1;
+
+	nr_iaa_per_package = nr_iaa / nr_packages;
+	if (!nr_iaa_per_package)
+		nr_iaa_per_package = 1;
+
 out:
 	return 0;
 }
@@ -735,53 +776,45 @@ static void remove_iaa_wq(struct idxd_wq *wq)
 	}
 
 	if (nr_iaa) {
-		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+		cpus_per_iaa = (nr_packages * nr_cpus_per_package) / nr_iaa;
 		if (!cpus_per_iaa)
 			cpus_per_iaa = 1;
-	} else
+
+		nr_iaa_per_package = nr_iaa / nr_packages;
+		if (!nr_iaa_per_package)
+			nr_iaa_per_package = 1;
+	} else {
 		cpus_per_iaa = 1;
+		nr_iaa_per_package = 1;
+	}
 }
 
 /***************************************************************
  * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
  ***************************************************************/
-static void wq_table_free_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
-}
-
-static void wq_table_clear_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
-}
-
-static void clear_wq_table(void)
+/*
+ * Given a cpu, find the closest IAA instance.  The idea is to try to
+ * choose the most appropriate IAA instance for a caller and spread
+ * available workqueues around to clients.
+ */
+static inline int cpu_to_iaa(int cpu)
 {
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
+	int package_id, base_iaa, iaa = 0;
 
-	pr_debug("cleared wq table\n");
-}
+	if (!nr_packages || !nr_iaa_per_package)
+		return 0;
 
-static void free_wq_table(void)
-{
-	int cpu;
+	package_id = topology_logical_package_id(cpu);
+	base_iaa = package_id * nr_iaa_per_package;
+	iaa = base_iaa + ((cpu % nr_cpus_per_package) / cpus_per_iaa);
 
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_free_entry(cpu);
+	pr_debug("cpu = %d, package_id = %d, base_iaa = %d, iaa = %d",
+		 cpu, package_id, base_iaa, iaa);
 
-	free_percpu(wq_table);
+	if (iaa >= 0 && iaa < nr_iaa)
+		return iaa;
 
-	pr_debug("freed wq table\n");
+	return (nr_iaa - 1);
 }
 
 static int alloc_wq_table(int max_wqs)
@@ -795,13 +828,11 @@ static int alloc_wq_table(int max_wqs)
 
 	for (cpu = 0; cpu < nr_cpus; cpu++) {
 		entry = per_cpu_ptr(wq_table, cpu);
-		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
-		if (!entry->wqs) {
-			free_wq_table();
-			return -ENOMEM;
-		}
 
+		entry->wqs = NULL;
 		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
 	}
 
 	pr_debug("initialized wq table\n");
@@ -809,33 +840,27 @@ static int alloc_wq_table(int max_wqs)
 	return 0;
 }
 
-static void wq_table_add(int cpu, struct idxd_wq *wq)
+static void wq_table_add(int cpu, struct wq_table_entry *iaa_local_wqs)
 {
 	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
 
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
-		return;
-
-	entry->wqs[entry->n_wqs++] = wq;
+	entry->wqs = iaa_local_wqs->wqs;
+	entry->max_wqs = iaa_local_wqs->max_wqs;
+	entry->n_wqs = iaa_local_wqs->n_wqs;
+	entry->cur_wq = 0;
 
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
+	pr_debug("%s: cpu %d: added %d iaa local wqs up to wq %d.%d\n", __func__,
+		 cpu, entry->n_wqs,
 		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+		 entry->wqs[entry->n_wqs - 1]->id);
 }
 
 static int wq_table_add_wqs(int iaa, int cpu)
 {
 	struct iaa_device *iaa_device, *found_device = NULL;
-	int ret = 0, cur_iaa = 0, n_wqs_added = 0;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
+	int ret = 0, cur_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		idxd = iaa_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
 
 		if (cur_iaa != iaa) {
 			cur_iaa++;
@@ -843,7 +868,8 @@ static int wq_table_add_wqs(int iaa, int cpu)
 		}
 
 		found_device = iaa_device;
-		dev_dbg(dev, "getting wq from iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 		break;
 	}
@@ -858,29 +884,58 @@ static int wq_table_add_wqs(int iaa, int cpu)
 		}
 		cur_iaa = 0;
 
-		idxd = found_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
-		dev_dbg(dev, "getting wq from only iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from only iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 	}
 
-	list_for_each_entry(iaa_wq, &found_device->wqs, list) {
-		wq_table_add(cpu, iaa_wq->wq);
-		pr_debug("rebalance: added wq for cpu=%d: iaa wq %d.%d\n",
-			 cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
-		n_wqs_added++;
+	wq_table_add(cpu, found_device->iaa_local_wqs);
+
+out:
+	return ret;
+}
+
+static int map_iaa_device_wqs(struct iaa_device *iaa_device)
+{
+	struct wq_table_entry *local;
+	int ret = 0, n_wqs_added = 0;
+	struct iaa_wq *iaa_wq;
+
+	local = iaa_device->iaa_local_wqs;
+
+	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+		if (iaa_wq->mapped && ++n_wqs_added)
+			continue;
+
+		pr_debug("iaa_device %px: processing wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+
+		if (WARN_ON(local->n_wqs == local->max_wqs))
+			break;
+
+		local->wqs[local->n_wqs++] = iaa_wq->wq;
+		pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+
+		iaa_wq->mapped = true;
+		++n_wqs_added;
 	}
 
-	if (!n_wqs_added) {
-		pr_debug("couldn't find any iaa wqs!\n");
+	if (!n_wqs_added && !iaa_device->n_wq) {
+		pr_debug("iaa_device %d: couldn't find any iaa wqs!\n", iaa_device->idxd->id);
 		ret = -EINVAL;
-		goto out;
 	}
-out:
+
 	return ret;
 }
 
+static void map_iaa_devices(void)
+{
+	struct iaa_device *iaa_device;
+
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		BUG_ON(map_iaa_device_wqs(iaa_device));
+	}
+}
+
 /*
  * Rebalance the wq table so that given a cpu, it's easy to find the
  * closest IAA instance.  The idea is to try to choose the most
@@ -889,48 +944,42 @@ static int wq_table_add_wqs(int iaa, int cpu)
  */
 static void rebalance_wq_table(void)
 {
-	const struct cpumask *node_cpus;
-	int node, cpu, iaa = -1;
+	int cpu, iaa;
 
 	if (nr_iaa == 0)
 		return;
 
-	pr_debug("rebalance: nr_nodes=%d, nr_cpus %d, nr_iaa %d, cpus_per_iaa %d\n",
-		 nr_nodes, nr_cpus, nr_iaa, cpus_per_iaa);
+	map_iaa_devices();
 
-	clear_wq_table();
+	pr_debug("rebalance: nr_packages=%d, nr_cpus %d, nr_iaa %d, cpus_per_iaa %d\n",
+		 nr_packages, nr_cpus, nr_iaa, cpus_per_iaa);
 
-	if (nr_iaa == 1) {
-		for (cpu = 0; cpu < nr_cpus; cpu++) {
-			if (WARN_ON(wq_table_add_wqs(0, cpu))) {
-				pr_debug("could not add any wqs for iaa 0 to cpu %d!\n", cpu);
-				return;
-			}
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		iaa = cpu_to_iaa(cpu);
+		pr_debug("rebalance: cpu=%d iaa=%d\n", cpu, iaa);
+
+		if (WARN_ON(iaa == -1)) {
+			pr_debug("rebalance (cpu_to_iaa(%d)) failed!\n", cpu);
+			return;
 		}
 
-		return;
+		if (WARN_ON(wq_table_add_wqs(iaa, cpu))) {
+			pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
+			return;
+		}
 	}
 
-	for_each_node_with_cpus(node) {
-		node_cpus = cpumask_of_node(node);
-
-		for (cpu = 0; cpu <  cpumask_weight(node_cpus); cpu++) {
-			int node_cpu = cpumask_nth(cpu, node_cpus);
-
-			if (WARN_ON(node_cpu >= nr_cpu_ids)) {
-				pr_debug("node_cpu %d doesn't exist!\n", node_cpu);
-				return;
-			}
-
-			if ((cpu % cpus_per_iaa) == 0)
-				iaa++;
+	pr_debug("Finished rebalance local wqs.");
+}
 
-			if (WARN_ON(wq_table_add_wqs(iaa, node_cpu))) {
-				pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
-				return;
-			}
-		}
+static void free_wq_tables(void)
+{
+	if (wq_table) {
+		free_percpu(wq_table);
+		wq_table = NULL;
 	}
+
+	pr_debug("freed local wq table\n");
 }
 
 /***************************************************************
@@ -2281,7 +2330,7 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	free_iaa_wq(idxd_wq_get_private(wq));
 err_save:
 	if (first_wq)
-		free_wq_table();
+		free_wq_tables();
 err_alloc:
 	mutex_unlock(&iaa_devices_lock);
 	idxd_drv_disable_wq(wq);
@@ -2331,7 +2380,9 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 
 	if (nr_iaa == 0) {
 		iaa_crypto_enabled = false;
-		free_wq_table();
+		free_wq_tables();
+		BUG_ON(!list_empty(&iaa_devices));
+		INIT_LIST_HEAD(&iaa_devices);
 		module_put(THIS_MODULE);
 
 		pr_info("iaa_crypto now DISABLED\n");
@@ -2357,16 +2408,11 @@ static struct idxd_device_driver iaa_crypto_driver = {
 static int __init iaa_crypto_init_module(void)
 {
 	int ret = 0;
-	int node;
+	INIT_LIST_HEAD(&iaa_devices);
 
 	nr_cpus = num_possible_cpus();
-	for_each_node_with_cpus(node)
-		nr_nodes++;
-	if (!nr_nodes) {
-		pr_err("IAA couldn't find any nodes with cpus\n");
-		return -ENODEV;
-	}
-	nr_cpus_per_node = nr_cpus / nr_nodes;
+	nr_cpus_per_package = topology_num_cores_per_package();
+	nr_packages = topology_max_packages();
 
 	if (crypto_has_comp("deflate-generic", 0, 0))
 		deflate_generic_tfm = crypto_alloc_comp("deflate-generic", 0, 0);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 08/10] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package.
  2024-11-23  7:01 [PATCH v4 00/10] zswap IAA compress batching Kanchana P Sridhar
                   ` (6 preceding siblings ...)
  2024-11-23  7:01 ` [PATCH v4 07/10] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
@ 2024-11-23  7:01 ` Kanchana P Sridhar
  2024-11-23  7:01 ` [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching Kanchana P Sridhar
  2024-11-23  7:01 ` [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios Kanchana P Sridhar
  9 siblings, 0 replies; 39+ messages in thread
From: Kanchana P Sridhar @ 2024-11-23  7:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This change enables processes running on any logical core on a package to
use all the IAA devices enabled on that package for compress jobs. In
other words, compressions originating from any process in a package will be
distributed in round-robin manner to the available IAA devices on the same
package.

The main premise behind this change is to make sure that no compress
engines on any IAA device are un-utilized/under-utilized/over-utilized.
In other words, the compress engines on all IAA devices are considered a
global resource for that package, thus maximizing compression throughput.

This allows the use of all IAA devices present in a given package for
(batched) compressions originating from zswap/zram, from all cores
on this package.

A new per-cpu "global_wq_table" implements this in the iaa_crypto driver.
We can think of the global WQ per IAA as a WQ to which all cores on
that package can submit compress jobs.

To avail of this feature, the user must configure 2 WQs per IAA in order to
enable distribution of compress jobs to multiple IAA devices.

Each IAA will have 2 WQs:
 wq.0 (local WQ):
   Used for decompress jobs from cores mapped by the cpu_to_iaa() "even
   balancing of logical cores to IAA devices" algorithm.

 wq.1 (global WQ):
   Used for compress jobs from *all* logical cores on that package.

The iaa_crypto driver will place all global WQs from all same-package IAA
devices in the global_wq_table per cpu on that package. When the driver
receives a compress job, it will lookup the "next" global WQ in the cpu's
global_wq_table to submit the descriptor.

The starting wq in the global_wq_table for each cpu is the global wq
associated with the IAA nearest to it, so that we stagger the starting
global wq for each process. This results in very uniform usage of all IAAs
for compress jobs.

Two new driver module parameters are added for this feature:

g_wqs_per_iaa (default 0):

 /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa

 This represents the number of global WQs that can be configured per IAA
 device. The recommended setting is 1 to enable the use of this feature
 once the user configures 2 WQs per IAA using higher level scripts as
 described in Documentation/driver-api/crypto/iaa/iaa-crypto.rst.

g_consec_descs_per_gwq (default 1):

 /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq

 This represents the number of consecutive compress jobs that will be
 submitted to the same global WQ (i.e. to the same IAA device) from a given
 core, before moving to the next global WQ. The default is 1, which is also
 the recommended setting to avail of this feature.

The decompress jobs from any core will be sent to the "local" IAA, namely
the one that the driver assigns with the cpu_to_iaa() mapping algorithm
that evenly balances the assignment of logical cores to IAA devices on a
package.

On a 2-package Sapphire Rapids server where each package has 56 cores and
4 IAA devices, this is how the compress/decompress jobs will be mapped
when the user configures 2 WQs per IAA device (which implies wq.1 will
be added to the global WQ table for each logical core on that package):

 package(s):        2
 package0 CPU(s):   0-55,112-167
 package1 CPU(s):   56-111,168-223

 Compress jobs:
 --------------
 package 0:
 iaa_crypto will send compress jobs from all cpus (0-55,112-167) to all IAA
 devices on the package (iax1/iax3/iax5/iax7) in round-robin manner:
 iaa:   iax1           iax3           iax5           iax7

 package 1:
 iaa_crypto will send compress jobs from all cpus (56-111,168-223) to all
 IAA devices on the package (iax9/iax11/iax13/iax15) in round-robin manner:
 iaa:   iax9           iax11          iax13           iax15

 Decompress jobs:
 ----------------
 package 0:
 cpu   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
 iaa:  iax1           iax3           iax5           iax7

 package 1:
 cpu   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
 iaa:  iax9           iax11          iax13           iax15

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |   1 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 385 ++++++++++++++++++++-
 2 files changed, 378 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index ca317c5aaf27..ca7326d6e9bf 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -82,6 +82,7 @@ struct iaa_device {
 	struct list_head		wqs;
 
 	struct wq_table_entry		*iaa_local_wqs;
+	struct wq_table_entry		*iaa_global_wqs;
 
 	atomic64_t			comp_calls;
 	atomic64_t			comp_bytes;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 28f2f5617bf0..1cbf92d1b3e5 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -42,6 +42,18 @@ static struct crypto_comp *deflate_generic_tfm;
 /* Per-cpu lookup table for balanced wqs */
 static struct wq_table_entry __percpu *wq_table = NULL;
 
+static struct wq_table_entry **pkg_global_wq_tables = NULL;
+
+/* Per-cpu lookup table for global wqs shared by all cpus. */
+static struct wq_table_entry __percpu *global_wq_table = NULL;
+
+/*
+ * Per-cpu counter of consecutive descriptors allocated to
+ * the same wq in the global_wq_table, so that we know
+ * when to switch to the next wq in the global_wq_table.
+ */
+static int __percpu *num_consec_descs_per_wq = NULL;
+
 /* Verify results of IAA compress or not */
 static bool iaa_verify_compress = false;
 
@@ -79,6 +91,16 @@ static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
+/* Number of global wqs per iaa*/
+static int g_wqs_per_iaa = 0;
+
+/*
+ * Number of consecutive descriptors to allocate from a
+ * given global wq before switching to the next wq in
+ * the global_wq_table.
+ */
+static int g_consec_descs_per_gwq = 1;
+
 static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
 LIST_HEAD(iaa_devices);
@@ -180,6 +202,60 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
+static ssize_t g_wqs_per_iaa_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_wqs_per_iaa);
+}
+
+static ssize_t g_wqs_per_iaa_store(struct device_driver *driver,
+				   const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_wqs_per_iaa);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_wqs_per_iaa);
+
+static ssize_t g_consec_descs_per_gwq_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_consec_descs_per_gwq);
+}
+
+static ssize_t g_consec_descs_per_gwq_store(struct device_driver *driver,
+					    const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_consec_descs_per_gwq);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_consec_descs_per_gwq);
+
 /****************************
  * Driver compression modes.
  ****************************/
@@ -465,7 +541,7 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
  ***********************************************************/
 static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 {
-	struct wq_table_entry *local;
+	struct wq_table_entry *local, *global;
 	struct iaa_device *iaa_device;
 
 	iaa_device = kzalloc(sizeof(*iaa_device), GFP_KERNEL);
@@ -488,6 +564,20 @@ static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 	local->max_wqs = iaa_device->idxd->max_wqs;
 	local->n_wqs = 0;
 
+	/* IAA device's global wqs. */
+	iaa_device->iaa_global_wqs = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->iaa_global_wqs)
+		goto err;
+
+	global = iaa_device->iaa_global_wqs;
+
+	global->wqs = kzalloc(iaa_device->idxd->max_wqs * sizeof(struct wq *), GFP_KERNEL);
+	if (!global->wqs)
+		goto err;
+
+	global->max_wqs = iaa_device->idxd->max_wqs;
+	global->n_wqs = 0;
+
 	INIT_LIST_HEAD(&iaa_device->wqs);
 
 	return iaa_device;
@@ -499,6 +589,8 @@ static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 				kfree(iaa_device->iaa_local_wqs->wqs);
 			kfree(iaa_device->iaa_local_wqs);
 		}
+		if (iaa_device->iaa_global_wqs)
+			kfree(iaa_device->iaa_global_wqs);
 		kfree(iaa_device);
 	}
 
@@ -616,6 +708,12 @@ static void free_iaa_device(struct iaa_device *iaa_device)
 		kfree(iaa_device->iaa_local_wqs);
 	}
 
+	if (iaa_device->iaa_global_wqs) {
+		if (iaa_device->iaa_global_wqs->wqs)
+			kfree(iaa_device->iaa_global_wqs->wqs);
+		kfree(iaa_device->iaa_global_wqs);
+	}
+
 	kfree(iaa_device);
 }
 
@@ -817,6 +915,58 @@ static inline int cpu_to_iaa(int cpu)
 	return (nr_iaa - 1);
 }
 
+static void free_global_wq_table(void)
+{
+	if (global_wq_table) {
+		free_percpu(global_wq_table);
+		global_wq_table = NULL;
+	}
+
+	if (num_consec_descs_per_wq) {
+		free_percpu(num_consec_descs_per_wq);
+		num_consec_descs_per_wq = NULL;
+	}
+
+	pr_debug("freed global wq table\n");
+}
+
+static int pkg_global_wq_tables_alloc(void)
+{
+	int i, j;
+
+	pkg_global_wq_tables = kzalloc(nr_packages * sizeof(*pkg_global_wq_tables), GFP_KERNEL);
+	if (!pkg_global_wq_tables)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_packages; ++i) {
+		pkg_global_wq_tables[i] = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+
+		if (!pkg_global_wq_tables[i]) {
+			for (j = 0; j < i; ++j)
+				kfree(pkg_global_wq_tables[j]);
+			kfree(pkg_global_wq_tables);
+			pkg_global_wq_tables = NULL;
+			return -ENOMEM;
+		}
+		pkg_global_wq_tables[i]->wqs = NULL;
+	}
+
+	return 0;
+}
+
+static void pkg_global_wq_tables_dealloc(void)
+{
+	int i;
+
+	for (i = 0; i < nr_packages; ++i) {
+		if (pkg_global_wq_tables[i]->wqs)
+			kfree(pkg_global_wq_tables[i]->wqs);
+		kfree(pkg_global_wq_tables[i]);
+	}
+	kfree(pkg_global_wq_tables);
+	pkg_global_wq_tables = NULL;
+}
+
 static int alloc_wq_table(int max_wqs)
 {
 	struct wq_table_entry *entry;
@@ -835,6 +985,35 @@ static int alloc_wq_table(int max_wqs)
 		entry->cur_wq = 0;
 	}
 
+	global_wq_table = alloc_percpu(struct wq_table_entry);
+	if (!global_wq_table)
+		return 0;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		entry = per_cpu_ptr(global_wq_table, cpu);
+
+		entry->wqs = NULL;
+		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+	}
+
+	num_consec_descs_per_wq = alloc_percpu(int);
+	if (!num_consec_descs_per_wq) {
+		free_global_wq_table();
+		return 0;
+	}
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+		*num_consec_descs = 0;
+	}
+
+	if (pkg_global_wq_tables_alloc()) {
+		free_global_wq_table();
+		return 0;
+	}
+
 	pr_debug("initialized wq table\n");
 
 	return 0;
@@ -895,13 +1074,120 @@ static int wq_table_add_wqs(int iaa, int cpu)
 	return ret;
 }
 
+static void pkg_global_wq_tables_reinit(void)
+{
+	int i, cur_iaa = 0, pkg = 0, nr_pkg_wqs = 0;
+	struct iaa_device *iaa_device;
+	struct wq_table_entry *global;
+
+	if (!pkg_global_wq_tables)
+		return;
+
+	/* Reallocate per-package wqs. */
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		global = iaa_device->iaa_global_wqs;
+		nr_pkg_wqs += global->n_wqs;
+
+		if (++cur_iaa == nr_iaa_per_package) {
+			nr_pkg_wqs = nr_pkg_wqs ? max_t(int, iaa_device->idxd->max_wqs, nr_pkg_wqs) : 0;
+
+			if (pkg_global_wq_tables[pkg]->wqs) {
+				kfree(pkg_global_wq_tables[pkg]->wqs);
+				pkg_global_wq_tables[pkg]->wqs = NULL;
+			}
+
+			if (nr_pkg_wqs)
+				pkg_global_wq_tables[pkg]->wqs = kzalloc(nr_pkg_wqs *
+									 sizeof(struct wq *),
+									 GFP_KERNEL);
+
+			pkg_global_wq_tables[pkg]->n_wqs = 0;
+			pkg_global_wq_tables[pkg]->cur_wq = 0;
+			pkg_global_wq_tables[pkg]->max_wqs = nr_pkg_wqs;
+
+			if (++pkg == nr_packages)
+				break;
+			cur_iaa = 0;
+			nr_pkg_wqs = 0;
+		}
+	}
+
+	pkg = 0;
+	cur_iaa = 0;
+
+	/* Re-initialize per-package wqs. */
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		global = iaa_device->iaa_global_wqs;
+
+		if (pkg_global_wq_tables[pkg]->wqs)
+			for (i = 0; i < global->n_wqs; ++i)
+				pkg_global_wq_tables[pkg]->wqs[pkg_global_wq_tables[pkg]->n_wqs++] = global->wqs[i];
+
+		pr_debug("pkg_global_wq_tables[%d] has %d wqs", pkg, pkg_global_wq_tables[pkg]->n_wqs);
+
+		if (++cur_iaa == nr_iaa_per_package) {
+			if (++pkg == nr_packages)
+				break;
+			cur_iaa = 0;
+		}
+	}
+}
+
+static void global_wq_table_add(int cpu, struct wq_table_entry *pkg_global_wq_table)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+
+	/* This could be NULL. */
+	entry->wqs = pkg_global_wq_table->wqs;
+	entry->max_wqs = pkg_global_wq_table->max_wqs;
+	entry->n_wqs = pkg_global_wq_table->n_wqs;
+	entry->cur_wq = 0;
+
+	if (entry->wqs)
+		pr_debug("%s: cpu %d: added %d iaa global wqs up to wq %d.%d\n", __func__,
+			 cpu, entry->n_wqs,
+			 entry->wqs[entry->n_wqs - 1]->idxd->id,
+			 entry->wqs[entry->n_wqs - 1]->id);
+}
+
+static void global_wq_table_set_start_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int start_wq = g_wqs_per_iaa * (cpu_to_iaa(cpu) % nr_iaa_per_package);
+
+	if ((start_wq >= 0) && (start_wq < entry->n_wqs))
+		entry->cur_wq = start_wq;
+}
+
+static void global_wq_table_add_wqs(void)
+{
+	int cpu;
+
+	if (!pkg_global_wq_tables)
+		return;
+
+	for (cpu = 0; cpu < nr_cpus; cpu += nr_cpus_per_package) {
+		/* cpu's on the same package get the same global_wq_table. */
+		int package_id = topology_logical_package_id(cpu);
+		int pkg_cpu;
+
+		for (pkg_cpu = cpu; pkg_cpu < cpu + nr_cpus_per_package; ++pkg_cpu) {
+			if (pkg_global_wq_tables[package_id]->n_wqs > 0) {
+				global_wq_table_add(pkg_cpu, pkg_global_wq_tables[package_id]);
+				global_wq_table_set_start_wq(pkg_cpu);
+			}
+		}
+	}
+}
+
 static int map_iaa_device_wqs(struct iaa_device *iaa_device)
 {
-	struct wq_table_entry *local;
+	struct wq_table_entry *local, *global;
 	int ret = 0, n_wqs_added = 0;
 	struct iaa_wq *iaa_wq;
 
 	local = iaa_device->iaa_local_wqs;
+	global = iaa_device->iaa_global_wqs;
 
 	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
 		if (iaa_wq->mapped && ++n_wqs_added)
@@ -909,11 +1195,18 @@ static int map_iaa_device_wqs(struct iaa_device *iaa_device)
 
 		pr_debug("iaa_device %px: processing wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
 
-		if (WARN_ON(local->n_wqs == local->max_wqs))
-			break;
+		if ((!n_wqs_added || ((n_wqs_added + g_wqs_per_iaa) < iaa_device->n_wq)) &&
+			(local->n_wqs < local->max_wqs)) {
+
+			local->wqs[local->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		} else {
+			if (WARN_ON(global->n_wqs == global->max_wqs))
+				break;
 
-		local->wqs[local->n_wqs++] = iaa_wq->wq;
-		pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+			global->wqs[global->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %px: added global wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		}
 
 		iaa_wq->mapped = true;
 		++n_wqs_added;
@@ -969,6 +1262,10 @@ static void rebalance_wq_table(void)
 		}
 	}
 
+	if (iaa_crypto_enabled && pkg_global_wq_tables) {
+		pkg_global_wq_tables_reinit();
+		global_wq_table_add_wqs();
+	}
 	pr_debug("Finished rebalance local wqs.");
 }
 
@@ -979,7 +1276,17 @@ static void free_wq_tables(void)
 		wq_table = NULL;
 	}
 
-	pr_debug("freed local wq table\n");
+	if (global_wq_table) {
+		free_percpu(global_wq_table);
+		global_wq_table = NULL;
+	}
+
+	if (num_consec_descs_per_wq) {
+		free_percpu(num_consec_descs_per_wq);
+		num_consec_descs_per_wq = NULL;
+	}
+
+	pr_debug("freed wq tables\n");
 }
 
 /***************************************************************
@@ -1002,6 +1309,35 @@ static struct idxd_wq *wq_table_next_wq(int cpu)
 	return entry->wqs[entry->cur_wq];
 }
 
+/*
+ * Caller should make sure to call only if the
+ * per_cpu_ptr "global_wq_table" is non-NULL
+ * and has at least one wq configured.
+ */
+static struct idxd_wq *global_wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+
+	/*
+	 * Fall-back to local IAA's wq if there were no global wqs configured
+	 * for any IAA device, or if there were problems in setting up global
+	 * wqs for this cpu's package.
+	 */
+	if (!entry->wqs)
+		return wq_table_next_wq(cpu);
+
+	if ((*num_consec_descs) == g_consec_descs_per_gwq) {
+		if (++entry->cur_wq >= entry->n_wqs)
+			entry->cur_wq = 0;
+		*num_consec_descs = 0;
+	}
+
+	++(*num_consec_descs);
+
+	return entry->wqs[entry->cur_wq];
+}
+
 /*************************************************
  * Core iaa_crypto compress/decompress functions.
  *************************************************/
@@ -1553,6 +1889,7 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	struct idxd_wq *wq;
 	struct device *dev;
 	int order = -1;
+	struct wq_table_entry *entry;
 
 	compression_ctx = crypto_tfm_ctx(tfm);
 
@@ -1571,8 +1908,15 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		disable_async = true;
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (!entry || !entry->wqs || entry->n_wqs == 0) {
+		wq = wq_table_next_wq(cpu);
+	} else {
+		wq = global_wq_table_next_wq(cpu);
+	}
 	put_cpu();
+
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
 		return -ENODEV;
@@ -2380,6 +2724,7 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 
 	if (nr_iaa == 0) {
 		iaa_crypto_enabled = false;
+		pkg_global_wq_tables_dealloc();
 		free_wq_tables();
 		BUG_ON(!list_empty(&iaa_devices));
 		INIT_LIST_HEAD(&iaa_devices);
@@ -2449,6 +2794,20 @@ static int __init iaa_crypto_init_module(void)
 		goto err_sync_attr_create;
 	}
 
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_wqs_per_iaa);
+	if (ret) {
+		pr_debug("IAA g_wqs_per_iaa attr creation failed\n");
+		goto err_g_wqs_per_iaa_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_consec_descs_per_gwq);
+	if (ret) {
+		pr_debug("IAA g_consec_descs_per_gwq attr creation failed\n");
+		goto err_g_consec_descs_per_gwq_attr_create;
+	}
+
 	if (iaa_crypto_debugfs_init())
 		pr_warn("debugfs init failed, stats not available\n");
 
@@ -2456,6 +2815,12 @@ static int __init iaa_crypto_init_module(void)
 out:
 	return ret;
 
+err_g_consec_descs_per_gwq_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+err_g_wqs_per_iaa_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_sync_mode);
 err_sync_attr_create:
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
@@ -2479,6 +2844,10 @@ static void __exit iaa_crypto_cleanup_module(void)
 			   &driver_attr_sync_mode);
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_consec_descs_per_gwq);
 	idxd_driver_unregister(&iaa_crypto_driver);
 	iaa_aecs_cleanup_fixed();
 	crypto_free_comp(deflate_generic_tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-11-23  7:01 [PATCH v4 00/10] zswap IAA compress batching Kanchana P Sridhar
                   ` (7 preceding siblings ...)
  2024-11-23  7:01 ` [PATCH v4 08/10] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
@ 2024-11-23  7:01 ` Kanchana P Sridhar
  2024-12-02 19:15   ` Nhat Pham
  2024-11-23  7:01 ` [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios Kanchana P Sridhar
  9 siblings, 1 reply; 39+ messages in thread
From: Kanchana P Sridhar @ 2024-11-23  7:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch does the following:

1) Modifies the definition of "struct crypto_acomp_ctx" to represent a
   configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
   "struct crypto_acomp_ctx" to contain the nr of resources that will be
   allocated in the cpu onlining code.

2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
   crypto_acomp created for the pool (in other words, the zswap compression
   algorithm) has registered an implementation for batch_compress() and
   batch_decompress(). If so, it will set "nr_reqs" to
   SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and set
   the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not support
   batching, "nr_reqs" defaults to 1.

3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to
   true if the batching API are present for the crypto_acomp.

SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress batching
"sub-batch" size when zswap_batch_store() is processing a large folio. This
represents the nr of buffers that can be compressed/decompressed in
parallel by Intel IAA hardware.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h |   7 +++
 mm/zswap.c            | 120 +++++++++++++++++++++++++++++++-----------
 2 files changed, 95 insertions(+), 32 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index d961ead91bf1..9ad27ab3d222 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -7,6 +7,13 @@
 
 struct lruvec;
 
+/*
+ * For IAA compression batching:
+ * Maximum number of IAA acomp compress requests that will be processed
+ * in a batch: in parallel, if iaa_crypto async/no irq mode is enabled
+ * (the default); else sequentially, if iaa_crypto sync mode is in effect.
+ */
+#define SWAP_CRYPTO_BATCH_SIZE 8UL
 extern atomic_long_t zswap_stored_pages;
 
 #ifdef CONFIG_ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index f6316b66fb23..173f7632990e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -143,9 +143,10 @@ bool zswap_never_enabled(void)
 
 struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
+	struct acomp_req **reqs;
+	u8 **buffers;
+	unsigned int nr_reqs;
 	struct crypto_wait wait;
-	u8 *buffer;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -158,6 +159,7 @@ struct crypto_acomp_ctx {
  */
 struct zswap_pool {
 	struct zpool *zpool;
+	bool can_batch;
 	struct crypto_acomp_ctx __percpu *acomp_ctx;
 	struct percpu_ref ref;
 	struct list_head list;
@@ -285,6 +287,8 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 		goto error;
 	}
 
+	pool->can_batch = false;
+
 	ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
 				       &pool->node);
 	if (ret)
@@ -818,49 +822,90 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
-	int ret;
+	unsigned int nr_reqs = 1;
+	int ret = -ENOMEM;
+	int i, j;
 
 	mutex_init(&acomp_ctx->mutex);
-
-	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!acomp_ctx->buffer)
-		return -ENOMEM;
+	acomp_ctx->nr_reqs = 0;
 
 	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
 	if (IS_ERR(acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp));
-		ret = PTR_ERR(acomp);
-		goto acomp_fail;
+		return PTR_ERR(acomp);
 	}
 	acomp_ctx->acomp = acomp;
 	acomp_ctx->is_sleepable = acomp_is_async(acomp);
 
-	req = acomp_request_alloc(acomp_ctx->acomp);
-	if (!req) {
-		pr_err("could not alloc crypto acomp_request %s\n",
-		       pool->tfm_name);
-		ret = -ENOMEM;
+	/*
+	 * Create the necessary batching resources if the crypto acomp alg
+	 * implements the batch_compress and batch_decompress API.
+	 */
+	if (acomp_has_async_batching(acomp)) {
+		pool->can_batch = true;
+		nr_reqs = SWAP_CRYPTO_BATCH_SIZE;
+		pr_info_once("Creating acomp_ctx with %d reqs for batching since crypto acomp %s\nhas registered batch_compress() and batch_decompress()\n",
+			nr_reqs, pool->tfm_name);
+	}
+
+	acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *), GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffers)
+		goto buf_fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+		if (!acomp_ctx->buffers[i]) {
+			for (j = 0; j < i; ++j)
+				kfree(acomp_ctx->buffers[j]);
+			kfree(acomp_ctx->buffers);
+			ret = -ENOMEM;
+			goto buf_fail;
+		}
+	}
+
+	acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req *), GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->reqs)
 		goto req_fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
+		if (!acomp_ctx->reqs[i]) {
+			pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
+			       i, pool->tfm_name);
+			for (j = 0; j < i; ++j)
+				acomp_request_free(acomp_ctx->reqs[j]);
+			kfree(acomp_ctx->reqs);
+			ret = -ENOMEM;
+			goto req_fail;
+		}
 	}
-	acomp_ctx->req = req;
 
+	/*
+	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
+	 * or non-poll mode of acomp, hence there is only one "wait" per
+	 * acomp_ctx, with callback set to reqs[0], under the assumption that
+	 * there is at least 1 request per acomp_ctx.
+	 */
 	crypto_init_wait(&acomp_ctx->wait);
 	/*
 	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
-	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+	acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
+	acomp_ctx->nr_reqs = nr_reqs;
 	return 0;
 
 req_fail:
+	for (i = 0; i < nr_reqs; ++i)
+		kfree(acomp_ctx->buffers[i]);
+	kfree(acomp_ctx->buffers);
+buf_fail:
 	crypto_free_acomp(acomp_ctx->acomp);
-acomp_fail:
-	kfree(acomp_ctx->buffer);
+	pool->can_batch = false;
 	return ret;
 }
 
@@ -870,11 +915,22 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 
 	if (!IS_ERR_OR_NULL(acomp_ctx)) {
-		if (!IS_ERR_OR_NULL(acomp_ctx->req))
-			acomp_request_free(acomp_ctx->req);
+		int i;
+
+		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+			if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
+				acomp_request_free(acomp_ctx->reqs[i]);
+		kfree(acomp_ctx->reqs);
+
+		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+			kfree(acomp_ctx->buffers[i]);
+		kfree(acomp_ctx->buffers);
+
 		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 			crypto_free_acomp(acomp_ctx->acomp);
-		kfree(acomp_ctx->buffer);
+
+		acomp_ctx->nr_reqs = 0;
+		acomp_ctx = NULL;
 	}
 
 	return 0;
@@ -897,7 +953,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->buffer;
+	dst = acomp_ctx->buffers[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
@@ -907,7 +963,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * giving the dst buffer with enough length to avoid buffer overflow.
 	 */
 	sg_init_one(&output, dst, PAGE_SIZE * 2);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
 
 	/*
 	 * it maybe looks a little bit silly that we send an asynchronous request,
@@ -921,8 +977,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * but in different threads running on different cpu, we have different
 	 * acomp instance, so multiple threads can do (de)compression in parallel.
 	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
+	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
+	dlen = acomp_ctx->reqs[0]->dlen;
 	if (comp_ret)
 		goto unlock;
 
@@ -975,20 +1031,20 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	 */
 	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
 	    !virt_addr_valid(src)) {
-		memcpy(acomp_ctx->buffer, src, entry->length);
-		src = acomp_ctx->buffer;
+		memcpy(acomp_ctx->buffers[0], src, entry->length);
+		src = acomp_ctx->buffers[0];
 		zpool_unmap_handle(zpool, entry->handle);
 	}
 
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_folio(&output, folio, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
-	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
-	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE);
+	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait));
+	BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
 	mutex_unlock(&acomp_ctx->mutex);
 
-	if (src != acomp_ctx->buffer)
+	if (src != acomp_ctx->buffers[0])
 		zpool_unmap_handle(zpool, entry->handle);
 }
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-11-23  7:01 ` [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching Kanchana P Sridhar
@ 2024-12-02 19:15   ` Nhat Pham
  2024-12-03  0:30     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 39+ messages in thread
From: Nhat Pham @ 2024-12-02 19:15 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch does the following:
>
> 1) Modifies the definition of "struct crypto_acomp_ctx" to represent a
>    configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
>    "struct crypto_acomp_ctx" to contain the nr of resources that will be
>    allocated in the cpu onlining code.
>
> 2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
>    crypto_acomp created for the pool (in other words, the zswap compression
>    algorithm) has registered an implementation for batch_compress() and
>    batch_decompress(). If so, it will set "nr_reqs" to
>    SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and set
>    the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not support
>    batching, "nr_reqs" defaults to 1.
>
> 3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to
>    true if the batching API are present for the crypto_acomp.

Why do we need this "can_batch" field? IIUC, this can be determined
from the compressor internal fields itself, no?

acomp_has_async_batching(acomp);

Is this just for convenience, or is this actually an expensive thing to compute?

>
> SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress batching

I like a sane default value as much as the next guy, but this seems a
bit odd to me:

1. The placement of this constant/default value seems strange to me.
This is a compressor-specific value no? Why are we enforcing this
batching size at the zswap level, and uniformly at that? What if we
introduce a new batch compression algorithm...? Or am I missing
something, and this is a sane default for other compressors too?

2. Why is this value set to 8? Experimentation? Could you add some
justification in documentation?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-02 19:15   ` Nhat Pham
@ 2024-12-03  0:30     ` Sridhar, Kanchana P
  2024-12-03  8:00       ` Herbert Xu
  2024-12-21  6:30       ` Sridhar, Kanchana P
  0 siblings, 2 replies; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-12-03  0:30 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

Hi Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Monday, December 2, 2024 11:16 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch does the following:
> >
> > 1) Modifies the definition of "struct crypto_acomp_ctx" to represent a
> >    configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
> >    "struct crypto_acomp_ctx" to contain the nr of resources that will be
> >    allocated in the cpu onlining code.
> >
> > 2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
> >    crypto_acomp created for the pool (in other words, the zswap
> compression
> >    algorithm) has registered an implementation for batch_compress() and
> >    batch_decompress(). If so, it will set "nr_reqs" to
> >    SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and
> set
> >    the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not
> support
> >    batching, "nr_reqs" defaults to 1.
> >
> > 3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to
> >    true if the batching API are present for the crypto_acomp.
> 
> Why do we need this "can_batch" field? IIUC, this can be determined
> from the compressor internal fields itself, no?
> 
> acomp_has_async_batching(acomp);
> 
> Is this just for convenience, or is this actually an expensive thing to compute?

Thanks for your comments. This is a good question. I tried not to imply that
batching resources have been allocated for the cpu based only on what
acomp_has_async_batching() returns. It is possible that the cpu onlining
code ran into an -ENOMEM error on any particular cpu. In this case, I set
the pool->can_batch to "false", mainly for convenience, so that zswap
can be somewhat insulated from migration. I agree that this may not be
the best solution; and whether or not batching is enabled can be directly
determined just before the call to crypto_acomp_batch_compress()
based on:

acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;

I currently have a BUG_ON() for this condition not being met, that relies
on the pool->can_batch gating the flow to get to zswap_batch_compress().

I think a better solution would be to check for having SWAP_CRYPTO_BATCH_SIZE
# of acomp_ctx resources right after we acquire the acomp_ctx->mutex and before
the call to crypto_acomp_batch_compress(). If so, we proceed, and if not, we call
crypto_acomp_compress(). It seems this might be the only way to know for sure
whether the crypto batching API can be called, given that migration is possible
at any point in zswap_store(). Once we have obtained the mutex_lock, it seems
we can proceed with batching based on this check (although the UAF situation
remains as a larger issue, beyond the scope of this patch). I would appreciate
other ideas as well.

Also, I have submitted a patch-series [1] with Yosry's & Johannes' suggestions
to this series. This is setting up a consolidated zswap_store()/zswap_store_pages()
code path for batching and non-batching compressors. My goal is for [1] to
go through code reviews and be able to transition to batching, with a simple
check:

if (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE)
         zswap_batch_compress();
else
         zswap_compress();

Please feel free to provide code review comments in [1]. Thanks!

[1]: https://patchwork.kernel.org/project/linux-mm/list/?series=912937

> 
> >
> > SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress
> batching
> 
> I like a sane default value as much as the next guy, but this seems a
> bit odd to me:
> 
> 1. The placement of this constant/default value seems strange to me.
> This is a compressor-specific value no? Why are we enforcing this
> batching size at the zswap level, and uniformly at that? What if we
> introduce a new batch compression algorithm...? Or am I missing
> something, and this is a sane default for other compressors too?

You bring up an excellent point. This is a compressor-specific value.
Instead of setting this up as a constant, which as you correctly observe,
may not make sense for a non-IAA compressor, one way to get
this could be by querying the compressor, say:

int acomp_get_max_batchsize(struct crypto_acomp *tfm) {...};

to then allocate sufficient acomp_reqs/buffers/etc. in the zswap
cpu onlining code. 

> 
> 2. Why is this value set to 8? Experimentation? Could you add some
> justification in documentation?

Can I get back to you later this week with a proposal for this? We plan
to have a team discussion on how best to approach this for current
and future hardware.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-03  0:30     ` Sridhar, Kanchana P
@ 2024-12-03  8:00       ` Herbert Xu
  2024-12-03 21:37         ` Sridhar, Kanchana P
  2024-12-21  6:30       ` Sridhar, Kanchana P
  1 sibling, 1 reply; 39+ messages in thread
From: Herbert Xu @ 2024-12-03  8:00 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Nhat Pham, linux-kernel, linux-mm, hannes, yosryahmed,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
>
> > Why do we need this "can_batch" field? IIUC, this can be determined
> > from the compressor internal fields itself, no?
> > 
> > acomp_has_async_batching(acomp);
> > 
> > Is this just for convenience, or is this actually an expensive thing to compute?
> 
> Thanks for your comments. This is a good question. I tried not to imply that
> batching resources have been allocated for the cpu based only on what
> acomp_has_async_batching() returns. It is possible that the cpu onlining
> code ran into an -ENOMEM error on any particular cpu. In this case, I set
> the pool->can_batch to "false", mainly for convenience, so that zswap
> can be somewhat insulated from migration. I agree that this may not be
> the best solution; and whether or not batching is enabled can be directly
> determined just before the call to crypto_acomp_batch_compress()
> based on:
> 
> acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;

With ahash request chaining, the idea is to accumulate as much
data as you can before you provide it to the Crypto API.  The
API is responsible for dividing it up if the underlying driver
is only able to handle one request at a time.

So that would be the ideal model to use for compression as well.
Provide as much data as you can and let the API handle the case
where the data needs to be divided up.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-03  8:00       ` Herbert Xu
@ 2024-12-03 21:37         ` Sridhar, Kanchana P
  2024-12-03 21:44           ` Yosry Ahmed
  0 siblings, 1 reply; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-12-03 21:37 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Nhat Pham, linux-kernel, linux-mm, hannes, yosryahmed,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Tuesday, December 3, 2024 12:01 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
> >
> > > Why do we need this "can_batch" field? IIUC, this can be determined
> > > from the compressor internal fields itself, no?
> > >
> > > acomp_has_async_batching(acomp);
> > >
> > > Is this just for convenience, or is this actually an expensive thing to
> compute?
> >
> > Thanks for your comments. This is a good question. I tried not to imply that
> > batching resources have been allocated for the cpu based only on what
> > acomp_has_async_batching() returns. It is possible that the cpu onlining
> > code ran into an -ENOMEM error on any particular cpu. In this case, I set
> > the pool->can_batch to "false", mainly for convenience, so that zswap
> > can be somewhat insulated from migration. I agree that this may not be
> > the best solution; and whether or not batching is enabled can be directly
> > determined just before the call to crypto_acomp_batch_compress()
> > based on:
> >
> > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> 
> With ahash request chaining, the idea is to accumulate as much
> data as you can before you provide it to the Crypto API.  The
> API is responsible for dividing it up if the underlying driver
> is only able to handle one request at a time.
> 
> So that would be the ideal model to use for compression as well.
> Provide as much data as you can and let the API handle the case
> where the data needs to be divided up.

Thanks for this suggestion! This sounds like a clean way to handle the
batching/sequential compress/decompress within the crypto API as long
as it can be contained in the crypto acompress layer. 
If the zswap maintainers don't have any objections, I can look into the
feasibility of doing this.

Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-03 21:37         ` Sridhar, Kanchana P
@ 2024-12-03 21:44           ` Yosry Ahmed
  2024-12-03 22:17             ` Sridhar, Kanchana P
  2024-12-04  1:42             ` Herbert Xu
  0 siblings, 2 replies; 39+ messages in thread
From: Yosry Ahmed @ 2024-12-03 21:44 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Herbert Xu, Nhat Pham, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Sent: Tuesday, December 3, 2024 12:01 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> > akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> > Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> > the crypto_alg supports batching.
> >
> > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
> > >
> > > > Why do we need this "can_batch" field? IIUC, this can be determined
> > > > from the compressor internal fields itself, no?
> > > >
> > > > acomp_has_async_batching(acomp);
> > > >
> > > > Is this just for convenience, or is this actually an expensive thing to
> > compute?
> > >
> > > Thanks for your comments. This is a good question. I tried not to imply that
> > > batching resources have been allocated for the cpu based only on what
> > > acomp_has_async_batching() returns. It is possible that the cpu onlining
> > > code ran into an -ENOMEM error on any particular cpu. In this case, I set
> > > the pool->can_batch to "false", mainly for convenience, so that zswap
> > > can be somewhat insulated from migration. I agree that this may not be
> > > the best solution; and whether or not batching is enabled can be directly
> > > determined just before the call to crypto_acomp_batch_compress()
> > > based on:
> > >
> > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> >
> > With ahash request chaining, the idea is to accumulate as much
> > data as you can before you provide it to the Crypto API.  The
> > API is responsible for dividing it up if the underlying driver
> > is only able to handle one request at a time.
> >
> > So that would be the ideal model to use for compression as well.
> > Provide as much data as you can and let the API handle the case
> > where the data needs to be divided up.
>
> Thanks for this suggestion! This sounds like a clean way to handle the
> batching/sequential compress/decompress within the crypto API as long
> as it can be contained in the crypto acompress layer.
> If the zswap maintainers don't have any objections, I can look into the
> feasibility of doing this.

Does this mean that instead of zswap breaking down the folio into
SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
crypto layer and let it do the batching as it pleases?

It sounds nice on the surface, but this implies that we have to
allocate folio_nr_pages() buffers in zswap, essentially as the
allocation is the same size as the folio itself. While the allocation
does not need to be contiguous, making a large number of allocations
in the reclaim path is definitely not something we want. For a 2M THP,
we'd need to allocate 2M in zswap_store().

If we choose to keep preallocating, assuming the maximum THP size is
2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of
memory.

I feel like I am missing something.

>
> Thanks,
> Kanchana
>
> >
> > Cheers,
> > --
> > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-03 21:44           ` Yosry Ahmed
@ 2024-12-03 22:17             ` Sridhar, Kanchana P
  2024-12-03 22:24               ` Sridhar, Kanchana P
  2024-12-04  1:42             ` Herbert Xu
  1 sibling, 1 reply; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-12-03 22:17 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Herbert Xu, Nhat Pham, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, December 3, 2024 1:44 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham
> <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > Sent: Tuesday, December 3, 2024 12:01 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org;
> linux-
> > > mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> > > akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>;
> > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching
> resources if
> > > the crypto_alg supports batching.
> > >
> > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > > > Why do we need this "can_batch" field? IIUC, this can be determined
> > > > > from the compressor internal fields itself, no?
> > > > >
> > > > > acomp_has_async_batching(acomp);
> > > > >
> > > > > Is this just for convenience, or is this actually an expensive thing to
> > > compute?
> > > >
> > > > Thanks for your comments. This is a good question. I tried not to imply
> that
> > > > batching resources have been allocated for the cpu based only on what
> > > > acomp_has_async_batching() returns. It is possible that the cpu onlining
> > > > code ran into an -ENOMEM error on any particular cpu. In this case, I
> set
> > > > the pool->can_batch to "false", mainly for convenience, so that zswap
> > > > can be somewhat insulated from migration. I agree that this may not be
> > > > the best solution; and whether or not batching is enabled can be directly
> > > > determined just before the call to crypto_acomp_batch_compress()
> > > > based on:
> > > >
> > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> > >
> > > With ahash request chaining, the idea is to accumulate as much
> > > data as you can before you provide it to the Crypto API.  The
> > > API is responsible for dividing it up if the underlying driver
> > > is only able to handle one request at a time.
> > >
> > > So that would be the ideal model to use for compression as well.
> > > Provide as much data as you can and let the API handle the case
> > > where the data needs to be divided up.
> >
> > Thanks for this suggestion! This sounds like a clean way to handle the
> > batching/sequential compress/decompress within the crypto API as long
> > as it can be contained in the crypto acompress layer.
> > If the zswap maintainers don't have any objections, I can look into the
> > feasibility of doing this.
> 
> Does this mean that instead of zswap breaking down the folio into
> SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> crypto layer and let it do the batching as it pleases?

If I understand Herbert's suggestion correctly, I think what he meant was
that we allocate only SWAP_CRYPTO_BATCH_SIZE # of buffers in zswap (say, 8)
during the cpu onlining always. The acomp_has_async_batching() API can
be used to determine whether to allocate more than one acomp_req and
crypto_wait (fyi, I am creating SWAP_CRYPTO_BATCH_SIZE # of crypto_wait
for the request chaining with the goal of understanding performance wrt the
existing implementation of crypto_acomp_batch_compress()).
In zswap_store_folio(), we process the large folio in batches of 8 pages
and call "crypto_acomp_batch_compress()" for each batch. Based on earlier
discussions in this thread, it might make sense to add a bool option to
crypto_acomp_batch_compress() as follows:

static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
					       struct crypto_wait *waits[],
					       struct page *pages[],
					       u8 *dsts[],
					       unsigned int dlens[],
					       int errors[],
					       int nr_pages,
					       bool parallel);

zswap would acquire the per-cpu acomp_ctx->mutex, and pass
(acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) for the "parallel" parameter.
This indicates to crypto_acomp_batch_compress() whether or not
SWAP_CRYPTO_BATCH_SIZE # of elements are available in "reqs" and "waits".

If we have multiple "reqs" (parallel == true), we use request chaining (or the
existing asynchronous poll implementation) for IAA batching. If (parallel == false),
crypto_acomp_batch_compress() will look something like this:

static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
					       struct crypto_wait *waits[],
					       struct page *pages[],
					       u8 *dsts[],
					       unsigned int dlens[],
					       int errors[],
					       int nr_pages,
					       bool parallel)
{
	if (!parallel) {
		struct scatterlist input, output;
		int i;

		for (i = 0; i < nr_pages; ++i) {
			/* for pages[i], buffers[i], dlens[i]: borrow first half of
			 * zswap_compress() functionality:
			*/
			dst = acomp_ctx->buffers[i];
			sg_init_table(&input, 1);
			sg_set_page(&input, pages[i], PAGE_SIZE, 0);

			sg_init_one(&output, dst, PAGE_SIZE * 2);
			acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlens[i]);

			comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), acomp_ctx->waits[0]);
			dlens[i] = acomp_ctx->reqs[0]->dlen;
		}
	}

	/*
	 * At this point we would have sequentially compressed the batch.
	 * zswap_store_folio() can process the buffers and dlens using
	 * common code for batching and non-batching compressors.
	*/
}

IIUC, this suggestion appears to be along the lines of using common
code in zswap as far as possible, for compressors that do and do not
support batching. Herbert can correct me if I am wrong.

If this is indeed the case, the memory penalty for software compressors
would be:
1) pre-allocating SWAP_CRYPTO_BATCH_SIZE acomp_ctx->buffers in
    zswap_cpu_comp_prepare().
2) SWAP_CRYPTO_BATCH_SIZE stack variables for pages and dlens in
    zswap_store_folio().

This would be an additional memory penalty for what we gain by
having the common code paths in zswap for compressors that do
and do not support batching.

Thanks,
Kanchana

> 
> It sounds nice on the surface, but this implies that we have to
> allocate folio_nr_pages() buffers in zswap, essentially as the
> allocation is the same size as the folio itself. While the allocation
> does not need to be contiguous, making a large number of allocations
> in the reclaim path is definitely not something we want. For a 2M THP,
> we'd need to allocate 2M in zswap_store().
> 
> If we choose to keep preallocating, assuming the maximum THP size is
> 2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of
> memory.
> 
> I feel like I am missing something.
> 
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > Cheers,
> > > --
> > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > Home Page: http://gondor.apana.org.au/~herbert/
> > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-03 22:17             ` Sridhar, Kanchana P
@ 2024-12-03 22:24               ` Sridhar, Kanchana P
  0 siblings, 0 replies; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-12-03 22:24 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Herbert Xu, Nhat Pham, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Tuesday, December 3, 2024 2:18 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham
> <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Tuesday, December 3, 2024 1:44 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham
> > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org;
> > hannes@cmpxchg.org; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com;
> > 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources
> if
> > the crypto_alg supports batching.
> >
> > On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Sent: Tuesday, December 3, 2024 12:01 AM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org;
> > linux-
> > > > mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> > > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> > > > akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>;
> > > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching
> > resources if
> > > > the crypto_alg supports batching.
> > > >
> > > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
> > > > >
> > > > > > Why do we need this "can_batch" field? IIUC, this can be
> determined
> > > > > > from the compressor internal fields itself, no?
> > > > > >
> > > > > > acomp_has_async_batching(acomp);
> > > > > >
> > > > > > Is this just for convenience, or is this actually an expensive thing to
> > > > compute?
> > > > >
> > > > > Thanks for your comments. This is a good question. I tried not to imply
> > that
> > > > > batching resources have been allocated for the cpu based only on what
> > > > > acomp_has_async_batching() returns. It is possible that the cpu
> onlining
> > > > > code ran into an -ENOMEM error on any particular cpu. In this case, I
> > set
> > > > > the pool->can_batch to "false", mainly for convenience, so that zswap
> > > > > can be somewhat insulated from migration. I agree that this may not
> be
> > > > > the best solution; and whether or not batching is enabled can be
> directly
> > > > > determined just before the call to crypto_acomp_batch_compress()
> > > > > based on:
> > > > >
> > > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> > > >
> > > > With ahash request chaining, the idea is to accumulate as much
> > > > data as you can before you provide it to the Crypto API.  The
> > > > API is responsible for dividing it up if the underlying driver
> > > > is only able to handle one request at a time.
> > > >
> > > > So that would be the ideal model to use for compression as well.
> > > > Provide as much data as you can and let the API handle the case
> > > > where the data needs to be divided up.
> > >
> > > Thanks for this suggestion! This sounds like a clean way to handle the
> > > batching/sequential compress/decompress within the crypto API as long
> > > as it can be contained in the crypto acompress layer.
> > > If the zswap maintainers don't have any objections, I can look into the
> > > feasibility of doing this.
> >
> > Does this mean that instead of zswap breaking down the folio into
> > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> > crypto layer and let it do the batching as it pleases?
> 
> If I understand Herbert's suggestion correctly, I think what he meant was
> that we allocate only SWAP_CRYPTO_BATCH_SIZE # of buffers in zswap (say,
> 8)
> during the cpu onlining always. The acomp_has_async_batching() API can
> be used to determine whether to allocate more than one acomp_req and
> crypto_wait (fyi, I am creating SWAP_CRYPTO_BATCH_SIZE # of crypto_wait
> for the request chaining with the goal of understanding performance wrt the
> existing implementation of crypto_acomp_batch_compress()).
> In zswap_store_folio(), we process the large folio in batches of 8 pages
> and call "crypto_acomp_batch_compress()" for each batch. Based on earlier
> discussions in this thread, it might make sense to add a bool option to
> crypto_acomp_batch_compress() as follows:
> 
> static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
> 					       struct crypto_wait *waits[],
> 					       struct page *pages[],
> 					       u8 *dsts[],
> 					       unsigned int dlens[],
> 					       int errors[],
> 					       int nr_pages,
> 					       bool parallel);
> 
> zswap would acquire the per-cpu acomp_ctx->mutex, and pass
> (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) for the "parallel"
> parameter.
> This indicates to crypto_acomp_batch_compress() whether or not
> SWAP_CRYPTO_BATCH_SIZE # of elements are available in "reqs" and
> "waits".
> 
> If we have multiple "reqs" (parallel == true), we use request chaining (or the
> existing asynchronous poll implementation) for IAA batching. If (parallel ==
> false),
> crypto_acomp_batch_compress() will look something like this:
> 
> static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
> 					       struct crypto_wait *waits[],
> 					       struct page *pages[],
> 					       u8 *dsts[],
> 					       unsigned int dlens[],
> 					       int errors[],
> 					       int nr_pages,
> 					       bool parallel)
> {
> 	if (!parallel) {
> 		struct scatterlist input, output;
> 		int i;
> 
> 		for (i = 0; i < nr_pages; ++i) {
> 			/* for pages[i], buffers[i], dlens[i]: borrow first half of
> 			 * zswap_compress() functionality:
> 			*/
> 			dst = acomp_ctx->buffers[i];
> 			sg_init_table(&input, 1);
> 			sg_set_page(&input, pages[i], PAGE_SIZE, 0);
> 
> 			sg_init_one(&output, dst, PAGE_SIZE * 2);
> 			acomp_request_set_params(acomp_ctx->reqs[0],
> &input, &output, PAGE_SIZE, dlens[i]);
> 
> 			comp_ret =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), acomp_ctx-
> >waits[0]);
> 			dlens[i] = acomp_ctx->reqs[0]->dlen;
> 		}
> 	}
> 
> 	/*
> 	 * At this point we would have sequentially compressed the batch.
> 	 * zswap_store_folio() can process the buffers and dlens using
> 	 * common code for batching and non-batching compressors.
> 	*/
> }
> 
> IIUC, this suggestion appears to be along the lines of using common
> code in zswap as far as possible, for compressors that do and do not
> support batching. Herbert can correct me if I am wrong.
> 
> If this is indeed the case, the memory penalty for software compressors
> would be:
> 1) pre-allocating SWAP_CRYPTO_BATCH_SIZE acomp_ctx->buffers in
>     zswap_cpu_comp_prepare().
> 2) SWAP_CRYPTO_BATCH_SIZE stack variables for pages and dlens in
>     zswap_store_folio().
> 
> This would be an additional memory penalty for what we gain by
> having the common code paths in zswap for compressors that do
> and do not support batching.

Alternately, we could use request chaining always, even for software
compressors for a larger memory penalty per-cpu, by allocating
SWAP_CRYPTO_BATCH_SIZE # of reqs/waits by default. I don't know
if this would have functional issues because the chain of requests
will be processed sequentially (basically all requests are added to a
list), but maybe Herbert is suggesting this (not sure).

Thanks,
Kanchana

> 
> Thanks,
> Kanchana
> 
> >
> > It sounds nice on the surface, but this implies that we have to
> > allocate folio_nr_pages() buffers in zswap, essentially as the
> > allocation is the same size as the folio itself. While the allocation
> > does not need to be contiguous, making a large number of allocations
> > in the reclaim path is definitely not something we want. For a 2M THP,
> > we'd need to allocate 2M in zswap_store().
> >
> > If we choose to keep preallocating, assuming the maximum THP size is
> > 2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of
> > memory.
> >
> > I feel like I am missing something.
> >
> > >
> > > Thanks,
> > > Kanchana
> > >
> > > >
> > > > Cheers,
> > > > --
> > > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Home Page: http://gondor.apana.org.au/~herbert/
> > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-03 21:44           ` Yosry Ahmed
  2024-12-03 22:17             ` Sridhar, Kanchana P
@ 2024-12-04  1:42             ` Herbert Xu
  2024-12-04 22:35               ` Yosry Ahmed
  1 sibling, 1 reply; 39+ messages in thread
From: Herbert Xu @ 2024-12-04  1:42 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sridhar, Kanchana P, Nhat Pham, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote:
>
> Does this mean that instead of zswap breaking down the folio into
> SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> crypto layer and let it do the batching as it pleases?

You provide as much (or little) as you're comfortable with.  Just
treat the acomp API as one that can take as much as you want to
give it.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-04  1:42             ` Herbert Xu
@ 2024-12-04 22:35               ` Yosry Ahmed
  2024-12-04 22:49                 ` Sridhar, Kanchana P
  0 siblings, 1 reply; 39+ messages in thread
From: Yosry Ahmed @ 2024-12-04 22:35 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sridhar, Kanchana P, Nhat Pham, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote:
> >
> > Does this mean that instead of zswap breaking down the folio into
> > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> > crypto layer and let it do the batching as it pleases?
>
> You provide as much (or little) as you're comfortable with.  Just
> treat the acomp API as one that can take as much as you want to
> give it.

In this case, it seems like the batch size is completely up to zswap,
and not necessarily dependent on the compressor. That being said,
Intel IAA will naturally prefer a batch size that maximizes the
parallelization.

How about this, we can define a fixed max batch size in zswap, to
provide a hard limit on the number of buffers we preallocate (e.g.
MAX_BATCH_SIZE). The compressors can provide zswap a hint with their
desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate
min(MAX_BATCH_SIZE, compressor_batch_size).

Assuming software compressors provide 1 for the batch size, if
MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and
software compressors get the same behavior as today. This abstracts
the batch size needed by the compressor while making sure zswap does
not preallocate a ridiculous amount of memory.

Does this make sense to everyone or am I missing something?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-04 22:35               ` Yosry Ahmed
@ 2024-12-04 22:49                 ` Sridhar, Kanchana P
  2024-12-04 22:55                   ` Yosry Ahmed
  0 siblings, 1 reply; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-12-04 22:49 UTC (permalink / raw)
  To: Yosry Ahmed, Herbert Xu
  Cc: Nhat Pham, linux-kernel, linux-mm, hannes, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, December 4, 2024 2:36 PM
> To: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham
> <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu <herbert@gondor.apana.org.au>
> wrote:
> >
> > On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote:
> > >
> > > Does this mean that instead of zswap breaking down the folio into
> > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> > > crypto layer and let it do the batching as it pleases?
> >
> > You provide as much (or little) as you're comfortable with.  Just
> > treat the acomp API as one that can take as much as you want to
> > give it.
> 
> In this case, it seems like the batch size is completely up to zswap,
> and not necessarily dependent on the compressor. That being said,
> Intel IAA will naturally prefer a batch size that maximizes the
> parallelization.
> 
> How about this, we can define a fixed max batch size in zswap, to
> provide a hard limit on the number of buffers we preallocate (e.g.
> MAX_BATCH_SIZE). The compressors can provide zswap a hint with their
> desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate
> min(MAX_BATCH_SIZE, compressor_batch_size).
> 
> Assuming software compressors provide 1 for the batch size, if
> MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and
> software compressors get the same behavior as today. This abstracts
> the batch size needed by the compressor while making sure zswap does
> not preallocate a ridiculous amount of memory.
> 
> Does this make sense to everyone or am I missing something?

Thanks Yosry, this makes perfect sense. I can declare a default
CRYPTO_ACOMP_BATCH_SIZE=1, and a crypto API that zswap can
query, acomp_get_batch_size(struct crypto_acomp *tfm) that
can call a crypto algorithm interface if it is registered, for e.g.
crypto_get_batch_size() that IAA can register to return the max
batch size for IAA. If a compressor does not provide an
implementation for crypto_get_batch_size(), we would return
CRYPTO_ACOMP_BATCH_SIZE. This way, nothing specific will
need to be done for the software compressors for now. Unless
they define a specific batch_size via say, another interface,
crypto_set_batch_size(), the acomp_get_batch_size() will return 1.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-04 22:49                 ` Sridhar, Kanchana P
@ 2024-12-04 22:55                   ` Yosry Ahmed
  2024-12-04 23:12                     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 39+ messages in thread
From: Yosry Ahmed @ 2024-12-04 22:55 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Herbert Xu, Nhat Pham, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Wed, Dec 4, 2024 at 2:49 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Wednesday, December 4, 2024 2:36 PM
> > To: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham
> > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com;
> > 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> > the crypto_alg supports batching.
> >
> > On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu <herbert@gondor.apana.org.au>
> > wrote:
> > >
> > > On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote:
> > > >
> > > > Does this mean that instead of zswap breaking down the folio into
> > > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> > > > crypto layer and let it do the batching as it pleases?
> > >
> > > You provide as much (or little) as you're comfortable with.  Just
> > > treat the acomp API as one that can take as much as you want to
> > > give it.
> >
> > In this case, it seems like the batch size is completely up to zswap,
> > and not necessarily dependent on the compressor. That being said,
> > Intel IAA will naturally prefer a batch size that maximizes the
> > parallelization.
> >
> > How about this, we can define a fixed max batch size in zswap, to
> > provide a hard limit on the number of buffers we preallocate (e.g.
> > MAX_BATCH_SIZE). The compressors can provide zswap a hint with their
> > desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate
> > min(MAX_BATCH_SIZE, compressor_batch_size).
> >
> > Assuming software compressors provide 1 for the batch size, if
> > MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and
> > software compressors get the same behavior as today. This abstracts
> > the batch size needed by the compressor while making sure zswap does
> > not preallocate a ridiculous amount of memory.
> >
> > Does this make sense to everyone or am I missing something?
>
> Thanks Yosry, this makes perfect sense. I can declare a default
> CRYPTO_ACOMP_BATCH_SIZE=1, and a crypto API that zswap can
> query, acomp_get_batch_size(struct crypto_acomp *tfm) that
> can call a crypto algorithm interface if it is registered, for e.g.
> crypto_get_batch_size() that IAA can register to return the max
> batch size for IAA. If a compressor does not provide an
> implementation for crypto_get_batch_size(), we would return
> CRYPTO_ACOMP_BATCH_SIZE. This way, nothing specific will
> need to be done for the software compressors for now. Unless
> they define a specific batch_size via say, another interface,
> crypto_set_batch_size(), the acomp_get_batch_size() will return 1.

I still think zswap should define its own maximum to avoid having the
compressors have complete control over the amount of memory that zswap
preallocates.

For the acomp stuff I will let Herbert decide what he thinks is best.
From the zswap side, I just want:
- A hard limit on the amount of memory we preallocate.
- No change for the software compressors.

>
> Thanks,
> Kanchana


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-04 22:55                   ` Yosry Ahmed
@ 2024-12-04 23:12                     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-12-04 23:12 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Herbert Xu, Nhat Pham, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, December 4, 2024 2:55 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham
> <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> On Wed, Dec 4, 2024 at 2:49 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Wednesday, December 4, 2024 2:36 PM
> > > To: Herbert Xu <herbert@gondor.apana.org.au>
> > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham
> > > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org;
> > > hannes@cmpxchg.org; chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com;
> ying.huang@intel.com;
> > > 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> > > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching
> resources if
> > > the crypto_alg supports batching.
> > >
> > > On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu
> <herbert@gondor.apana.org.au>
> > > wrote:
> > > >
> > > > On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote:
> > > > >
> > > > > Does this mean that instead of zswap breaking down the folio into
> > > > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to
> the
> > > > > crypto layer and let it do the batching as it pleases?
> > > >
> > > > You provide as much (or little) as you're comfortable with.  Just
> > > > treat the acomp API as one that can take as much as you want to
> > > > give it.
> > >
> > > In this case, it seems like the batch size is completely up to zswap,
> > > and not necessarily dependent on the compressor. That being said,
> > > Intel IAA will naturally prefer a batch size that maximizes the
> > > parallelization.
> > >
> > > How about this, we can define a fixed max batch size in zswap, to
> > > provide a hard limit on the number of buffers we preallocate (e.g.
> > > MAX_BATCH_SIZE). The compressors can provide zswap a hint with their
> > > desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate
> > > min(MAX_BATCH_SIZE, compressor_batch_size).
> > >
> > > Assuming software compressors provide 1 for the batch size, if
> > > MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and
> > > software compressors get the same behavior as today. This abstracts
> > > the batch size needed by the compressor while making sure zswap does
> > > not preallocate a ridiculous amount of memory.
> > >
> > > Does this make sense to everyone or am I missing something?
> >
> > Thanks Yosry, this makes perfect sense. I can declare a default
> > CRYPTO_ACOMP_BATCH_SIZE=1, and a crypto API that zswap can
> > query, acomp_get_batch_size(struct crypto_acomp *tfm) that
> > can call a crypto algorithm interface if it is registered, for e.g.
> > crypto_get_batch_size() that IAA can register to return the max
> > batch size for IAA. If a compressor does not provide an
> > implementation for crypto_get_batch_size(), we would return
> > CRYPTO_ACOMP_BATCH_SIZE. This way, nothing specific will
> > need to be done for the software compressors for now. Unless
> > they define a specific batch_size via say, another interface,
> > crypto_set_batch_size(), the acomp_get_batch_size() will return 1.
> 
> I still think zswap should define its own maximum to avoid having the
> compressors have complete control over the amount of memory that zswap
> preallocates.

For sure, zswap should set the MAX_BATCH_SIZE for this purpose.

> 
> For the acomp stuff I will let Herbert decide what he thinks is best.
> From the zswap side, I just want:
> - A hard limit on the amount of memory we preallocate.
> - No change for the software compressors.

Sounds good!

Thanks,
Kanchana

> 
> >
> > Thanks,
> > Kanchana

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-03  0:30     ` Sridhar, Kanchana P
  2024-12-03  8:00       ` Herbert Xu
@ 2024-12-21  6:30       ` Sridhar, Kanchana P
  1 sibling, 0 replies; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-12-21  6:30 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

Hi Nhat,

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Monday, December 2, 2024 4:31 PM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> Hi Nhat,
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Monday, December 2, 2024 11:16 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> > akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> > herbert@gondor.apana.org.au; davem@davemloft.net;
> > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources
> if
> > the crypto_alg supports batching.
> >
> > On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > This patch does the following:
> > >
> > > 1) Modifies the definition of "struct crypto_acomp_ctx" to represent a
> > >    configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
> > >    "struct crypto_acomp_ctx" to contain the nr of resources that will be
> > >    allocated in the cpu onlining code.
> > >
> > > 2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
> > >    crypto_acomp created for the pool (in other words, the zswap
> > compression
> > >    algorithm) has registered an implementation for batch_compress() and
> > >    batch_decompress(). If so, it will set "nr_reqs" to
> > >    SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and
> > set
> > >    the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not
> > support
> > >    batching, "nr_reqs" defaults to 1.
> > >
> > > 3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to
> > >    true if the batching API are present for the crypto_acomp.
> >
> > Why do we need this "can_batch" field? IIUC, this can be determined
> > from the compressor internal fields itself, no?
> >
> > acomp_has_async_batching(acomp);
> >
> > Is this just for convenience, or is this actually an expensive thing to
> compute?
> 
> Thanks for your comments. This is a good question. I tried not to imply that
> batching resources have been allocated for the cpu based only on what
> acomp_has_async_batching() returns. It is possible that the cpu onlining
> code ran into an -ENOMEM error on any particular cpu. In this case, I set
> the pool->can_batch to "false", mainly for convenience, so that zswap
> can be somewhat insulated from migration. I agree that this may not be
> the best solution; and whether or not batching is enabled can be directly
> determined just before the call to crypto_acomp_batch_compress()
> based on:
> 
> acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> 
> I currently have a BUG_ON() for this condition not being met, that relies
> on the pool->can_batch gating the flow to get to zswap_batch_compress().
> 
> I think a better solution would be to check for having
> SWAP_CRYPTO_BATCH_SIZE
> # of acomp_ctx resources right after we acquire the acomp_ctx->mutex and
> before
> the call to crypto_acomp_batch_compress(). If so, we proceed, and if not, we
> call
> crypto_acomp_compress(). It seems this might be the only way to know for
> sure
> whether the crypto batching API can be called, given that migration is possible
> at any point in zswap_store(). Once we have obtained the mutex_lock, it
> seems
> we can proceed with batching based on this check (although the UAF situation
> remains as a larger issue, beyond the scope of this patch). I would appreciate
> other ideas as well.
> 
> Also, I have submitted a patch-series [1] with Yosry's & Johannes' suggestions
> to this series. This is setting up a consolidated
> zswap_store()/zswap_store_pages()
> code path for batching and non-batching compressors. My goal is for [1] to
> go through code reviews and be able to transition to batching, with a simple
> check:
> 
> if (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE)
>          zswap_batch_compress();
> else
>          zswap_compress();
> 
> Please feel free to provide code review comments in [1]. Thanks!
> 
> [1]: https://patchwork.kernel.org/project/linux-mm/list/?series=912937
> 
> >
> > >
> > > SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress
> > batching
> >
> > I like a sane default value as much as the next guy, but this seems a
> > bit odd to me:
> >
> > 1. The placement of this constant/default value seems strange to me.
> > This is a compressor-specific value no? Why are we enforcing this
> > batching size at the zswap level, and uniformly at that? What if we
> > introduce a new batch compression algorithm...? Or am I missing
> > something, and this is a sane default for other compressors too?
> 
> You bring up an excellent point. This is a compressor-specific value.
> Instead of setting this up as a constant, which as you correctly observe,
> may not make sense for a non-IAA compressor, one way to get
> this could be by querying the compressor, say:
> 
> int acomp_get_max_batchsize(struct crypto_acomp *tfm) {...};
> 
> to then allocate sufficient acomp_reqs/buffers/etc. in the zswap
> cpu onlining code.
> 
> >
> > 2. Why is this value set to 8? Experimentation? Could you add some
> > justification in documentation?
> 
> Can I get back to you later this week with a proposal for this? We plan
> to have a team discussion on how best to approach this for current
> and future hardware.

Sorry it took me quite a while to get back to you on this. I have been busy
with implementing request chaining, and other major improvements to this
series based on the comments received thus far.

I will be submitting a v5 of this series shortly, in which I have implemented
an IAA_CRYPTO_MAX_BATCH_SIZE in the iaa_crypto driver. For now I set this
to 8 since we have done all our testing with a batch size of 8, but we are still
running experiments to figure this out, hence this #define in the iaa_crypto
driver (in v5) can potentially change. Further, there is a zswap-specific
ZSWAP_MAX_BATCH_SIZE in v5, which is also 8. I would appreciate code
review comments for v5. If the approach I've taken in v5 is acceptable, I
will add more details/justification in the documentation in a v6.

Thanks,
Kanchana

> 
> Thanks,
> Kanchana


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios.
  2024-11-23  7:01 [PATCH v4 00/10] zswap IAA compress batching Kanchana P Sridhar
                   ` (8 preceding siblings ...)
  2024-11-23  7:01 ` [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching Kanchana P Sridhar
@ 2024-11-23  7:01 ` Kanchana P Sridhar
  2024-11-25  8:00   ` kernel test robot
  2024-11-25 20:20   ` Yosry Ahmed
  9 siblings, 2 replies; 39+ messages in thread
From: Kanchana P Sridhar @ 2024-11-23  7:01 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch adds two new zswap API:

 1) bool zswap_can_batch(void);
 2) void zswap_batch_store(struct folio_batch *batch, int *errors);

Higher level mm code, for instance, swap_writepage(), can query if the
current zswap pool supports batching, by calling zswap_can_batch(). If so
it can invoke zswap_batch_store() to swapout a large folio much more
efficiently to zswap, instead of calling zswap_store().

Hence, on systems with Intel IAA hardware compress/decompress accelerators,
swap_writepage() will invoke zswap_batch_store() for large folios.

zswap_batch_store() will call crypto_acomp_batch_compress() to compress up
to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using
the multiple compress engines available in IAA.

On platforms with multiple IAA devices per package, compress jobs from all
cores in a package will be distributed among all IAA devices in the package
by the iaa_crypto driver.

The newly added zswap_batch_store() follows the general structure of
zswap_store(). Some amount of restructuring and optimization is done to
minimize failure points for a batch, fail early and maximize the zswap
store pipeline occupancy with SWAP_CRYPTO_BATCH_SIZE pages, potentially
from multiple folios in future. This is intended to maximize reclaim
throughput with the IAA hardware parallel compressions.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h |  12 +
 mm/page_io.c          |  16 +-
 mm/zswap.c            | 639 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 666 insertions(+), 1 deletion(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 9ad27ab3d222..a05f59139a6e 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -4,6 +4,7 @@
 
 #include <linux/types.h>
 #include <linux/mm_types.h>
+#include <linux/pagevec.h>
 
 struct lruvec;
 
@@ -33,6 +34,8 @@ struct zswap_lruvec_state {
 
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
+bool zswap_can_batch(void);
+void zswap_batch_store(struct folio_batch *batch, int *errors);
 bool zswap_load(struct folio *folio);
 void zswap_invalidate(swp_entry_t swp);
 int zswap_swapon(int type, unsigned long nr_pages);
@@ -51,6 +54,15 @@ static inline bool zswap_store(struct folio *folio)
 	return false;
 }
 
+static inline bool zswap_can_batch(void)
+{
+	return false;
+}
+
+static inline void zswap_batch_store(struct folio_batch *batch, int *errors)
+{
+}
+
 static inline bool zswap_load(struct folio *folio)
 {
 	return false;
diff --git a/mm/page_io.c b/mm/page_io.c
index 4b4ea8e49cf6..271d3a40c0c1 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -276,7 +276,21 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		 */
 		swap_zeromap_folio_clear(folio);
 	}
-	if (zswap_store(folio)) {
+
+	if (folio_test_large(folio) && zswap_can_batch()) {
+		struct folio_batch batch;
+		int error = -1;
+
+		folio_batch_init(&batch);
+		folio_batch_add(&batch, folio);
+		zswap_batch_store(&batch, &error);
+
+		if (!error) {
+			count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
+			folio_unlock(folio);
+			return 0;
+		}
+	} else if (zswap_store(folio)) {
 		count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
 		folio_unlock(folio);
 		return 0;
diff --git a/mm/zswap.c b/mm/zswap.c
index 173f7632990e..53c8e39b778b 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -229,6 +229,80 @@ static DEFINE_MUTEX(zswap_init_lock);
 /* init completed, but couldn't create the initial pool */
 static bool zswap_has_pool;
 
+/*
+ * struct zswap_batch_store_sub_batch:
+ *
+ * This represents a sub-batch of SWAP_CRYPTO_BATCH_SIZE pages during IAA
+ * compress batching of a folio or (conceptually, a reclaim batch of) folios.
+ * The new zswap_batch_store() API will break down the batch of folios being
+ * reclaimed into sub-batches of SWAP_CRYPTO_BATCH_SIZE pages, batch compress
+ * the pages by calling the iaa_crypto driver API crypto_acomp_batch_compress();
+ * and storing the sub-batch in zpool/xarray before updating objcg/vm/zswap
+ * stats.
+ *
+ * Although the page itself is represented directly, the structure adds a
+ * "u8 folio_id" to represent an index for the folio in a conceptual
+ * "reclaim batch of folios" that can be passed to zswap_store(). Conceptually,
+ * this allows for up to 256 folios that can be passed to zswap_store().
+ * Even though the folio_id seems redundant in the context of a single large
+ * folio being stored by zswap, it does simplify error handling and redundant
+ * computes/rewinding state, all of which can add latency. Since the
+ * zswap_batch_store() of a large folio can fail for any of these reasons --
+ * compress errors, zpool malloc errors, xarray store errors -- the procedures
+ * that detect these errors for a sub-batch, can all call a single cleanup
+ * procedure, zswap_batch_cleanup(), which will de-allocate zpool memory and
+ * zswap_entries for the sub-batch and set the "errors[folio_id]" to -EINVAL.
+ * All subsequent procedures that operate on a sub-batch will do nothing if the
+ * errors[folio_id] is non-0. Hence, the folio_id facilitates the use of the
+ * "errors" passed to zswap_batch_store() as a global folio error status for a
+ * single folio (which could also be a folio in the folio_batch).
+ *
+ * The sub-batch concept could be further evolved to use pipelining to
+ * overlap CPU computes with IAA computes. For instance, we could stage
+ * the post-compress computes for sub-batch "N-1" to happen in parallel with
+ * IAA batch compression of sub-batch "N".
+ *
+ * We begin by developing the concept of compress batching. Pipelining with
+ * overlap can be future work.
+ *
+ * @pages: The individual pages in the sub-batch. There are no assumptions
+ *         about all of them belonging to the same folio.
+ * @dsts: The destination buffers for batch compress of the sub-batch.
+ * @dlens: The destination length constraints, and eventual compressed lengths
+ *         of successful compressions.
+ * @comp_errors: The compress error status for each page in the sub-batch, set
+ *               by crypto_acomp_batch_compress().
+ * @folio_ids: The containing folio_id of each sub-batch page.
+ * @swpentries: The page_swap_entry() for each corresponding sub-batch page.
+ * @objcgs: The objcg for each corresponding sub-batch page.
+ * @entries: The zswap_entry for each corresponding sub-batch page.
+ * @nr_pages: Total number of pages in @sub_batch.
+ * @pool: A valid zswap_pool that can_batch.
+ *
+ * Note:
+ * The max sub-batch size is SWAP_CRYPTO_BATCH_SIZE, currently 8UL.
+ * Hence, if SWAP_CRYPTO_BATCH_SIZE exceeds 256, @nr_pages needs to become u16.
+ * The sub-batch representation is future-proofed to a small extent to be able
+ * to easily scale the zswap_batch_store() implementation to handle a conceptual
+ * "reclaim batch of folios"; without addding too much complexity, while
+ * benefiting from simpler error handling, localized sub-batch resources cleanup
+ * and avoiding expensive rewinding state. If this conceptual number of reclaim
+ * folios sent to zswap_batch_store() exceeds 256, @folio_ids needs to
+ * become u16.
+ */
+struct zswap_batch_store_sub_batch {
+	struct page *pages[SWAP_CRYPTO_BATCH_SIZE];
+	u8 *dsts[SWAP_CRYPTO_BATCH_SIZE];
+	unsigned int dlens[SWAP_CRYPTO_BATCH_SIZE];
+	int comp_errors[SWAP_CRYPTO_BATCH_SIZE]; /* folio error status. */
+	u8 folio_ids[SWAP_CRYPTO_BATCH_SIZE];
+	swp_entry_t swpentries[SWAP_CRYPTO_BATCH_SIZE];
+	struct obj_cgroup *objcgs[SWAP_CRYPTO_BATCH_SIZE];
+	struct zswap_entry *entries[SWAP_CRYPTO_BATCH_SIZE];
+	u8 nr_pages;
+	struct zswap_pool *pool;
+};
+
 /*********************************
 * helpers and fwd declarations
 **********************************/
@@ -1705,6 +1779,571 @@ void zswap_invalidate(swp_entry_t swp)
 		zswap_entry_free(entry);
 }
 
+/******************************************************
+ * zswap_batch_store() with compress batching.
+ ******************************************************/
+
+/*
+ * Note: If SWAP_CRYPTO_BATCH_SIZE exceeds 256, change the
+ * u8 stack variables in the next several functions, to u16.
+ */
+bool zswap_can_batch(void)
+{
+	struct zswap_pool *pool;
+	bool ret = false;
+
+	pool = zswap_pool_current_get();
+
+	if (!pool)
+		return ret;
+
+	if (pool->can_batch)
+		ret = true;
+
+	zswap_pool_put(pool);
+
+	return ret;
+}
+
+/*
+ * If the zswap store fails or zswap is disabled, we must invalidate
+ * the possibly stale entries which were previously stored at the
+ * offsets corresponding to each page of the folio. Otherwise,
+ * writeback could overwrite the new data in the swapfile.
+ */
+static void zswap_delete_stored_entries(struct folio *folio)
+{
+	swp_entry_t swp = folio->swap;
+	unsigned type = swp_type(swp);
+	pgoff_t offset = swp_offset(swp);
+	struct zswap_entry *entry;
+	struct xarray *tree;
+	long index;
+
+	for (index = 0; index < folio_nr_pages(folio); ++index) {
+		tree = swap_zswap_tree(swp_entry(type, offset + index));
+		entry = xa_erase(tree, offset + index);
+		if (entry)
+			zswap_entry_free(entry);
+	}
+}
+
+static __always_inline void zswap_batch_reset(struct zswap_batch_store_sub_batch *sb)
+{
+	sb->nr_pages = 0;
+}
+
+/*
+ * Upon encountering the first sub-batch page in a folio with an error due to
+ * any of the following:
+ *  - compression
+ *  - zpool malloc
+ *  - xarray store
+ * , cleanup the sub-batch resources (zpool memory, zswap_entry) for all other
+ * sub_batch elements belonging to the same folio, using the "error_folio_id".
+ *
+ * Set the "errors[error_folio_id] to signify to all downstream computes in
+ * zswap_batch_store(), that no further processing is required for the folio
+ * with "error_folio_id" in the batch: this folio's zswap store status will
+ * be considered an error, and existing zswap_entries in the xarray will be
+ * deleted before zswap_batch_store() exits.
+ */
+static void zswap_batch_cleanup(struct zswap_batch_store_sub_batch *sb,
+				int *errors,
+				u8 error_folio_id)
+{
+	u8 i;
+
+	if (errors[error_folio_id])
+		return;
+
+	for (i = 0; i < sb->nr_pages; ++i) {
+		if (sb->folio_ids[i] == error_folio_id) {
+			if (sb->entries[i]) {
+				if (!IS_ERR_VALUE(sb->entries[i]->handle))
+					zpool_free(sb->pool->zpool, sb->entries[i]->handle);
+
+				zswap_entry_cache_free(sb->entries[i]);
+				sb->entries[i] = NULL;
+			}
+		}
+	}
+
+	errors[error_folio_id] = -EINVAL;
+}
+
+/*
+ * Returns true if the entry was successfully
+ * stored in the xarray, and false otherwise.
+ */
+static bool zswap_store_entry(swp_entry_t page_swpentry, struct zswap_entry *entry)
+{
+	struct zswap_entry *old = xa_store(swap_zswap_tree(page_swpentry),
+					   swp_offset(page_swpentry),
+					   entry, GFP_KERNEL);
+	if (xa_is_err(old)) {
+		int err = xa_err(old);
+
+		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+		zswap_reject_alloc_fail++;
+		return false;
+	}
+
+	/*
+	 * We may have had an existing entry that became stale when
+	 * the folio was redirtied and now the new version is being
+	 * swapped out. Get rid of the old.
+	 */
+	if (old)
+		zswap_entry_free(old);
+
+	return true;
+}
+
+/*
+ * The stats accounting makes no assumptions about all pages in the sub-batch
+ * belonging to the same folio, or having the same objcg; while still doing
+ * the updates in aggregation.
+ */
+static void zswap_batch_xarray_stats(struct zswap_batch_store_sub_batch *sb,
+				     int *errors)
+{
+	int nr_objcg_pages = 0, nr_pages = 0;
+	struct obj_cgroup *objcg = NULL;
+	size_t compressed_bytes = 0;
+	u8 i;
+
+	for (i = 0; i < sb->nr_pages; ++i) {
+		if (errors[sb->folio_ids[i]])
+			continue;
+
+		if (!zswap_store_entry(sb->swpentries[i], sb->entries[i])) {
+			zswap_batch_cleanup(sb, errors, sb->folio_ids[i]);
+			continue;
+		}
+
+		/*
+		 * The entry is successfully compressed and stored in the tree,
+		 * there is no further possibility of failure. Grab refs to the
+		 * pool and objcg. These refs will be dropped by
+		 * zswap_entry_free() when the entry is removed from the tree.
+		 */
+		zswap_pool_get(sb->pool);
+		if (sb->objcgs[i])
+			obj_cgroup_get(sb->objcgs[i]);
+
+		/*
+		 * We finish initializing the entry while it's already in xarray.
+		 * This is safe because:
+		 *
+		 * 1. Concurrent stores and invalidations are excluded by folio
+		 *    lock.
+		 *
+		 * 2. Writeback is excluded by the entry not being on the LRU yet.
+		 *    The publishing order matters to prevent writeback from seeing
+		 *    an incoherent entry.
+		 */
+		sb->entries[i]->pool = sb->pool;
+		sb->entries[i]->swpentry = sb->swpentries[i];
+		sb->entries[i]->objcg = sb->objcgs[i];
+		sb->entries[i]->referenced = true;
+		if (sb->entries[i]->length) {
+			INIT_LIST_HEAD(&(sb->entries[i]->lru));
+			zswap_lru_add(&zswap_list_lru, sb->entries[i]);
+		}
+
+		if (!objcg && sb->objcgs[i]) {
+			objcg = sb->objcgs[i];
+		} else if (objcg && sb->objcgs[i] && (objcg != sb->objcgs[i])) {
+			obj_cgroup_charge_zswap(objcg, compressed_bytes);
+			count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
+			compressed_bytes = 0;
+			nr_objcg_pages = 0;
+			objcg = sb->objcgs[i];
+		}
+
+		if (sb->objcgs[i]) {
+			compressed_bytes += sb->entries[i]->length;
+			++nr_objcg_pages;
+		}
+
+		++nr_pages;
+	} /* for sub-batch pages. */
+
+	if (objcg) {
+		obj_cgroup_charge_zswap(objcg, compressed_bytes);
+		count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
+	}
+
+	atomic_long_add(nr_pages, &zswap_stored_pages);
+	count_vm_events(ZSWPOUT, nr_pages);
+}
+
+static void zswap_batch_zpool_store(struct zswap_batch_store_sub_batch *sb,
+				    int *errors)
+{
+	u8 i;
+
+	for (i = 0; i < sb->nr_pages; ++i) {
+		struct zpool *zpool;
+		unsigned long handle;
+		char *buf;
+		gfp_t gfp;
+		int err;
+
+		/* Skip pages belonging to folios that had compress errors. */
+		if (errors[sb->folio_ids[i]])
+			continue;
+
+		zpool = sb->pool->zpool;
+		gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
+		if (zpool_malloc_support_movable(zpool))
+			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
+		err = zpool_malloc(zpool, sb->dlens[i], gfp, &handle);
+
+		if (err) {
+			if (err == -ENOSPC)
+				zswap_reject_compress_poor++;
+			else
+				zswap_reject_alloc_fail++;
+
+			/*
+			 * A zpool malloc error should trigger cleanup for
+			 * other same-folio pages in the sub-batch, and zpool
+			 * resources/zswap_entries for those pages should be
+			 * de-allocated.
+			 */
+			zswap_batch_cleanup(sb, errors, sb->folio_ids[i]);
+			continue;
+		}
+
+		buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
+		memcpy(buf, sb->dsts[i], sb->dlens[i]);
+		zpool_unmap_handle(zpool, handle);
+
+		sb->entries[i]->handle = handle;
+		sb->entries[i]->length = sb->dlens[i];
+	}
+}
+
+static void zswap_batch_proc_comp_errors(struct zswap_batch_store_sub_batch *sb,
+					 int *errors)
+{
+	u8 i;
+
+	for (i = 0; i < sb->nr_pages; ++i) {
+		if (sb->comp_errors[i]) {
+			if (sb->comp_errors[i] == -ENOSPC)
+				zswap_reject_compress_poor++;
+			else
+				zswap_reject_compress_fail++;
+
+			if (!errors[sb->folio_ids[i]])
+				zswap_batch_cleanup(sb, errors, sb->folio_ids[i]);
+		}
+	}
+}
+
+/*
+ * Batch compress up to SWAP_CRYPTO_BATCH_SIZE pages with IAA.
+ * It is important to note that the SWAP_CRYPTO_BATCH_SIZE resources
+ * resources are allocated for the pool's per-cpu acomp_ctx during cpu
+ * hotplug only if the crypto_acomp has registered either
+ * batch_compress() and batch_decompress().
+ * The iaa_crypto driver registers implementations for both these API.
+ * Hence, if IAA is the zswap compressor, the call to
+ * crypto_acomp_batch_compress() will compress the pages in parallel,
+ * resulting in significant performance improvements as compared to
+ * software compressors.
+ */
+static void zswap_batch_compress(struct zswap_batch_store_sub_batch *sb,
+				 int *errors)
+{
+	struct crypto_acomp_ctx *acomp_ctx = raw_cpu_ptr(sb->pool->acomp_ctx);
+	u8 i;
+
+	mutex_lock(&acomp_ctx->mutex);
+
+	BUG_ON(acomp_ctx->nr_reqs != SWAP_CRYPTO_BATCH_SIZE);
+
+	for (i = 0; i < sb->nr_pages; ++i) {
+		sb->dsts[i] = acomp_ctx->buffers[i];
+		sb->dlens[i] = PAGE_SIZE;
+	}
+
+	/*
+	 * Batch compress sub-batch "N". If IAA is the compressor, the
+	 * hardware will compress multiple pages in parallel.
+	 */
+	crypto_acomp_batch_compress(
+		acomp_ctx->reqs,
+		&acomp_ctx->wait,
+		sb->pages,
+		sb->dsts,
+		sb->dlens,
+		sb->comp_errors,
+		sb->nr_pages);
+
+	/*
+	 * Scan the sub-batch for any compression errors,
+	 * and invalidate pages with errors, along with other
+	 * pages belonging to the same folio as the error page(s).
+	 * Set the folio's error status in "errors" so that no
+	 * further zswap_batch_store() processing is done for
+	 * the folio(s) with compression errors.
+	 */
+	zswap_batch_proc_comp_errors(sb, errors);
+
+	zswap_batch_zpool_store(sb, errors);
+
+	mutex_unlock(&acomp_ctx->mutex);
+}
+
+static void zswap_batch_add_pages(struct zswap_batch_store_sub_batch *sb,
+				  struct folio *folio,
+				  u8 folio_id,
+				  struct obj_cgroup *objcg,
+				  struct zswap_entry *entries[],
+				  long start_idx,
+				  u8 nr)
+{
+	long index;
+
+	for (index = start_idx; index < (start_idx + nr); ++index) {
+		u8 i = sb->nr_pages;
+		struct page *page = folio_page(folio, index);
+		sb->pages[i] = page;
+		sb->swpentries[i] = page_swap_entry(page);
+		sb->folio_ids[i] = folio_id;
+		sb->objcgs[i] = objcg;
+		sb->entries[i] = entries[index - start_idx];
+		sb->comp_errors[i] = 0;
+		++sb->nr_pages;
+	}
+}
+
+/* Allocate entries for the next sub-batch. */
+static int zswap_batch_alloc_entries(struct zswap_entry *entries[], int node_id, u8 nr)
+{
+	u8 i;
+
+	for (i = 0; i < nr; ++i) {
+		entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
+		if (!entries[i]) {
+			u8 j;
+
+			zswap_reject_kmemcache_fail++;
+			for (j = 0; j < i; ++j)
+				zswap_entry_cache_free(entries[j]);
+			return -EINVAL;
+		}
+
+		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
+	}
+
+	return 0;
+}
+
+static bool zswap_batch_comp_folio(struct folio *folio, int *errors, u8 folio_id,
+				   struct obj_cgroup *objcg,
+				   struct zswap_batch_store_sub_batch *sub_batch,
+				   bool last)
+{
+	long folio_start_idx = 0, nr_folio_pages = folio_nr_pages(folio);
+	struct zswap_entry *entries[SWAP_CRYPTO_BATCH_SIZE];
+	int node_id = folio_nid(folio);
+
+	/*
+	 * Iterate over the pages in the folio passed in. Construct compress
+	 * sub-batches of up to SWAP_CRYPTO_BATCH_SIZE pages. Process each
+	 * sub-batch with IAA batch compression. Detect errors from batch
+	 * compression and set the folio's error status.
+	 */
+	while (nr_folio_pages > 0) {
+		u8 add_nr_pages;
+
+		/*
+		 * If we have accumulated SWAP_CRYPTO_BATCH_SIZE
+		 * pages, process the sub-batch.
+		 */
+		if (sub_batch->nr_pages == SWAP_CRYPTO_BATCH_SIZE) {
+			zswap_batch_compress(sub_batch, errors);
+			zswap_batch_xarray_stats(sub_batch, errors);
+			zswap_batch_reset(sub_batch);
+			/*
+			 * Stop processing this folio if it had compress errors.
+			 */
+			if (errors[folio_id])
+				goto ret_folio;
+		}
+
+		/* Add pages from the folio to the compress sub-batch. */
+		add_nr_pages = min3((
+				(long)SWAP_CRYPTO_BATCH_SIZE -
+				(long)sub_batch->nr_pages),
+				nr_folio_pages,
+				(long)SWAP_CRYPTO_BATCH_SIZE);
+
+		/*
+		 * Allocate zswap entries for this sub-batch. If we get errors
+		 * while doing so, we can fail early and flag an error for the
+		 * folio.
+		 */
+		if (zswap_batch_alloc_entries(entries, node_id, add_nr_pages)) {
+			zswap_batch_reset(sub_batch);
+			errors[folio_id] = -EINVAL;
+			goto ret_folio;
+		}
+
+		zswap_batch_add_pages(sub_batch, folio,	folio_id, objcg,
+				      entries, folio_start_idx, add_nr_pages);
+
+		nr_folio_pages -= add_nr_pages;
+		folio_start_idx += add_nr_pages;
+	} /* this folio has pages to be compressed. */
+
+	/*
+	 * Process last sub-batch: it could contain pages from multiple folios.
+	 */
+	if (last && sub_batch->nr_pages) {
+		zswap_batch_compress(sub_batch, errors);
+		zswap_batch_xarray_stats(sub_batch, errors);
+	}
+
+ret_folio:
+	return (!errors[folio_id]);
+}
+
+/*
+ * Store a large folio and/or a batch of any-order folio(s) in zswap
+ * using IAA compress batching API.
+ *
+ * This the main procedure for batching within large folios and for batching
+ * of folios. Each large folio will be broken into sub-batches of
+ * SWAP_CRYPTO_BATCH_SIZE pages, the sub-batch pages will be compressed by
+ * IAA hardware compress engines in parallel, then stored in zpool/xarray.
+ *
+ * This procedure should only be called if zswap supports batching of stores.
+ * Otherwise, the sequential implementation for storing folios as in the
+ * current zswap_store() should be used. The code handles the unlikely event
+ * that the zswap pool changes from batching to non-batching between
+ * swap_writepage() and the start of zswap_batch_store().
+ *
+ * The signature of this procedure is meant to allow the calling function,
+ * (for instance, swap_writepage()) to pass a batch of folios @batch
+ * (the "reclaim batch") to be stored in zswap.
+ *
+ * @batch and @errors have folio_batch_count(@batch) number of entries,
+ * with one-one correspondence (@errors[i] represents the error status of
+ * @batch->folios[i], for i in folio_batch_count(@batch)). Please also
+ * see comments preceding "struct zswap_batch_store_sub_batch" definition
+ * above.
+ *
+ * The calling function (for instance, swap_writepage()) should initialize
+ * @errors[i] to a non-0 value.
+ * If zswap successfully stores @batch->folios[i], it will set @errors[i] to 0.
+ * If there is an error in zswap, it will set @errors[i] to -EINVAL.
+ *
+ * @batch: folio_batch of folios to be batch compressed.
+ * @errors: zswap_batch_store() error status for the folios in @batch.
+ */
+void zswap_batch_store(struct folio_batch *batch, int *errors)
+{
+	struct zswap_batch_store_sub_batch sub_batch;
+	struct zswap_pool *pool;
+	u8 i;
+
+	/*
+	 * If zswap is disabled, we must invalidate the possibly stale entry
+	 * which was previously stored at this offset. Otherwise, writeback
+	 * could overwrite the new data in the swapfile.
+	 */
+	if (!zswap_enabled)
+		goto check_old;
+
+	pool = zswap_pool_current_get();
+
+	if (!pool) {
+		if (zswap_check_limits())
+			queue_work(shrink_wq, &zswap_shrink_work);
+		goto check_old;
+	}
+
+	if (!pool->can_batch) {
+		for (i = 0; i < folio_batch_count(batch); ++i)
+			if (zswap_store(batch->folios[i]))
+				errors[i] = 0;
+			else
+				errors[i] = -EINVAL;
+		/*
+		 * Seems preferable to release the pool ref after the calls to
+		 * zswap_store(), so that the non-batching pool cannot be
+		 * deleted, can be used for sequential stores, and the zswap pool
+		 * cannot morph into a batching pool.
+		 */
+		zswap_pool_put(pool);
+		return;
+	}
+
+	zswap_batch_reset(&sub_batch);
+	sub_batch.pool = pool;
+
+	for (i = 0; i < folio_batch_count(batch); ++i) {
+		struct folio *folio = batch->folios[i];
+		struct obj_cgroup *objcg = NULL;
+		struct mem_cgroup *memcg = NULL;
+		bool ret;
+
+		VM_WARN_ON_ONCE(!folio_test_locked(folio));
+		VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+
+		objcg = get_obj_cgroup_from_folio(folio);
+		if (objcg && !obj_cgroup_may_zswap(objcg)) {
+			memcg = get_mem_cgroup_from_objcg(objcg);
+			if (shrink_memcg(memcg)) {
+				mem_cgroup_put(memcg);
+				goto put_objcg;
+			}
+			mem_cgroup_put(memcg);
+		}
+
+		if (zswap_check_limits())
+			goto put_objcg;
+
+		if (objcg) {
+			memcg = get_mem_cgroup_from_objcg(objcg);
+			if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
+				mem_cgroup_put(memcg);
+				goto put_objcg;
+			}
+			mem_cgroup_put(memcg);
+		}
+
+		/*
+		 * By default, set zswap error status in "errors" to "success"
+		 * for use in swap_writepage() when this returns. In case of
+		 * errors encountered in any sub-batch in which this folio's
+		 * pages are batch-compressed, a negative error number will
+		 * over-write this when zswap_batch_cleanup() is called.
+		 */
+		errors[i] = 0;
+		ret = zswap_batch_comp_folio(folio, errors, i, objcg, &sub_batch,
+					     (i == folio_batch_count(batch) - 1));
+
+put_objcg:
+		obj_cgroup_put(objcg);
+		if (!ret && zswap_pool_reached_full)
+			queue_work(shrink_wq, &zswap_shrink_work);
+	} /* for batch folios */
+
+	zswap_pool_put(pool);
+
+check_old:
+	for (i = 0; i < folio_batch_count(batch); ++i)
+		if (errors[i])
+			zswap_delete_stored_entries(batch->folios[i]);
+}
+
 int zswap_swapon(int type, unsigned long nr_pages)
 {
 	struct xarray *trees, *tree;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios.
  2024-11-23  7:01 ` [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios Kanchana P Sridhar
@ 2024-11-25  8:00   ` kernel test robot
  2024-11-25 20:20   ` Yosry Ahmed
  1 sibling, 0 replies; 39+ messages in thread
From: kernel test robot @ 2024-11-25  8:00 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, ying.huang,
	21cnbao, akpm, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi
  Cc: llvm, oe-kbuild-all, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Hi Kanchana,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 5a7056135bb69da2ce0a42eb8c07968c1331777b]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchana-P-Sridhar/crypto-acomp-Define-two-new-interfaces-for-compress-decompress-batching/20241125-110412
base:   5a7056135bb69da2ce0a42eb8c07968c1331777b
patch link:    https://lore.kernel.org/r/20241123070127.332773-11-kanchana.p.sridhar%40intel.com
patch subject: [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios.
config: x86_64-buildonly-randconfig-003-20241125 (https://download.01.org/0day-ci/archive/20241125/202411251534.ETkkSgz6-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241125/202411251534.ETkkSgz6-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411251534.ETkkSgz6-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from mm/zswap.c:18:
   In file included from include/linux/highmem.h:8:
   In file included from include/linux/cacheflush.h:5:
   In file included from arch/x86/include/asm/cacheflush.h:5:
   In file included from include/linux/mm.h:2211:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   In file included from mm/zswap.c:40:
   In file included from mm/internal.h:13:
   include/linux/mm_inline.h:47:41: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      47 |         __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
         |                                    ~~~~~~~~~~~ ^ ~~~
   include/linux/mm_inline.h:49:22: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      49 |                                 NR_ZONE_LRU_BASE + lru, nr_pages);
         |                                 ~~~~~~~~~~~~~~~~ ^ ~~~
>> mm/zswap.c:2315:8: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
    2315 |                         if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
         |                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/zswap.c:2335:8: note: uninitialized use occurs here
    2335 |                 if (!ret && zswap_pool_reached_full)
         |                      ^~~
   mm/zswap.c:2315:4: note: remove the 'if' if its condition is always false
    2315 |                         if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
         |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    2316 |                                 mem_cgroup_put(memcg);
         |                                 ~~~~~~~~~~~~~~~~~~~~~~
    2317 |                                 goto put_objcg;
         |                                 ~~~~~~~~~~~~~~~
    2318 |                         }
         |                         ~
   mm/zswap.c:2310:7: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
    2310 |                 if (zswap_check_limits())
         |                     ^~~~~~~~~~~~~~~~~~~~
   mm/zswap.c:2335:8: note: uninitialized use occurs here
    2335 |                 if (!ret && zswap_pool_reached_full)
         |                      ^~~
   mm/zswap.c:2310:3: note: remove the 'if' if its condition is always false
    2310 |                 if (zswap_check_limits())
         |                 ^~~~~~~~~~~~~~~~~~~~~~~~~
    2311 |                         goto put_objcg;
         |                         ~~~~~~~~~~~~~~
   mm/zswap.c:2303:8: warning: variable 'ret' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
    2303 |                         if (shrink_memcg(memcg)) {
         |                             ^~~~~~~~~~~~~~~~~~~
   mm/zswap.c:2335:8: note: uninitialized use occurs here
    2335 |                 if (!ret && zswap_pool_reached_full)
         |                      ^~~
   mm/zswap.c:2303:4: note: remove the 'if' if its condition is always false
    2303 |                         if (shrink_memcg(memcg)) {
         |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~
    2304 |                                 mem_cgroup_put(memcg);
         |                                 ~~~~~~~~~~~~~~~~~~~~~~
    2305 |                                 goto put_objcg;
         |                                 ~~~~~~~~~~~~~~~
    2306 |                         }
         |                         ~
   mm/zswap.c:2295:11: note: initialize the variable 'ret' to silence this warning
    2295 |                 bool ret;
         |                         ^
         |                          = 0
   6 warnings generated.


vim +2315 mm/zswap.c

  2216	
  2217	/*
  2218	 * Store a large folio and/or a batch of any-order folio(s) in zswap
  2219	 * using IAA compress batching API.
  2220	 *
  2221	 * This the main procedure for batching within large folios and for batching
  2222	 * of folios. Each large folio will be broken into sub-batches of
  2223	 * SWAP_CRYPTO_BATCH_SIZE pages, the sub-batch pages will be compressed by
  2224	 * IAA hardware compress engines in parallel, then stored in zpool/xarray.
  2225	 *
  2226	 * This procedure should only be called if zswap supports batching of stores.
  2227	 * Otherwise, the sequential implementation for storing folios as in the
  2228	 * current zswap_store() should be used. The code handles the unlikely event
  2229	 * that the zswap pool changes from batching to non-batching between
  2230	 * swap_writepage() and the start of zswap_batch_store().
  2231	 *
  2232	 * The signature of this procedure is meant to allow the calling function,
  2233	 * (for instance, swap_writepage()) to pass a batch of folios @batch
  2234	 * (the "reclaim batch") to be stored in zswap.
  2235	 *
  2236	 * @batch and @errors have folio_batch_count(@batch) number of entries,
  2237	 * with one-one correspondence (@errors[i] represents the error status of
  2238	 * @batch->folios[i], for i in folio_batch_count(@batch)). Please also
  2239	 * see comments preceding "struct zswap_batch_store_sub_batch" definition
  2240	 * above.
  2241	 *
  2242	 * The calling function (for instance, swap_writepage()) should initialize
  2243	 * @errors[i] to a non-0 value.
  2244	 * If zswap successfully stores @batch->folios[i], it will set @errors[i] to 0.
  2245	 * If there is an error in zswap, it will set @errors[i] to -EINVAL.
  2246	 *
  2247	 * @batch: folio_batch of folios to be batch compressed.
  2248	 * @errors: zswap_batch_store() error status for the folios in @batch.
  2249	 */
  2250	void zswap_batch_store(struct folio_batch *batch, int *errors)
  2251	{
  2252		struct zswap_batch_store_sub_batch sub_batch;
  2253		struct zswap_pool *pool;
  2254		u8 i;
  2255	
  2256		/*
  2257		 * If zswap is disabled, we must invalidate the possibly stale entry
  2258		 * which was previously stored at this offset. Otherwise, writeback
  2259		 * could overwrite the new data in the swapfile.
  2260		 */
  2261		if (!zswap_enabled)
  2262			goto check_old;
  2263	
  2264		pool = zswap_pool_current_get();
  2265	
  2266		if (!pool) {
  2267			if (zswap_check_limits())
  2268				queue_work(shrink_wq, &zswap_shrink_work);
  2269			goto check_old;
  2270		}
  2271	
  2272		if (!pool->can_batch) {
  2273			for (i = 0; i < folio_batch_count(batch); ++i)
  2274				if (zswap_store(batch->folios[i]))
  2275					errors[i] = 0;
  2276				else
  2277					errors[i] = -EINVAL;
  2278			/*
  2279			 * Seems preferable to release the pool ref after the calls to
  2280			 * zswap_store(), so that the non-batching pool cannot be
  2281			 * deleted, can be used for sequential stores, and the zswap pool
  2282			 * cannot morph into a batching pool.
  2283			 */
  2284			zswap_pool_put(pool);
  2285			return;
  2286		}
  2287	
  2288		zswap_batch_reset(&sub_batch);
  2289		sub_batch.pool = pool;
  2290	
  2291		for (i = 0; i < folio_batch_count(batch); ++i) {
  2292			struct folio *folio = batch->folios[i];
  2293			struct obj_cgroup *objcg = NULL;
  2294			struct mem_cgroup *memcg = NULL;
  2295			bool ret;
  2296	
  2297			VM_WARN_ON_ONCE(!folio_test_locked(folio));
  2298			VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
  2299	
  2300			objcg = get_obj_cgroup_from_folio(folio);
  2301			if (objcg && !obj_cgroup_may_zswap(objcg)) {
  2302				memcg = get_mem_cgroup_from_objcg(objcg);
  2303				if (shrink_memcg(memcg)) {
  2304					mem_cgroup_put(memcg);
  2305					goto put_objcg;
  2306				}
  2307				mem_cgroup_put(memcg);
  2308			}
  2309	
  2310			if (zswap_check_limits())
  2311				goto put_objcg;
  2312	
  2313			if (objcg) {
  2314				memcg = get_mem_cgroup_from_objcg(objcg);
> 2315				if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
  2316					mem_cgroup_put(memcg);
  2317					goto put_objcg;
  2318				}
  2319				mem_cgroup_put(memcg);
  2320			}
  2321	
  2322			/*
  2323			 * By default, set zswap error status in "errors" to "success"
  2324			 * for use in swap_writepage() when this returns. In case of
  2325			 * errors encountered in any sub-batch in which this folio's
  2326			 * pages are batch-compressed, a negative error number will
  2327			 * over-write this when zswap_batch_cleanup() is called.
  2328			 */
  2329			errors[i] = 0;
  2330			ret = zswap_batch_comp_folio(folio, errors, i, objcg, &sub_batch,
  2331						     (i == folio_batch_count(batch) - 1));
  2332	
  2333	put_objcg:
  2334			obj_cgroup_put(objcg);
  2335			if (!ret && zswap_pool_reached_full)
  2336				queue_work(shrink_wq, &zswap_shrink_work);
  2337		} /* for batch folios */
  2338	
  2339		zswap_pool_put(pool);
  2340	
  2341	check_old:
  2342		for (i = 0; i < folio_batch_count(batch); ++i)
  2343			if (errors[i])
  2344				zswap_delete_stored_entries(batch->folios[i]);
  2345	}
  2346	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios.
  2024-11-23  7:01 ` [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios Kanchana P Sridhar
  2024-11-25  8:00   ` kernel test robot
@ 2024-11-25 20:20   ` Yosry Ahmed
  2024-11-25 21:47     ` Johannes Weiner
  2024-11-25 21:54     ` Sridhar, Kanchana P
  1 sibling, 2 replies; 39+ messages in thread
From: Yosry Ahmed @ 2024-11-25 20:20 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch adds two new zswap API:
>
>  1) bool zswap_can_batch(void);
>  2) void zswap_batch_store(struct folio_batch *batch, int *errors);
>
> Higher level mm code, for instance, swap_writepage(), can query if the
> current zswap pool supports batching, by calling zswap_can_batch(). If so
> it can invoke zswap_batch_store() to swapout a large folio much more
> efficiently to zswap, instead of calling zswap_store().
>
> Hence, on systems with Intel IAA hardware compress/decompress accelerators,
> swap_writepage() will invoke zswap_batch_store() for large folios.
>
> zswap_batch_store() will call crypto_acomp_batch_compress() to compress up
> to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using
> the multiple compress engines available in IAA.
>
> On platforms with multiple IAA devices per package, compress jobs from all
> cores in a package will be distributed among all IAA devices in the package
> by the iaa_crypto driver.
>
> The newly added zswap_batch_store() follows the general structure of
> zswap_store(). Some amount of restructuring and optimization is done to
> minimize failure points for a batch, fail early and maximize the zswap
> store pipeline occupancy with SWAP_CRYPTO_BATCH_SIZE pages, potentially
> from multiple folios in future. This is intended to maximize reclaim
> throughput with the IAA hardware parallel compressions.
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Suggested-by: Yosry Ahmed <yosryahmed@google.com>

This is definitely not what I suggested :)

I won't speak for Johannes here but I suspect it's not quite what he
wanted either.

More below.

> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  include/linux/zswap.h |  12 +
>  mm/page_io.c          |  16 +-
>  mm/zswap.c            | 639 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 666 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 9ad27ab3d222..a05f59139a6e 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -4,6 +4,7 @@
>
>  #include <linux/types.h>
>  #include <linux/mm_types.h>
> +#include <linux/pagevec.h>
>
>  struct lruvec;
>
> @@ -33,6 +34,8 @@ struct zswap_lruvec_state {
>
>  unsigned long zswap_total_pages(void);
>  bool zswap_store(struct folio *folio);
> +bool zswap_can_batch(void);
> +void zswap_batch_store(struct folio_batch *batch, int *errors);
>  bool zswap_load(struct folio *folio);
>  void zswap_invalidate(swp_entry_t swp);
>  int zswap_swapon(int type, unsigned long nr_pages);
> @@ -51,6 +54,15 @@ static inline bool zswap_store(struct folio *folio)
>         return false;
>  }
>
> +static inline bool zswap_can_batch(void)
> +{
> +       return false;
> +}
> +
> +static inline void zswap_batch_store(struct folio_batch *batch, int *errors)
> +{
> +}
> +
>  static inline bool zswap_load(struct folio *folio)
>  {
>         return false;
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 4b4ea8e49cf6..271d3a40c0c1 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -276,7 +276,21 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>                  */
>                 swap_zeromap_folio_clear(folio);
>         }
> -       if (zswap_store(folio)) {
> +
> +       if (folio_test_large(folio) && zswap_can_batch()) {
> +               struct folio_batch batch;
> +               int error = -1;
> +
> +               folio_batch_init(&batch);
> +               folio_batch_add(&batch, folio);
> +               zswap_batch_store(&batch, &error);
> +
> +               if (!error) {
> +                       count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
> +                       folio_unlock(folio);
> +                       return 0;
> +               }

First of all, why does the code outside of zswap has to care or be changed?

We should still call zswap_store() here, and within that figure out if
we can do batching or not. I am not sure what we gain by adding a
separate interface here, especially that we are creating a batch of a
single folio and passing it in anyway. I suspect that this leaked here
from the patch that batches unrelated folios swapout, but please don't
do that. This patch is about batching compression of pages in the same
folio, and for that, there is no need for the code here to do anything
differently or pass in a folio batch.

Also, eliminating the need for a folio batch eliminates the need to
call the batches in the zswap code "sub batches", which is really
confusing imo :)

> +       } else if (zswap_store(folio)) {
>                 count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
>                 folio_unlock(folio);
>                 return 0;
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 173f7632990e..53c8e39b778b 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -229,6 +229,80 @@ static DEFINE_MUTEX(zswap_init_lock);
>  /* init completed, but couldn't create the initial pool */
>  static bool zswap_has_pool;
>
> +/*
> + * struct zswap_batch_store_sub_batch:
> + *
> + * This represents a sub-batch of SWAP_CRYPTO_BATCH_SIZE pages during IAA
> + * compress batching of a folio or (conceptually, a reclaim batch of) folios.
> + * The new zswap_batch_store() API will break down the batch of folios being
> + * reclaimed into sub-batches of SWAP_CRYPTO_BATCH_SIZE pages, batch compress
> + * the pages by calling the iaa_crypto driver API crypto_acomp_batch_compress();
> + * and storing the sub-batch in zpool/xarray before updating objcg/vm/zswap
> + * stats.
> + *
> + * Although the page itself is represented directly, the structure adds a
> + * "u8 folio_id" to represent an index for the folio in a conceptual
> + * "reclaim batch of folios" that can be passed to zswap_store(). Conceptually,
> + * this allows for up to 256 folios that can be passed to zswap_store().
> + * Even though the folio_id seems redundant in the context of a single large
> + * folio being stored by zswap, it does simplify error handling and redundant
> + * computes/rewinding state, all of which can add latency. Since the
> + * zswap_batch_store() of a large folio can fail for any of these reasons --
> + * compress errors, zpool malloc errors, xarray store errors -- the procedures
> + * that detect these errors for a sub-batch, can all call a single cleanup
> + * procedure, zswap_batch_cleanup(), which will de-allocate zpool memory and
> + * zswap_entries for the sub-batch and set the "errors[folio_id]" to -EINVAL.
> + * All subsequent procedures that operate on a sub-batch will do nothing if the
> + * errors[folio_id] is non-0. Hence, the folio_id facilitates the use of the
> + * "errors" passed to zswap_batch_store() as a global folio error status for a
> + * single folio (which could also be a folio in the folio_batch).
> + *
> + * The sub-batch concept could be further evolved to use pipelining to
> + * overlap CPU computes with IAA computes. For instance, we could stage
> + * the post-compress computes for sub-batch "N-1" to happen in parallel with
> + * IAA batch compression of sub-batch "N".
> + *
> + * We begin by developing the concept of compress batching. Pipelining with
> + * overlap can be future work.

I suppose all this gets simplified once we eliminate passing in a
folio batch to zswap. In that case, a batch is just a construct
created by zswap if the allocator supports batching page compression.
There is also no need to describe what the code does, especially a
centralized comment like that rather than per-function docs/comments.

We also don't want the comments to be very specific to IAA. It is
currently the only implementation, but the code here is not specific
to IAA AFAICT. It should just be an example or so.

> + *
> + * @pages: The individual pages in the sub-batch. There are no assumptions
> + *         about all of them belonging to the same folio.
> + * @dsts: The destination buffers for batch compress of the sub-batch.
> + * @dlens: The destination length constraints, and eventual compressed lengths
> + *         of successful compressions.
> + * @comp_errors: The compress error status for each page in the sub-batch, set
> + *               by crypto_acomp_batch_compress().
> + * @folio_ids: The containing folio_id of each sub-batch page.
> + * @swpentries: The page_swap_entry() for each corresponding sub-batch page.
> + * @objcgs: The objcg for each corresponding sub-batch page.
> + * @entries: The zswap_entry for each corresponding sub-batch page.
> + * @nr_pages: Total number of pages in @sub_batch.
> + * @pool: A valid zswap_pool that can_batch.
> + *
> + * Note:
> + * The max sub-batch size is SWAP_CRYPTO_BATCH_SIZE, currently 8UL.
> + * Hence, if SWAP_CRYPTO_BATCH_SIZE exceeds 256, @nr_pages needs to become u16.
> + * The sub-batch representation is future-proofed to a small extent to be able
> + * to easily scale the zswap_batch_store() implementation to handle a conceptual
> + * "reclaim batch of folios"; without addding too much complexity, while
> + * benefiting from simpler error handling, localized sub-batch resources cleanup
> + * and avoiding expensive rewinding state. If this conceptual number of reclaim
> + * folios sent to zswap_batch_store() exceeds 256, @folio_ids needs to
> + * become u16.
> + */
> +struct zswap_batch_store_sub_batch {
> +       struct page *pages[SWAP_CRYPTO_BATCH_SIZE];
> +       u8 *dsts[SWAP_CRYPTO_BATCH_SIZE];
> +       unsigned int dlens[SWAP_CRYPTO_BATCH_SIZE];
> +       int comp_errors[SWAP_CRYPTO_BATCH_SIZE]; /* folio error status. */
> +       u8 folio_ids[SWAP_CRYPTO_BATCH_SIZE];
> +       swp_entry_t swpentries[SWAP_CRYPTO_BATCH_SIZE];
> +       struct obj_cgroup *objcgs[SWAP_CRYPTO_BATCH_SIZE];
> +       struct zswap_entry *entries[SWAP_CRYPTO_BATCH_SIZE];
> +       u8 nr_pages;
> +       struct zswap_pool *pool;
> +};
> +
>  /*********************************
>  * helpers and fwd declarations
>  **********************************/
> @@ -1705,6 +1779,571 @@ void zswap_invalidate(swp_entry_t swp)
>                 zswap_entry_free(entry);
>  }
>
> +/******************************************************
> + * zswap_batch_store() with compress batching.
> + ******************************************************/
> +
> +/*
> + * Note: If SWAP_CRYPTO_BATCH_SIZE exceeds 256, change the
> + * u8 stack variables in the next several functions, to u16.
> + */
> +bool zswap_can_batch(void)
> +{
> +       struct zswap_pool *pool;
> +       bool ret = false;
> +
> +       pool = zswap_pool_current_get();
> +
> +       if (!pool)
> +               return ret;
> +
> +       if (pool->can_batch)
> +               ret = true;
> +
> +       zswap_pool_put(pool);
> +
> +       return ret;
> +}
> +
> +/*
> + * If the zswap store fails or zswap is disabled, we must invalidate
> + * the possibly stale entries which were previously stored at the
> + * offsets corresponding to each page of the folio. Otherwise,
> + * writeback could overwrite the new data in the swapfile.
> + */
> +static void zswap_delete_stored_entries(struct folio *folio)
> +{
> +       swp_entry_t swp = folio->swap;
> +       unsigned type = swp_type(swp);
> +       pgoff_t offset = swp_offset(swp);
> +       struct zswap_entry *entry;
> +       struct xarray *tree;
> +       long index;
> +
> +       for (index = 0; index < folio_nr_pages(folio); ++index) {
> +               tree = swap_zswap_tree(swp_entry(type, offset + index));
> +               entry = xa_erase(tree, offset + index);
> +               if (entry)
> +                       zswap_entry_free(entry);
> +       }
> +}
> +
> +static __always_inline void zswap_batch_reset(struct zswap_batch_store_sub_batch *sb)
> +{
> +       sb->nr_pages = 0;
> +}
> +
> +/*
> + * Upon encountering the first sub-batch page in a folio with an error due to
> + * any of the following:
> + *  - compression
> + *  - zpool malloc
> + *  - xarray store
> + * , cleanup the sub-batch resources (zpool memory, zswap_entry) for all other
> + * sub_batch elements belonging to the same folio, using the "error_folio_id".
> + *
> + * Set the "errors[error_folio_id] to signify to all downstream computes in
> + * zswap_batch_store(), that no further processing is required for the folio
> + * with "error_folio_id" in the batch: this folio's zswap store status will
> + * be considered an error, and existing zswap_entries in the xarray will be
> + * deleted before zswap_batch_store() exits.
> + */
> +static void zswap_batch_cleanup(struct zswap_batch_store_sub_batch *sb,
> +                               int *errors,
> +                               u8 error_folio_id)
> +{
> +       u8 i;
> +
> +       if (errors[error_folio_id])
> +               return;
> +
> +       for (i = 0; i < sb->nr_pages; ++i) {
> +               if (sb->folio_ids[i] == error_folio_id) {
> +                       if (sb->entries[i]) {
> +                               if (!IS_ERR_VALUE(sb->entries[i]->handle))
> +                                       zpool_free(sb->pool->zpool, sb->entries[i]->handle);
> +
> +                               zswap_entry_cache_free(sb->entries[i]);
> +                               sb->entries[i] = NULL;
> +                       }
> +               }
> +       }
> +
> +       errors[error_folio_id] = -EINVAL;
> +}
> +
> +/*
> + * Returns true if the entry was successfully
> + * stored in the xarray, and false otherwise.
> + */
> +static bool zswap_store_entry(swp_entry_t page_swpentry, struct zswap_entry *entry)
> +{
> +       struct zswap_entry *old = xa_store(swap_zswap_tree(page_swpentry),
> +                                          swp_offset(page_swpentry),
> +                                          entry, GFP_KERNEL);
> +       if (xa_is_err(old)) {
> +               int err = xa_err(old);
> +
> +               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> +               zswap_reject_alloc_fail++;
> +               return false;
> +       }
> +
> +       /*
> +        * We may have had an existing entry that became stale when
> +        * the folio was redirtied and now the new version is being
> +        * swapped out. Get rid of the old.
> +        */
> +       if (old)
> +               zswap_entry_free(old);
> +
> +       return true;
> +}
> +
> +/*
> + * The stats accounting makes no assumptions about all pages in the sub-batch
> + * belonging to the same folio, or having the same objcg; while still doing
> + * the updates in aggregation.
> + */
> +static void zswap_batch_xarray_stats(struct zswap_batch_store_sub_batch *sb,
> +                                    int *errors)
> +{
> +       int nr_objcg_pages = 0, nr_pages = 0;
> +       struct obj_cgroup *objcg = NULL;
> +       size_t compressed_bytes = 0;
> +       u8 i;
> +
> +       for (i = 0; i < sb->nr_pages; ++i) {
> +               if (errors[sb->folio_ids[i]])
> +                       continue;
> +
> +               if (!zswap_store_entry(sb->swpentries[i], sb->entries[i])) {
> +                       zswap_batch_cleanup(sb, errors, sb->folio_ids[i]);
> +                       continue;
> +               }
> +
> +               /*
> +                * The entry is successfully compressed and stored in the tree,
> +                * there is no further possibility of failure. Grab refs to the
> +                * pool and objcg. These refs will be dropped by
> +                * zswap_entry_free() when the entry is removed from the tree.
> +                */
> +               zswap_pool_get(sb->pool);
> +               if (sb->objcgs[i])
> +                       obj_cgroup_get(sb->objcgs[i]);
> +
> +               /*
> +                * We finish initializing the entry while it's already in xarray.
> +                * This is safe because:
> +                *
> +                * 1. Concurrent stores and invalidations are excluded by folio
> +                *    lock.
> +                *
> +                * 2. Writeback is excluded by the entry not being on the LRU yet.
> +                *    The publishing order matters to prevent writeback from seeing
> +                *    an incoherent entry.
> +                */
> +               sb->entries[i]->pool = sb->pool;
> +               sb->entries[i]->swpentry = sb->swpentries[i];
> +               sb->entries[i]->objcg = sb->objcgs[i];
> +               sb->entries[i]->referenced = true;
> +               if (sb->entries[i]->length) {
> +                       INIT_LIST_HEAD(&(sb->entries[i]->lru));
> +                       zswap_lru_add(&zswap_list_lru, sb->entries[i]);
> +               }
> +
> +               if (!objcg && sb->objcgs[i]) {
> +                       objcg = sb->objcgs[i];
> +               } else if (objcg && sb->objcgs[i] && (objcg != sb->objcgs[i])) {
> +                       obj_cgroup_charge_zswap(objcg, compressed_bytes);
> +                       count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
> +                       compressed_bytes = 0;
> +                       nr_objcg_pages = 0;
> +                       objcg = sb->objcgs[i];
> +               }
> +
> +               if (sb->objcgs[i]) {
> +                       compressed_bytes += sb->entries[i]->length;
> +                       ++nr_objcg_pages;
> +               }
> +
> +               ++nr_pages;
> +       } /* for sub-batch pages. */
> +
> +       if (objcg) {
> +               obj_cgroup_charge_zswap(objcg, compressed_bytes);
> +               count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
> +       }
> +
> +       atomic_long_add(nr_pages, &zswap_stored_pages);
> +       count_vm_events(ZSWPOUT, nr_pages);
> +}
> +
> +static void zswap_batch_zpool_store(struct zswap_batch_store_sub_batch *sb,
> +                                   int *errors)
> +{
> +       u8 i;
> +
> +       for (i = 0; i < sb->nr_pages; ++i) {
> +               struct zpool *zpool;
> +               unsigned long handle;
> +               char *buf;
> +               gfp_t gfp;
> +               int err;
> +
> +               /* Skip pages belonging to folios that had compress errors. */
> +               if (errors[sb->folio_ids[i]])
> +                       continue;
> +
> +               zpool = sb->pool->zpool;
> +               gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
> +               if (zpool_malloc_support_movable(zpool))
> +                       gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> +               err = zpool_malloc(zpool, sb->dlens[i], gfp, &handle);
> +
> +               if (err) {
> +                       if (err == -ENOSPC)
> +                               zswap_reject_compress_poor++;
> +                       else
> +                               zswap_reject_alloc_fail++;
> +
> +                       /*
> +                        * A zpool malloc error should trigger cleanup for
> +                        * other same-folio pages in the sub-batch, and zpool
> +                        * resources/zswap_entries for those pages should be
> +                        * de-allocated.
> +                        */
> +                       zswap_batch_cleanup(sb, errors, sb->folio_ids[i]);
> +                       continue;
> +               }
> +
> +               buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> +               memcpy(buf, sb->dsts[i], sb->dlens[i]);
> +               zpool_unmap_handle(zpool, handle);
> +
> +               sb->entries[i]->handle = handle;
> +               sb->entries[i]->length = sb->dlens[i];
> +       }
> +}
> +
> +static void zswap_batch_proc_comp_errors(struct zswap_batch_store_sub_batch *sb,
> +                                        int *errors)
> +{
> +       u8 i;
> +
> +       for (i = 0; i < sb->nr_pages; ++i) {
> +               if (sb->comp_errors[i]) {
> +                       if (sb->comp_errors[i] == -ENOSPC)
> +                               zswap_reject_compress_poor++;
> +                       else
> +                               zswap_reject_compress_fail++;
> +
> +                       if (!errors[sb->folio_ids[i]])
> +                               zswap_batch_cleanup(sb, errors, sb->folio_ids[i]);
> +               }
> +       }
> +}
> +
> +/*
> + * Batch compress up to SWAP_CRYPTO_BATCH_SIZE pages with IAA.
> + * It is important to note that the SWAP_CRYPTO_BATCH_SIZE resources
> + * resources are allocated for the pool's per-cpu acomp_ctx during cpu
> + * hotplug only if the crypto_acomp has registered either
> + * batch_compress() and batch_decompress().
> + * The iaa_crypto driver registers implementations for both these API.
> + * Hence, if IAA is the zswap compressor, the call to
> + * crypto_acomp_batch_compress() will compress the pages in parallel,
> + * resulting in significant performance improvements as compared to
> + * software compressors.
> + */
> +static void zswap_batch_compress(struct zswap_batch_store_sub_batch *sb,
> +                                int *errors)
> +{
> +       struct crypto_acomp_ctx *acomp_ctx = raw_cpu_ptr(sb->pool->acomp_ctx);
> +       u8 i;
> +
> +       mutex_lock(&acomp_ctx->mutex);
> +
> +       BUG_ON(acomp_ctx->nr_reqs != SWAP_CRYPTO_BATCH_SIZE);
> +
> +       for (i = 0; i < sb->nr_pages; ++i) {
> +               sb->dsts[i] = acomp_ctx->buffers[i];
> +               sb->dlens[i] = PAGE_SIZE;
> +       }
> +
> +       /*
> +        * Batch compress sub-batch "N". If IAA is the compressor, the
> +        * hardware will compress multiple pages in parallel.
> +        */
> +       crypto_acomp_batch_compress(
> +               acomp_ctx->reqs,
> +               &acomp_ctx->wait,
> +               sb->pages,
> +               sb->dsts,
> +               sb->dlens,
> +               sb->comp_errors,
> +               sb->nr_pages);
> +
> +       /*
> +        * Scan the sub-batch for any compression errors,
> +        * and invalidate pages with errors, along with other
> +        * pages belonging to the same folio as the error page(s).
> +        * Set the folio's error status in "errors" so that no
> +        * further zswap_batch_store() processing is done for
> +        * the folio(s) with compression errors.
> +        */
> +       zswap_batch_proc_comp_errors(sb, errors);
> +
> +       zswap_batch_zpool_store(sb, errors);
> +
> +       mutex_unlock(&acomp_ctx->mutex);
> +}
> +
> +static void zswap_batch_add_pages(struct zswap_batch_store_sub_batch *sb,
> +                                 struct folio *folio,
> +                                 u8 folio_id,
> +                                 struct obj_cgroup *objcg,
> +                                 struct zswap_entry *entries[],
> +                                 long start_idx,
> +                                 u8 nr)
> +{
> +       long index;
> +
> +       for (index = start_idx; index < (start_idx + nr); ++index) {
> +               u8 i = sb->nr_pages;
> +               struct page *page = folio_page(folio, index);
> +               sb->pages[i] = page;
> +               sb->swpentries[i] = page_swap_entry(page);
> +               sb->folio_ids[i] = folio_id;
> +               sb->objcgs[i] = objcg;
> +               sb->entries[i] = entries[index - start_idx];
> +               sb->comp_errors[i] = 0;
> +               ++sb->nr_pages;
> +       }
> +}
> +
> +/* Allocate entries for the next sub-batch. */
> +static int zswap_batch_alloc_entries(struct zswap_entry *entries[], int node_id, u8 nr)
> +{
> +       u8 i;
> +
> +       for (i = 0; i < nr; ++i) {
> +               entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> +               if (!entries[i]) {
> +                       u8 j;
> +
> +                       zswap_reject_kmemcache_fail++;
> +                       for (j = 0; j < i; ++j)
> +                               zswap_entry_cache_free(entries[j]);
> +                       return -EINVAL;
> +               }
> +
> +               entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> +       }
> +
> +       return 0;
> +}
> +
> +static bool zswap_batch_comp_folio(struct folio *folio, int *errors, u8 folio_id,
> +                                  struct obj_cgroup *objcg,
> +                                  struct zswap_batch_store_sub_batch *sub_batch,
> +                                  bool last)
> +{
> +       long folio_start_idx = 0, nr_folio_pages = folio_nr_pages(folio);
> +       struct zswap_entry *entries[SWAP_CRYPTO_BATCH_SIZE];
> +       int node_id = folio_nid(folio);
> +
> +       /*
> +        * Iterate over the pages in the folio passed in. Construct compress
> +        * sub-batches of up to SWAP_CRYPTO_BATCH_SIZE pages. Process each
> +        * sub-batch with IAA batch compression. Detect errors from batch
> +        * compression and set the folio's error status.
> +        */
> +       while (nr_folio_pages > 0) {
> +               u8 add_nr_pages;
> +
> +               /*
> +                * If we have accumulated SWAP_CRYPTO_BATCH_SIZE
> +                * pages, process the sub-batch.
> +                */
> +               if (sub_batch->nr_pages == SWAP_CRYPTO_BATCH_SIZE) {
> +                       zswap_batch_compress(sub_batch, errors);
> +                       zswap_batch_xarray_stats(sub_batch, errors);
> +                       zswap_batch_reset(sub_batch);
> +                       /*
> +                        * Stop processing this folio if it had compress errors.
> +                        */
> +                       if (errors[folio_id])
> +                               goto ret_folio;
> +               }
> +
> +               /* Add pages from the folio to the compress sub-batch. */
> +               add_nr_pages = min3((
> +                               (long)SWAP_CRYPTO_BATCH_SIZE -
> +                               (long)sub_batch->nr_pages),
> +                               nr_folio_pages,
> +                               (long)SWAP_CRYPTO_BATCH_SIZE);
> +
> +               /*
> +                * Allocate zswap entries for this sub-batch. If we get errors
> +                * while doing so, we can fail early and flag an error for the
> +                * folio.
> +                */
> +               if (zswap_batch_alloc_entries(entries, node_id, add_nr_pages)) {
> +                       zswap_batch_reset(sub_batch);
> +                       errors[folio_id] = -EINVAL;
> +                       goto ret_folio;
> +               }
> +
> +               zswap_batch_add_pages(sub_batch, folio, folio_id, objcg,
> +                                     entries, folio_start_idx, add_nr_pages);
> +
> +               nr_folio_pages -= add_nr_pages;
> +               folio_start_idx += add_nr_pages;
> +       } /* this folio has pages to be compressed. */
> +
> +       /*
> +        * Process last sub-batch: it could contain pages from multiple folios.
> +        */
> +       if (last && sub_batch->nr_pages) {
> +               zswap_batch_compress(sub_batch, errors);
> +               zswap_batch_xarray_stats(sub_batch, errors);
> +       }
> +
> +ret_folio:
> +       return (!errors[folio_id]);
> +}
> +
> +/*
> + * Store a large folio and/or a batch of any-order folio(s) in zswap
> + * using IAA compress batching API.
> + *
> + * This the main procedure for batching within large folios and for batching
> + * of folios. Each large folio will be broken into sub-batches of
> + * SWAP_CRYPTO_BATCH_SIZE pages, the sub-batch pages will be compressed by
> + * IAA hardware compress engines in parallel, then stored in zpool/xarray.
> + *
> + * This procedure should only be called if zswap supports batching of stores.
> + * Otherwise, the sequential implementation for storing folios as in the
> + * current zswap_store() should be used. The code handles the unlikely event
> + * that the zswap pool changes from batching to non-batching between
> + * swap_writepage() and the start of zswap_batch_store().
> + *
> + * The signature of this procedure is meant to allow the calling function,
> + * (for instance, swap_writepage()) to pass a batch of folios @batch
> + * (the "reclaim batch") to be stored in zswap.
> + *
> + * @batch and @errors have folio_batch_count(@batch) number of entries,
> + * with one-one correspondence (@errors[i] represents the error status of
> + * @batch->folios[i], for i in folio_batch_count(@batch)). Please also
> + * see comments preceding "struct zswap_batch_store_sub_batch" definition
> + * above.
> + *
> + * The calling function (for instance, swap_writepage()) should initialize
> + * @errors[i] to a non-0 value.
> + * If zswap successfully stores @batch->folios[i], it will set @errors[i] to 0.
> + * If there is an error in zswap, it will set @errors[i] to -EINVAL.
> + *
> + * @batch: folio_batch of folios to be batch compressed.
> + * @errors: zswap_batch_store() error status for the folios in @batch.
> + */
> +void zswap_batch_store(struct folio_batch *batch, int *errors)
> +{
> +       struct zswap_batch_store_sub_batch sub_batch;
> +       struct zswap_pool *pool;
> +       u8 i;
> +
> +       /*
> +        * If zswap is disabled, we must invalidate the possibly stale entry
> +        * which was previously stored at this offset. Otherwise, writeback
> +        * could overwrite the new data in the swapfile.
> +        */
> +       if (!zswap_enabled)
> +               goto check_old;
> +
> +       pool = zswap_pool_current_get();
> +
> +       if (!pool) {
> +               if (zswap_check_limits())
> +                       queue_work(shrink_wq, &zswap_shrink_work);
> +               goto check_old;
> +       }
> +
> +       if (!pool->can_batch) {
> +               for (i = 0; i < folio_batch_count(batch); ++i)
> +                       if (zswap_store(batch->folios[i]))
> +                               errors[i] = 0;
> +                       else
> +                               errors[i] = -EINVAL;
> +               /*
> +                * Seems preferable to release the pool ref after the calls to
> +                * zswap_store(), so that the non-batching pool cannot be
> +                * deleted, can be used for sequential stores, and the zswap pool
> +                * cannot morph into a batching pool.
> +                */
> +               zswap_pool_put(pool);
> +               return;
> +       }
> +
> +       zswap_batch_reset(&sub_batch);
> +       sub_batch.pool = pool;
> +
> +       for (i = 0; i < folio_batch_count(batch); ++i) {
> +               struct folio *folio = batch->folios[i];
> +               struct obj_cgroup *objcg = NULL;
> +               struct mem_cgroup *memcg = NULL;
> +               bool ret;
> +
> +               VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +               VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> +               objcg = get_obj_cgroup_from_folio(folio);
> +               if (objcg && !obj_cgroup_may_zswap(objcg)) {
> +                       memcg = get_mem_cgroup_from_objcg(objcg);
> +                       if (shrink_memcg(memcg)) {
> +                               mem_cgroup_put(memcg);
> +                               goto put_objcg;
> +                       }
> +                       mem_cgroup_put(memcg);
> +               }
> +
> +               if (zswap_check_limits())
> +                       goto put_objcg;
> +
> +               if (objcg) {
> +                       memcg = get_mem_cgroup_from_objcg(objcg);
> +                       if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> +                               mem_cgroup_put(memcg);
> +                               goto put_objcg;
> +                       }
> +                       mem_cgroup_put(memcg);
> +               }
> +
> +               /*
> +                * By default, set zswap error status in "errors" to "success"
> +                * for use in swap_writepage() when this returns. In case of
> +                * errors encountered in any sub-batch in which this folio's
> +                * pages are batch-compressed, a negative error number will
> +                * over-write this when zswap_batch_cleanup() is called.
> +                */
> +               errors[i] = 0;
> +               ret = zswap_batch_comp_folio(folio, errors, i, objcg, &sub_batch,
> +                                            (i == folio_batch_count(batch) - 1));
> +
> +put_objcg:
> +               obj_cgroup_put(objcg);
> +               if (!ret && zswap_pool_reached_full)
> +                       queue_work(shrink_wq, &zswap_shrink_work);
> +       } /* for batch folios */
> +
> +       zswap_pool_put(pool);
> +
> +check_old:
> +       for (i = 0; i < folio_batch_count(batch); ++i)
> +               if (errors[i])
> +                       zswap_delete_stored_entries(batch->folios[i]);
> +}
> +

I didn't look too closely at the code, but you are essentially
replicating the entire  zswap store code path and making it work with
batches. This is a maintenance nightmare, and the code could very
easily go out-of-sync.

What we really need to do (and I suppose what Johannes meant, but
please correct me if I am wrong), is to make the existing flow work
with batches.

For example, most of zswap_store() should remain the same. It is still
getting a folio to compress, the only difference is that we will
parallelize the page compressions. zswap_store_page() is where some
changes need to be made. Instead of a single function that handles the
storage of each page, we need a vectorized function that handles the
storage of N pages in a folio (allocate zswap_entry's, do xarray
insertions, etc). This should be refactoring in a separate patch.

Once we have that, the logic introduced by this patch should really be
mostly limited to zswap_compress(), where the acomp interfacing would
be different based on whether batching is supported or not. This could
be changes in zswap_compress() itself, or maybe at this point we can
have a completely different path (e.g. zswap_compress_batch()). But
outside of that, I don't see why we should have a completely different
store path for the batching.

>  int zswap_swapon(int type, unsigned long nr_pages)
>  {
>         struct xarray *trees, *tree;
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios.
  2024-11-25 20:20   ` Yosry Ahmed
@ 2024-11-25 21:47     ` Johannes Weiner
  2024-11-25 21:54     ` Sridhar, Kanchana P
  1 sibling, 0 replies; 39+ messages in thread
From: Johannes Weiner @ 2024-11-25 21:47 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Mon, Nov 25, 2024 at 12:20:01PM -0800, Yosry Ahmed wrote:
> On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch adds two new zswap API:
> >
> >  1) bool zswap_can_batch(void);
> >  2) void zswap_batch_store(struct folio_batch *batch, int *errors);
> >
> > Higher level mm code, for instance, swap_writepage(), can query if the
> > current zswap pool supports batching, by calling zswap_can_batch(). If so
> > it can invoke zswap_batch_store() to swapout a large folio much more
> > efficiently to zswap, instead of calling zswap_store().
> >
> > Hence, on systems with Intel IAA hardware compress/decompress accelerators,
> > swap_writepage() will invoke zswap_batch_store() for large folios.
> >
> > zswap_batch_store() will call crypto_acomp_batch_compress() to compress up
> > to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using
> > the multiple compress engines available in IAA.
> >
> > On platforms with multiple IAA devices per package, compress jobs from all
> > cores in a package will be distributed among all IAA devices in the package
> > by the iaa_crypto driver.
> >
> > The newly added zswap_batch_store() follows the general structure of
> > zswap_store(). Some amount of restructuring and optimization is done to
> > minimize failure points for a batch, fail early and maximize the zswap
> > store pipeline occupancy with SWAP_CRYPTO_BATCH_SIZE pages, potentially
> > from multiple folios in future. This is intended to maximize reclaim
> > throughput with the IAA hardware parallel compressions.
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> 
> This is definitely not what I suggested :)
> 
> I won't speak for Johannes here but I suspect it's not quite what he
> wanted either.

It is not.

I suggested having an integrated code path where "legacy" stores of
single pages is just the batch_size=1 case.

https://lore.kernel.org/linux-mm/20241107185340.GG1172372@cmpxchg.org/

> What we really need to do (and I suppose what Johannes meant, but
> please correct me if I am wrong), is to make the existing flow work
> with batches.
> 
> For example, most of zswap_store() should remain the same. It is still
> getting a folio to compress, the only difference is that we will
> parallelize the page compressions. zswap_store_page() is where some
> changes need to be made. Instead of a single function that handles the
> storage of each page, we need a vectorized function that handles the
> storage of N pages in a folio (allocate zswap_entry's, do xarray
> insertions, etc). This should be refactoring in a separate patch.
> 
> Once we have that, the logic introduced by this patch should really be
> mostly limited to zswap_compress(), where the acomp interfacing would
> be different based on whether batching is supported or not. This could
> be changes in zswap_compress() itself, or maybe at this point we can
> have a completely different path (e.g. zswap_compress_batch()). But
> outside of that, I don't see why we should have a completely different
> store path for the batching.

+1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios.
  2024-11-25 20:20   ` Yosry Ahmed
  2024-11-25 21:47     ` Johannes Weiner
@ 2024-11-25 21:54     ` Sridhar, Kanchana P
  2024-11-25 22:08       ` Yosry Ahmed
  2024-12-02 19:26       ` Nhat Pham
  1 sibling, 2 replies; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-25 21:54 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Monday, November 25, 2024 12:20 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA
> in zswap_batch_store() of large folios.
> 
> On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch adds two new zswap API:
> >
> >  1) bool zswap_can_batch(void);
> >  2) void zswap_batch_store(struct folio_batch *batch, int *errors);
> >
> > Higher level mm code, for instance, swap_writepage(), can query if the
> > current zswap pool supports batching, by calling zswap_can_batch(). If so
> > it can invoke zswap_batch_store() to swapout a large folio much more
> > efficiently to zswap, instead of calling zswap_store().
> >
> > Hence, on systems with Intel IAA hardware compress/decompress
> accelerators,
> > swap_writepage() will invoke zswap_batch_store() for large folios.
> >
> > zswap_batch_store() will call crypto_acomp_batch_compress() to compress
> up
> > to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using
> > the multiple compress engines available in IAA.
> >
> > On platforms with multiple IAA devices per package, compress jobs from all
> > cores in a package will be distributed among all IAA devices in the package
> > by the iaa_crypto driver.
> >
> > The newly added zswap_batch_store() follows the general structure of
> > zswap_store(). Some amount of restructuring and optimization is done to
> > minimize failure points for a batch, fail early and maximize the zswap
> > store pipeline occupancy with SWAP_CRYPTO_BATCH_SIZE pages,
> potentially
> > from multiple folios in future. This is intended to maximize reclaim
> > throughput with the IAA hardware parallel compressions.
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> 
> This is definitely not what I suggested :)
> 
> I won't speak for Johannes here but I suspect it's not quite what he
> wanted either.

Thanks for the comments Yosry. I was attributing these suggestions noted
in the change log:

5) Incorporated Johannes' suggestion to not have a sysctl to enable
   compress batching.
6) Incorporated Yosry's suggestion to allocate batching resources in the
   cpu hotplug onlining code, since there is no longer a sysctl to control
   batching. Thanks Yosry!
7) Incorporated Johannes' suggestions related to making the overall
   sequence of events between zswap_store() and zswap_batch_store() similar
   as much as possible for readability and control flow, better naming of
   procedures, avoiding forward declarations, not inlining error path
   procedures, deleting zswap internal details from zswap.h, etc. Thanks
   Johannes, really appreciate the direction!
   I have tried to explain the minimal future-proofing in terms of the
   zswap_batch_store() signature and the definition of "struct
   zswap_batch_store_sub_batch" in the comments for this struct. I hope the
   new code explains the control flow a bit better.

I will delete the "Suggested-by" in subsequent revs, not a problem.

> 
> More below.
> 
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  include/linux/zswap.h |  12 +
> >  mm/page_io.c          |  16 +-
> >  mm/zswap.c            | 639
> ++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 666 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > index 9ad27ab3d222..a05f59139a6e 100644
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -4,6 +4,7 @@
> >
> >  #include <linux/types.h>
> >  #include <linux/mm_types.h>
> > +#include <linux/pagevec.h>
> >
> >  struct lruvec;
> >
> > @@ -33,6 +34,8 @@ struct zswap_lruvec_state {
> >
> >  unsigned long zswap_total_pages(void);
> >  bool zswap_store(struct folio *folio);
> > +bool zswap_can_batch(void);
> > +void zswap_batch_store(struct folio_batch *batch, int *errors);
> >  bool zswap_load(struct folio *folio);
> >  void zswap_invalidate(swp_entry_t swp);
> >  int zswap_swapon(int type, unsigned long nr_pages);
> > @@ -51,6 +54,15 @@ static inline bool zswap_store(struct folio *folio)
> >         return false;
> >  }
> >
> > +static inline bool zswap_can_batch(void)
> > +{
> > +       return false;
> > +}
> > +
> > +static inline void zswap_batch_store(struct folio_batch *batch, int *errors)
> > +{
> > +}
> > +
> >  static inline bool zswap_load(struct folio *folio)
> >  {
> >         return false;
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index 4b4ea8e49cf6..271d3a40c0c1 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -276,7 +276,21 @@ int swap_writepage(struct page *page, struct
> writeback_control *wbc)
> >                  */
> >                 swap_zeromap_folio_clear(folio);
> >         }
> > -       if (zswap_store(folio)) {
> > +
> > +       if (folio_test_large(folio) && zswap_can_batch()) {
> > +               struct folio_batch batch;
> > +               int error = -1;
> > +
> > +               folio_batch_init(&batch);
> > +               folio_batch_add(&batch, folio);
> > +               zswap_batch_store(&batch, &error);
> > +
> > +               if (!error) {
> > +                       count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
> > +                       folio_unlock(folio);
> > +                       return 0;
> > +               }
> 
> First of all, why does the code outside of zswap has to care or be changed?
> 
> We should still call zswap_store() here, and within that figure out if
> we can do batching or not. I am not sure what we gain by adding a
> separate interface here, especially that we are creating a batch of a
> single folio and passing it in anyway. I suspect that this leaked here
> from the patch that batches unrelated folios swapout, but please don't
> do that. This patch is about batching compression of pages in the same
> folio, and for that, there is no need for the code here to do anything
> differently or pass in a folio batch.

This was the "minimal future proofing" and "error handling simplification/
avoiding adding latency due to rewinds" rationale I alluded to in the
change log and in the comments for "struct zswap_batch_store_sub_batch"
respectively. This is what I was trying to articulate in terms of the benefits
of the new signature of zswap_batch_store().

The change in swap_writepage() was simply an illustration to show-case
how the reclaim batching would work, to try and explain how IAA can
significantly improve reclaim latency, not just zswap latency (and get
suggestions early-on).

I don't mind keeping swap_writepage() unchanged if the maintainers
feel strongly about this. I guess I am eager to demonstrate the full potential
of IAA, hence guilty of the minimal future-proofing.

> 
> Also, eliminating the need for a folio batch eliminates the need to
> call the batches in the zswap code "sub batches", which is really
> confusing imo :)

Ok, I understand.

> 
> > +       } else if (zswap_store(folio)) {
> >                 count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
> >                 folio_unlock(folio);
> >                 return 0;
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 173f7632990e..53c8e39b778b 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -229,6 +229,80 @@ static DEFINE_MUTEX(zswap_init_lock);
> >  /* init completed, but couldn't create the initial pool */
> >  static bool zswap_has_pool;
> >
> > +/*
> > + * struct zswap_batch_store_sub_batch:
> > + *
> > + * This represents a sub-batch of SWAP_CRYPTO_BATCH_SIZE pages during
> IAA
> > + * compress batching of a folio or (conceptually, a reclaim batch of) folios.
> > + * The new zswap_batch_store() API will break down the batch of folios
> being
> > + * reclaimed into sub-batches of SWAP_CRYPTO_BATCH_SIZE pages, batch
> compress
> > + * the pages by calling the iaa_crypto driver API
> crypto_acomp_batch_compress();
> > + * and storing the sub-batch in zpool/xarray before updating
> objcg/vm/zswap
> > + * stats.
> > + *
> > + * Although the page itself is represented directly, the structure adds a
> > + * "u8 folio_id" to represent an index for the folio in a conceptual
> > + * "reclaim batch of folios" that can be passed to zswap_store().
> Conceptually,
> > + * this allows for up to 256 folios that can be passed to zswap_store().
> > + * Even though the folio_id seems redundant in the context of a single large
> > + * folio being stored by zswap, it does simplify error handling and
> redundant
> > + * computes/rewinding state, all of which can add latency. Since the
> > + * zswap_batch_store() of a large folio can fail for any of these reasons --
> > + * compress errors, zpool malloc errors, xarray store errors -- the
> procedures
> > + * that detect these errors for a sub-batch, can all call a single cleanup
> > + * procedure, zswap_batch_cleanup(), which will de-allocate zpool
> memory and
> > + * zswap_entries for the sub-batch and set the "errors[folio_id]" to -
> EINVAL.
> > + * All subsequent procedures that operate on a sub-batch will do nothing if
> the
> > + * errors[folio_id] is non-0. Hence, the folio_id facilitates the use of the
> > + * "errors" passed to zswap_batch_store() as a global folio error status for
> a
> > + * single folio (which could also be a folio in the folio_batch).
> > + *
> > + * The sub-batch concept could be further evolved to use pipelining to
> > + * overlap CPU computes with IAA computes. For instance, we could stage
> > + * the post-compress computes for sub-batch "N-1" to happen in parallel
> with
> > + * IAA batch compression of sub-batch "N".
> > + *
> > + * We begin by developing the concept of compress batching. Pipelining
> with
> > + * overlap can be future work.
> 
> I suppose all this gets simplified once we eliminate passing in a
> folio batch to zswap. In that case, a batch is just a construct
> created by zswap if the allocator supports batching page compression.
> There is also no need to describe what the code does, especially a
> centralized comment like that rather than per-function docs/comments.

These comments are only intended to articulate the vision for batching,
as we go through the revisions. I will delete these comments in the next rev.

> 
> We also don't want the comments to be very specific to IAA. It is
> currently the only implementation, but the code here is not specific
> to IAA AFAICT. It should just be an example or so.

Sure. You are absolutely right.

> 
> > + *
> > + * @pages: The individual pages in the sub-batch. There are no
> assumptions
> > + *         about all of them belonging to the same folio.
> > + * @dsts: The destination buffers for batch compress of the sub-batch.
> > + * @dlens: The destination length constraints, and eventual compressed
> lengths
> > + *         of successful compressions.
> > + * @comp_errors: The compress error status for each page in the sub-
> batch, set
> > + *               by crypto_acomp_batch_compress().
> > + * @folio_ids: The containing folio_id of each sub-batch page.
> > + * @swpentries: The page_swap_entry() for each corresponding sub-batch
> page.
> > + * @objcgs: The objcg for each corresponding sub-batch page.
> > + * @entries: The zswap_entry for each corresponding sub-batch page.
> > + * @nr_pages: Total number of pages in @sub_batch.
> > + * @pool: A valid zswap_pool that can_batch.
> > + *
> > + * Note:
> > + * The max sub-batch size is SWAP_CRYPTO_BATCH_SIZE, currently 8UL.
> > + * Hence, if SWAP_CRYPTO_BATCH_SIZE exceeds 256, @nr_pages needs to
> become u16.
> > + * The sub-batch representation is future-proofed to a small extent to be
> able
> > + * to easily scale the zswap_batch_store() implementation to handle a
> conceptual
> > + * "reclaim batch of folios"; without addding too much complexity, while
> > + * benefiting from simpler error handling, localized sub-batch resources
> cleanup
> > + * and avoiding expensive rewinding state. If this conceptual number of
> reclaim
> > + * folios sent to zswap_batch_store() exceeds 256, @folio_ids needs to
> > + * become u16.
> > + */
> > +struct zswap_batch_store_sub_batch {
> > +       struct page *pages[SWAP_CRYPTO_BATCH_SIZE];
> > +       u8 *dsts[SWAP_CRYPTO_BATCH_SIZE];
> > +       unsigned int dlens[SWAP_CRYPTO_BATCH_SIZE];
> > +       int comp_errors[SWAP_CRYPTO_BATCH_SIZE]; /* folio error status. */
> > +       u8 folio_ids[SWAP_CRYPTO_BATCH_SIZE];
> > +       swp_entry_t swpentries[SWAP_CRYPTO_BATCH_SIZE];
> > +       struct obj_cgroup *objcgs[SWAP_CRYPTO_BATCH_SIZE];
> > +       struct zswap_entry *entries[SWAP_CRYPTO_BATCH_SIZE];
> > +       u8 nr_pages;
> > +       struct zswap_pool *pool;
> > +};
> > +
> >  /*********************************
> >  * helpers and fwd declarations
> >  **********************************/
> > @@ -1705,6 +1779,571 @@ void zswap_invalidate(swp_entry_t swp)
> >                 zswap_entry_free(entry);
> >  }
> >
> > +/******************************************************
> > + * zswap_batch_store() with compress batching.
> > + ******************************************************/
> > +
> > +/*
> > + * Note: If SWAP_CRYPTO_BATCH_SIZE exceeds 256, change the
> > + * u8 stack variables in the next several functions, to u16.
> > + */
> > +bool zswap_can_batch(void)
> > +{
> > +       struct zswap_pool *pool;
> > +       bool ret = false;
> > +
> > +       pool = zswap_pool_current_get();
> > +
> > +       if (!pool)
> > +               return ret;
> > +
> > +       if (pool->can_batch)
> > +               ret = true;
> > +
> > +       zswap_pool_put(pool);
> > +
> > +       return ret;
> > +}
> > +
> > +/*
> > + * If the zswap store fails or zswap is disabled, we must invalidate
> > + * the possibly stale entries which were previously stored at the
> > + * offsets corresponding to each page of the folio. Otherwise,
> > + * writeback could overwrite the new data in the swapfile.
> > + */
> > +static void zswap_delete_stored_entries(struct folio *folio)
> > +{
> > +       swp_entry_t swp = folio->swap;
> > +       unsigned type = swp_type(swp);
> > +       pgoff_t offset = swp_offset(swp);
> > +       struct zswap_entry *entry;
> > +       struct xarray *tree;
> > +       long index;
> > +
> > +       for (index = 0; index < folio_nr_pages(folio); ++index) {
> > +               tree = swap_zswap_tree(swp_entry(type, offset + index));
> > +               entry = xa_erase(tree, offset + index);
> > +               if (entry)
> > +                       zswap_entry_free(entry);
> > +       }
> > +}
> > +
> > +static __always_inline void zswap_batch_reset(struct
> zswap_batch_store_sub_batch *sb)
> > +{
> > +       sb->nr_pages = 0;
> > +}
> > +
> > +/*
> > + * Upon encountering the first sub-batch page in a folio with an error due
> to
> > + * any of the following:
> > + *  - compression
> > + *  - zpool malloc
> > + *  - xarray store
> > + * , cleanup the sub-batch resources (zpool memory, zswap_entry) for all
> other
> > + * sub_batch elements belonging to the same folio, using the
> "error_folio_id".
> > + *
> > + * Set the "errors[error_folio_id] to signify to all downstream computes in
> > + * zswap_batch_store(), that no further processing is required for the folio
> > + * with "error_folio_id" in the batch: this folio's zswap store status will
> > + * be considered an error, and existing zswap_entries in the xarray will be
> > + * deleted before zswap_batch_store() exits.
> > + */
> > +static void zswap_batch_cleanup(struct zswap_batch_store_sub_batch
> *sb,
> > +                               int *errors,
> > +                               u8 error_folio_id)
> > +{
> > +       u8 i;
> > +
> > +       if (errors[error_folio_id])
> > +               return;
> > +
> > +       for (i = 0; i < sb->nr_pages; ++i) {
> > +               if (sb->folio_ids[i] == error_folio_id) {
> > +                       if (sb->entries[i]) {
> > +                               if (!IS_ERR_VALUE(sb->entries[i]->handle))
> > +                                       zpool_free(sb->pool->zpool, sb->entries[i]->handle);
> > +
> > +                               zswap_entry_cache_free(sb->entries[i]);
> > +                               sb->entries[i] = NULL;
> > +                       }
> > +               }
> > +       }
> > +
> > +       errors[error_folio_id] = -EINVAL;
> > +}
> > +
> > +/*
> > + * Returns true if the entry was successfully
> > + * stored in the xarray, and false otherwise.
> > + */
> > +static bool zswap_store_entry(swp_entry_t page_swpentry, struct
> zswap_entry *entry)
> > +{
> > +       struct zswap_entry *old =
> xa_store(swap_zswap_tree(page_swpentry),
> > +                                          swp_offset(page_swpentry),
> > +                                          entry, GFP_KERNEL);
> > +       if (xa_is_err(old)) {
> > +               int err = xa_err(old);
> > +
> > +               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n",
> err);
> > +               zswap_reject_alloc_fail++;
> > +               return false;
> > +       }
> > +
> > +       /*
> > +        * We may have had an existing entry that became stale when
> > +        * the folio was redirtied and now the new version is being
> > +        * swapped out. Get rid of the old.
> > +        */
> > +       if (old)
> > +               zswap_entry_free(old);
> > +
> > +       return true;
> > +}
> > +
> > +/*
> > + * The stats accounting makes no assumptions about all pages in the sub-
> batch
> > + * belonging to the same folio, or having the same objcg; while still doing
> > + * the updates in aggregation.
> > + */
> > +static void zswap_batch_xarray_stats(struct
> zswap_batch_store_sub_batch *sb,
> > +                                    int *errors)
> > +{
> > +       int nr_objcg_pages = 0, nr_pages = 0;
> > +       struct obj_cgroup *objcg = NULL;
> > +       size_t compressed_bytes = 0;
> > +       u8 i;
> > +
> > +       for (i = 0; i < sb->nr_pages; ++i) {
> > +               if (errors[sb->folio_ids[i]])
> > +                       continue;
> > +
> > +               if (!zswap_store_entry(sb->swpentries[i], sb->entries[i])) {
> > +                       zswap_batch_cleanup(sb, errors, sb->folio_ids[i]);
> > +                       continue;
> > +               }
> > +
> > +               /*
> > +                * The entry is successfully compressed and stored in the tree,
> > +                * there is no further possibility of failure. Grab refs to the
> > +                * pool and objcg. These refs will be dropped by
> > +                * zswap_entry_free() when the entry is removed from the tree.
> > +                */
> > +               zswap_pool_get(sb->pool);
> > +               if (sb->objcgs[i])
> > +                       obj_cgroup_get(sb->objcgs[i]);
> > +
> > +               /*
> > +                * We finish initializing the entry while it's already in xarray.
> > +                * This is safe because:
> > +                *
> > +                * 1. Concurrent stores and invalidations are excluded by folio
> > +                *    lock.
> > +                *
> > +                * 2. Writeback is excluded by the entry not being on the LRU yet.
> > +                *    The publishing order matters to prevent writeback from seeing
> > +                *    an incoherent entry.
> > +                */
> > +               sb->entries[i]->pool = sb->pool;
> > +               sb->entries[i]->swpentry = sb->swpentries[i];
> > +               sb->entries[i]->objcg = sb->objcgs[i];
> > +               sb->entries[i]->referenced = true;
> > +               if (sb->entries[i]->length) {
> > +                       INIT_LIST_HEAD(&(sb->entries[i]->lru));
> > +                       zswap_lru_add(&zswap_list_lru, sb->entries[i]);
> > +               }
> > +
> > +               if (!objcg && sb->objcgs[i]) {
> > +                       objcg = sb->objcgs[i];
> > +               } else if (objcg && sb->objcgs[i] && (objcg != sb->objcgs[i])) {
> > +                       obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > +                       count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
> > +                       compressed_bytes = 0;
> > +                       nr_objcg_pages = 0;
> > +                       objcg = sb->objcgs[i];
> > +               }
> > +
> > +               if (sb->objcgs[i]) {
> > +                       compressed_bytes += sb->entries[i]->length;
> > +                       ++nr_objcg_pages;
> > +               }
> > +
> > +               ++nr_pages;
> > +       } /* for sub-batch pages. */
> > +
> > +       if (objcg) {
> > +               obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > +               count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
> > +       }
> > +
> > +       atomic_long_add(nr_pages, &zswap_stored_pages);
> > +       count_vm_events(ZSWPOUT, nr_pages);
> > +}
> > +
> > +static void zswap_batch_zpool_store(struct zswap_batch_store_sub_batch
> *sb,
> > +                                   int *errors)
> > +{
> > +       u8 i;
> > +
> > +       for (i = 0; i < sb->nr_pages; ++i) {
> > +               struct zpool *zpool;
> > +               unsigned long handle;
> > +               char *buf;
> > +               gfp_t gfp;
> > +               int err;
> > +
> > +               /* Skip pages belonging to folios that had compress errors. */
> > +               if (errors[sb->folio_ids[i]])
> > +                       continue;
> > +
> > +               zpool = sb->pool->zpool;
> > +               gfp = __GFP_NORETRY | __GFP_NOWARN |
> __GFP_KSWAPD_RECLAIM;
> > +               if (zpool_malloc_support_movable(zpool))
> > +                       gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> > +               err = zpool_malloc(zpool, sb->dlens[i], gfp, &handle);
> > +
> > +               if (err) {
> > +                       if (err == -ENOSPC)
> > +                               zswap_reject_compress_poor++;
> > +                       else
> > +                               zswap_reject_alloc_fail++;
> > +
> > +                       /*
> > +                        * A zpool malloc error should trigger cleanup for
> > +                        * other same-folio pages in the sub-batch, and zpool
> > +                        * resources/zswap_entries for those pages should be
> > +                        * de-allocated.
> > +                        */
> > +                       zswap_batch_cleanup(sb, errors, sb->folio_ids[i]);
> > +                       continue;
> > +               }
> > +
> > +               buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> > +               memcpy(buf, sb->dsts[i], sb->dlens[i]);
> > +               zpool_unmap_handle(zpool, handle);
> > +
> > +               sb->entries[i]->handle = handle;
> > +               sb->entries[i]->length = sb->dlens[i];
> > +       }
> > +}
> > +
> > +static void zswap_batch_proc_comp_errors(struct
> zswap_batch_store_sub_batch *sb,
> > +                                        int *errors)
> > +{
> > +       u8 i;
> > +
> > +       for (i = 0; i < sb->nr_pages; ++i) {
> > +               if (sb->comp_errors[i]) {
> > +                       if (sb->comp_errors[i] == -ENOSPC)
> > +                               zswap_reject_compress_poor++;
> > +                       else
> > +                               zswap_reject_compress_fail++;
> > +
> > +                       if (!errors[sb->folio_ids[i]])
> > +                               zswap_batch_cleanup(sb, errors, sb->folio_ids[i]);
> > +               }
> > +       }
> > +}
> > +
> > +/*
> > + * Batch compress up to SWAP_CRYPTO_BATCH_SIZE pages with IAA.
> > + * It is important to note that the SWAP_CRYPTO_BATCH_SIZE resources
> > + * resources are allocated for the pool's per-cpu acomp_ctx during cpu
> > + * hotplug only if the crypto_acomp has registered either
> > + * batch_compress() and batch_decompress().
> > + * The iaa_crypto driver registers implementations for both these API.
> > + * Hence, if IAA is the zswap compressor, the call to
> > + * crypto_acomp_batch_compress() will compress the pages in parallel,
> > + * resulting in significant performance improvements as compared to
> > + * software compressors.
> > + */
> > +static void zswap_batch_compress(struct zswap_batch_store_sub_batch
> *sb,
> > +                                int *errors)
> > +{
> > +       struct crypto_acomp_ctx *acomp_ctx = raw_cpu_ptr(sb->pool-
> >acomp_ctx);
> > +       u8 i;
> > +
> > +       mutex_lock(&acomp_ctx->mutex);
> > +
> > +       BUG_ON(acomp_ctx->nr_reqs != SWAP_CRYPTO_BATCH_SIZE);
> > +
> > +       for (i = 0; i < sb->nr_pages; ++i) {
> > +               sb->dsts[i] = acomp_ctx->buffers[i];
> > +               sb->dlens[i] = PAGE_SIZE;
> > +       }
> > +
> > +       /*
> > +        * Batch compress sub-batch "N". If IAA is the compressor, the
> > +        * hardware will compress multiple pages in parallel.
> > +        */
> > +       crypto_acomp_batch_compress(
> > +               acomp_ctx->reqs,
> > +               &acomp_ctx->wait,
> > +               sb->pages,
> > +               sb->dsts,
> > +               sb->dlens,
> > +               sb->comp_errors,
> > +               sb->nr_pages);
> > +
> > +       /*
> > +        * Scan the sub-batch for any compression errors,
> > +        * and invalidate pages with errors, along with other
> > +        * pages belonging to the same folio as the error page(s).
> > +        * Set the folio's error status in "errors" so that no
> > +        * further zswap_batch_store() processing is done for
> > +        * the folio(s) with compression errors.
> > +        */
> > +       zswap_batch_proc_comp_errors(sb, errors);
> > +
> > +       zswap_batch_zpool_store(sb, errors);
> > +
> > +       mutex_unlock(&acomp_ctx->mutex);
> > +}
> > +
> > +static void zswap_batch_add_pages(struct zswap_batch_store_sub_batch
> *sb,
> > +                                 struct folio *folio,
> > +                                 u8 folio_id,
> > +                                 struct obj_cgroup *objcg,
> > +                                 struct zswap_entry *entries[],
> > +                                 long start_idx,
> > +                                 u8 nr)
> > +{
> > +       long index;
> > +
> > +       for (index = start_idx; index < (start_idx + nr); ++index) {
> > +               u8 i = sb->nr_pages;
> > +               struct page *page = folio_page(folio, index);
> > +               sb->pages[i] = page;
> > +               sb->swpentries[i] = page_swap_entry(page);
> > +               sb->folio_ids[i] = folio_id;
> > +               sb->objcgs[i] = objcg;
> > +               sb->entries[i] = entries[index - start_idx];
> > +               sb->comp_errors[i] = 0;
> > +               ++sb->nr_pages;
> > +       }
> > +}
> > +
> > +/* Allocate entries for the next sub-batch. */
> > +static int zswap_batch_alloc_entries(struct zswap_entry *entries[], int
> node_id, u8 nr)
> > +{
> > +       u8 i;
> > +
> > +       for (i = 0; i < nr; ++i) {
> > +               entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> > +               if (!entries[i]) {
> > +                       u8 j;
> > +
> > +                       zswap_reject_kmemcache_fail++;
> > +                       for (j = 0; j < i; ++j)
> > +                               zswap_entry_cache_free(entries[j]);
> > +                       return -EINVAL;
> > +               }
> > +
> > +               entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static bool zswap_batch_comp_folio(struct folio *folio, int *errors, u8
> folio_id,
> > +                                  struct obj_cgroup *objcg,
> > +                                  struct zswap_batch_store_sub_batch *sub_batch,
> > +                                  bool last)
> > +{
> > +       long folio_start_idx = 0, nr_folio_pages = folio_nr_pages(folio);
> > +       struct zswap_entry *entries[SWAP_CRYPTO_BATCH_SIZE];
> > +       int node_id = folio_nid(folio);
> > +
> > +       /*
> > +        * Iterate over the pages in the folio passed in. Construct compress
> > +        * sub-batches of up to SWAP_CRYPTO_BATCH_SIZE pages. Process
> each
> > +        * sub-batch with IAA batch compression. Detect errors from batch
> > +        * compression and set the folio's error status.
> > +        */
> > +       while (nr_folio_pages > 0) {
> > +               u8 add_nr_pages;
> > +
> > +               /*
> > +                * If we have accumulated SWAP_CRYPTO_BATCH_SIZE
> > +                * pages, process the sub-batch.
> > +                */
> > +               if (sub_batch->nr_pages == SWAP_CRYPTO_BATCH_SIZE) {
> > +                       zswap_batch_compress(sub_batch, errors);
> > +                       zswap_batch_xarray_stats(sub_batch, errors);
> > +                       zswap_batch_reset(sub_batch);
> > +                       /*
> > +                        * Stop processing this folio if it had compress errors.
> > +                        */
> > +                       if (errors[folio_id])
> > +                               goto ret_folio;
> > +               }
> > +
> > +               /* Add pages from the folio to the compress sub-batch. */
> > +               add_nr_pages = min3((
> > +                               (long)SWAP_CRYPTO_BATCH_SIZE -
> > +                               (long)sub_batch->nr_pages),
> > +                               nr_folio_pages,
> > +                               (long)SWAP_CRYPTO_BATCH_SIZE);
> > +
> > +               /*
> > +                * Allocate zswap entries for this sub-batch. If we get errors
> > +                * while doing so, we can fail early and flag an error for the
> > +                * folio.
> > +                */
> > +               if (zswap_batch_alloc_entries(entries, node_id, add_nr_pages)) {
> > +                       zswap_batch_reset(sub_batch);
> > +                       errors[folio_id] = -EINVAL;
> > +                       goto ret_folio;
> > +               }
> > +
> > +               zswap_batch_add_pages(sub_batch, folio, folio_id, objcg,
> > +                                     entries, folio_start_idx, add_nr_pages);
> > +
> > +               nr_folio_pages -= add_nr_pages;
> > +               folio_start_idx += add_nr_pages;
> > +       } /* this folio has pages to be compressed. */
> > +
> > +       /*
> > +        * Process last sub-batch: it could contain pages from multiple folios.
> > +        */
> > +       if (last && sub_batch->nr_pages) {
> > +               zswap_batch_compress(sub_batch, errors);
> > +               zswap_batch_xarray_stats(sub_batch, errors);
> > +       }
> > +
> > +ret_folio:
> > +       return (!errors[folio_id]);
> > +}
> > +
> > +/*
> > + * Store a large folio and/or a batch of any-order folio(s) in zswap
> > + * using IAA compress batching API.
> > + *
> > + * This the main procedure for batching within large folios and for batching
> > + * of folios. Each large folio will be broken into sub-batches of
> > + * SWAP_CRYPTO_BATCH_SIZE pages, the sub-batch pages will be
> compressed by
> > + * IAA hardware compress engines in parallel, then stored in zpool/xarray.
> > + *
> > + * This procedure should only be called if zswap supports batching of
> stores.
> > + * Otherwise, the sequential implementation for storing folios as in the
> > + * current zswap_store() should be used. The code handles the unlikely
> event
> > + * that the zswap pool changes from batching to non-batching between
> > + * swap_writepage() and the start of zswap_batch_store().
> > + *
> > + * The signature of this procedure is meant to allow the calling function,
> > + * (for instance, swap_writepage()) to pass a batch of folios @batch
> > + * (the "reclaim batch") to be stored in zswap.
> > + *
> > + * @batch and @errors have folio_batch_count(@batch) number of
> entries,
> > + * with one-one correspondence (@errors[i] represents the error status of
> > + * @batch->folios[i], for i in folio_batch_count(@batch)). Please also
> > + * see comments preceding "struct zswap_batch_store_sub_batch"
> definition
> > + * above.
> > + *
> > + * The calling function (for instance, swap_writepage()) should initialize
> > + * @errors[i] to a non-0 value.
> > + * If zswap successfully stores @batch->folios[i], it will set @errors[i] to 0.
> > + * If there is an error in zswap, it will set @errors[i] to -EINVAL.
> > + *
> > + * @batch: folio_batch of folios to be batch compressed.
> > + * @errors: zswap_batch_store() error status for the folios in @batch.
> > + */
> > +void zswap_batch_store(struct folio_batch *batch, int *errors)
> > +{
> > +       struct zswap_batch_store_sub_batch sub_batch;
> > +       struct zswap_pool *pool;
> > +       u8 i;
> > +
> > +       /*
> > +        * If zswap is disabled, we must invalidate the possibly stale entry
> > +        * which was previously stored at this offset. Otherwise, writeback
> > +        * could overwrite the new data in the swapfile.
> > +        */
> > +       if (!zswap_enabled)
> > +               goto check_old;
> > +
> > +       pool = zswap_pool_current_get();
> > +
> > +       if (!pool) {
> > +               if (zswap_check_limits())
> > +                       queue_work(shrink_wq, &zswap_shrink_work);
> > +               goto check_old;
> > +       }
> > +
> > +       if (!pool->can_batch) {
> > +               for (i = 0; i < folio_batch_count(batch); ++i)
> > +                       if (zswap_store(batch->folios[i]))
> > +                               errors[i] = 0;
> > +                       else
> > +                               errors[i] = -EINVAL;
> > +               /*
> > +                * Seems preferable to release the pool ref after the calls to
> > +                * zswap_store(), so that the non-batching pool cannot be
> > +                * deleted, can be used for sequential stores, and the zswap pool
> > +                * cannot morph into a batching pool.
> > +                */
> > +               zswap_pool_put(pool);
> > +               return;
> > +       }
> > +
> > +       zswap_batch_reset(&sub_batch);
> > +       sub_batch.pool = pool;
> > +
> > +       for (i = 0; i < folio_batch_count(batch); ++i) {
> > +               struct folio *folio = batch->folios[i];
> > +               struct obj_cgroup *objcg = NULL;
> > +               struct mem_cgroup *memcg = NULL;
> > +               bool ret;
> > +
> > +               VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > +               VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +
> > +               objcg = get_obj_cgroup_from_folio(folio);
> > +               if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > +                       memcg = get_mem_cgroup_from_objcg(objcg);
> > +                       if (shrink_memcg(memcg)) {
> > +                               mem_cgroup_put(memcg);
> > +                               goto put_objcg;
> > +                       }
> > +                       mem_cgroup_put(memcg);
> > +               }
> > +
> > +               if (zswap_check_limits())
> > +                       goto put_objcg;
> > +
> > +               if (objcg) {
> > +                       memcg = get_mem_cgroup_from_objcg(objcg);
> > +                       if (memcg_list_lru_alloc(memcg, &zswap_list_lru,
> GFP_KERNEL)) {
> > +                               mem_cgroup_put(memcg);
> > +                               goto put_objcg;
> > +                       }
> > +                       mem_cgroup_put(memcg);
> > +               }
> > +
> > +               /*
> > +                * By default, set zswap error status in "errors" to "success"
> > +                * for use in swap_writepage() when this returns. In case of
> > +                * errors encountered in any sub-batch in which this folio's
> > +                * pages are batch-compressed, a negative error number will
> > +                * over-write this when zswap_batch_cleanup() is called.
> > +                */
> > +               errors[i] = 0;
> > +               ret = zswap_batch_comp_folio(folio, errors, i, objcg, &sub_batch,
> > +                                            (i == folio_batch_count(batch) - 1));
> > +
> > +put_objcg:
> > +               obj_cgroup_put(objcg);
> > +               if (!ret && zswap_pool_reached_full)
> > +                       queue_work(shrink_wq, &zswap_shrink_work);
> > +       } /* for batch folios */
> > +
> > +       zswap_pool_put(pool);
> > +
> > +check_old:
> > +       for (i = 0; i < folio_batch_count(batch); ++i)
> > +               if (errors[i])
> > +                       zswap_delete_stored_entries(batch->folios[i]);
> > +}
> > +
> 
> I didn't look too closely at the code, but you are essentially
> replicating the entire  zswap store code path and making it work with
> batches. This is a maintenance nightmare, and the code could very
> easily go out-of-sync.
> 
> What we really need to do (and I suppose what Johannes meant, but
> please correct me if I am wrong), is to make the existing flow work
> with batches.
> 
> For example, most of zswap_store() should remain the same. It is still
> getting a folio to compress, the only difference is that we will
> parallelize the page compressions. zswap_store_page() is where some
> changes need to be made. Instead of a single function that handles the
> storage of each page, we need a vectorized function that handles the
> storage of N pages in a folio (allocate zswap_entry's, do xarray
> insertions, etc). This should be refactoring in a separate patch.
> 
> Once we have that, the logic introduced by this patch should really be
> mostly limited to zswap_compress(), where the acomp interfacing would
> be different based on whether batching is supported or not. This could
> be changes in zswap_compress() itself, or maybe at this point we can
> have a completely different path (e.g. zswap_compress_batch()). But
> outside of that, I don't see why we should have a completely different
> store path for the batching.

You are absolutely right, and that is my eventual goal. I see no reason
why zswap_batch_store() cannot morph into a vectorized implementation
of zswap_store()/zswap_store_batch(). I just did not want to make a
drastic change in v4.

As per earlier suggestions, I have tried to derive the same structure
for zswap_batch_store() as is in place for zswap_store(), plus made some
optimizations that can only benefit the current zswap_store(), such as
minimizing the rewinding of state from must-have computes for a large
folio such as allocating zswap_entries upfront. If zswap_batch_store()
replaces zswap_store(), this would be an optimization over-and-above
the existing zswap_store().

For sure, it wouldn't make sense to maintain both versions. 

There are some minimal "future-proofing" details such as:

1) The folio_batch: This is the most contentious, I believe, because it
     is aimed toward evolving the zswap_batch_store() interface for
     reclaim batching, while allowing the folio-error association for the
     partial benefits provided by (2). As mentioned earlier, I can delete this
     in the next rev if the maintainers feel strongly about this.
2) int* error signature: benefit can be realized today due to the latency
    optimization it enables from detecting errors early, localized cleanup,
    preventing unwinding state. That said, the same benefits can be realized
    without making it a part of the interface.

Thanks,
Kanchana

> 
> >  int zswap_swapon(int type, unsigned long nr_pages)
> >  {
> >         struct xarray *trees, *tree;
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios.
  2024-11-25 21:54     ` Sridhar, Kanchana P
@ 2024-11-25 22:08       ` Yosry Ahmed
  2024-12-02 19:26       ` Nhat Pham
  1 sibling, 0 replies; 39+ messages in thread
From: Yosry Ahmed @ 2024-11-25 22:08 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

[..]
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > This is definitely not what I suggested :)
> >
> > I won't speak for Johannes here but I suspect it's not quite what he
> > wanted either.
>
> Thanks for the comments Yosry. I was attributing these suggestions noted
> in the change log:
>
> 5) Incorporated Johannes' suggestion to not have a sysctl to enable
>    compress batching.
> 6) Incorporated Yosry's suggestion to allocate batching resources in the
>    cpu hotplug onlining code, since there is no longer a sysctl to control
>    batching. Thanks Yosry!
> 7) Incorporated Johannes' suggestions related to making the overall
>    sequence of events between zswap_store() and zswap_batch_store() similar
>    as much as possible for readability and control flow, better naming of
>    procedures, avoiding forward declarations, not inlining error path
>    procedures, deleting zswap internal details from zswap.h, etc. Thanks
>    Johannes, really appreciate the direction!
>    I have tried to explain the minimal future-proofing in terms of the
>    zswap_batch_store() signature and the definition of "struct
>    zswap_batch_store_sub_batch" in the comments for this struct. I hope the
>    new code explains the control flow a bit better.
>
> I will delete the "Suggested-by" in subsequent revs, not a problem.

I wasn't really arguing about the tag, but rather that this
implementation is not in the direction that we suggested.

FWIW, the way I usually handle "Suggested-by" is if the core idea of a
patch was suggested by someone. In this case, these are normal review
comments that you addressed, so I wouldn't add the tag. The only case
in which I add the tag in response to review comments is if they
resulted in a new patch that was the reviewer's idea.

Just my 2c, perhaps others do it differently.

[..]
> > > @@ -276,7 +276,21 @@ int swap_writepage(struct page *page, struct
> > writeback_control *wbc)
> > >                  */
> > >                 swap_zeromap_folio_clear(folio);
> > >         }
> > > -       if (zswap_store(folio)) {
> > > +
> > > +       if (folio_test_large(folio) && zswap_can_batch()) {
> > > +               struct folio_batch batch;
> > > +               int error = -1;
> > > +
> > > +               folio_batch_init(&batch);
> > > +               folio_batch_add(&batch, folio);
> > > +               zswap_batch_store(&batch, &error);
> > > +
> > > +               if (!error) {
> > > +                       count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
> > > +                       folio_unlock(folio);
> > > +                       return 0;
> > > +               }
> >
> > First of all, why does the code outside of zswap has to care or be changed?
> >
> > We should still call zswap_store() here, and within that figure out if
> > we can do batching or not. I am not sure what we gain by adding a
> > separate interface here, especially that we are creating a batch of a
> > single folio and passing it in anyway. I suspect that this leaked here
> > from the patch that batches unrelated folios swapout, but please don't
> > do that. This patch is about batching compression of pages in the same
> > folio, and for that, there is no need for the code here to do anything
> > differently or pass in a folio batch.
>
> This was the "minimal future proofing" and "error handling simplification/
> avoiding adding latency due to rewinds" rationale I alluded to in the
> change log and in the comments for "struct zswap_batch_store_sub_batch"
> respectively. This is what I was trying to articulate in terms of the benefits
> of the new signature of zswap_batch_store().
>
> The change in swap_writepage() was simply an illustration to show-case
> how the reclaim batching would work, to try and explain how IAA can
> significantly improve reclaim latency, not just zswap latency (and get
> suggestions early-on).
>
> I don't mind keeping swap_writepage() unchanged if the maintainers
> feel strongly about this. I guess I am eager to demonstrate the full potential
> of IAA, hence guilty of the minimal future-proofing.

While I do appreciate your eagerness, I will be completely honest
here. The change to batch reclaim of unrelated folios, while can be
reasonable, is an order of magnitude more controversial imo. By
tightly coupling this to this series, you are doing yourself a
disservice tbh. I would say drop any changes specific to that for now.
Leave that to a completely separate discussion. This is the easiest
way to make forward progress on this series, which I am sure is what
we all want here (and I do very much appreciate your work!).

[..]
> > I didn't look too closely at the code, but you are essentially
> > replicating the entire  zswap store code path and making it work with
> > batches. This is a maintenance nightmare, and the code could very
> > easily go out-of-sync.
> >
> > What we really need to do (and I suppose what Johannes meant, but
> > please correct me if I am wrong), is to make the existing flow work
> > with batches.
> >
> > For example, most of zswap_store() should remain the same. It is still
> > getting a folio to compress, the only difference is that we will
> > parallelize the page compressions. zswap_store_page() is where some
> > changes need to be made. Instead of a single function that handles the
> > storage of each page, we need a vectorized function that handles the
> > storage of N pages in a folio (allocate zswap_entry's, do xarray
> > insertions, etc). This should be refactoring in a separate patch.
> >
> > Once we have that, the logic introduced by this patch should really be
> > mostly limited to zswap_compress(), where the acomp interfacing would
> > be different based on whether batching is supported or not. This could
> > be changes in zswap_compress() itself, or maybe at this point we can
> > have a completely different path (e.g. zswap_compress_batch()). But
> > outside of that, I don't see why we should have a completely different
> > store path for the batching.
>
> You are absolutely right, and that is my eventual goal. I see no reason
> why zswap_batch_store() cannot morph into a vectorized implementation
> of zswap_store()/zswap_store_batch(). I just did not want to make a
> drastic change in v4.

I wouldn't send intermediate versions that are not the code you want
to be merged. It's not the best use of anyone's time to send code that
both of us agree is not what we want to merge anyway. Even if that
means a rewrite between versions, that's not uncommon as long as you
describe the changes in the changelog.

>
> As per earlier suggestions, I have tried to derive the same structure
> for zswap_batch_store() as is in place for zswap_store(), plus made some
> optimizations that can only benefit the current zswap_store(), such as
> minimizing the rewinding of state from must-have computes for a large
> folio such as allocating zswap_entries upfront. If zswap_batch_store()
> replaces zswap_store(), this would be an optimization over-and-above
> the existing zswap_store().
>
> For sure, it wouldn't make sense to maintain both versions.
>
> There are some minimal "future-proofing" details such as:
>
> 1) The folio_batch: This is the most contentious, I believe, because it
>      is aimed toward evolving the zswap_batch_store() interface for
>      reclaim batching, while allowing the folio-error association for the
>      partial benefits provided by (2). As mentioned earlier, I can delete this
>      in the next rev if the maintainers feel strongly about this.

Yeah I would drop this for now as I mentioned earlier.

> 2) int* error signature: benefit can be realized today due to the latency
>     optimization it enables from detecting errors early, localized cleanup,
>     preventing unwinding state. That said, the same benefits can be realized
>     without making it a part of the interface.

If this is a pure optimization that is not needed initially, I would
do it in a separate patch after this one. Even better if it's a follow
up patch, this is already dense enough :)

To summarize, I understand your eagerness to present all the work you
have in mind, and I appreciate it. But it's most effective to have
changes in independent digestible chunks wherever possible. For next
versions, I would only include the bare minimum to support the
compression batching and showcase its performance benefits. Extensions
and optimizations can be added on top once this lands. This makes both
your life and the reviewers' lives a lot easier, and ultimately gets
things merged faster.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios.
  2024-11-25 21:54     ` Sridhar, Kanchana P
  2024-11-25 22:08       ` Yosry Ahmed
@ 2024-12-02 19:26       ` Nhat Pham
  2024-12-03  0:34         ` Sridhar, Kanchana P
  1 sibling, 1 reply; 39+ messages in thread
From: Nhat Pham @ 2024-12-02 19:26 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Yosry Ahmed, linux-kernel, linux-mm, hannes, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On Mon, Nov 25, 2024 at 1:54 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> There are some minimal "future-proofing" details such as:

I don't think they're minimal :)

>
> 1) The folio_batch: This is the most contentious, I believe, because it
>      is aimed toward evolving the zswap_batch_store() interface for
>      reclaim batching, while allowing the folio-error association for the
>      partial benefits provided by (2). As mentioned earlier, I can delete this
>      in the next rev if the maintainers feel strongly about this.

Let's delete it, and focus on the low hanging fruit (large folio zswap storing).

> 2) int* error signature: benefit can be realized today due to the latency
>     optimization it enables from detecting errors early, localized cleanup,
>     preventing unwinding state. That said, the same benefits can be realized
>     without making it a part of the interface.

This can be done in a separate patch/follow up. It's not related to this work :)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios.
  2024-12-02 19:26       ` Nhat Pham
@ 2024-12-03  0:34         ` Sridhar, Kanchana P
  0 siblings, 0 replies; 39+ messages in thread
From: Sridhar, Kanchana P @ 2024-12-03  0:34 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Yosry Ahmed, linux-kernel, linux-mm, hannes, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Monday, December 2, 2024 11:27 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; hannes@cmpxchg.org; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA
> in zswap_batch_store() of large folios.
> 
> On Mon, Nov 25, 2024 at 1:54 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > There are some minimal "future-proofing" details such as:
> 
> I don't think they're minimal :)

Sure. Deprecated :), and suggestions being pursued in [1] with the goal of
intercepting with a v5 of this series.

[1]: https://patchwork.kernel.org/project/linux-mm/list/?series=912937

> 
> >
> > 1) The folio_batch: This is the most contentious, I believe, because it
> >      is aimed toward evolving the zswap_batch_store() interface for
> >      reclaim batching, while allowing the folio-error association for the
> >      partial benefits provided by (2). As mentioned earlier, I can delete this
> >      in the next rev if the maintainers feel strongly about this.
> 
> Let's delete it, and focus on the low hanging fruit (large folio zswap storing).

Sure.

> 
> > 2) int* error signature: benefit can be realized today due to the latency
> >     optimization it enables from detecting errors early, localized cleanup,
> >     preventing unwinding state. That said, the same benefits can be realized
> >     without making it a part of the interface.
> 
> This can be done in a separate patch/follow up. It's not related to this work :)

Agreed. Simpler approach with consolidated batching/non-batching code paths
being pursued in [1].

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2024-12-21  6:31 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-23  7:01 [PATCH v4 00/10] zswap IAA compress batching Kanchana P Sridhar
2024-11-23  7:01 ` [PATCH v4 01/10] crypto: acomp - Define two new interfaces for compress/decompress batching Kanchana P Sridhar
2024-11-25  9:35   ` Herbert Xu
2024-11-25 20:03     ` Sridhar, Kanchana P
2024-11-26  2:13       ` Sridhar, Kanchana P
2024-11-26  2:14         ` Herbert Xu
2024-11-26  2:37           ` Sridhar, Kanchana P
2024-11-27  1:22             ` Sridhar, Kanchana P
2024-11-27  5:04               ` Herbert Xu
2024-11-23  7:01 ` [PATCH v4 02/10] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
2024-11-23  7:01 ` [PATCH v4 03/10] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
2024-11-26  7:05   ` kernel test robot
2024-11-23  7:01 ` [PATCH v4 04/10] crypto: iaa - Make async mode the default Kanchana P Sridhar
2024-11-23  7:01 ` [PATCH v4 05/10] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2024-11-23  7:01 ` [PATCH v4 06/10] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
2024-11-23  7:01 ` [PATCH v4 07/10] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
2024-11-23  7:01 ` [PATCH v4 08/10] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
2024-11-23  7:01 ` [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching Kanchana P Sridhar
2024-12-02 19:15   ` Nhat Pham
2024-12-03  0:30     ` Sridhar, Kanchana P
2024-12-03  8:00       ` Herbert Xu
2024-12-03 21:37         ` Sridhar, Kanchana P
2024-12-03 21:44           ` Yosry Ahmed
2024-12-03 22:17             ` Sridhar, Kanchana P
2024-12-03 22:24               ` Sridhar, Kanchana P
2024-12-04  1:42             ` Herbert Xu
2024-12-04 22:35               ` Yosry Ahmed
2024-12-04 22:49                 ` Sridhar, Kanchana P
2024-12-04 22:55                   ` Yosry Ahmed
2024-12-04 23:12                     ` Sridhar, Kanchana P
2024-12-21  6:30       ` Sridhar, Kanchana P
2024-11-23  7:01 ` [PATCH v4 10/10] mm: zswap: Compress batching with Intel IAA in zswap_batch_store() of large folios Kanchana P Sridhar
2024-11-25  8:00   ` kernel test robot
2024-11-25 20:20   ` Yosry Ahmed
2024-11-25 21:47     ` Johannes Weiner
2024-11-25 21:54     ` Sridhar, Kanchana P
2024-11-25 22:08       ` Yosry Ahmed
2024-12-02 19:26       ` Nhat Pham
2024-12-03  0:34         ` Sridhar, Kanchana P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox