linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/13] zswap IAA compress batching
@ 2024-11-06 19:20 Kanchana P Sridhar
  2024-11-06 19:20 ` [PATCH v3 01/13] crypto: acomp - Define two new interfaces for compress/decompress batching Kanchana P Sridhar
                   ` (13 more replies)
  0 siblings, 14 replies; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar


IAA Compression Batching:
=========================

This patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel compression of pages in large folios.

The patch-series is organized as follows:

 1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
    patches are tagged with "crypto:" in the subject:

    Patch 1) New acomp_alg/crypto_acomp batch_compress() and batch_decompress()
             interfaces, that swap modules can invoke using the new
             batching API crypto_acomp_batch_compress() and
             crypto_acomp_batch_decompress().
    Patch 2) New CRYPTO_ACOMP_REQ_POLL acomp_req flag to act as a gate for
             async poll mode in iaa_crypto.
    Patch 3) iaa-crypto driver implementations for async polling,
             crypto_acomp_batch_compress() and crypto_acomp_batch_decompress().
    Patch 4) Modifying the default iaa_crypto driver mode to async.
    Patch 5) Disabling verify_compress by default, to facilitate users to
             run IAA easily for comparison with software compressors.
    Patch 6) Changing the cpu-to-iaa mappings to more evenly balance cores
             to IAA devices.
    Patch 7) Addition of a "global_wq" per IAA, which can be used as a
             global resource for compress jobs for the socket. If the user
             configures 2WQs per IAA device, the driver will distribute
             compress jobs from all cores on the socket to the "global_wqs"
             of all the IAA devices on that socket, in a round-robin
             manner. This can be used to improve compression throughput for
             workloads that see a lot of swapout activity.

 2) zswap modifications to enable compress batching in zswap_store() of
    large folios (including pmd-mappable folios):

    Patch 8) acomp_ctx mutex lock acquire/release once optimizations in
             zswap_store() and a minor change in releasing the lock in
             zswap_decompress().
    Patch 9) Change the "struct crypto_acomp_ctx" to contain a configurable
             number of acomp_reqs and buffers.
    Patch 10) Introduce a separate per-cpu "acomp_batch_ctx" member in
              "struct zswap_pool" to be able to allocate multiple
              acomp_reqs/buffers for use in batching, as needed, per core.
    Patch 11) Allocation of the per-cpu "acomp_batch_ctx" for a
              zswap_pool.
    Patch 12) Add a new "sysctl vm.compress-batching" 0/1 switch to
              enable/disable compress batching dynamically at runtime.
    Patch 13) zswap_store() IAA compress batching implementation with
              minimal memory footprint cost per-cpu, and using the new
              crypto_acomp_batch_compress() iaa_crypto driver API.

With the runtime configuration to enable compress batching and the crypto
batching API added in v2, this feature will be enabled only on Intel
platforms that have IAA.
 

System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 10-29-2024,
commit 9fb8e0a1c486, without and with this patch-series.
Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores
per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
partition swap. Core frequency was fixed at 2500MHz.

Other kernel configuration parameters:

    zswap compressor  : zstd, deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 0, 2

IAA "compression verification" is disabled and IAA is run in the the async
poll mode (the defaults with this series). 2WQs are configured per IAA
device. Compress jobs from all cores on a socket are distributed among all
4 IAA devices on the same socket.

I ran experiments with these workloads:

1) usemem 30 processes with these large folios enabled to "always":
   - 16k/32k/64k
   - 2048k

2) Kernel compilation allmodconfig with 2G max memory, run in tmpfs with
   these large folios enabled to "always":
   - 16k/32k/64k


Performance testing (usemem30):
===============================
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:

usemem --init-time -w -O -s 10 -n 30 10g


 16k/32/64k folios: usemem30/deflate-iaa:
 ========================================

 -------------------------------------------------------------------------------
                     mm-unstable-10-29-2024             v2 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor               deflate-iaa        deflate-iaa      deflate-iaa      
 vm.compress-batching                   n/a                  0                1
 vm.page-cluster                          2                  2                2
 -------------------------------------------------------------------------------
 Total throughput (KB/s)          7,756,632          7,753,984        8,075,817 
 Avg throughput (KB/s)              258,554            258,466          269,193 
 elapsed time (sec)                   87.75              88.71            85.82 
 sys time (sec)                    2,073.04           2,147.47         2,030.52 
                                                                               
 -------------------------------------------------------------------------------
 memcg_high                         715,854            714,238          720,459
 memcg_swap_fail                      1,194              1,175            1,250
 zswpout                         64,510,869         64,510,832       64,511,219
 zswpin                                 458                456              450
 pswpout                                  0                  0                0
 pswpin                                   0                  0                0
 thp_swpout                               0                  0                0
 thp_swpout_fallback                      0                  0                0
 16kB-mthp_swpout_fallback                0                  0                0                                          
 32kB-mthp_swpout_fallback                0                  0                0
 64kB-mthp_swpout_fallback            1,194              1,175            1,250
 pgmajfault                           3,183              3,513            3,116
 swap_ra                                108                125              116
 swap_ra_hit                             45                 65               43
 ZSWPOUT-16kB                             2                  3                3
 ZSWPOUT-32kB                             1                  1                2
 ZSWPOUT-64kB                     4,030,658          4,030,672        4,030,624
 SWPOUT-16kB                              0                  0                0
 SWPOUT-32kB                              0                  0                0
 SWPOUT-64kB                              0                  0                0
 -------------------------------------------------------------------------------


 16k/32/64k folios: usemem30/zstd:
 =================================

 -------------------------------------------------------------------------------
                     mm-unstable-10-29-2024        v2 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor                      zstd                  zstd
 vm.compress-batching                   n/a                     0 
 vm.page-cluster                          2                     2
 -----------------------------------------------------------------------------
 Total throughput (KB/s)          6,054,147             6,109,360 
 Avg throughput (KB/s)              201,804               203,645 
 elapsed time (sec)                  111.66                111.72 
 sys time (sec)                    2,693.21              2,685.27 
                                                                 
 -----------------------------------------------------------------------------
 memcg_high                         489,133               480,524
 memcg_swap_fail                      1,045                 1,308
 zswpout                         48,931,716            48,931,540
 zswpin                                 407                   394
 pswpout                                  0                     0
 pswpin                                   0                     0
 thp_swpout                               0                     0
 thp_swpout_fallback                      0                     0
 16kB-mthp_swpout_fallback                0                     0                                        
 32kB-mthp_swpout_fallback                0                     0
 64kB-mthp_swpout_fallback            1,045                 1,308
 pgmajfault                           3,095                 3,424
 swap_ra                                136                   101
 swap_ra_hit                             86                    50
 ZSWPOUT-16kB                             2                     4
 ZSWPOUT-32kB                             0                     2
 ZSWPOUT-64kB                     3,057,161             3,056,927
 SWPOUT-16kB                              0                     0
 SWPOUT-32kB                              0                     0
 SWPOUT-64kB                              0                     0
 -----------------------------------------------------------------------------


 2M folios: usemem30/deflate-iaa:
 ================================

 -------------------------------------------------------------------------------
                     mm-unstable-10-29-2024             v2 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor               deflate-iaa        deflate-iaa      deflate-iaa      
 vm.compress-batching                   n/a                  0                1
 vm.page-cluster                          2                  2                2
 -------------------------------------------------------------------------------
 Total throughput (KB/s)          7,948,345          8,096,440        8,165,171 
 Avg throughput (KB/s)              264,944            269,881          272,172 
 elapsed time (sec)                   88.18              87.13            87.30 
 sys time (sec)                    2,067.56           2,018.08         2,046.79 
                                                                               
 -------------------------------------------------------------------------------
 memcg_high                          91,002             87,243           92,084
 memcg_swap_fail                         39                 56               54
 zswpout                         64,518,833         64,520,439       64,520,116
 zswpin                                 413                452              504
 pswpout                                  0                  0                0
 pswpin                                   0                  0                0
 thp_swpout                               0                  0                0
 thp_swpout_fallback                     39                 56               54
 2048kB-mthp_swpout_fallback             39                 56               54                                          
 pgmajfault                          10,946             15,737            9,645
 swap_ra                             23,456             36,495           19,247
 swap_ra_hit                         23,406             36,431           19,193
 ZSWPOUT-2048kB                     125,915            125,913          125,912
 SWPOUT-2048kB                            0                  0                0
 -------------------------------------------------------------------------------


 2M folios: usemem30/zstd:
 =========================

 -------------------------------------------------------------------------------
                     mm-unstable-10-29-2024        v2 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor                      zstd                  zstd
 vm.compress-batching                   n/a                     0 
 vm.page-cluster                          2                     2
 -----------------------------------------------------------------------------
 Total throughput (KB/s)          6,300,116             6,278,179 
 Avg throughput (KB/s)              210,003               209,272 
 elapsed time (sec)                  110.21                111.72 
 sys time (sec)                    2,504.45              2,542.59 
                                                                 
 -----------------------------------------------------------------------------
 memcg_high                          57,036                60,090
 memcg_swap_fail                         61                    50
 zswpout                         48,934,256            48,904,582
 zswpin                                 387                   380
 pswpout                                  0                     0
 pswpin                                   0                     0
 thp_swpout                               0                     0
 thp_swpout_fallback                     61                    50
 2048kB-mthp_swpout_fallback             61                    50
 pgmajfault                           3,713                 6,146
 swap_ra                              2,004                 8,133
 swap_ra_hit                          1,960                 8,088
 ZSWPOUT-2048kB                      95,511                95,460
 SWPOUT-2048kB                            0                     0
 -----------------------------------------------------------------------------


Performance testing (Kernel compilation, allmodconfig):
=======================================================

The experiments with kernel compilation test in tmpfs use the
"allmodconfig" that takes ~12 minutes, and has considerable swapout
activity. The cgroup's memory.max is set to 2G.


 16k/32k/64k folios: Kernel compilation/allmodconfig/deflate-iaa:
 ================================================================

 -------------------------------------------------------------------------------
                     mm-unstable-10-29-2024             v2 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor               deflate-iaa        deflate-iaa      deflate-iaa      
 vm.compress-batching                   n/a                  0                1
 vm.page-cluster                          0                  0                0
 -------------------------------------------------------------------------------
 real_sec                            801.25            790.87            768.92
 user_sec                         15,776.31         15,755.97         15,753.89
 sys_sec                           4,250.34          3,877.02          3,892.17
 Max_Res_Set_Size_KB              1,869,428         1,873,376         1,871,600
                                                                                          
 -------------------------------------------------------------------------------
 memcg_high                               0                 0                 0
 memcg_swap_fail                          0                 0                 0
 zswpout                        106,798,327       105,469,307       104,528,841
 zswpin                          31,542,093        30,469,671        30,596,840
 pswpout                                774               290                80
 pswpin                                 370               288                59
 thp_swpout                               0                 0                 0
 thp_swpout_fallback                      0                 0                 0
 16kB-mthp_swpout_fallback                0                 0                 0                                          
 32kB-mthp_swpout_fallback                0                 0                 0
 64kB-mthp_swpout_fallback           16,340            12,633            12,000
 pgmajfault                      33,983,602        32,783,214        32,731,862
 swap_ra                                  0                 0                 0
 swap_ra_hit                          1,467             5,112             3,854
 ZSWPOUT-16kB                     1,475,121         1,435,571         1,426,738
 ZSWPOUT-32kB                       821,119           813,202           790,658
 ZSWPOUT-64kB                     3,483,295         3,490,244         3,435,056
 SWPOUT-16kB                              1                 0                 0
 SWPOUT-32kB                              3                 0                 0
 SWPOUT-64kB                             40                18                 4
 -------------------------------------------------------------------------------


 16k/32k/64k folios: Kernel compilation/allmodconfig/zstd:
 =========================================================

 -------------------------------------------------------------------------------
                   mm-unstable-10-29-2024       v2 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor                    zstd                   zstd
 vm.compress-batching                 n/a                      0
 vm.page-cluster                        0                      0
 -------------------------------------------------------------------------------
 real_sec                          812.38                 800.09  
 user_sec                       15,774.12              15,771.02
 sys_sec                         5,283.64               5,257.05
 Max_Res_Set_Size_KB            1,872,688              1,873,444
                                                                
 -------------------------------------------------------------------------------
 memcg_high                             0                      0
 memcg_swap_fail                        0                      0
 zswpout                       91,540,018             90,338,507
 zswpin                        26,421,271             26,485,837
 pswpout                               64                    144
 pswpin                                64                    114
 thp_swpout                             0                      0
 thp_swpout_fallback                    0                      0
 16kB-mthp_swpout_fallback              0                      0                         
 32kB-mthp_swpout_fallback              0                      0
 64kB-mthp_swpout_fallback          4,509                    566
 pgmajfault                    28,341,722             28,427,509
 swap_ra                                0                      0
 swap_ra_hit                        3,359                  2,931
 ZSWPOUT-16kB                   1,287,206              1,266,947
 ZSWPOUT-32kB                     707,746                700,270
 ZSWPOUT-64kB                   2,985,002              2,940,288
 SWPOUT-16kB                            0                      0
 SWPOUT-32kB                            0                      0
 SWPOUT-64kB                            4                      9
 -------------------------------------------------------------------------------


Summary:
========
The performance testing data with usemem 30 processes and kernel
compilation test show throughput gains and elapsed/sys time reduction with
zswap_store() large folios using IAA compress batching.

The iaa_crypto wq stats will show almost the same number of compress calls
for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively.
We see a latency reduction of 2.5% by distributing compress jobs among all
IAA devices on the socket (based on v1 data).

We can expect to see even more significant performance and throughput
improvements if we use the parallelism offered by IAA to batch compress the
pages comprising a batch of 4K (really any-order) folios, not just batching
within large folios. This is the reclaim batching patch 13 in v1,
which will be submitted in a separate patch-series.

Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 3.5% reduction in workload run time as
compared to software compressors.


Changes since v2:
=================
1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
   returned by kmalloc_node() for acomp_ctx->buffers and for
   acomp_ctx->reqs.
3) Fixed a bug in zswap_pool_can_batch() for returning true if
   pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
   the per-cpu acomp_batch_ctx tests true for batching resources having
   been allocated on this cpu. Also, changed from per_cpu_ptr() to
   raw_cpu_ptr().
4) Incorporated the zswap_store_propagate_errors() compilation warning fix
   suggested by Dan Carpenter. Thanks Dan!
5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
   zswap.h, with SWAP_CRYPTO_BATCH_SIZE.

Changes since v1:
=================
1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
   async/poll mode, and to encapsulate the polling functionality in the
   iaa_crypto driver. Thanks Herbert!
3) Incorporated Herbert's and Yosry's suggestions to implement the batching
   API in iaa_crypto and to make its use seamless from zswap's
   perspective. Thanks Herbert and Yosry!
4) Incorporated Yosry's suggestion to make it more convenient for the user
   to enable compress batching, while minimizing the memory footprint
   cost. Thanks Yosry!
5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
   reclaim batching patch from this series, since it requires a broader
   discussion.


Requesting the maintainers & reviewers to kindly review v3 of this
patch-series instead of v2.

I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana



Kanchana P Sridhar (13):
  crypto: acomp - Define two new interfaces for compress/decompress
    batching.
  crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable
    async mode.
  crypto: iaa - Implement compress/decompress batching API in
    iaa_crypto.
  crypto: iaa - Make async mode the default.
  crypto: iaa - Disable iaa_verify_compress by default.
  crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to
    IAAs.
  crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA
    node.
  mm: zswap: acomp_ctx mutex lock/unlock optimizations.
  mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of
    acomp_reqs.
  mm: zswap: Add a per-cpu "acomp_batch_ctx" to struct zswap_pool.
  mm: zswap: Allocate acomp_batch_ctx resources for a given zswap_pool.
  mm: Add sysctl vm.compress-batching switch for compress batching
    during swapout.
  mm: zswap: Compress batching with Intel IAA in zswap_store() of large
    folios.

 crypto/acompress.c                         |   2 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 717 +++++++++++++++--
 include/crypto/acompress.h                 |  87 +++
 include/crypto/internal/acompress.h        |  16 +
 include/linux/mm.h                         |   2 +
 include/linux/zswap.h                      |  91 +++
 kernel/sysctl.c                            |   9 +
 mm/swap.c                                  |   6 +
 mm/zswap.c                                 | 865 +++++++++++++++++++--
 9 files changed, 1701 insertions(+), 94 deletions(-)


base-commit: 7994b7ea6ac880efd0c38fedfbffd5ab8b1b7b2b
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 01/13] crypto: acomp - Define two new interfaces for compress/decompress batching.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
@ 2024-11-06 19:20 ` Kanchana P Sridhar
  2024-11-06 19:20 ` [PATCH v3 02/13] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This commit adds batch_compress() and batch_decompress() interfaces to:

  struct acomp_alg
  struct crypto_acomp

This allows the iaa_crypto Intel IAA driver to register implementation for
batch_compress() and batch_decompress() API, that can subsequently be
invoked from the kernel zswap/zram swap modules to compress/decompress
up to CRYPTO_BATCH_SIZE (i.e. 8) pages in parallel in the IAA hardware
accelerator to improve swapout/swapin performance.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c                  |  2 +
 include/crypto/acompress.h          | 82 +++++++++++++++++++++++++++++
 include/crypto/internal/acompress.h | 16 ++++++
 3 files changed, 100 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index 6fdf0ff9f3c0..a506db499a37 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -71,6 +71,8 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm)
 
 	acomp->compress = alg->compress;
 	acomp->decompress = alg->decompress;
+	acomp->batch_compress = alg->batch_compress;
+	acomp->batch_decompress = alg->batch_decompress;
 	acomp->dst_free = alg->dst_free;
 	acomp->reqsize = alg->reqsize;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 54937b615239..ab0d9987bde1 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -37,12 +37,20 @@ struct acomp_req {
 	void *__ctx[] CRYPTO_MINALIGN_ATTR;
 };
 
+/*
+ * The max compress/decompress batch size, for crypto algorithms
+ * that support batch_compress and batch_decompress API.
+ */
+#define CRYPTO_BATCH_SIZE 8UL
+
 /**
  * struct crypto_acomp - user-instantiated objects which encapsulate
  * algorithms and core processing logic
  *
  * @compress:		Function performs a compress operation
  * @decompress:		Function performs a de-compress operation
+ * @batch_compress:	Function performs a batch compress operation
+ * @batch_decompress:	Function performs a batch decompress operation
  * @dst_free:		Frees destination buffer if allocated inside the
  *			algorithm
  * @reqsize:		Context size for (de)compression requests
@@ -51,6 +59,20 @@ struct acomp_req {
 struct crypto_acomp {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	void (*batch_compress)(struct acomp_req *reqs[],
+			       struct crypto_wait *wait,
+			       struct page *pages[],
+			       u8 *dsts[],
+			       unsigned int dlens[],
+			       int errors[],
+			       int nr_pages);
+	void (*batch_decompress)(struct acomp_req *reqs[],
+				 struct crypto_wait *wait,
+				 u8 *srcs[],
+				 struct page *pages[],
+				 unsigned int slens[],
+				 int errors[],
+				 int nr_pages);
 	void (*dst_free)(struct scatterlist *dst);
 	unsigned int reqsize;
 	struct crypto_tfm base;
@@ -265,4 +287,64 @@ static inline int crypto_acomp_decompress(struct acomp_req *req)
 	return crypto_acomp_reqtfm(req)->decompress(req);
 }
 
+/**
+ * crypto_acomp_batch_compress() -- compress a batch of requests
+ *
+ * Function invokes the batch compress operation
+ *
+ * @reqs: @nr_pages asynchronous compress requests.
+ * @wait: crypto_wait for synchronous acomp batch compress. If NULL, the
+ *        driver must provide a way to process completions asynchronously.
+ * @pages: Pages to be compressed.
+ * @dsts: Pre-allocated destination buffers to store results of compression.
+ * @dlens: Will contain the compressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to CRYPTO_BATCH_SIZE,
+ *            to be compressed.
+ */
+static inline void crypto_acomp_batch_compress(struct acomp_req *reqs[],
+					       struct crypto_wait *wait,
+					       struct page *pages[],
+					       u8 *dsts[],
+					       unsigned int dlens[],
+					       int errors[],
+					       int nr_pages)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_compress(reqs, wait, pages, dsts,
+				   dlens, errors, nr_pages);
+}
+
+/**
+ * crypto_acomp_batch_decompress() -- decompress a batch of requests
+ *
+ * Function invokes the batch decompress operation
+ *
+ * @reqs: @nr_pages asynchronous decompress requests.
+ * @wait: crypto_wait for synchronous acomp batch decompress. If NULL, the
+ *        driver must provide a way to process completions asynchronously.
+ * @srcs: The src buffers to be decompressed.
+ * @pages: The pages to store the decompressed buffers.
+ * @slens: Compressed lengths of @srcs.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to CRYPTO_BATCH_SIZE,
+ *            to be decompressed.
+ */
+static inline void crypto_acomp_batch_decompress(struct acomp_req *reqs[],
+						 struct crypto_wait *wait,
+						 u8 *srcs[],
+						 struct page *pages[],
+						 unsigned int slens[],
+						 int errors[],
+						 int nr_pages)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_decompress(reqs, wait, srcs, pages,
+				     slens, errors, nr_pages);
+}
+
 #endif
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 8831edaafc05..acfe2d9d5a83 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -17,6 +17,8 @@
  *
  * @compress:	Function performs a compress operation
  * @decompress:	Function performs a de-compress operation
+ * @batch_compress:	Function performs a batch compress operation
+ * @batch_decompress:	Function performs a batch decompress operation
  * @dst_free:	Frees destination buffer if allocated inside the algorithm
  * @init:	Initialize the cryptographic transformation object.
  *		This function is used to initialize the cryptographic
@@ -37,6 +39,20 @@
 struct acomp_alg {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	void (*batch_compress)(struct acomp_req *reqs[],
+			       struct crypto_wait *wait,
+			       struct page *pages[],
+			       u8 *dsts[],
+			       unsigned int dlens[],
+			       int errors[],
+			       int nr_pages);
+	void (*batch_decompress)(struct acomp_req *reqs[],
+				 struct crypto_wait *wait,
+				 u8 *srcs[],
+				 struct page *pages[],
+				 unsigned int slens[],
+				 int errors[],
+				 int nr_pages);
 	void (*dst_free)(struct scatterlist *dst);
 	int (*init)(struct crypto_acomp *tfm);
 	void (*exit)(struct crypto_acomp *tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 02/13] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
  2024-11-06 19:20 ` [PATCH v3 01/13] crypto: acomp - Define two new interfaces for compress/decompress batching Kanchana P Sridhar
@ 2024-11-06 19:20 ` Kanchana P Sridhar
  2024-11-06 19:20 ` [PATCH v3 03/13] crypto: iaa - Implement compress/decompress batching API in iaa_crypto Kanchana P Sridhar
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

If the iaa_crypto driver has async_mode set to true, and use_irq set to
false, it can still be forced to use synchronous mode by turning off the
CRYPTO_ACOMP_REQ_POLL flag in req->flags.

All three of the following need to be true for a request to be processed in
fully async poll mode:

 1) async_mode should be "true"
 2) use_irq should be "false"
 3) req->flags & CRYPTO_ACOMP_REQ_POLL should be "true"

Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 11 ++++++++++-
 include/crypto/acompress.h                 |  5 +++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 237f87000070..2edaecd42cc6 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1510,6 +1510,10 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		return -EINVAL;
 	}
 
+	/* If the caller has requested no polling, disable async. */
+	if (!(req->flags & CRYPTO_ACOMP_REQ_POLL))
+		disable_async = true;
+
 	cpu = get_cpu();
 	wq = wq_table_next_wq(cpu);
 	put_cpu();
@@ -1702,6 +1706,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 {
 	struct crypto_tfm *tfm = req->base.tfm;
 	dma_addr_t src_addr, dst_addr;
+	bool disable_async = false;
 	int nr_sgs, cpu, ret = 0;
 	struct iaa_wq *iaa_wq;
 	struct device *dev;
@@ -1717,6 +1722,10 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		return -EINVAL;
 	}
 
+	/* If the caller has requested no polling, disable async. */
+	if (!(req->flags & CRYPTO_ACOMP_REQ_POLL))
+		disable_async = true;
+
 	if (!req->dst)
 		return iaa_comp_adecompress_alloc_dest(req);
 
@@ -1765,7 +1774,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		req->dst, req->dlen, sg_dma_len(req->dst));
 
 	ret = iaa_decompress(tfm, req, wq, src_addr, req->slen,
-			     dst_addr, &req->dlen, false);
+			     dst_addr, &req->dlen, disable_async);
 	if (ret == -EINPROGRESS)
 		return ret;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index ab0d9987bde1..5973f5f67954 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -14,6 +14,11 @@
 #include <linux/crypto.h>
 
 #define CRYPTO_ACOMP_ALLOC_OUTPUT	0x00000001
+/*
+ * If set, the driver must have a way to submit the req, then
+ * poll its completion status for success/error.
+ */
+#define CRYPTO_ACOMP_REQ_POLL		0x00000002
 #define CRYPTO_ACOMP_DST_MAX		131072
 
 /**
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 03/13] crypto: iaa - Implement compress/decompress batching API in iaa_crypto.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
  2024-11-06 19:20 ` [PATCH v3 01/13] crypto: acomp - Define two new interfaces for compress/decompress batching Kanchana P Sridhar
  2024-11-06 19:20 ` [PATCH v3 02/13] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
@ 2024-11-06 19:20 ` Kanchana P Sridhar
  2024-11-06 19:20 ` [PATCH v3 04/13] crypto: iaa - Make async mode the default Kanchana P Sridhar
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch provides iaa_crypto driver implementations for the newly added
crypto_acomp batch_compress() and batch_decompress() interfaces.

This allows swap modules such as zswap/zram to invoke batch parallel
compression/decompression of pages on systems with Intel IAA, by invoking
these API, respectively:

 crypto_acomp_batch_compress(...);
 crypto_acomp_batch_decompress(...);

This enables zswap_store() compress batching code to be developed in a
manner similar to the current single-page synchronous calls to:

 crypto_acomp_compress(...);
 crypto_acomp_decompress(...);

thereby, facilitating encapsulated and modular hand-off between the kernel
zswap/zram code and the crypto_acomp layer.

Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 313 +++++++++++++++++++++
 1 file changed, 313 insertions(+)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 2edaecd42cc6..3ac3a37fd2e6 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1797,6 +1797,317 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 	ctx->use_irq = use_irq;
 }
 
+static int iaa_comp_poll(struct acomp_req *req)
+{
+	struct idxd_desc *idxd_desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	struct idxd_wq *wq;
+	bool compress_op;
+	int ret;
+
+	idxd_desc = req->base.data;
+	if (!idxd_desc)
+		return -EAGAIN;
+
+	compress_op = (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS);
+	wq = idxd_desc->wq;
+	iaa_wq = idxd_wq_get_private(wq);
+	idxd = iaa_wq->iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	ret = check_completion(dev, idxd_desc->iax_completion, true, true);
+	if (ret == -EAGAIN)
+		return ret;
+	if (ret)
+		goto out;
+
+	req->dlen = idxd_desc->iax_completion->output_size;
+
+	/* Update stats */
+	if (compress_op) {
+		update_total_comp_bytes_out(req->dlen);
+		update_wq_comp_bytes(wq, req->dlen);
+	} else {
+		update_total_decomp_bytes_in(req->slen);
+		update_wq_decomp_bytes(wq, req->slen);
+	}
+
+	if (iaa_verify_compress && (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS)) {
+		struct crypto_tfm *tfm = req->base.tfm;
+		dma_addr_t src_addr, dst_addr;
+		u32 compression_crc;
+
+		compression_crc = idxd_desc->iax_completion->crc;
+
+		dma_sync_sg_for_device(dev, req->dst, 1, DMA_FROM_DEVICE);
+		dma_sync_sg_for_device(dev, req->src, 1, DMA_TO_DEVICE);
+
+		src_addr = sg_dma_address(req->src);
+		dst_addr = sg_dma_address(req->dst);
+
+		ret = iaa_compress_verify(tfm, req, wq, src_addr, req->slen,
+					  dst_addr, &req->dlen, compression_crc);
+	}
+out:
+	/* caller doesn't call crypto_wait_req, so no acomp_request_complete() */
+
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	idxd_free_desc(idxd_desc->wq, idxd_desc);
+
+	dev_dbg(dev, "%s: returning ret=%d\n", __func__, ret);
+
+	return ret;
+}
+
+static void iaa_set_req_poll(
+	struct acomp_req *reqs[],
+	int nr_reqs,
+	bool set_flag)
+{
+	int i;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		set_flag ? (reqs[i]->flags |= CRYPTO_ACOMP_REQ_POLL) :
+			   (reqs[i]->flags &= ~CRYPTO_ACOMP_REQ_POLL);
+	}
+}
+
+/**
+ * This API provides IAA compress batching functionality for use by swap
+ * modules.
+ *
+ * @reqs: @nr_pages asynchronous compress requests.
+ * @wait: crypto_wait for synchronous acomp batch compress. If NULL, the
+ *        completions will be processed asynchronously.
+ * @pages: Pages to be compressed by IAA in parallel.
+ * @dsts: Pre-allocated destination buffers to store results of IAA
+ *        compression. Each element of @dsts must be of size "PAGE_SIZE * 2".
+ * @dlens: Will contain the compressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to CRYPTO_BATCH_SIZE,
+ *            to be compressed.
+ */
+static void iaa_comp_acompress_batch(
+	struct acomp_req *reqs[],
+	struct crypto_wait *wait,
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_pages)
+{
+	struct scatterlist inputs[CRYPTO_BATCH_SIZE];
+	struct scatterlist outputs[CRYPTO_BATCH_SIZE];
+	bool compressions_done = false;
+	bool poll = (async_mode && !use_irq);
+	int i;
+
+	BUG_ON(nr_pages > CRYPTO_BATCH_SIZE);
+	BUG_ON(!poll && !wait);
+
+	if (poll)
+		iaa_set_req_poll(reqs, nr_pages, true);
+	else
+		iaa_set_req_poll(reqs, nr_pages, false);
+
+	/*
+	 * Prepare and submit acomp_reqs to IAA. IAA will process these
+	 * compress jobs in parallel if async-poll mode is enabled.
+	 * If IAA is used in sync mode, the jobs will be processed sequentially
+	 * using "wait".
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		sg_init_table(&inputs[i], 1);
+		sg_set_page(&inputs[i], pages[i], PAGE_SIZE, 0);
+
+		/*
+		 * Each dst buffer should be of size (PAGE_SIZE * 2).
+		 * Reflect same in sg_list.
+		 */
+		sg_init_one(&outputs[i], dsts[i], PAGE_SIZE * 2);
+		acomp_request_set_params(reqs[i], &inputs[i],
+					 &outputs[i], PAGE_SIZE, dlens[i]);
+
+		/*
+		 * If poll is in effect, submit the request now, and poll for
+		 * a completion status later, after all descriptors have been
+		 * submitted. If polling is not enabled, submit the request
+		 * and wait for it to complete, i.e., synchronously, before
+		 * moving on to the next request.
+		 */
+		if (poll) {
+			errors[i] = iaa_comp_acompress(reqs[i]);
+
+			if (errors[i] != -EINPROGRESS)
+				errors[i] = -EINVAL;
+			else
+				errors[i] = -EAGAIN;
+		} else {
+			acomp_request_set_callback(reqs[i],
+						   CRYPTO_TFM_REQ_MAY_BACKLOG,
+						   crypto_req_done, wait);
+			errors[i] = crypto_wait_req(iaa_comp_acompress(reqs[i]),
+						    wait);
+			if (!errors[i])
+				dlens[i] = reqs[i]->dlen;
+		}
+	}
+
+	/*
+	 * If not doing async compressions, the batch has been processed at
+	 * this point and we can return.
+	 */
+	if (!poll)
+		return;
+
+	/*
+	 * Poll for and process IAA compress job completions
+	 * in out-of-order manner.
+	 */
+	while (!compressions_done) {
+		compressions_done = true;
+
+		for (i = 0; i < nr_pages; ++i) {
+			/*
+			 * Skip, if the compression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					compressions_done = false;
+			} else {
+				dlens[i] = reqs[i]->dlen;
+			}
+		}
+	}
+}
+
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ *
+ * @reqs: @nr_pages asynchronous decompress requests.
+ * @wait: crypto_wait for synchronous acomp batch decompress. If NULL, the
+ *        driver must provide a way to process completions asynchronously.
+ * @srcs: The src buffers to be decompressed by IAA in parallel.
+ * @pages: The pages to store the decompressed buffers.
+ * @slens: Compressed lengths of @srcs.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to CRYPTO_BATCH_SIZE,
+ *            to be decompressed.
+ */
+static void iaa_comp_adecompress_batch(
+	struct acomp_req *reqs[],
+	struct crypto_wait *wait,
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	int errors[],
+	int nr_pages)
+{
+	struct scatterlist inputs[CRYPTO_BATCH_SIZE];
+	struct scatterlist outputs[CRYPTO_BATCH_SIZE];
+	unsigned int dlens[CRYPTO_BATCH_SIZE];
+	bool decompressions_done = false;
+	bool poll = (async_mode && !use_irq);
+	int i;
+
+	BUG_ON(nr_pages > CRYPTO_BATCH_SIZE);
+	BUG_ON(!poll && !wait);
+
+	if (poll)
+		iaa_set_req_poll(reqs, nr_pages, true);
+	else
+		iaa_set_req_poll(reqs, nr_pages, false);
+
+	/*
+	 * Prepare and submit acomp_reqs to IAA. IAA will process these
+	 * decompress jobs in parallel if async-poll mode is enabled.
+	 * If IAA is used in sync mode, the jobs will be processed sequentially
+	 * using "wait".
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		dlens[i] = PAGE_SIZE;
+		sg_init_one(&inputs[i], srcs[i], slens[i]);
+		sg_init_table(&outputs[i], 1);
+		sg_set_page(&outputs[i], pages[i], PAGE_SIZE, 0);
+		acomp_request_set_params(reqs[i], &inputs[i],
+					&outputs[i], slens[i], dlens[i]);
+		/*
+		 * If poll is in effect, submit the request now, and poll for
+		 * a completion status later, after all descriptors have been
+		 * submitted. If polling is not enabled, submit the request
+		 * and wait for it to complete, i.e., synchronously, before
+		 * moving on to the next request.
+		 */
+		if (poll) {
+			errors[i] = iaa_comp_adecompress(reqs[i]);
+
+			if (errors[i] != -EINPROGRESS)
+				errors[i] = -EINVAL;
+			else
+				errors[i] = -EAGAIN;
+		} else {
+			acomp_request_set_callback(reqs[i],
+						   CRYPTO_TFM_REQ_MAY_BACKLOG,
+						   crypto_req_done, wait);
+			errors[i] = crypto_wait_req(iaa_comp_adecompress(reqs[i]),
+						    wait);
+			if (!errors[i]) {
+				dlens[i] = reqs[i]->dlen;
+				BUG_ON(dlens[i] != PAGE_SIZE);
+			}
+		}
+	}
+
+	/*
+	 * If not doing async decompressions, the batch has been processed at
+	 * this point and we can return.
+	 */
+	if (!poll)
+		return;
+
+	/*
+	 * Poll for and process IAA decompress job completions
+	 * in out-of-order manner.
+	 */
+	while (!decompressions_done) {
+		decompressions_done = true;
+
+		for (i = 0; i < nr_pages; ++i) {
+			/*
+			 * Skip, if the decompression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					decompressions_done = false;
+			} else {
+				dlens[i] = reqs[i]->dlen;
+				BUG_ON(dlens[i] != PAGE_SIZE);
+			}
+		}
+	}
+}
+
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
@@ -1822,6 +2133,8 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	.compress		= iaa_comp_acompress,
 	.decompress		= iaa_comp_adecompress,
 	.dst_free               = dst_free,
+	.batch_compress		= iaa_comp_acompress_batch,
+	.batch_decompress	= iaa_comp_adecompress_batch,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa",
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 04/13] crypto: iaa - Make async mode the default.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2024-11-06 19:20 ` [PATCH v3 03/13] crypto: iaa - Implement compress/decompress batching API in iaa_crypto Kanchana P Sridhar
@ 2024-11-06 19:20 ` Kanchana P Sridhar
  2024-11-06 19:20 ` [PATCH v3 05/13] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default in the most efficient/recommended "async"
mode for parallel compressions/decompressions, namely, asynchronous
submission of descriptors, followed by polling for job completions.
Earlier, the "sync" mode used to be the default.

This way, anyone that wants to use IAA can do so after building the kernel,
and without having to go through these steps to use async poll:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo async > /sys/bus/dsa/drivers/crypto/sync_mode
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 3ac3a37fd2e6..13f9d22811ff 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -153,7 +153,7 @@ static DRIVER_ATTR_RW(verify_compress);
  */
 
 /* Use async mode */
-static bool async_mode;
+static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 05/13] crypto: iaa - Disable iaa_verify_compress by default.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2024-11-06 19:20 ` [PATCH v3 04/13] crypto: iaa - Make async mode the default Kanchana P Sridhar
@ 2024-11-06 19:20 ` Kanchana P Sridhar
  2024-11-06 19:20 ` [PATCH v3 06/13] crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to IAAs Kanchana P Sridhar
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default with "iaa_verify_compress" disabled, to
facilitate performance comparisons with software compressors (which also
do not run compress verification by default). Earlier, iaa_crypto compress
verification used to be enabled by default.

With this patch, if users want to enable compress verification, they can do
so with these steps:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo 1 > /sys/bus/dsa/drivers/crypto/verify_compress
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 13f9d22811ff..c4b143dd1ddd 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -94,7 +94,7 @@ static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
 /* Verify results of IAA compress or not */
-static bool iaa_verify_compress = true;
+static bool iaa_verify_compress = false;
 
 static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
 {
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 06/13] crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to IAAs.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2024-11-06 19:20 ` [PATCH v3 05/13] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
@ 2024-11-06 19:20 ` Kanchana P Sridhar
  2024-11-06 19:20 ` [PATCH v3 07/13] crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA node Kanchana P Sridhar
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This change distributes the cpus more evenly among the IAAs in each socket.

 Old algorithm to assign cpus to IAA:
 ------------------------------------
 If "nr_cpus" = nr_logical_cpus (includes hyper-threading), the current
 algorithm determines "nr_cpus_per_node" = nr_cpus / nr_nodes.

 Hence, on a 2-socket Sapphire Rapids server where each socket has 56 cores
 and 4 IAA devices, nr_cpus_per_node = 112.

 Further, cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa
 Hence, cpus_per_iaa = 224/8 = 28.

 The iaa_crypto driver then assigns 28 "logical" node cpus per IAA device
 on that node, that results in this cpu-to-iaa mapping:

 lscpu|grep NUMA
 NUMA node(s):        2
 NUMA node0 CPU(s):   0-55,112-167
 NUMA node1 CPU(s):   56-111,168-223

 NUMA node 0:
 cpu   0-27    28-55  112-139  140-167
 iaa   iax1    iax3   iax5     iax7

 NUMA node 1:
 cpu   56-83  84-111  168-195   196-223
 iaa   iax9   iax11   iax13     iax15

 This appears non-optimal for a few reasons:

 1) The 2 logical threads on a core will get assigned to different IAA
    devices. For e.g.:
      cpu 0:   iax1
      cpu 112: iax5
 2) One of the logical threads on a core is assigned to an IAA that is not
    closest to that core. For e.g. cpu 112.
 3) If numactl is used to start processes sequentially on the logical
    cores, some of the IAA devices on the socket could be over-subscribed,
    while some could be under-utilized.

This patch introduces a scheme to more evenly balance the logical cores to
IAA devices on a socket.

 New algorithm to assign cpus to IAA:
 ------------------------------------
 We introduce a function "cpu_to_iaa()" that takes a logical cpu and
 returns the IAA device closest to it.

 If "nr_cpus" = nr_logical_cpus (includes hyper-threading), the new
 algorithm determines "nr_cpus_per_node" = topology_num_cores_per_package().

 Hence, on a 2-socket Sapphire Rapids server where each socket has 56 cores
 and 4 IAA devices, nr_cpus_per_node = 56.

 Further, cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa
 Hence, cpus_per_iaa = 112/8 = 14.

 The iaa_crypto driver then assigns 14 "logical" node cpus per IAA device
 on that node, that results in this cpu-to-iaa mapping:

 NUMA node 0:
 cpu   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
 iaa   iax1           iax3           iax5           iax7

 NUMA node 1:
 cpu   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
 iaa   iax9           iax11          iax13           iax15

 This resolves the 3 issues with non-optimality of cpu-to-iaa mappings
 pointed out earlier with the existing approach.

Originally-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 84 ++++++++++++++--------
 1 file changed, 54 insertions(+), 30 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index c4b143dd1ddd..a12a8f9caa84 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -55,6 +55,46 @@ static struct idxd_wq *wq_table_next_wq(int cpu)
 	return entry->wqs[entry->cur_wq];
 }
 
+/*
+ * Given a cpu, find the closest IAA instance.  The idea is to try to
+ * choose the most appropriate IAA instance for a caller and spread
+ * available workqueues around to clients.
+ */
+static inline int cpu_to_iaa(int cpu)
+{
+	int node, n_cpus = 0, test_cpu, iaa = 0;
+	int nr_iaa_per_node;
+	const struct cpumask *node_cpus;
+
+	if (!nr_nodes)
+		return 0;
+
+	nr_iaa_per_node = nr_iaa / nr_nodes;
+	if (!nr_iaa_per_node)
+		return 0;
+
+	for_each_online_node(node) {
+		node_cpus = cpumask_of_node(node);
+		if (!cpumask_test_cpu(cpu, node_cpus))
+			continue;
+
+		for_each_cpu(test_cpu, node_cpus) {
+			if ((n_cpus % nr_cpus_per_node) == 0)
+				iaa = node * nr_iaa_per_node;
+
+			if (test_cpu == cpu)
+				return iaa;
+
+			n_cpus++;
+
+			if ((n_cpus % cpus_per_iaa) == 0)
+				iaa++;
+		}
+	}
+
+	return -1;
+}
+
 static void wq_table_add(int cpu, struct idxd_wq *wq)
 {
 	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
@@ -895,8 +935,7 @@ static int wq_table_add_wqs(int iaa, int cpu)
  */
 static void rebalance_wq_table(void)
 {
-	const struct cpumask *node_cpus;
-	int node, cpu, iaa = -1;
+	int cpu, iaa;
 
 	if (nr_iaa == 0)
 		return;
@@ -906,37 +945,22 @@ static void rebalance_wq_table(void)
 
 	clear_wq_table();
 
-	if (nr_iaa == 1) {
-		for (cpu = 0; cpu < nr_cpus; cpu++) {
-			if (WARN_ON(wq_table_add_wqs(0, cpu))) {
-				pr_debug("could not add any wqs for iaa 0 to cpu %d!\n", cpu);
-				return;
-			}
-		}
-
-		return;
-	}
-
-	for_each_node_with_cpus(node) {
-		node_cpus = cpumask_of_node(node);
-
-		for (cpu = 0; cpu <  cpumask_weight(node_cpus); cpu++) {
-			int node_cpu = cpumask_nth(cpu, node_cpus);
-
-			if (WARN_ON(node_cpu >= nr_cpu_ids)) {
-				pr_debug("node_cpu %d doesn't exist!\n", node_cpu);
-				return;
-			}
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		iaa = cpu_to_iaa(cpu);
+		pr_debug("rebalance: cpu=%d iaa=%d\n", cpu, iaa);
 
-			if ((cpu % cpus_per_iaa) == 0)
-				iaa++;
+		if (WARN_ON(iaa == -1)) {
+			pr_debug("rebalance (cpu_to_iaa(%d)) failed!\n", cpu);
+			return;
+		}
 
-			if (WARN_ON(wq_table_add_wqs(iaa, node_cpu))) {
-				pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
-				return;
-			}
+		if (WARN_ON(wq_table_add_wqs(iaa, cpu))) {
+			pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
+			return;
 		}
 	}
+
+	pr_debug("Finished rebalance local wqs.");
 }
 
 static inline int check_completion(struct device *dev,
@@ -2332,7 +2356,7 @@ static int __init iaa_crypto_init_module(void)
 		pr_err("IAA couldn't find any nodes with cpus\n");
 		return -ENODEV;
 	}
-	nr_cpus_per_node = nr_cpus / nr_nodes;
+	nr_cpus_per_node = topology_num_cores_per_package();
 
 	if (crypto_has_comp("deflate-generic", 0, 0))
 		deflate_generic_tfm = crypto_alloc_comp("deflate-generic", 0, 0);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 07/13] crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA node.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2024-11-06 19:20 ` [PATCH v3 06/13] crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to IAAs Kanchana P Sridhar
@ 2024-11-06 19:20 ` Kanchana P Sridhar
  2024-11-06 19:21 ` [PATCH v3 08/13] mm: zswap: acomp_ctx mutex lock/unlock optimizations Kanchana P Sridhar
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This change enables processes running on any logical core on a NUMA node to
use all the IAA devices enabled on that NUMA node for compress jobs. In
other words, compressions originating from any process in a node will be
distributed in round-robin manner to the available IAA devices on the same
socket. The main premise behind this change is to make sure that no
compress engines on any IAA device are left un-utilized/under-utilized. In
other words, the compress engines on all IAA devices are considered a
global resource for that socket.

This allows the use of all IAA devices present in a given NUMA node for
(batched) compressions originating from zswap/zram, from all cores
on this node.

A new per-cpu "global_wq_table" implements this in the iaa_crypto driver.
We can think of the global WQ per IAA as a WQ to which all cores on
that socket can submit compress jobs.

To avail of this feature, the user must configure 2 WQs per IAA in order to
enable distribution of compress jobs to multiple IAA devices.

Each IAA will have 2 WQs:
 wq.0 (local WQ):
   Used for decompress jobs from cores mapped by the cpu_to_iaa() "even
   balancing of logical cores to IAA devices" algorithm.

 wq.1 (global WQ):
   Used for compress jobs from *all* logical cores on that socket.

The iaa_crypto driver will place all global WQs from all same-socket IAA
devices in the global_wq_table per cpu on that socket. When the driver
receives a compress job, it will lookup the "next" global WQ in the cpu's
global_wq_table to submit the descriptor.

The starting wq in the global_wq_table for each cpu is the global wq
associated with the IAA nearest to it, so that we stagger the starting
global wq for each process. This results in very uniform usage of all IAAs
for compress jobs.

Two new driver module parameters are added for this feature:

g_wqs_per_iaa (default 1):

 /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa

 This represents the number of global WQs that can be configured per IAA
 device. The default is 1, and is the recommended setting to enable the use
 of this feature once the user configures 2 WQs per IAA using higher level
 scripts as described in
 Documentation/driver-api/crypto/iaa/iaa-crypto.rst.

g_consec_descs_per_gwq (default 1):

 /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq

 This represents the number of consecutive compress jobs that will be
 submitted to the same global WQ (i.e. to the same IAA device) from a given
 core, before moving to the next global WQ. The default is 1, which is also
 the recommended setting to avail of this feature.

The decompress jobs from any core will be sent to the "local" IAA, namely
the one that the driver assigns with the cpu_to_iaa() mapping algorithm
that evenly balances the assignment of logical cores to IAA devices on a
NUMA node.

On a 2-socket Sapphire Rapids server where each socket has 56 cores and
4 IAA devices, this is how the compress/decompress jobs will be mapped
when the user configures 2 WQs per IAA device (which implies wq.1 will
be added to the global WQ table for each logical core on that NUMA node):

 lscpu|grep NUMA
 NUMA node(s):        2
 NUMA node0 CPU(s):   0-55,112-167
 NUMA node1 CPU(s):   56-111,168-223

 Compress jobs:
 --------------
 NUMA node 0:
 All cpus (0-55,112-167) can send compress jobs to all IAA devices on the
 socket (iax1/iax3/iax5/iax7) in round-robin manner:
 iaa   iax1           iax3           iax5           iax7

 NUMA node 1:
 All cpus (56-111,168-223) can send compress jobs to all IAA devices on the
 socket (iax9/iax11/iax13/iax15) in round-robin manner:
 iaa   iax9           iax11          iax13           iax15

 Decompress jobs:
 ----------------
 NUMA node 0:
 cpu   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
 iaa   iax1           iax3           iax5           iax7

 NUMA node 1:
 cpu   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
 iaa   iax9           iax11          iax13           iax15

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 305 ++++++++++++++++++++-
 1 file changed, 290 insertions(+), 15 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index a12a8f9caa84..ca0a71b8f31d 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -29,14 +29,23 @@ static unsigned int nr_iaa;
 static unsigned int nr_cpus;
 static unsigned int nr_nodes;
 static unsigned int nr_cpus_per_node;
-
 /* Number of physical cpus sharing each iaa instance */
 static unsigned int cpus_per_iaa;
 
 static struct crypto_comp *deflate_generic_tfm;
 
 /* Per-cpu lookup table for balanced wqs */
-static struct wq_table_entry __percpu *wq_table;
+static struct wq_table_entry __percpu *wq_table = NULL;
+
+/* Per-cpu lookup table for global wqs shared by all cpus. */
+static struct wq_table_entry __percpu *global_wq_table = NULL;
+
+/*
+ * Per-cpu counter of consecutive descriptors allocated to
+ * the same wq in the global_wq_table, so that we know
+ * when to switch to the next wq in the global_wq_table.
+ */
+static int __percpu *num_consec_descs_per_wq = NULL;
 
 static struct idxd_wq *wq_table_next_wq(int cpu)
 {
@@ -104,26 +113,68 @@ static void wq_table_add(int cpu, struct idxd_wq *wq)
 
 	entry->wqs[entry->n_wqs++] = wq;
 
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+	pr_debug("%s: added iaa local wq %d.%d to idx %d of cpu %d\n", __func__,
+		entry->wqs[entry->n_wqs - 1]->idxd->id,
+		entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+}
+
+static void global_wq_table_add(int cpu, struct idxd_wq *wq)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+		return;
+
+	entry->wqs[entry->n_wqs++] = wq;
+
+	pr_debug("%s: added iaa global wq %d.%d to idx %d of cpu %d\n", __func__,
+		entry->wqs[entry->n_wqs - 1]->idxd->id,
+		entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+}
+
+static void global_wq_table_set_start_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int start_wq = (entry->n_wqs / nr_iaa) * cpu_to_iaa(cpu);
+
+	if ((start_wq >= 0) && (start_wq < entry->n_wqs))
+		entry->cur_wq = start_wq;
 }
 
 static void wq_table_free_entry(int cpu)
 {
 	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
 
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
+	if (entry) {
+		kfree(entry->wqs);
+		memset(entry, 0, sizeof(*entry));
+	}
+
+	entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (entry) {
+		kfree(entry->wqs);
+		memset(entry, 0, sizeof(*entry));
+	}
 }
 
 static void wq_table_clear_entry(int cpu)
 {
 	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
 
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+	if (entry) {
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+		memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+	}
+
+	entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (entry) {
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+		memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+	}
 }
 
 LIST_HEAD(iaa_devices);
@@ -163,6 +214,70 @@ static ssize_t verify_compress_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(verify_compress);
 
+/* Number of global wqs per iaa*/
+static int g_wqs_per_iaa = 1;
+
+static ssize_t g_wqs_per_iaa_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_wqs_per_iaa);
+}
+
+static ssize_t g_wqs_per_iaa_store(struct device_driver *driver,
+				     const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_wqs_per_iaa);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_wqs_per_iaa);
+
+/*
+ * Number of consecutive descriptors to allocate from a
+ * given global wq before switching to the next wq in
+ * the global_wq_table.
+ */
+static int g_consec_descs_per_gwq = 1;
+
+static ssize_t g_consec_descs_per_gwq_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_consec_descs_per_gwq);
+}
+
+static ssize_t g_consec_descs_per_gwq_store(struct device_driver *driver,
+				     const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_consec_descs_per_gwq);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_consec_descs_per_gwq);
+
 /*
  * The iaa crypto driver supports three 'sync' methods determining how
  * compressions and decompressions are performed:
@@ -751,7 +866,20 @@ static void free_wq_table(void)
 	for (cpu = 0; cpu < nr_cpus; cpu++)
 		wq_table_free_entry(cpu);
 
-	free_percpu(wq_table);
+	if (wq_table) {
+		free_percpu(wq_table);
+		wq_table = NULL;
+	}
+
+	if (global_wq_table) {
+		free_percpu(global_wq_table);
+		global_wq_table = NULL;
+	}
+
+	if (num_consec_descs_per_wq) {
+		free_percpu(num_consec_descs_per_wq);
+		num_consec_descs_per_wq = NULL;
+	}
 
 	pr_debug("freed wq table\n");
 }
@@ -774,6 +902,38 @@ static int alloc_wq_table(int max_wqs)
 		}
 
 		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+	}
+
+	global_wq_table = alloc_percpu(struct wq_table_entry);
+	if (!global_wq_table) {
+		free_wq_table();
+		return -ENOMEM;
+	}
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		entry = per_cpu_ptr(global_wq_table, cpu);
+		entry->wqs = kzalloc(GFP_KERNEL, max_wqs * sizeof(struct wq *));
+		if (!entry->wqs) {
+			free_wq_table();
+			return -ENOMEM;
+		}
+
+		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+	}
+
+	num_consec_descs_per_wq = alloc_percpu(int);
+	if (!num_consec_descs_per_wq) {
+		free_wq_table();
+		return -ENOMEM;
+	}
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+		*num_consec_descs = 0;
 	}
 
 	pr_debug("initialized wq table\n");
@@ -912,9 +1072,14 @@ static int wq_table_add_wqs(int iaa, int cpu)
 	}
 
 	list_for_each_entry(iaa_wq, &found_device->wqs, list) {
-		wq_table_add(cpu, iaa_wq->wq);
+
+		if (((found_device->n_wq - g_wqs_per_iaa) < 1) ||
+			(n_wqs_added < (found_device->n_wq - g_wqs_per_iaa))) {
+			wq_table_add(cpu, iaa_wq->wq);
+		}
+
 		pr_debug("rebalance: added wq for cpu=%d: iaa wq %d.%d\n",
-			 cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
+			cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
 		n_wqs_added++;
 	}
 
@@ -927,6 +1092,63 @@ static int wq_table_add_wqs(int iaa, int cpu)
 	return ret;
 }
 
+static int global_wq_table_add_wqs(void)
+{
+	struct iaa_device *iaa_device;
+	int ret = 0, n_wqs_added;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int cpu, node, node_of_cpu = -1;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+
+#ifdef CONFIG_NUMA
+		node_of_cpu = -1;
+		for_each_online_node(node) {
+			const struct cpumask *node_cpus;
+			node_cpus = cpumask_of_node(node);
+			if (!cpumask_test_cpu(cpu, node_cpus))
+				continue;
+			node_of_cpu = node;
+			break;
+		}
+#endif
+		list_for_each_entry(iaa_device, &iaa_devices, list) {
+			idxd = iaa_device->idxd;
+			pdev = idxd->pdev;
+			dev = &pdev->dev;
+
+#ifdef CONFIG_NUMA
+			if (dev && (node_of_cpu != dev->numa_node))
+				continue;
+#endif
+
+			if (iaa_device->n_wq <= g_wqs_per_iaa)
+				continue;
+
+			n_wqs_added = 0;
+
+			list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+
+				if (n_wqs_added < (iaa_device->n_wq - g_wqs_per_iaa)) {
+					n_wqs_added++;
+				}
+				else {
+					global_wq_table_add(cpu, iaa_wq->wq);
+					pr_debug("rebalance: added global wq for cpu=%d: iaa wq %d.%d\n",
+						cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
+				}
+			}
+		}
+
+		global_wq_table_set_start_wq(cpu);
+	}
+
+	return ret;
+}
+
 /*
  * Rebalance the wq table so that given a cpu, it's easy to find the
  * closest IAA instance.  The idea is to try to choose the most
@@ -961,6 +1183,7 @@ static void rebalance_wq_table(void)
 	}
 
 	pr_debug("Finished rebalance local wqs.");
+	global_wq_table_add_wqs();
 }
 
 static inline int check_completion(struct device *dev,
@@ -1509,6 +1732,27 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	goto out;
 }
 
+/*
+ * Caller should make sure to call only if the
+ * per_cpu_ptr "global_wq_table" is non-NULL
+ * and has at least one wq configured.
+ */
+static struct idxd_wq *global_wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+
+	if ((*num_consec_descs) == g_consec_descs_per_gwq) {
+		if (++entry->cur_wq >= entry->n_wqs)
+			entry->cur_wq = 0;
+		*num_consec_descs = 0;
+	}
+
+	++(*num_consec_descs);
+
+	return entry->wqs[entry->cur_wq];
+}
+
 static int iaa_comp_acompress(struct acomp_req *req)
 {
 	struct iaa_compression_ctx *compression_ctx;
@@ -1521,6 +1765,7 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	struct idxd_wq *wq;
 	struct device *dev;
 	int order = -1;
+	struct wq_table_entry *entry;
 
 	compression_ctx = crypto_tfm_ctx(tfm);
 
@@ -1539,8 +1784,15 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		disable_async = true;
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (!entry || entry->n_wqs == 0) {
+		wq = wq_table_next_wq(cpu);
+	} else {
+		wq = global_wq_table_next_wq(cpu);
+	}
 	put_cpu();
+
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
 		return -ENODEV;
@@ -2393,13 +2645,32 @@ static int __init iaa_crypto_init_module(void)
 		goto err_sync_attr_create;
 	}
 
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_wqs_per_iaa);
+	if (ret) {
+		pr_debug("IAA g_wqs_per_iaa attr creation failed\n");
+		goto err_g_wqs_per_iaa_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_consec_descs_per_gwq);
+	if (ret) {
+		pr_debug("IAA g_consec_descs_per_gwq attr creation failed\n");
+		goto err_g_consec_descs_per_gwq_attr_create;
+	}
+
 	if (iaa_crypto_debugfs_init())
 		pr_warn("debugfs init failed, stats not available\n");
 
 	pr_debug("initialized\n");
 out:
 	return ret;
-
+err_g_consec_descs_per_gwq_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+err_g_wqs_per_iaa_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_sync_mode);
 err_sync_attr_create:
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
@@ -2423,6 +2694,10 @@ static void __exit iaa_crypto_cleanup_module(void)
 			   &driver_attr_sync_mode);
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_consec_descs_per_gwq);
 	idxd_driver_unregister(&iaa_crypto_driver);
 	iaa_aecs_cleanup_fixed();
 	crypto_free_comp(deflate_generic_tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 08/13] mm: zswap: acomp_ctx mutex lock/unlock optimizations.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (6 preceding siblings ...)
  2024-11-06 19:20 ` [PATCH v3 07/13] crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA node Kanchana P Sridhar
@ 2024-11-06 19:21 ` Kanchana P Sridhar
  2024-11-08 20:14   ` Yosry Ahmed
  2024-11-06 19:21 ` [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs Kanchana P Sridhar
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch implements two changes with respect to the acomp_ctx mutex lock:

1) The mutex lock is not acquired/released in zswap_compress(). Instead,
   zswap_store() acquires the mutex lock once before compressing each page
   in a large folio, and releases the lock once all pages in the folio have
   been compressed. This should reduce some compute cycles in case of large
   folio stores.
2) In zswap_decompress(), the mutex lock is released after the conditional
   zpool_unmap_handle() based on "src != acomp_ctx->buffer" rather than
   before. This ensures that the value of "src" obtained earlier does not
   change. If the mutex lock is released before the comparison of "src" it
   is possible that another call to reclaim by the same process could
   obtain the mutex lock and over-write the value of "src".

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index f6316b66fb23..3e899fa61445 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -880,6 +880,9 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
+/*
+ * The acomp_ctx->mutex must be locked/unlocked in the calling procedure.
+ */
 static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 			   struct zswap_pool *pool)
 {
@@ -895,8 +898,6 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
 
-	mutex_lock(&acomp_ctx->mutex);
-
 	dst = acomp_ctx->buffer;
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
@@ -949,7 +950,6 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	else if (alloc_ret)
 		zswap_reject_alloc_fail++;
 
-	mutex_unlock(&acomp_ctx->mutex);
 	return comp_ret == 0 && alloc_ret == 0;
 }
 
@@ -986,10 +986,16 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
 	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
 	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
-	mutex_unlock(&acomp_ctx->mutex);
 
 	if (src != acomp_ctx->buffer)
 		zpool_unmap_handle(zpool, entry->handle);
+
+	/*
+	 * It is safer to unlock the mutex after the check for
+	 * "src != acomp_ctx->buffer" so that the value of "src"
+	 * does not change.
+	 */
+	mutex_unlock(&acomp_ctx->mutex);
 }
 
 /*********************************
@@ -1487,6 +1493,7 @@ bool zswap_store(struct folio *folio)
 {
 	long nr_pages = folio_nr_pages(folio);
 	swp_entry_t swp = folio->swap;
+	struct crypto_acomp_ctx *acomp_ctx;
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
 	struct zswap_pool *pool;
@@ -1526,6 +1533,9 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
+	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+	mutex_lock(&acomp_ctx->mutex);
+
 	for (index = 0; index < nr_pages; ++index) {
 		struct page *page = folio_page(folio, index);
 		ssize_t bytes;
@@ -1547,6 +1557,7 @@ bool zswap_store(struct folio *folio)
 	ret = true;
 
 put_pool:
+	mutex_unlock(&acomp_ctx->mutex);
 	zswap_pool_put(pool);
 put_objcg:
 	obj_cgroup_put(objcg);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (7 preceding siblings ...)
  2024-11-06 19:21 ` [PATCH v3 08/13] mm: zswap: acomp_ctx mutex lock/unlock optimizations Kanchana P Sridhar
@ 2024-11-06 19:21 ` Kanchana P Sridhar
  2024-11-07 17:20   ` Johannes Weiner
  2024-11-06 19:21 ` [PATCH v3 10/13] mm: zswap: Add a per-cpu "acomp_batch_ctx" to struct zswap_pool Kanchana P Sridhar
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Modified the definition of "struct crypto_acomp_ctx" to represent a
configurable number of acomp_reqs and the required number of buffers.

Accordingly, refactored the code that allocates/deallocates the acomp_ctx
resources, so that it can be called to create a regular acomp_ctx with
exactly one acomp_req/buffer, for use in the the existing non-batching
zswap_store(), as well as to create a separate "batching acomp_ctx" with
multiple acomp_reqs/buffers for IAA compress batching.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 149 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 107 insertions(+), 42 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 3e899fa61445..02e031122fdf 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -143,9 +143,10 @@ bool zswap_never_enabled(void)
 
 struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
+	struct acomp_req **reqs;
+	u8 **buffers;
+	unsigned int nr_reqs;
 	struct crypto_wait wait;
-	u8 *buffer;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -241,6 +242,11 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 	pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,		\
 		 zpool_get_type((p)->zpool))
 
+static int zswap_create_acomp_ctx(unsigned int cpu,
+				  struct crypto_acomp_ctx *acomp_ctx,
+				  char *tfm_name,
+				  unsigned int nr_reqs);
+
 /*********************************
 * pool functions
 **********************************/
@@ -813,69 +819,128 @@ static void zswap_entry_free(struct zswap_entry *entry)
 /*********************************
 * compressed storage functions
 **********************************/
-static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
+static int zswap_create_acomp_ctx(unsigned int cpu,
+				  struct crypto_acomp_ctx *acomp_ctx,
+				  char *tfm_name,
+				  unsigned int nr_reqs)
 {
-	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
-	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
-	int ret;
+	int ret = -ENOMEM;
+	int i, j;
 
+	acomp_ctx->nr_reqs = 0;
 	mutex_init(&acomp_ctx->mutex);
 
-	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!acomp_ctx->buffer)
-		return -ENOMEM;
-
-	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+	acomp = crypto_alloc_acomp_node(tfm_name, 0, 0, cpu_to_node(cpu));
 	if (IS_ERR(acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
-				pool->tfm_name, PTR_ERR(acomp));
-		ret = PTR_ERR(acomp);
-		goto acomp_fail;
+				tfm_name, PTR_ERR(acomp));
+		return PTR_ERR(acomp);
 	}
+
 	acomp_ctx->acomp = acomp;
 	acomp_ctx->is_sleepable = acomp_is_async(acomp);
 
-	req = acomp_request_alloc(acomp_ctx->acomp);
-	if (!req) {
-		pr_err("could not alloc crypto acomp_request %s\n",
-		       pool->tfm_name);
-		ret = -ENOMEM;
+	acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *),
+					  GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffers)
+		goto buf_fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2,
+						     GFP_KERNEL, cpu_to_node(cpu));
+		if (!acomp_ctx->buffers[i]) {
+			for (j = 0; j < i; ++j)
+				kfree(acomp_ctx->buffers[j]);
+			kfree(acomp_ctx->buffers);
+			ret = -ENOMEM;
+			goto buf_fail;
+		}
+	}
+
+	acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req *),
+				       GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->reqs)
 		goto req_fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
+		if (!acomp_ctx->reqs[i]) {
+			pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
+			       i, tfm_name);
+			for (j = 0; j < i; ++j)
+				acomp_request_free(acomp_ctx->reqs[j]);
+			kfree(acomp_ctx->reqs);
+			ret = -ENOMEM;
+			goto req_fail;
+		}
 	}
-	acomp_ctx->req = req;
 
+	/*
+	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
+	 * or non-poll mode of acomp, hence there is only one "wait" per
+	 * acomp_ctx, with callback set to reqs[0], under the assumption that
+	 * there is at least 1 request per acomp_ctx.
+	 */
 	crypto_init_wait(&acomp_ctx->wait);
 	/*
 	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
-	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+	acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
+	acomp_ctx->nr_reqs = nr_reqs;
 	return 0;
 
 req_fail:
+	for (i = 0; i < nr_reqs; ++i)
+		kfree(acomp_ctx->buffers[i]);
+	kfree(acomp_ctx->buffers);
+buf_fail:
 	crypto_free_acomp(acomp_ctx->acomp);
-acomp_fail:
-	kfree(acomp_ctx->buffer);
 	return ret;
 }
 
-static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
+static void zswap_delete_acomp_ctx(struct crypto_acomp_ctx *acomp_ctx)
 {
-	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
-	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
-
 	if (!IS_ERR_OR_NULL(acomp_ctx)) {
-		if (!IS_ERR_OR_NULL(acomp_ctx->req))
-			acomp_request_free(acomp_ctx->req);
+		int i;
+
+		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+			if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
+				acomp_request_free(acomp_ctx->reqs[i]);
+		kfree(acomp_ctx->reqs);
+
+		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+			kfree(acomp_ctx->buffers[i]);
+		kfree(acomp_ctx->buffers);
+
 		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 			crypto_free_acomp(acomp_ctx->acomp);
-		kfree(acomp_ctx->buffer);
+
+		acomp_ctx->nr_reqs = 0;
+		acomp_ctx = NULL;
 	}
+}
+
+static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
+{
+	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
+	struct crypto_acomp_ctx *acomp_ctx;
+
+	acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+	return zswap_create_acomp_ctx(cpu, acomp_ctx, pool->tfm_name, 1);
+}
+
+static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
+{
+	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
+	struct crypto_acomp_ctx *acomp_ctx;
+
+	acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+	zswap_delete_acomp_ctx(acomp_ctx);
 
 	return 0;
 }
@@ -898,7 +963,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
 
-	dst = acomp_ctx->buffer;
+	dst = acomp_ctx->buffers[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
@@ -908,7 +973,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * giving the dst buffer with enough length to avoid buffer overflow.
 	 */
 	sg_init_one(&output, dst, PAGE_SIZE * 2);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
 
 	/*
 	 * it maybe looks a little bit silly that we send an asynchronous request,
@@ -922,8 +987,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * but in different threads running on different cpu, we have different
 	 * acomp instance, so multiple threads can do (de)compression in parallel.
 	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
+	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
+	dlen = acomp_ctx->reqs[0]->dlen;
 	if (comp_ret)
 		goto unlock;
 
@@ -975,24 +1040,24 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	 */
 	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
 	    !virt_addr_valid(src)) {
-		memcpy(acomp_ctx->buffer, src, entry->length);
-		src = acomp_ctx->buffer;
+		memcpy(acomp_ctx->buffers[0], src, entry->length);
+		src = acomp_ctx->buffers[0];
 		zpool_unmap_handle(zpool, entry->handle);
 	}
 
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_folio(&output, folio, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
-	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
-	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE);
+	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait));
+	BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
 
-	if (src != acomp_ctx->buffer)
+	if (src != acomp_ctx->buffers[0])
 		zpool_unmap_handle(zpool, entry->handle);
 
 	/*
 	 * It is safer to unlock the mutex after the check for
-	 * "src != acomp_ctx->buffer" so that the value of "src"
+	 * "src != acomp_ctx->buffers[0]" so that the value of "src"
 	 * does not change.
 	 */
 	mutex_unlock(&acomp_ctx->mutex);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 10/13] mm: zswap: Add a per-cpu "acomp_batch_ctx" to struct zswap_pool.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (8 preceding siblings ...)
  2024-11-06 19:21 ` [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs Kanchana P Sridhar
@ 2024-11-06 19:21 ` Kanchana P Sridhar
  2024-11-08 20:23   ` Yosry Ahmed
  2024-11-06 19:21 ` [PATCH v3 11/13] mm: zswap: Allocate acomp_batch_ctx resources for a given zswap_pool Kanchana P Sridhar
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch adds a separate per-cpu batching acomp context "acomp_batch_ctx"
to the zswap_pool. The per-cpu acomp_batch_ctx pointer is allocated at pool
creation time, but no per-cpu resources are allocated for it.

The idea is to not incur the memory footprint cost of multiple acomp_reqs
and buffers in the existing "acomp_ctx" for cases where compress batching
is not possible; for instance, with software compressor algorithms, on
systems without IAA, on systems with IAA that want to run the existing
non-batching implementation of zswap_store() of large folios.

By creating a separate acomp_batch_ctx, we have the ability to allocate
additional memory per-cpu only if the zswap compressor supports batching,
and if the user wants to enable the use of compress batching in
zswap_store() to improve swapout performance of large folios.

Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 02e031122fdf..80a928cf0f7e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -160,6 +160,7 @@ struct crypto_acomp_ctx {
 struct zswap_pool {
 	struct zpool *zpool;
 	struct crypto_acomp_ctx __percpu *acomp_ctx;
+	struct crypto_acomp_ctx __percpu *acomp_batch_ctx;
 	struct percpu_ref ref;
 	struct list_head list;
 	struct work_struct release_work;
@@ -287,10 +288,14 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 
 	pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
 	if (!pool->acomp_ctx) {
-		pr_err("percpu alloc failed\n");
+		pr_err("percpu acomp_ctx alloc failed\n");
 		goto error;
 	}
 
+	pool->acomp_batch_ctx = alloc_percpu(*pool->acomp_batch_ctx);
+	if (!pool->acomp_batch_ctx)
+		pr_err("percpu acomp_batch_ctx alloc failed\n");
+
 	ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
 				       &pool->node);
 	if (ret)
@@ -312,6 +317,8 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 ref_fail:
 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
 error:
+	if (pool->acomp_batch_ctx)
+		free_percpu(pool->acomp_batch_ctx);
 	if (pool->acomp_ctx)
 		free_percpu(pool->acomp_ctx);
 	if (pool->zpool)
@@ -368,6 +375,8 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
 
 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
 	free_percpu(pool->acomp_ctx);
+	if (pool->acomp_batch_ctx)
+		free_percpu(pool->acomp_batch_ctx);
 
 	zpool_destroy_pool(pool->zpool);
 	kfree(pool);
@@ -930,6 +939,11 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx;
 
+	if (pool->acomp_batch_ctx) {
+		acomp_ctx = per_cpu_ptr(pool->acomp_batch_ctx, cpu);
+		acomp_ctx->nr_reqs = 0;
+	}
+
 	acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	return zswap_create_acomp_ctx(cpu, acomp_ctx, pool->tfm_name, 1);
 }
@@ -939,6 +953,12 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx;
 
+	if (pool->acomp_batch_ctx) {
+		acomp_ctx = per_cpu_ptr(pool->acomp_batch_ctx, cpu);
+		if (!IS_ERR_OR_NULL(acomp_ctx) && (acomp_ctx->nr_reqs > 0))
+			zswap_delete_acomp_ctx(acomp_ctx);
+	}
+
 	acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	zswap_delete_acomp_ctx(acomp_ctx);
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 11/13] mm: zswap: Allocate acomp_batch_ctx resources for a given zswap_pool.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (9 preceding siblings ...)
  2024-11-06 19:21 ` [PATCH v3 10/13] mm: zswap: Add a per-cpu "acomp_batch_ctx" to struct zswap_pool Kanchana P Sridhar
@ 2024-11-06 19:21 ` Kanchana P Sridhar
  2024-11-07 17:31   ` Johannes Weiner
  2024-11-06 19:21 ` [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout Kanchana P Sridhar
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

If the zswap_pool is associated with an acomp_alg/crypto_acomp that has
registered batch_compress() and batch_decompress() API, we can allocate the
necessary batching resources for the pool's acomp_batch_ctx.

This patch makes the above determination on incurring the per-cpu memory
footprint cost for batching, and if so, goes ahead and allocates
SWAP_CRYPTO_BATCH_SIZE (i.e. 8) acomp_reqs/buffers for the
pool->acomp_batch_ctx on that specific cpu.

It also "remembers" the pool's batching readiness as a result of the above,
through a new

   	enum batch_comp_status can_batch_comp;

member added to struct zswap_pool, for fast retrieval during
zswap_store().

This allows us a way to only incur the memory footprint cost of the
pool->acomp_batch_ctx resources for a given cpu on which zswap_store()
needs to process a large folio.

Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h |  7 ++++++
 mm/zswap.c            | 52 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index d961ead91bf1..9ad27ab3d222 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -7,6 +7,13 @@
 
 struct lruvec;
 
+/*
+ * For IAA compression batching:
+ * Maximum number of IAA acomp compress requests that will be processed
+ * in a batch: in parallel, if iaa_crypto async/no irq mode is enabled
+ * (the default); else sequentially, if iaa_crypto sync mode is in effect.
+ */
+#define SWAP_CRYPTO_BATCH_SIZE 8UL
 extern atomic_long_t zswap_stored_pages;
 
 #ifdef CONFIG_ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 80a928cf0f7e..2af736e38213 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -151,6 +151,12 @@ struct crypto_acomp_ctx {
 	bool is_sleepable;
 };
 
+enum batch_comp_status {
+	UNINIT_BATCH_COMP = -1,
+	CANNOT_BATCH_COMP = 0,
+	BATCH_COMP_ENABLED = 1,
+};
+
 /*
  * The lock ordering is zswap_tree.lock -> zswap_pool.lru_lock.
  * The only case where lru_lock is not acquired while holding tree.lock is
@@ -159,6 +165,7 @@ struct crypto_acomp_ctx {
  */
 struct zswap_pool {
 	struct zpool *zpool;
+	enum batch_comp_status can_batch_comp;
 	struct crypto_acomp_ctx __percpu *acomp_ctx;
 	struct crypto_acomp_ctx __percpu *acomp_batch_ctx;
 	struct percpu_ref ref;
@@ -310,6 +317,7 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 		goto ref_fail;
 	INIT_LIST_HEAD(&pool->list);
 
+	pool->can_batch_comp = UNINIT_BATCH_COMP;
 	zswap_pool_debug("created", pool);
 
 	return pool;
@@ -695,6 +703,39 @@ static int zswap_enabled_param_set(const char *val,
 	return ret;
 }
 
+/* Called only if sysctl vm.compress-batching is set to "1". */
+static __always_inline bool zswap_pool_can_batch(struct zswap_pool *pool)
+{
+	struct crypto_acomp_ctx *acomp_ctx;
+
+	if ((pool->can_batch_comp == BATCH_COMP_ENABLED) &&
+		!IS_ERR_OR_NULL((acomp_ctx = raw_cpu_ptr(pool->acomp_batch_ctx))) &&
+		(acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE))
+		return true;
+
+	if (pool->can_batch_comp == CANNOT_BATCH_COMP)
+		return false;
+
+	if ((pool->can_batch_comp == UNINIT_BATCH_COMP) && pool->acomp_batch_ctx) {
+		acomp_ctx = raw_cpu_ptr(pool->acomp_batch_ctx);
+
+		if (!IS_ERR_OR_NULL(acomp_ctx)) {
+			if ((acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) ||
+			    (!acomp_ctx->nr_reqs &&
+			     !zswap_create_acomp_ctx(raw_smp_processor_id(),
+						     acomp_ctx,
+						     pool->tfm_name,
+						     SWAP_CRYPTO_BATCH_SIZE))) {
+				pool->can_batch_comp = BATCH_COMP_ENABLED;
+				return true;
+			}
+		}
+	}
+
+	pool->can_batch_comp = CANNOT_BATCH_COMP;
+	return false;
+}
+
 /*********************************
 * lru functions
 **********************************/
@@ -850,6 +891,17 @@ static int zswap_create_acomp_ctx(unsigned int cpu,
 	acomp_ctx->acomp = acomp;
 	acomp_ctx->is_sleepable = acomp_is_async(acomp);
 
+	/*
+	 * Cannot create a batching ctx without the crypto acomp alg supporting
+	 * batch_compress and batch_decompress API.
+	 */
+	if ((nr_reqs > 1) && (!acomp->batch_compress || !acomp->batch_decompress)) {
+		WARN_ONCE(1, "Cannot alloc acomp_ctx with %d reqs since crypto acomp %s\nhas not registered batch_compress() and/or batch_decompress()\n",
+			  nr_reqs, tfm_name);
+		ret = -ENODEV;
+		goto buf_fail;
+	}
+
 	acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *),
 					  GFP_KERNEL, cpu_to_node(cpu));
 	if (!acomp_ctx->buffers)
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (10 preceding siblings ...)
  2024-11-06 19:21 ` [PATCH v3 11/13] mm: zswap: Allocate acomp_batch_ctx resources for a given zswap_pool Kanchana P Sridhar
@ 2024-11-06 19:21 ` Kanchana P Sridhar
  2024-11-06 20:17   ` Andrew Morton
  2024-11-07 17:34   ` Johannes Weiner
  2024-11-06 19:21 ` [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
  2024-11-06 20:25 ` [PATCH v3 00/13] zswap IAA compress batching Andrew Morton
  13 siblings, 2 replies; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

The sysctl vm.compress-batching parameter is 0 by default. If the platform
has Intel IAA, the user can run experiments with IAA compress batching of
large folios in zswap_store() as follows:

sysctl vm.compress-batching=1
echo deflate-iaa > /sys/module/zswap/parameters/compressor

This is expected to significantly improve zswap_store() latency of swapping
out large folios due to parallel compression of 8 pages in the large folio
at a time, in hardware.

Setting vm.compress-batching to "1" takes effect only if the zswap
compression algorithm's crypto_acomp registers implementations for the
batch_compress() and batch_decompress() API. In other words, compress
batching works only with the iaa_crypto driver, that does register these
new batching API. It is a no-op for compressors that do not register the
batching API.

The sysctl vm.compress-batching acts as a switch because it takes effect
upon future zswap_store() calls on any given core. If the switch is "1",
large folios will use parallel batched compression of the folio's pages.
If the switch is "0", zswap_store() will use sequential compression for
storing every page in a large folio.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/mm.h | 2 ++
 kernel/sysctl.c    | 9 +++++++++
 mm/swap.c          | 6 ++++++
 3 files changed, 17 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fecd47239fa9..f61915aa2f37 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -82,8 +82,10 @@ extern const int page_cluster_max;
 
 #ifdef CONFIG_SYSCTL
 extern int sysctl_legacy_va_layout;
+extern unsigned int compress_batching;
 #else
 #define sysctl_legacy_va_layout 0
+#define compress_batching 0
 #endif
 
 #ifdef CONFIG_HAVE_ARCH_MMAP_RND_BITS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 79e6cb1d5c48..e298857595b4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2064,6 +2064,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= (void *)&page_cluster_max,
 	},
+	{
+		.procname	= "compress-batching",
+		.data		= &compress_batching,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{
 		.procname	= "dirtytime_expire_seconds",
 		.data		= &dirtytime_expire_interval,
diff --git a/mm/swap.c b/mm/swap.c
index 638a3f001676..bc4c9079769e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -47,6 +47,9 @@
 int page_cluster;
 const int page_cluster_max = 31;
 
+/* Enable/disable compress batching during swapout. */
+unsigned int compress_batching;
+
 struct cpu_fbatches {
 	/*
 	 * The following folio batches are grouped together because they are protected
@@ -1074,4 +1077,7 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
+
+	/* Disable compress batching during swapout by default. */
+	compress_batching = 0;
 }
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (11 preceding siblings ...)
  2024-11-06 19:21 ` [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout Kanchana P Sridhar
@ 2024-11-06 19:21 ` Kanchana P Sridhar
  2024-11-07 18:16   ` Johannes Weiner
  2024-11-07 18:53   ` Johannes Weiner
  2024-11-06 20:25 ` [PATCH v3 00/13] zswap IAA compress batching Andrew Morton
  13 siblings, 2 replies; 38+ messages in thread
From: Kanchana P Sridhar @ 2024-11-06 19:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

If the system has Intel IAA, and if sysctl vm.compress-batching is set to
"1", zswap_store() will call crypto_acomp_batch_compress() to compress up
to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using
the multiple compress engines available in IAA hardware.

On platforms with multiple IAA devices per socket, compress jobs from all
cores in a socket will be distributed among all IAA devices on the socket
by the iaa_crypto driver.

With deflate-iaa configured as the zswap compressor, and
sysctl vm.compress-batching is enabled, the first time zswap_store() has to
swapout a large folio on any given cpu, it will allocate the
pool->acomp_batch_ctx resources on that cpu, and set pool->can_batch_comp
to BATCH_COMP_ENABLED. It will then proceed to call the main
__zswap_store_batch_core() compress batching function. Subsequent calls to
zswap_store() on the same cpu will go ahead and use the acomp_batch_ctx by
checking the pool->can_batch_comp status.

Hence, we allocate the per-cpu pool->acomp_batch_ctx resources only on an
as-needed basis, to reduce memory footprint cost. The cost is not incurred
on cores that never get to swapout a large folio.

This patch introduces the main __zswap_store_batch_core() function for
compress batching. This interface represents the extensible compress
batching architecture that can potentially be called with a batch of
any-order folios from shrink_folio_list(). In other words, although
zswap_store() calls __zswap_store_batch_core() with exactly one large folio
in this patch, we can reuse this interface to reclaim a batch of folios, to
significantly improve the reclaim path efficiency due to IAA's parallel
compression capability.

The newly added functions that implement batched stores follow the
general structure of zswap_store() of a large folio. Some amount of
restructuring and optimization is done to minimize failure points
for a batch, fail early and maximize the zswap store pipeline occupancy
with SWAP_CRYPTO_BATCH_SIZE pages, potentially from multiple
folios. This is intended to maximize reclaim throughput with the IAA
hardware parallel compressions.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h |  84 ++++++
 mm/zswap.c            | 625 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 709 insertions(+)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 9ad27ab3d222..6d3ef4780c69 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -31,6 +31,88 @@ struct zswap_lruvec_state {
 	atomic_long_t nr_disk_swapins;
 };
 
+/*
+ * struct zswap_store_sub_batch_page:
+ *
+ * This represents one "zswap batching element", namely, the
+ * attributes associated with a page in a large folio that will
+ * be compressed and stored in zswap. The term "batch" is reserved
+ * for a conceptual "batch" of folios that can be sent to
+ * zswap_store() by reclaim. The term "sub-batch" is used to describe
+ * a collection of "zswap batching elements", i.e., an array of
+ * "struct zswap_store_sub_batch_page *".
+ *
+ * The zswap compress sub-batch size is specified by
+ * SWAP_CRYPTO_BATCH_SIZE, currently set as 8UL if the
+ * platform has Intel IAA. This means zswap can store a large folio
+ * by creating sub-batches of up to 8 pages and compressing this
+ * batch using IAA to parallelize the 8 compress jobs in hardware.
+ * For e.g., a 64KB folio can be compressed as 2 sub-batches of
+ * 8 pages each. This can significantly improve the zswap_store()
+ * performance for large folios.
+ *
+ * Although the page itself is represented directly, the structure
+ * adds a "u8 batch_idx" to represent an index for the folio in a
+ * conceptual "batch of folios" that can be passed to zswap_store().
+ * Conceptually, this allows for up to 256 folios that can be passed
+ * to zswap_store(). If this conceptual number of folios sent to
+ * zswap_store() exceeds 256, the "batch_idx" needs to become u16.
+ */
+struct zswap_store_sub_batch_page {
+	u8 batch_idx;
+	swp_entry_t swpentry;
+	struct obj_cgroup *objcg;
+	struct zswap_entry *entry;
+	int error; /* folio error status. */
+};
+
+/*
+ * struct zswap_store_pipeline_state:
+ *
+ * This stores state during IAA compress batching of (conceptually, a batch of)
+ * folios. The term pipelining in this context, refers to breaking down
+ * the batch of folios being reclaimed into sub-batches of
+ * SWAP_CRYPTO_BATCH_SIZE pages, batch compressing and storing the
+ * sub-batch. This concept could be further evolved to use overlap of CPU
+ * computes with IAA computes. For instance, we could stage the post-compress
+ * computes for sub-batch "N-1" to happen in parallel with IAA batch
+ * compression of sub-batch "N".
+ *
+ * We begin by developing the concept of compress batching. Pipelining with
+ * overlap can be future work.
+ *
+ * @errors: The errors status for the batch of reclaim folios passed in from
+ *          a higher mm layer such as swap_writepage().
+ * @pool: A valid zswap_pool.
+ * @acomp_ctx: The per-cpu pointer to the crypto_acomp_ctx for the @pool.
+ * @sub_batch: This is an array that represents the sub-batch of up to
+ *             SWAP_CRYPTO_BATCH_SIZE pages that are being stored
+ *             in zswap.
+ * @comp_dsts: The destination buffers for crypto_acomp_compress() for each
+ *             page being compressed.
+ * @comp_dlens: The destination buffers' lengths from crypto_acomp_compress()
+ *              obtained after crypto_acomp_poll() returns completion status,
+ *              for each page being compressed.
+ * @comp_errors: Compression errors for each page being compressed.
+ * @nr_comp_pages: Total number of pages in @sub_batch.
+ *
+ * Note:
+ * The max sub-batch size is SWAP_CRYPTO_BATCH_SIZE, currently 8UL.
+ * Hence, if SWAP_CRYPTO_BATCH_SIZE exceeds 256, some of the
+ * u8 members (except @comp_dsts) need to become u16.
+ */
+struct zswap_store_pipeline_state {
+	int *errors;
+	struct zswap_pool *pool;
+	struct crypto_acomp_ctx *acomp_ctx;
+	struct zswap_store_sub_batch_page *sub_batch;
+	struct page **comp_pages;
+	u8 **comp_dsts;
+	unsigned int *comp_dlens;
+	int *comp_errors;
+	u8 nr_comp_pages;
+};
+
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
@@ -45,6 +127,8 @@ bool zswap_never_enabled(void);
 #else
 
 struct zswap_lruvec_state {};
+struct zswap_store_sub_batch_page {};
+struct zswap_store_pipeline_state {};
 
 static inline bool zswap_store(struct folio *folio)
 {
diff --git a/mm/zswap.c b/mm/zswap.c
index 2af736e38213..538aac3fb552 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -255,6 +255,12 @@ static int zswap_create_acomp_ctx(unsigned int cpu,
 				  char *tfm_name,
 				  unsigned int nr_reqs);
 
+static bool __zswap_store_batch_core(
+	int node_id,
+	struct folio **folios,
+	int *errors,
+	unsigned int nr_folios);
+
 /*********************************
 * pool functions
 **********************************/
@@ -1626,6 +1632,12 @@ static ssize_t zswap_store_page(struct page *page,
 	return -EINVAL;
 }
 
+/*
+ * Modified to use the IAA compress batching framework implemented in
+ * __zswap_store_batch_core() if sysctl vm.compress-batching is 1.
+ * The batching code is intended to significantly improve folio store
+ * performance over the sequential code.
+ */
 bool zswap_store(struct folio *folio)
 {
 	long nr_pages = folio_nr_pages(folio);
@@ -1638,6 +1650,38 @@ bool zswap_store(struct folio *folio)
 	bool ret = false;
 	long index;
 
+	/*
+	 * Improve large folio zswap_store() latency with IAA compress batching,
+	 * if this is enabled by setting sysctl vm.compress-batching to "1".
+	 * If enabled, the large folio's pages are compressed in parallel in
+	 * batches of SWAP_CRYPTO_BATCH_SIZE pages. If disabled, every page in
+	 * the large folio is compressed sequentially.
+	 */
+	if (folio_test_large(folio) && READ_ONCE(compress_batching)) {
+		pool = zswap_pool_current_get();
+		if (!pool) {
+			pr_err("Cannot setup acomp_batch_ctx for compress batching: no current pool found\n");
+			goto sequential_store;
+		}
+
+		if (zswap_pool_can_batch(pool)) {
+			int error = -1;
+			bool store_batch = __zswap_store_batch_core(
+						folio_nid(folio),
+						&folio, &error, 1);
+
+			if (store_batch) {
+				zswap_pool_put(pool);
+				if (!error)
+					ret = true;
+				return ret;
+			}
+		}
+		zswap_pool_put(pool);
+	}
+
+sequential_store:
+
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
 
@@ -1724,6 +1768,587 @@ bool zswap_store(struct folio *folio)
 	return ret;
 }
 
+/*
+ * Note: If SWAP_CRYPTO_BATCH_SIZE exceeds 256, change the
+ * u8 stack variables in the next several functions, to u16.
+ */
+
+/*
+ * Propagate the "sbp" error condition to other batch elements belonging to
+ * the same folio as "sbp".
+ */
+static __always_inline void zswap_store_propagate_errors(
+	struct zswap_store_pipeline_state *zst,
+	u8 error_batch_idx)
+{
+	u8 i;
+
+	if (zst->errors[error_batch_idx])
+		return;
+
+	for (i = 0; i < zst->nr_comp_pages; ++i) {
+		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
+
+		if (sbp->batch_idx == error_batch_idx) {
+			if (!sbp->error) {
+				if (sbp->entry) {
+					if (!IS_ERR_VALUE(sbp->entry->handle))
+						zpool_free(zst->pool->zpool, sbp->entry->handle);
+
+					zswap_entry_cache_free(sbp->entry);
+					sbp->entry = NULL;
+				}
+				sbp->error = -EINVAL;
+			}
+		}
+	}
+
+	/*
+	 * Set zswap status for the folio to "error"
+	 * for use in swap_writepage.
+	 */
+	zst->errors[error_batch_idx] = -EINVAL;
+}
+
+static __always_inline void zswap_process_comp_errors(
+	struct zswap_store_pipeline_state *zst)
+{
+	u8 i;
+
+	for (i = 0; i < zst->nr_comp_pages; ++i) {
+		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
+
+		if (zst->comp_errors[i]) {
+			if (zst->comp_errors[i] == -ENOSPC)
+				zswap_reject_compress_poor++;
+			else
+				zswap_reject_compress_fail++;
+
+			if (!sbp->error)
+				zswap_store_propagate_errors(zst,
+							     sbp->batch_idx);
+		}
+	}
+}
+
+static void zswap_compress_batch(struct zswap_store_pipeline_state *zst)
+{
+	/*
+	 * Compress up to SWAP_CRYPTO_BATCH_SIZE pages.
+	 * It is important to note that the zswap pool's per-cpu "acomp_batch_ctx"
+	 * resources are allocated only if the crypto_acomp has registered both,
+	 * crypto_acomp_batch_compress() and crypto_acomp_batch_decompress() API.
+	 * The iaa_crypto driver registers implementations for both these API.
+	 * Hence, if IAA is the zswap compressor and sysctl vm.compress-batching
+	 * is set to "1", the call to crypto_acomp_batch_compress() will
+	 * compresses the pages in parallel, leading to significant performance
+	 * improvements as compared to software compressors.
+	 */
+	crypto_acomp_batch_compress(
+		zst->acomp_ctx->reqs,
+		&zst->acomp_ctx->wait,
+		zst->comp_pages,
+		zst->comp_dsts,
+		zst->comp_dlens,
+		zst->comp_errors,
+		zst->nr_comp_pages);
+
+	/*
+	 * Scan the sub-batch for any compression errors,
+	 * and invalidate pages with errors, along with other
+	 * pages belonging to the same folio as the error pages.
+	 */
+	zswap_process_comp_errors(zst);
+}
+
+static void zswap_zpool_store_sub_batch(
+	struct zswap_store_pipeline_state *zst)
+{
+	u8 i;
+
+	for (i = 0; i < zst->nr_comp_pages; ++i) {
+		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
+		struct zpool *zpool;
+		unsigned long handle;
+		char *buf;
+		gfp_t gfp;
+		int err;
+
+		/* Skip pages that had compress errors. */
+		if (sbp->error)
+			continue;
+
+		zpool = zst->pool->zpool;
+		gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
+		if (zpool_malloc_support_movable(zpool))
+			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
+		err = zpool_malloc(zpool, zst->comp_dlens[i], gfp, &handle);
+
+		if (err) {
+			if (err == -ENOSPC)
+				zswap_reject_compress_poor++;
+			else
+				zswap_reject_alloc_fail++;
+
+			/*
+			 * An error should be propagated to other pages of the
+			 * same folio in the sub-batch, and zpool resources for
+			 * those pages (in sub-batch order prior to this zpool
+			 * error) should be de-allocated.
+			 */
+			zswap_store_propagate_errors(zst, sbp->batch_idx);
+			continue;
+		}
+
+		buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
+		memcpy(buf, zst->comp_dsts[i], zst->comp_dlens[i]);
+		zpool_unmap_handle(zpool, handle);
+
+		sbp->entry->handle = handle;
+		sbp->entry->length = zst->comp_dlens[i];
+	}
+}
+
+/*
+ * Returns true if the entry was successfully
+ * stored in the xarray, and false otherwise.
+ */
+static bool zswap_store_entry(swp_entry_t page_swpentry,
+			      struct zswap_entry *entry)
+{
+	struct zswap_entry *old = xa_store(swap_zswap_tree(page_swpentry),
+					   swp_offset(page_swpentry),
+					   entry, GFP_KERNEL);
+	if (xa_is_err(old)) {
+		int err = xa_err(old);
+
+		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+		zswap_reject_alloc_fail++;
+		return false;
+	}
+
+	/*
+	 * We may have had an existing entry that became stale when
+	 * the folio was redirtied and now the new version is being
+	 * swapped out. Get rid of the old.
+	 */
+	if (old)
+		zswap_entry_free(old);
+
+	return true;
+}
+
+static void zswap_batch_compress_post_proc(
+	struct zswap_store_pipeline_state *zst)
+{
+	int nr_objcg_pages = 0, nr_pages = 0;
+	struct obj_cgroup *objcg = NULL;
+	size_t compressed_bytes = 0;
+	u8 i;
+
+	zswap_zpool_store_sub_batch(zst);
+
+	for (i = 0; i < zst->nr_comp_pages; ++i) {
+		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
+
+		if (sbp->error)
+			continue;
+
+		if (!zswap_store_entry(sbp->swpentry, sbp->entry)) {
+			zswap_store_propagate_errors(zst, sbp->batch_idx);
+			continue;
+		}
+
+		/*
+		 * The entry is successfully compressed and stored in the tree,
+		 * there is no further possibility of failure. Grab refs to the
+		 * pool and objcg. These refs will be dropped by
+		 * zswap_entry_free() when the entry is removed from the tree.
+		 */
+		zswap_pool_get(zst->pool);
+		if (sbp->objcg)
+			obj_cgroup_get(sbp->objcg);
+
+		/*
+		 * We finish initializing the entry while it's already in xarray.
+		 * This is safe because:
+		 *
+		 * 1. Concurrent stores and invalidations are excluded by folio
+		 *    lock.
+		 *
+		 * 2. Writeback is excluded by the entry not being on the LRU yet.
+		 *    The publishing order matters to prevent writeback from seeing
+		 *    an incoherent entry.
+		 */
+		sbp->entry->pool = zst->pool;
+		sbp->entry->swpentry = sbp->swpentry;
+		sbp->entry->objcg = sbp->objcg;
+		sbp->entry->referenced = true;
+		if (sbp->entry->length) {
+			INIT_LIST_HEAD(&sbp->entry->lru);
+			zswap_lru_add(&zswap_list_lru, sbp->entry);
+		}
+
+		if (!objcg && sbp->objcg) {
+			objcg = sbp->objcg;
+		} else if (objcg && sbp->objcg && (objcg != sbp->objcg)) {
+			obj_cgroup_charge_zswap(objcg, compressed_bytes);
+			count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
+			compressed_bytes = 0;
+			nr_objcg_pages = 0;
+			objcg = sbp->objcg;
+		}
+
+		if (sbp->objcg) {
+			compressed_bytes += sbp->entry->length;
+			++nr_objcg_pages;
+		}
+
+		++nr_pages;
+	} /* for sub-batch pages. */
+
+	if (objcg) {
+		obj_cgroup_charge_zswap(objcg, compressed_bytes);
+		count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
+	}
+
+	atomic_long_add(nr_pages, &zswap_stored_pages);
+	count_vm_events(ZSWPOUT, nr_pages);
+}
+
+static void zswap_store_sub_batch(struct zswap_store_pipeline_state *zst)
+{
+	u8 i;
+
+	for (i = 0; i < zst->nr_comp_pages; ++i) {
+		zst->comp_dsts[i] = zst->acomp_ctx->buffers[i];
+		zst->comp_dlens[i] = PAGE_SIZE;
+	} /* for sub-batch pages. */
+
+	/*
+	 * Batch compress sub-batch "N". If IAA is the compressor, the
+	 * hardware will compress multiple pages in parallel.
+	 */
+	zswap_compress_batch(zst);
+
+	zswap_batch_compress_post_proc(zst);
+}
+
+static void zswap_add_folio_pages_to_sb(
+	struct zswap_store_pipeline_state *zst,
+	struct folio* folio,
+	u8 batch_idx,
+	struct obj_cgroup *objcg,
+	struct zswap_entry *entries[],
+	long start_idx,
+	u8 add_nr_pages)
+{
+	long index;
+
+	for (index = start_idx; index < (start_idx + add_nr_pages); ++index) {
+		u8 i = zst->nr_comp_pages;
+		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
+		struct page *page = folio_page(folio, index);
+		zst->comp_pages[i] = page;
+		sbp->swpentry = page_swap_entry(page);
+		sbp->batch_idx = batch_idx;
+		sbp->objcg = objcg;
+		sbp->entry = entries[index - start_idx];
+		sbp->error = 0;
+		++zst->nr_comp_pages;
+	}
+}
+
+static __always_inline void zswap_store_reset_sub_batch(
+	struct zswap_store_pipeline_state *zst)
+{
+	zst->nr_comp_pages = 0;
+}
+
+/* Allocate entries for the next sub-batch. */
+static int zswap_alloc_entries(u8 nr_entries,
+			       struct zswap_entry *entries[],
+			       int node_id)
+{
+	u8 i;
+
+	for (i = 0; i < nr_entries; ++i) {
+		entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
+		if (!entries[i]) {
+			u8 j;
+
+			zswap_reject_kmemcache_fail++;
+			for (j = 0; j < i; ++j)
+				zswap_entry_cache_free(entries[j]);
+			return -EINVAL;
+		}
+
+		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
+	}
+
+	return 0;
+}
+
+/*
+ * If the zswap store fails or zswap is disabled, we must invalidate
+ * the possibly stale entries which were previously stored at the
+ * offsets corresponding to each page of the folio. Otherwise,
+ * writeback could overwrite the new data in the swapfile.
+ */
+static void zswap_delete_stored_entries(struct folio *folio)
+{
+	swp_entry_t swp = folio->swap;
+	unsigned type = swp_type(swp);
+	pgoff_t offset = swp_offset(swp);
+	struct zswap_entry *entry;
+	struct xarray *tree;
+	long index;
+
+	for (index = 0; index < folio_nr_pages(folio); ++index) {
+		tree = swap_zswap_tree(swp_entry(type, offset + index));
+		entry = xa_erase(tree, offset + index);
+		if (entry)
+			zswap_entry_free(entry);
+	}
+}
+
+static void zswap_store_process_folio_errors(
+	struct folio **folios,
+	int *errors,
+	unsigned int nr_folios)
+{
+	u8 batch_idx;
+
+	for (batch_idx = 0; batch_idx < nr_folios; ++batch_idx)
+		if (errors[batch_idx])
+			zswap_delete_stored_entries(folios[batch_idx]);
+}
+
+/*
+ * Store a (batch of) any-order large folio(s) in zswap. Each folio will be
+ * broken into sub-batches of SWAP_CRYPTO_BATCH_SIZE pages, the
+ * sub-batch will be compressed by IAA in parallel, and stored in zpool/xarray.
+ *
+ * This the main procedure for batching of folios, and batching within
+ * large folios.
+ *
+ * This procedure should only be called if zswap supports batching of stores.
+ * Otherwise, the sequential implementation for storing folios as in the
+ * current zswap_store() should be used.
+ *
+ * The signature of this procedure is meant to allow the calling function,
+ * (for instance, swap_writepage()) to pass an array @folios
+ * (the "reclaim batch") of @nr_folios folios to be stored in zswap.
+ * All folios in the batch must have the same swap type and folio_nid @node_id
+ * (simplifying assumptions only to manage code complexity).
+ *
+ * @errors and @folios have @nr_folios number of entries, with one-one
+ * correspondence (@errors[i] represents the error status of @folios[i],
+ * for i in @nr_folios).
+ * The calling function (for instance, swap_writepage()) should initialize
+ * @errors[i] to a non-0 value.
+ * If zswap successfully stores @folios[i], it will set @errors[i] to 0.
+ * If there is an error in zswap, it will set @errors[i] to -EINVAL.
+ */
+static bool __zswap_store_batch_core(
+	int node_id,
+	struct folio **folios,
+	int *errors,
+	unsigned int nr_folios)
+{
+	struct zswap_store_sub_batch_page sub_batch[SWAP_CRYPTO_BATCH_SIZE];
+	struct page *comp_pages[SWAP_CRYPTO_BATCH_SIZE];
+	u8 *comp_dsts[SWAP_CRYPTO_BATCH_SIZE] = { NULL };
+	unsigned int comp_dlens[SWAP_CRYPTO_BATCH_SIZE];
+	int comp_errors[SWAP_CRYPTO_BATCH_SIZE];
+	struct crypto_acomp_ctx *acomp_ctx, *acomp_batch_ctx;
+	struct zswap_pool *pool;
+	/*
+	 * For now, lets say a max of 256 large folios can be reclaimed
+	 * at a time, as a batch. If this exceeds 256, change this to u16.
+	 */
+	u8 batch_idx;
+
+	/* Initialize the compress batching pipeline state. */
+	struct zswap_store_pipeline_state zst = {
+		.errors = errors,
+		.pool = NULL,
+		.acomp_ctx = NULL,
+		.sub_batch = sub_batch,
+		.comp_pages = comp_pages,
+		.comp_dsts = comp_dsts,
+		.comp_dlens = comp_dlens,
+		.comp_errors = comp_errors,
+		.nr_comp_pages = 0,
+	};
+
+	pool = zswap_pool_current_get();
+	if (!pool) {
+		if (zswap_check_limits())
+			queue_work(shrink_wq, &zswap_shrink_work);
+		goto check_old;
+	}
+
+	/*
+	 * Caller should make sure that __zswap_store_batch_core() is
+	 * invoked only if sysctl vm.compress-batching is set to "1".
+	 *
+	 * Verify if we are still on the same cpu for which batching
+	 * resources in acomp_batch_ctx were allocated in zswap_store().
+	 * If not, return to zswap_store() for sequential store of the folio.
+	 */
+	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+	mutex_lock(&acomp_ctx->mutex);
+
+	acomp_batch_ctx = raw_cpu_ptr(pool->acomp_batch_ctx);
+	if (!acomp_batch_ctx || !acomp_batch_ctx->nr_reqs) {
+		mutex_unlock(&acomp_ctx->mutex);
+		zswap_pool_put(pool);
+		return false;
+	}
+
+	mutex_lock(&acomp_batch_ctx->mutex);
+	mutex_unlock(&acomp_ctx->mutex);
+
+	zst.pool = pool;
+	zst.acomp_ctx = acomp_batch_ctx;
+
+	/*
+	 * Iterate over the folios passed in. Construct sub-batches of up to
+	 * SWAP_CRYPTO_BATCH_SIZE pages, if necessary, by iterating through
+	 * multiple folios from the input "folios". Process each sub-batch
+	 * with IAA batch compression. Detect errors from batch compression
+	 * and set the impacted folio's error status (this happens in
+	 * zswap_store_process_errors()).
+	 */
+	for (batch_idx = 0; batch_idx < nr_folios; ++batch_idx) {
+		struct folio *folio = folios[batch_idx];
+		BUG_ON(!folio);
+		long folio_start_idx, nr_pages = folio_nr_pages(folio);
+		struct zswap_entry *entries[SWAP_CRYPTO_BATCH_SIZE];
+		struct obj_cgroup *objcg = NULL;
+		struct mem_cgroup *memcg = NULL;
+
+		VM_WARN_ON_ONCE(!folio_test_locked(folio));
+		VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+
+		/*
+		 * If zswap is disabled, we must invalidate the possibly stale entry
+		 * which was previously stored at this offset. Otherwise, writeback
+		 * could overwrite the new data in the swapfile.
+		 */
+		if (!zswap_enabled)
+			continue;
+
+		/* Check cgroup limits */
+		objcg = get_obj_cgroup_from_folio(folio);
+		if (objcg && !obj_cgroup_may_zswap(objcg)) {
+			memcg = get_mem_cgroup_from_objcg(objcg);
+			if (shrink_memcg(memcg)) {
+				mem_cgroup_put(memcg);
+				goto put_objcg;
+			}
+			mem_cgroup_put(memcg);
+		}
+
+		if (zswap_check_limits())
+			goto put_objcg;
+
+		if (objcg) {
+			memcg = get_mem_cgroup_from_objcg(objcg);
+			if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
+				mem_cgroup_put(memcg);
+				goto put_objcg;
+			}
+			mem_cgroup_put(memcg);
+		}
+
+		/*
+		 * By default, set zswap status to "success" for use in
+		 * swap_writepage() when this returns. In case of errors,
+		 * a negative error number will over-write this when
+		 * zswap_store_process_errors() is called.
+		 */
+		errors[batch_idx] = 0;
+
+		folio_start_idx = 0;
+
+		while (nr_pages > 0) {
+			u8 add_nr_pages;
+
+			/*
+			 * If we have accumulated SWAP_CRYPTO_BATCH_SIZE
+			 * pages, process the sub-batch: it could contain pages
+			 * from multiple folios.
+			 */
+			if (zst.nr_comp_pages == SWAP_CRYPTO_BATCH_SIZE) {
+				zswap_store_sub_batch(&zst);
+				zswap_store_reset_sub_batch(&zst);
+				/*
+				 * Stop processing this folio if it had
+				 * compress errors.
+				 */
+				if (errors[batch_idx])
+					goto put_objcg;
+			}
+
+			add_nr_pages = min3((
+					(long)SWAP_CRYPTO_BATCH_SIZE -
+					(long)zst.nr_comp_pages),
+					nr_pages,
+					(long)SWAP_CRYPTO_BATCH_SIZE);
+
+			/*
+			 * Allocate zswap_entries for this sub-batch. If we
+			 * get errors while doing so, we can flag an error
+			 * for the folio, call the shrinker and move on.
+			 */
+			if (zswap_alloc_entries(add_nr_pages,
+						entries, node_id)) {
+				zswap_store_reset_sub_batch(&zst);
+				errors[batch_idx] = -EINVAL;
+				goto put_objcg;
+			}
+
+			zswap_add_folio_pages_to_sb(
+				&zst,
+				folio,
+				batch_idx,
+				objcg,
+				entries,
+				folio_start_idx,
+				add_nr_pages);
+
+			nr_pages -= add_nr_pages;
+			folio_start_idx += add_nr_pages;
+		} /* this folio has pages to be compressed. */
+
+		obj_cgroup_put(objcg);
+		continue;
+
+put_objcg:
+		obj_cgroup_put(objcg);
+		if (zswap_pool_reached_full)
+			queue_work(shrink_wq, &zswap_shrink_work);
+	} /* for batch folios */
+
+	if (!zswap_enabled)
+		goto check_old;
+
+	/*
+	 * Process last sub-batch: it could contain pages from
+	 * multiple folios.
+	 */
+	if (zst.nr_comp_pages)
+		zswap_store_sub_batch(&zst);
+
+	mutex_unlock(&acomp_batch_ctx->mutex);
+	zswap_pool_put(pool);
+check_old:
+	zswap_store_process_folio_errors(folios, errors, nr_folios);
+	return true;
+}
+
 bool zswap_load(struct folio *folio)
 {
 	swp_entry_t swp = folio->swap;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout.
  2024-11-06 19:21 ` [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout Kanchana P Sridhar
@ 2024-11-06 20:17   ` Andrew Morton
  2024-11-06 20:39     ` Sridhar, Kanchana P
  2024-11-07 17:34   ` Johannes Weiner
  1 sibling, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2024-11-06 20:17 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, wajdi.k.feghali, vinodh.gopal

On Wed,  6 Nov 2024 11:21:04 -0800 Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote:

>  extern int sysctl_legacy_va_layout;
> +extern unsigned int compress_batching;

nit: I suggest calling this "sysctl_compress_batching".  See how we
treated sysctl_legacy_va_layout.

> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -47,6 +47,9 @@
>  int page_cluster;
>  const int page_cluster_max = 31;
>  
> +/* Enable/disable compress batching during swapout. */
> +unsigned int compress_batching;
> +
>  struct cpu_fbatches {
>  	/*
>  	 * The following folio batches are grouped together because they are protected
> @@ -1074,4 +1077,7 @@ void __init swap_setup(void)
>  	 * Right now other parts of the system means that we
>  	 * _really_ don't want to cluster much more
>  	 */
> +
> +	/* Disable compress batching during swapout by default. */
> +	compress_batching = 0;

Not really needed?  The compiler already did that.

>  }



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/13] zswap IAA compress batching
  2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (12 preceding siblings ...)
  2024-11-06 19:21 ` [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
@ 2024-11-06 20:25 ` Andrew Morton
  2024-11-06 20:44   ` Sridhar, Kanchana P
  13 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2024-11-06 20:25 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, wajdi.k.feghali, vinodh.gopal

On Wed,  6 Nov 2024 11:20:52 -0800 Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote:

> IAA Compression Batching:

hm, is this a crypto patchset or a zswap patchset?

Thanks.  Unless someone stops me I think I'll add this to mm.git after
6.13-rc1 is released.  To get it additional testing exposure while
review proceeds.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout.
  2024-11-06 20:17   ` Andrew Morton
@ 2024-11-06 20:39     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-06 20:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, Huang, Ying, 21cnbao,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P

Hi Andrew,

> -----Original Message-----
> From: Andrew Morton <akpm@linux-foundation.org>
> Sent: Wednesday, November 6, 2024 12:18 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> zanussi@kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch
> for compress batching during swapout.
> 
> On Wed,  6 Nov 2024 11:21:04 -0800 Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> 
> >  extern int sysctl_legacy_va_layout;
> > +extern unsigned int compress_batching;
> 
> nit: I suggest calling this "sysctl_compress_batching".  See how we
> treated sysctl_legacy_va_layout.

Thanks for the code review comments. Sure, I will incorporate this in v4.

> 
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -47,6 +47,9 @@
> >  int page_cluster;
> >  const int page_cluster_max = 31;
> >
> > +/* Enable/disable compress batching during swapout. */
> > +unsigned int compress_batching;
> > +
> >  struct cpu_fbatches {
> >  	/*
> >  	 * The following folio batches are grouped together because they are
> protected
> > @@ -1074,4 +1077,7 @@ void __init swap_setup(void)
> >  	 * Right now other parts of the system means that we
> >  	 * _really_ don't want to cluster much more
> >  	 */
> > +
> > +	/* Disable compress batching during swapout by default. */
> > +	compress_batching = 0;
> 
> Not really needed?  The compiler already did that.

Sure, will address this in v4.

Thanks,
Kanchana

> 
> >  }



^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 00/13] zswap IAA compress batching
  2024-11-06 20:25 ` [PATCH v3 00/13] zswap IAA compress batching Andrew Morton
@ 2024-11-06 20:44   ` Sridhar, Kanchana P
  0 siblings, 0 replies; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-06 20:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, Huang, Ying, 21cnbao,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Andrew Morton <akpm@linux-foundation.org>
> Sent: Wednesday, November 6, 2024 12:26 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> zanussi@kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 00/13] zswap IAA compress batching
> 
> On Wed,  6 Nov 2024 11:20:52 -0800 Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> 
> > IAA Compression Batching:
> 
> hm, is this a crypto patchset or a zswap patchset?

Thanks Andrew, for this observation. Since this patch-series attempts to
improve zswap_store() latency for large folios using Intel IAA hardware
acceleration, it has patches in crypto and in zswap, to make it easier for
the reviewers to understand context.

I am Ok with organizing the patch-series differently if that makes better
sense. Appreciate suggestions with regards to this.

Thanks,
Kanchana



> 
> Thanks.  Unless someone stops me I think I'll add this to mm.git after
> 6.13-rc1 is released.  To get it additional testing exposure while
> review proceeds.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs.
  2024-11-06 19:21 ` [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs Kanchana P Sridhar
@ 2024-11-07 17:20   ` Johannes Weiner
  2024-11-07 22:21     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2024-11-07 17:20 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, wajdi.k.feghali, vinodh.gopal

On Wed, Nov 06, 2024 at 11:21:01AM -0800, Kanchana P Sridhar wrote:
> Modified the definition of "struct crypto_acomp_ctx" to represent a
> configurable number of acomp_reqs and the required number of buffers.
> 
> Accordingly, refactored the code that allocates/deallocates the acomp_ctx
> resources, so that it can be called to create a regular acomp_ctx with
> exactly one acomp_req/buffer, for use in the the existing non-batching
> zswap_store(), as well as to create a separate "batching acomp_ctx" with
> multiple acomp_reqs/buffers for IAA compress batching.
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 149 ++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 107 insertions(+), 42 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 3e899fa61445..02e031122fdf 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -143,9 +143,10 @@ bool zswap_never_enabled(void)
>  
>  struct crypto_acomp_ctx {
>  	struct crypto_acomp *acomp;
> -	struct acomp_req *req;
> +	struct acomp_req **reqs;
> +	u8 **buffers;
> +	unsigned int nr_reqs;
>  	struct crypto_wait wait;
> -	u8 *buffer;
>  	struct mutex mutex;
>  	bool is_sleepable;
>  };
> @@ -241,6 +242,11 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>  	pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,		\
>  		 zpool_get_type((p)->zpool))
>  
> +static int zswap_create_acomp_ctx(unsigned int cpu,
> +				  struct crypto_acomp_ctx *acomp_ctx,
> +				  char *tfm_name,
> +				  unsigned int nr_reqs);

This looks unnecessary.

> +
>  /*********************************
>  * pool functions
>  **********************************/
> @@ -813,69 +819,128 @@ static void zswap_entry_free(struct zswap_entry *entry)
>  /*********************************
>  * compressed storage functions
>  **********************************/
> -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
> +static int zswap_create_acomp_ctx(unsigned int cpu,
> +				  struct crypto_acomp_ctx *acomp_ctx,
> +				  char *tfm_name,
> +				  unsigned int nr_reqs)
>  {
> -	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> -	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>  	struct crypto_acomp *acomp;
> -	struct acomp_req *req;
> -	int ret;
> +	int ret = -ENOMEM;
> +	int i, j;
>  
> +	acomp_ctx->nr_reqs = 0;
>  	mutex_init(&acomp_ctx->mutex);
>  
> -	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> -	if (!acomp_ctx->buffer)
> -		return -ENOMEM;
> -
> -	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
> +	acomp = crypto_alloc_acomp_node(tfm_name, 0, 0, cpu_to_node(cpu));
>  	if (IS_ERR(acomp)) {
>  		pr_err("could not alloc crypto acomp %s : %ld\n",
> -				pool->tfm_name, PTR_ERR(acomp));
> -		ret = PTR_ERR(acomp);
> -		goto acomp_fail;
> +				tfm_name, PTR_ERR(acomp));
> +		return PTR_ERR(acomp);
>  	}
> +
>  	acomp_ctx->acomp = acomp;
>  	acomp_ctx->is_sleepable = acomp_is_async(acomp);
>  
> -	req = acomp_request_alloc(acomp_ctx->acomp);
> -	if (!req) {
> -		pr_err("could not alloc crypto acomp_request %s\n",
> -		       pool->tfm_name);
> -		ret = -ENOMEM;
> +	acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *),
> +					  GFP_KERNEL, cpu_to_node(cpu));
> +	if (!acomp_ctx->buffers)
> +		goto buf_fail;
> +
> +	for (i = 0; i < nr_reqs; ++i) {
> +		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2,
> +						     GFP_KERNEL, cpu_to_node(cpu));
> +		if (!acomp_ctx->buffers[i]) {
> +			for (j = 0; j < i; ++j)
> +				kfree(acomp_ctx->buffers[j]);
> +			kfree(acomp_ctx->buffers);
> +			ret = -ENOMEM;
> +			goto buf_fail;
> +		}
> +	}
> +
> +	acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req *),
> +				       GFP_KERNEL, cpu_to_node(cpu));
> +	if (!acomp_ctx->reqs)
>  		goto req_fail;
> +
> +	for (i = 0; i < nr_reqs; ++i) {
> +		acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
> +		if (!acomp_ctx->reqs[i]) {
> +			pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
> +			       i, tfm_name);
> +			for (j = 0; j < i; ++j)
> +				acomp_request_free(acomp_ctx->reqs[j]);
> +			kfree(acomp_ctx->reqs);
> +			ret = -ENOMEM;
> +			goto req_fail;
> +		}
>  	}
> -	acomp_ctx->req = req;
>  
> +	/*
> +	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
> +	 * or non-poll mode of acomp, hence there is only one "wait" per
> +	 * acomp_ctx, with callback set to reqs[0], under the assumption that
> +	 * there is at least 1 request per acomp_ctx.
> +	 */
>  	crypto_init_wait(&acomp_ctx->wait);
>  	/*
>  	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
>  	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
>  	 * won't be called, crypto_wait_req() will return without blocking.
>  	 */
> -	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> +	acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
>  				   crypto_req_done, &acomp_ctx->wait);
>  
> +	acomp_ctx->nr_reqs = nr_reqs;
>  	return 0;
>  
>  req_fail:
> +	for (i = 0; i < nr_reqs; ++i)
> +		kfree(acomp_ctx->buffers[i]);
> +	kfree(acomp_ctx->buffers);
> +buf_fail:
>  	crypto_free_acomp(acomp_ctx->acomp);
> -acomp_fail:
> -	kfree(acomp_ctx->buffer);
>  	return ret;
>  }
>  
> -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
> +static void zswap_delete_acomp_ctx(struct crypto_acomp_ctx *acomp_ctx)
>  {
> -	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> -	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> -
>  	if (!IS_ERR_OR_NULL(acomp_ctx)) {
> -		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> -			acomp_request_free(acomp_ctx->req);
> +		int i;
> +
> +		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +			if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> +				acomp_request_free(acomp_ctx->reqs[i]);
> +		kfree(acomp_ctx->reqs);
> +
> +		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +			kfree(acomp_ctx->buffers[i]);
> +		kfree(acomp_ctx->buffers);
> +
>  		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>  			crypto_free_acomp(acomp_ctx->acomp);
> -		kfree(acomp_ctx->buffer);
> +
> +		acomp_ctx->nr_reqs = 0;
> +		acomp_ctx = NULL;
>  	}
> +}
> +
> +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> +	struct crypto_acomp_ctx *acomp_ctx;
> +
> +	acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> +	return zswap_create_acomp_ctx(cpu, acomp_ctx, pool->tfm_name, 1);
> +}
> +
> +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> +	struct crypto_acomp_ctx *acomp_ctx;
> +
> +	acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> +	zswap_delete_acomp_ctx(acomp_ctx);
>  
>  	return 0;
>  }

There are no other callers to these functions. Just do the work
directly in the cpu callbacks here like it used to be.

Otherwise it looks good to me.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 11/13] mm: zswap: Allocate acomp_batch_ctx resources for a given zswap_pool.
  2024-11-06 19:21 ` [PATCH v3 11/13] mm: zswap: Allocate acomp_batch_ctx resources for a given zswap_pool Kanchana P Sridhar
@ 2024-11-07 17:31   ` Johannes Weiner
  2024-11-07 22:22     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2024-11-07 17:31 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, wajdi.k.feghali, vinodh.gopal

On Wed, Nov 06, 2024 at 11:21:03AM -0800, Kanchana P Sridhar wrote:
> If the zswap_pool is associated with an acomp_alg/crypto_acomp that has
> registered batch_compress() and batch_decompress() API, we can allocate the
> necessary batching resources for the pool's acomp_batch_ctx.
> 
> This patch makes the above determination on incurring the per-cpu memory
> footprint cost for batching, and if so, goes ahead and allocates
> SWAP_CRYPTO_BATCH_SIZE (i.e. 8) acomp_reqs/buffers for the
> pool->acomp_batch_ctx on that specific cpu.
> 
> It also "remembers" the pool's batching readiness as a result of the above,
> through a new
> 
>    	enum batch_comp_status can_batch_comp;
> 
> member added to struct zswap_pool, for fast retrieval during
> zswap_store().
> 
> This allows us a way to only incur the memory footprint cost of the
> pool->acomp_batch_ctx resources for a given cpu on which zswap_store()
> needs to process a large folio.
> 
> Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

A general observation: this is a lot of code for a hardware specific
feature that many CPUs and architectures do not support. Please keep
the code self-contained, and wrap struct members and functions in a
new CONFIG option, so that not everybody has to compile this in.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout.
  2024-11-06 19:21 ` [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout Kanchana P Sridhar
  2024-11-06 20:17   ` Andrew Morton
@ 2024-11-07 17:34   ` Johannes Weiner
  2024-11-07 22:24     ` Sridhar, Kanchana P
  2024-11-08 20:23     ` Yosry Ahmed
  1 sibling, 2 replies; 38+ messages in thread
From: Johannes Weiner @ 2024-11-07 17:34 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, wajdi.k.feghali, vinodh.gopal

On Wed, Nov 06, 2024 at 11:21:04AM -0800, Kanchana P Sridhar wrote:
> The sysctl vm.compress-batching parameter is 0 by default. If the platform
> has Intel IAA, the user can run experiments with IAA compress batching of
> large folios in zswap_store() as follows:
> 
> sysctl vm.compress-batching=1
> echo deflate-iaa > /sys/module/zswap/parameters/compressor

A sysctl seems uncalled for. Can't the batching code be gated on
deflate-iaa being the compressor? It can still be generalized later if
another compressor is shown to benefit from batching.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2024-11-06 19:21 ` [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
@ 2024-11-07 18:16   ` Johannes Weiner
  2024-11-07 22:32     ` Sridhar, Kanchana P
  2024-11-07 18:53   ` Johannes Weiner
  1 sibling, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2024-11-07 18:16 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, wajdi.k.feghali, vinodh.gopal

On Wed, Nov 06, 2024 at 11:21:05AM -0800, Kanchana P Sridhar wrote:
> If the system has Intel IAA, and if sysctl vm.compress-batching is set to
> "1", zswap_store() will call crypto_acomp_batch_compress() to compress up
> to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using
> the multiple compress engines available in IAA hardware.
> 
> On platforms with multiple IAA devices per socket, compress jobs from all
> cores in a socket will be distributed among all IAA devices on the socket
> by the iaa_crypto driver.
> 
> With deflate-iaa configured as the zswap compressor, and
> sysctl vm.compress-batching is enabled, the first time zswap_store() has to
> swapout a large folio on any given cpu, it will allocate the
> pool->acomp_batch_ctx resources on that cpu, and set pool->can_batch_comp
> to BATCH_COMP_ENABLED. It will then proceed to call the main
> __zswap_store_batch_core() compress batching function. Subsequent calls to
> zswap_store() on the same cpu will go ahead and use the acomp_batch_ctx by
> checking the pool->can_batch_comp status.
> 
> Hence, we allocate the per-cpu pool->acomp_batch_ctx resources only on an
> as-needed basis, to reduce memory footprint cost. The cost is not incurred
> on cores that never get to swapout a large folio.
> 
> This patch introduces the main __zswap_store_batch_core() function for
> compress batching. This interface represents the extensible compress
> batching architecture that can potentially be called with a batch of
> any-order folios from shrink_folio_list(). In other words, although
> zswap_store() calls __zswap_store_batch_core() with exactly one large folio
> in this patch, we can reuse this interface to reclaim a batch of folios, to
> significantly improve the reclaim path efficiency due to IAA's parallel
> compression capability.
> 
> The newly added functions that implement batched stores follow the
> general structure of zswap_store() of a large folio. Some amount of
> restructuring and optimization is done to minimize failure points
> for a batch, fail early and maximize the zswap store pipeline occupancy
> with SWAP_CRYPTO_BATCH_SIZE pages, potentially from multiple
> folios. This is intended to maximize reclaim throughput with the IAA
> hardware parallel compressions.
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  include/linux/zswap.h |  84 ++++++
>  mm/zswap.c            | 625 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 709 insertions(+)
> 
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 9ad27ab3d222..6d3ef4780c69 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -31,6 +31,88 @@ struct zswap_lruvec_state {
>  	atomic_long_t nr_disk_swapins;
>  };
>  
> +/*
> + * struct zswap_store_sub_batch_page:
> + *
> + * This represents one "zswap batching element", namely, the
> + * attributes associated with a page in a large folio that will
> + * be compressed and stored in zswap. The term "batch" is reserved
> + * for a conceptual "batch" of folios that can be sent to
> + * zswap_store() by reclaim. The term "sub-batch" is used to describe
> + * a collection of "zswap batching elements", i.e., an array of
> + * "struct zswap_store_sub_batch_page *".
> + *
> + * The zswap compress sub-batch size is specified by
> + * SWAP_CRYPTO_BATCH_SIZE, currently set as 8UL if the
> + * platform has Intel IAA. This means zswap can store a large folio
> + * by creating sub-batches of up to 8 pages and compressing this
> + * batch using IAA to parallelize the 8 compress jobs in hardware.
> + * For e.g., a 64KB folio can be compressed as 2 sub-batches of
> + * 8 pages each. This can significantly improve the zswap_store()
> + * performance for large folios.
> + *
> + * Although the page itself is represented directly, the structure
> + * adds a "u8 batch_idx" to represent an index for the folio in a
> + * conceptual "batch of folios" that can be passed to zswap_store().
> + * Conceptually, this allows for up to 256 folios that can be passed
> + * to zswap_store(). If this conceptual number of folios sent to
> + * zswap_store() exceeds 256, the "batch_idx" needs to become u16.
> + */
> +struct zswap_store_sub_batch_page {
> +	u8 batch_idx;
> +	swp_entry_t swpentry;
> +	struct obj_cgroup *objcg;
> +	struct zswap_entry *entry;
> +	int error; /* folio error status. */
> +};
> +
> +/*
> + * struct zswap_store_pipeline_state:
> + *
> + * This stores state during IAA compress batching of (conceptually, a batch of)
> + * folios. The term pipelining in this context, refers to breaking down
> + * the batch of folios being reclaimed into sub-batches of
> + * SWAP_CRYPTO_BATCH_SIZE pages, batch compressing and storing the
> + * sub-batch. This concept could be further evolved to use overlap of CPU
> + * computes with IAA computes. For instance, we could stage the post-compress
> + * computes for sub-batch "N-1" to happen in parallel with IAA batch
> + * compression of sub-batch "N".
> + *
> + * We begin by developing the concept of compress batching. Pipelining with
> + * overlap can be future work.
> + *
> + * @errors: The errors status for the batch of reclaim folios passed in from
> + *          a higher mm layer such as swap_writepage().
> + * @pool: A valid zswap_pool.
> + * @acomp_ctx: The per-cpu pointer to the crypto_acomp_ctx for the @pool.
> + * @sub_batch: This is an array that represents the sub-batch of up to
> + *             SWAP_CRYPTO_BATCH_SIZE pages that are being stored
> + *             in zswap.
> + * @comp_dsts: The destination buffers for crypto_acomp_compress() for each
> + *             page being compressed.
> + * @comp_dlens: The destination buffers' lengths from crypto_acomp_compress()
> + *              obtained after crypto_acomp_poll() returns completion status,
> + *              for each page being compressed.
> + * @comp_errors: Compression errors for each page being compressed.
> + * @nr_comp_pages: Total number of pages in @sub_batch.
> + *
> + * Note:
> + * The max sub-batch size is SWAP_CRYPTO_BATCH_SIZE, currently 8UL.
> + * Hence, if SWAP_CRYPTO_BATCH_SIZE exceeds 256, some of the
> + * u8 members (except @comp_dsts) need to become u16.
> + */
> +struct zswap_store_pipeline_state {
> +	int *errors;
> +	struct zswap_pool *pool;
> +	struct crypto_acomp_ctx *acomp_ctx;
> +	struct zswap_store_sub_batch_page *sub_batch;
> +	struct page **comp_pages;
> +	u8 **comp_dsts;
> +	unsigned int *comp_dlens;
> +	int *comp_errors;
> +	u8 nr_comp_pages;
> +};

Why are these in the public header?

>  unsigned long zswap_total_pages(void);
>  bool zswap_store(struct folio *folio);
>  bool zswap_load(struct folio *folio);
> @@ -45,6 +127,8 @@ bool zswap_never_enabled(void);
>  #else
>  
>  struct zswap_lruvec_state {};
> +struct zswap_store_sub_batch_page {};
> +struct zswap_store_pipeline_state {};
> 
>  static inline bool zswap_store(struct folio *folio)
>  {
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2af736e38213..538aac3fb552 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -255,6 +255,12 @@ static int zswap_create_acomp_ctx(unsigned int cpu,
>  				  char *tfm_name,
>  				  unsigned int nr_reqs);
>  
> +static bool __zswap_store_batch_core(
> +	int node_id,
> +	struct folio **folios,
> +	int *errors,
> +	unsigned int nr_folios);
> +

Please reorder the functions to avoid forward decls.

>  /*********************************
>  * pool functions
>  **********************************/
> @@ -1626,6 +1632,12 @@ static ssize_t zswap_store_page(struct page *page,
>  	return -EINVAL;
>  }
>  
> +/*
> + * Modified to use the IAA compress batching framework implemented in
> + * __zswap_store_batch_core() if sysctl vm.compress-batching is 1.
> + * The batching code is intended to significantly improve folio store
> + * performance over the sequential code.

This isn't helpful, please delete.

>  bool zswap_store(struct folio *folio)
>  {
>  	long nr_pages = folio_nr_pages(folio);
> @@ -1638,6 +1650,38 @@ bool zswap_store(struct folio *folio)
>  	bool ret = false;
>  	long index;
>  
> +	/*
> +	 * Improve large folio zswap_store() latency with IAA compress batching,
> +	 * if this is enabled by setting sysctl vm.compress-batching to "1".
> +	 * If enabled, the large folio's pages are compressed in parallel in
> +	 * batches of SWAP_CRYPTO_BATCH_SIZE pages. If disabled, every page in
> +	 * the large folio is compressed sequentially.
> +	 */

Same here. Reduce to "Try to batch compress large folios, fall back to
processing individual subpages if that fails."

> +	if (folio_test_large(folio) && READ_ONCE(compress_batching)) {
> +		pool = zswap_pool_current_get();

There is an existing zswap_pool_current_get() in zswap_store(), please
reorder the sequence so you don't need to add an extra one.

> +		if (!pool) {
> +			pr_err("Cannot setup acomp_batch_ctx for compress batching: no current pool found\n");

This is unnecessary.

> +			goto sequential_store;
> +		}
> +
> +		if (zswap_pool_can_batch(pool)) {

This function is introduced in another patch, where it isn't
used. Please add functions and callers in the same patch.

> +			int error = -1;
> +			bool store_batch = __zswap_store_batch_core(
> +						folio_nid(folio),
> +						&folio, &error, 1);
> +
> +			if (store_batch) {
> +				zswap_pool_put(pool);
> +				if (!error)
> +					ret = true;
> +				return ret;
> +			}
> +		}

Please don't future proof code like this, only implement what is
strictly necessary for the functionality in this patch. You're only
adding a single caller with nr_folios=1, so it shouldn't be a
parameter, and the function shouldn't have a that batch_idx loop.

> +		zswap_pool_put(pool);
> +	}
> +
> +sequential_store:
> +
>  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
>  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>  
> @@ -1724,6 +1768,587 @@ bool zswap_store(struct folio *folio)
>  	return ret;
>  }
>  
> +/*
> + * Note: If SWAP_CRYPTO_BATCH_SIZE exceeds 256, change the
> + * u8 stack variables in the next several functions, to u16.
> + */
> +
> +/*
> + * Propagate the "sbp" error condition to other batch elements belonging to
> + * the same folio as "sbp".
> + */
> +static __always_inline void zswap_store_propagate_errors(
> +	struct zswap_store_pipeline_state *zst,
> +	u8 error_batch_idx)
> +{

Please observe surrounding coding style on how to wrap >80 col
function signatures.

Don't use __always_inline unless there is a clear, spelled out
performance reason. Since it's an error path, that's doubtful.

Please use a consistent namespace for all this:

CONFIG_ZSWAP_BATCH
zswap_batch_store()
zswap_batch_alloc_entries()
zswap_batch_add_folios()
zswap_batch_compress()

etc.

Again, order to avoid forward decls.

Try to keep the overall sequence of events between zswap_store() and
zswap_batch_store() similar as much as possible for readability.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2024-11-06 19:21 ` [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
  2024-11-07 18:16   ` Johannes Weiner
@ 2024-11-07 18:53   ` Johannes Weiner
  2024-11-07 22:50     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2024-11-07 18:53 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, wajdi.k.feghali, vinodh.gopal

On Wed, Nov 06, 2024 at 11:21:05AM -0800, Kanchana P Sridhar wrote:
> +static void zswap_zpool_store_sub_batch(
> +	struct zswap_store_pipeline_state *zst)

There is a zswap_store_sub_batch() below, which does something
completely different. Naming is hard, but please invest a bit more
time into this to make this readable.

> +{
> +	u8 i;
> +
> +	for (i = 0; i < zst->nr_comp_pages; ++i) {
> +		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
> +		struct zpool *zpool;
> +		unsigned long handle;
> +		char *buf;
> +		gfp_t gfp;
> +		int err;
> +
> +		/* Skip pages that had compress errors. */
> +		if (sbp->error)
> +			continue;
> +
> +		zpool = zst->pool->zpool;
> +		gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
> +		if (zpool_malloc_support_movable(zpool))
> +			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> +		err = zpool_malloc(zpool, zst->comp_dlens[i], gfp, &handle);
> +
> +		if (err) {
> +			if (err == -ENOSPC)
> +				zswap_reject_compress_poor++;
> +			else
> +				zswap_reject_alloc_fail++;
> +
> +			/*
> +			 * An error should be propagated to other pages of the
> +			 * same folio in the sub-batch, and zpool resources for
> +			 * those pages (in sub-batch order prior to this zpool
> +			 * error) should be de-allocated.
> +			 */
> +			zswap_store_propagate_errors(zst, sbp->batch_idx);
> +			continue;
> +		}
> +
> +		buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> +		memcpy(buf, zst->comp_dsts[i], zst->comp_dlens[i]);
> +		zpool_unmap_handle(zpool, handle);
> +
> +		sbp->entry->handle = handle;
> +		sbp->entry->length = zst->comp_dlens[i];
> +	}
> +}
> +
> +/*
> + * Returns true if the entry was successfully
> + * stored in the xarray, and false otherwise.
> + */
> +static bool zswap_store_entry(swp_entry_t page_swpentry,
> +			      struct zswap_entry *entry)
> +{
> +	struct zswap_entry *old = xa_store(swap_zswap_tree(page_swpentry),
> +					   swp_offset(page_swpentry),
> +					   entry, GFP_KERNEL);
> +	if (xa_is_err(old)) {
> +		int err = xa_err(old);
> +
> +		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> +		zswap_reject_alloc_fail++;
> +		return false;
> +	}
> +
> +	/*
> +	 * We may have had an existing entry that became stale when
> +	 * the folio was redirtied and now the new version is being
> +	 * swapped out. Get rid of the old.
> +	 */
> +	if (old)
> +		zswap_entry_free(old);
> +
> +	return true;
> +}
> +
> +static void zswap_batch_compress_post_proc(
> +	struct zswap_store_pipeline_state *zst)
> +{
> +	int nr_objcg_pages = 0, nr_pages = 0;
> +	struct obj_cgroup *objcg = NULL;
> +	size_t compressed_bytes = 0;
> +	u8 i;
> +
> +	zswap_zpool_store_sub_batch(zst);
> +
> +	for (i = 0; i < zst->nr_comp_pages; ++i) {
> +		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
> +
> +		if (sbp->error)
> +			continue;
> +
> +		if (!zswap_store_entry(sbp->swpentry, sbp->entry)) {
> +			zswap_store_propagate_errors(zst, sbp->batch_idx);
> +			continue;
> +		}
> +
> +		/*
> +		 * The entry is successfully compressed and stored in the tree,
> +		 * there is no further possibility of failure. Grab refs to the
> +		 * pool and objcg. These refs will be dropped by
> +		 * zswap_entry_free() when the entry is removed from the tree.
> +		 */
> +		zswap_pool_get(zst->pool);
> +		if (sbp->objcg)
> +			obj_cgroup_get(sbp->objcg);
> +
> +		/*
> +		 * We finish initializing the entry while it's already in xarray.
> +		 * This is safe because:
> +		 *
> +		 * 1. Concurrent stores and invalidations are excluded by folio
> +		 *    lock.
> +		 *
> +		 * 2. Writeback is excluded by the entry not being on the LRU yet.
> +		 *    The publishing order matters to prevent writeback from seeing
> +		 *    an incoherent entry.
> +		 */
> +		sbp->entry->pool = zst->pool;
> +		sbp->entry->swpentry = sbp->swpentry;
> +		sbp->entry->objcg = sbp->objcg;
> +		sbp->entry->referenced = true;
> +		if (sbp->entry->length) {
> +			INIT_LIST_HEAD(&sbp->entry->lru);
> +			zswap_lru_add(&zswap_list_lru, sbp->entry);
> +		}
> +
> +		if (!objcg && sbp->objcg) {
> +			objcg = sbp->objcg;
> +		} else if (objcg && sbp->objcg && (objcg != sbp->objcg)) {
> +			obj_cgroup_charge_zswap(objcg, compressed_bytes);
> +			count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
> +			compressed_bytes = 0;
> +			nr_objcg_pages = 0;
> +			objcg = sbp->objcg;
> +		}
> +
> +		if (sbp->objcg) {
> +			compressed_bytes += sbp->entry->length;
> +			++nr_objcg_pages;
> +		}
> +
> +		++nr_pages;
> +	} /* for sub-batch pages. */
> +
> +	if (objcg) {
> +		obj_cgroup_charge_zswap(objcg, compressed_bytes);
> +		count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
> +	}
> +
> +	atomic_long_add(nr_pages, &zswap_stored_pages);
> +	count_vm_events(ZSWPOUT, nr_pages);
> +}
> +
> +static void zswap_store_sub_batch(struct zswap_store_pipeline_state *zst)
> +{
> +	u8 i;
> +
> +	for (i = 0; i < zst->nr_comp_pages; ++i) {
> +		zst->comp_dsts[i] = zst->acomp_ctx->buffers[i];
> +		zst->comp_dlens[i] = PAGE_SIZE;
> +	} /* for sub-batch pages. */
> +
> +	/*
> +	 * Batch compress sub-batch "N". If IAA is the compressor, the
> +	 * hardware will compress multiple pages in parallel.
> +	 */
> +	zswap_compress_batch(zst);
> +
> +	zswap_batch_compress_post_proc(zst);

The control flow here is a mess. Keep loops over the same batch at the
same function level. IOW, pull the nr_comp_pages loop out of
zswap_batch_compress_post_proc() and call the function from the loop.

Also give it a more descriptive name. If that's hard to do, then
you're probably doing too many different things in it. Create
functions for a specific purpose, don't carve up sequences at
arbitrary points.

My impression after trying to read this is that the existing
zswap_store() sequence could be a subset of the batched store, where
you can reuse most code to get the pool, charge the cgroup, allocate
entries, store entries, bump the stats etc. for both cases. Alas, your
naming choices make it a bit difficult to be sure.

Please explore this direction. Don't worry about the CONFIG symbol for
now, we can still look at this later.

Right now, it's basically

	if (special case)
		lots of duplicative code in slightly different order
	regular store sequence

and that isn't going to be maintainable.

Look for a high-level sequence that makes sense for both cases. E.g.:

	if (!zswap_enabled)
		goto check_old;

	get objcg

	check limits

	allocate memcg list lru

	for each batch {
		for each entry {
			allocate entry
			acquire objcg ref
			acquire pool ref
		}
		compress
		for each entry {
			store in tree
			add to lru
			bump stats and counters
		}
	}

	put objcg

	return true;

check_error:
	...

and then set up the two loops such that they also makes sense when the
folio is just a single page.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs.
  2024-11-07 17:20   ` Johannes Weiner
@ 2024-11-07 22:21     ` Sridhar, Kanchana P
  2024-11-08 20:22       ` Yosry Ahmed
  0 siblings, 1 reply; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-07 22:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P

Hi Johannes,

> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Thursday, November 7, 2024 9:21 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx
> to be configurable in nr of acomp_reqs.
> 
> On Wed, Nov 06, 2024 at 11:21:01AM -0800, Kanchana P Sridhar wrote:
> > Modified the definition of "struct crypto_acomp_ctx" to represent a
> > configurable number of acomp_reqs and the required number of buffers.
> >
> > Accordingly, refactored the code that allocates/deallocates the acomp_ctx
> > resources, so that it can be called to create a regular acomp_ctx with
> > exactly one acomp_req/buffer, for use in the the existing non-batching
> > zswap_store(), as well as to create a separate "batching acomp_ctx" with
> > multiple acomp_reqs/buffers for IAA compress batching.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 149 ++++++++++++++++++++++++++++++++++++++----------
> -----
> >  1 file changed, 107 insertions(+), 42 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 3e899fa61445..02e031122fdf 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -143,9 +143,10 @@ bool zswap_never_enabled(void)
> >
> >  struct crypto_acomp_ctx {
> >  	struct crypto_acomp *acomp;
> > -	struct acomp_req *req;
> > +	struct acomp_req **reqs;
> > +	u8 **buffers;
> > +	unsigned int nr_reqs;
> >  	struct crypto_wait wait;
> > -	u8 *buffer;
> >  	struct mutex mutex;
> >  	bool is_sleepable;
> >  };
> > @@ -241,6 +242,11 @@ static inline struct xarray
> *swap_zswap_tree(swp_entry_t swp)
> >  	pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,		\
> >  		 zpool_get_type((p)->zpool))
> >
> > +static int zswap_create_acomp_ctx(unsigned int cpu,
> > +				  struct crypto_acomp_ctx *acomp_ctx,
> > +				  char *tfm_name,
> > +				  unsigned int nr_reqs);
> 
> This looks unnecessary.

Thanks for the code review comments. I will make sure to avoid the
forward declarations.

> 
> > +
> >  /*********************************
> >  * pool functions
> >  **********************************/
> > @@ -813,69 +819,128 @@ static void zswap_entry_free(struct
> zswap_entry *entry)
> >  /*********************************
> >  * compressed storage functions
> >  **********************************/
> > -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node
> *node)
> > +static int zswap_create_acomp_ctx(unsigned int cpu,
> > +				  struct crypto_acomp_ctx *acomp_ctx,
> > +				  char *tfm_name,
> > +				  unsigned int nr_reqs)
> >  {
> > -	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > -	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> >  	struct crypto_acomp *acomp;
> > -	struct acomp_req *req;
> > -	int ret;
> > +	int ret = -ENOMEM;
> > +	int i, j;
> >
> > +	acomp_ctx->nr_reqs = 0;
> >  	mutex_init(&acomp_ctx->mutex);
> >
> > -	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> cpu_to_node(cpu));
> > -	if (!acomp_ctx->buffer)
> > -		return -ENOMEM;
> > -
> > -	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
> > +	acomp = crypto_alloc_acomp_node(tfm_name, 0, 0,
> cpu_to_node(cpu));
> >  	if (IS_ERR(acomp)) {
> >  		pr_err("could not alloc crypto acomp %s : %ld\n",
> > -				pool->tfm_name, PTR_ERR(acomp));
> > -		ret = PTR_ERR(acomp);
> > -		goto acomp_fail;
> > +				tfm_name, PTR_ERR(acomp));
> > +		return PTR_ERR(acomp);
> >  	}
> > +
> >  	acomp_ctx->acomp = acomp;
> >  	acomp_ctx->is_sleepable = acomp_is_async(acomp);
> >
> > -	req = acomp_request_alloc(acomp_ctx->acomp);
> > -	if (!req) {
> > -		pr_err("could not alloc crypto acomp_request %s\n",
> > -		       pool->tfm_name);
> > -		ret = -ENOMEM;
> > +	acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *),
> > +					  GFP_KERNEL, cpu_to_node(cpu));
> > +	if (!acomp_ctx->buffers)
> > +		goto buf_fail;
> > +
> > +	for (i = 0; i < nr_reqs; ++i) {
> > +		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2,
> > +						     GFP_KERNEL,
> cpu_to_node(cpu));
> > +		if (!acomp_ctx->buffers[i]) {
> > +			for (j = 0; j < i; ++j)
> > +				kfree(acomp_ctx->buffers[j]);
> > +			kfree(acomp_ctx->buffers);
> > +			ret = -ENOMEM;
> > +			goto buf_fail;
> > +		}
> > +	}
> > +
> > +	acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req
> *),
> > +				       GFP_KERNEL, cpu_to_node(cpu));
> > +	if (!acomp_ctx->reqs)
> >  		goto req_fail;
> > +
> > +	for (i = 0; i < nr_reqs; ++i) {
> > +		acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx-
> >acomp);
> > +		if (!acomp_ctx->reqs[i]) {
> > +			pr_err("could not alloc crypto acomp_request
> reqs[%d] %s\n",
> > +			       i, tfm_name);
> > +			for (j = 0; j < i; ++j)
> > +				acomp_request_free(acomp_ctx->reqs[j]);
> > +			kfree(acomp_ctx->reqs);
> > +			ret = -ENOMEM;
> > +			goto req_fail;
> > +		}
> >  	}
> > -	acomp_ctx->req = req;
> >
> > +	/*
> > +	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
> > +	 * or non-poll mode of acomp, hence there is only one "wait" per
> > +	 * acomp_ctx, with callback set to reqs[0], under the assumption that
> > +	 * there is at least 1 request per acomp_ctx.
> > +	 */
> >  	crypto_init_wait(&acomp_ctx->wait);
> >  	/*
> >  	 * if the backend of acomp is async zip, crypto_req_done() will
> wakeup
> >  	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
> >  	 * won't be called, crypto_wait_req() will return without blocking.
> >  	 */
> > -	acomp_request_set_callback(req,
> CRYPTO_TFM_REQ_MAY_BACKLOG,
> > +	acomp_request_set_callback(acomp_ctx->reqs[0],
> CRYPTO_TFM_REQ_MAY_BACKLOG,
> >  				   crypto_req_done, &acomp_ctx->wait);
> >
> > +	acomp_ctx->nr_reqs = nr_reqs;
> >  	return 0;
> >
> >  req_fail:
> > +	for (i = 0; i < nr_reqs; ++i)
> > +		kfree(acomp_ctx->buffers[i]);
> > +	kfree(acomp_ctx->buffers);
> > +buf_fail:
> >  	crypto_free_acomp(acomp_ctx->acomp);
> > -acomp_fail:
> > -	kfree(acomp_ctx->buffer);
> >  	return ret;
> >  }
> >
> > -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node
> *node)
> > +static void zswap_delete_acomp_ctx(struct crypto_acomp_ctx
> *acomp_ctx)
> >  {
> > -	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > -	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > -
> >  	if (!IS_ERR_OR_NULL(acomp_ctx)) {
> > -		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > -			acomp_request_free(acomp_ctx->req);
> > +		int i;
> > +
> > +		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > +			if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> > +				acomp_request_free(acomp_ctx->reqs[i]);
> > +		kfree(acomp_ctx->reqs);
> > +
> > +		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > +			kfree(acomp_ctx->buffers[i]);
> > +		kfree(acomp_ctx->buffers);
> > +
> >  		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> >  			crypto_free_acomp(acomp_ctx->acomp);
> > -		kfree(acomp_ctx->buffer);
> > +
> > +		acomp_ctx->nr_reqs = 0;
> > +		acomp_ctx = NULL;
> >  	}
> > +}
> > +
> > +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node
> *node)
> > +{
> > +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > +	struct crypto_acomp_ctx *acomp_ctx;
> > +
> > +	acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> > +	return zswap_create_acomp_ctx(cpu, acomp_ctx, pool->tfm_name,
> 1);
> > +}
> > +
> > +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node
> *node)
> > +{
> > +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > +	struct crypto_acomp_ctx *acomp_ctx;
> > +
> > +	acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> > +	zswap_delete_acomp_ctx(acomp_ctx);
> >
> >  	return 0;
> >  }
> 
> There are no other callers to these functions. Just do the work
> directly in the cpu callbacks here like it used to be.

There will be other callers to zswap_create_acomp_ctx() and
zswap_delete_acomp_ctx() in patches 10 and 11 of this series, when the
per-cpu "acomp_batch_ctx" is introduced in struct zswap_pool. I was trying
to modularize the code first, so as to split the changes into smaller commits.

The per-cpu "acomp_batch_ctx" resources are allocated in patch 11 in the
"zswap_pool_can_batch()" function, that allocates batching resources
for this cpu. This was to address Yosry's earlier comment about minimizing
the memory footprint cost of batching.

The way I decided to do this is by reusing the code that allocates the de-facto
pool->acomp_ctx for the selected compressor for all cpu's in zswap_pool_create().
However, I did not want to add the acomp_batch_ctx multiple reqs/buffers
allocation to the cpuhp_state_add_instance() code path which would incur the
memory cost on all cpu's.

Instead, the approach I chose to follow is to allocate the batching resources
in patch 11 only as needed, on "a given cpu" that has to store a large folio. Hope
this explains the purpose of the modularization better.

Other ideas towards accomplishing this are very welcome.

Thanks,
Kanchana

> 
> Otherwise it looks good to me.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 11/13] mm: zswap: Allocate acomp_batch_ctx resources for a given zswap_pool.
  2024-11-07 17:31   ` Johannes Weiner
@ 2024-11-07 22:22     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-07 22:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Thursday, November 7, 2024 9:31 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 11/13] mm: zswap: Allocate acomp_batch_ctx
> resources for a given zswap_pool.
> 
> On Wed, Nov 06, 2024 at 11:21:03AM -0800, Kanchana P Sridhar wrote:
> > If the zswap_pool is associated with an acomp_alg/crypto_acomp that has
> > registered batch_compress() and batch_decompress() API, we can allocate
> the
> > necessary batching resources for the pool's acomp_batch_ctx.
> >
> > This patch makes the above determination on incurring the per-cpu memory
> > footprint cost for batching, and if so, goes ahead and allocates
> > SWAP_CRYPTO_BATCH_SIZE (i.e. 8) acomp_reqs/buffers for the
> > pool->acomp_batch_ctx on that specific cpu.
> >
> > It also "remembers" the pool's batching readiness as a result of the above,
> > through a new
> >
> >    	enum batch_comp_status can_batch_comp;
> >
> > member added to struct zswap_pool, for fast retrieval during
> > zswap_store().
> >
> > This allows us a way to only incur the memory footprint cost of the
> > pool->acomp_batch_ctx resources for a given cpu on which zswap_store()
> > needs to process a large folio.
> >
> > Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> > Suggested-by: Ying Huang <ying.huang@intel.com>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> A general observation: this is a lot of code for a hardware specific
> feature that many CPUs and architectures do not support. Please keep
> the code self-contained, and wrap struct members and functions in a
> new CONFIG option, so that not everybody has to compile this in.

Thanks for this suggestion! Sure, I will address this in v4.

Thanks,
Kanchana



^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout.
  2024-11-07 17:34   ` Johannes Weiner
@ 2024-11-07 22:24     ` Sridhar, Kanchana P
  2024-11-08 20:23     ` Yosry Ahmed
  1 sibling, 0 replies; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-07 22:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Thursday, November 7, 2024 9:34 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch
> for compress batching during swapout.
> 
> On Wed, Nov 06, 2024 at 11:21:04AM -0800, Kanchana P Sridhar wrote:
> > The sysctl vm.compress-batching parameter is 0 by default. If the platform
> > has Intel IAA, the user can run experiments with IAA compress batching of
> > large folios in zswap_store() as follows:
> >
> > sysctl vm.compress-batching=1
> > echo deflate-iaa > /sys/module/zswap/parameters/compressor
> 
> A sysctl seems uncalled for. Can't the batching code be gated on
> deflate-iaa being the compressor? It can still be generalized later if
> another compressor is shown to benefit from batching.

That's a very valid point. I will gate the batching code based on the
availability of the crypto batch_compress() and batch_decompress()
interfaces in the compressor. I agree, this can still be generalized later to
other compressors, potentially via these same batching API.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2024-11-07 18:16   ` Johannes Weiner
@ 2024-11-07 22:32     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-07 22:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Thursday, November 7, 2024 10:16 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA
> in zswap_store() of large folios.
> 
> On Wed, Nov 06, 2024 at 11:21:05AM -0800, Kanchana P Sridhar wrote:
> > If the system has Intel IAA, and if sysctl vm.compress-batching is set to
> > "1", zswap_store() will call crypto_acomp_batch_compress() to compress up
> > to SWAP_CRYPTO_BATCH_SIZE (i.e. 8) pages in large folios in parallel using
> > the multiple compress engines available in IAA hardware.
> >
> > On platforms with multiple IAA devices per socket, compress jobs from all
> > cores in a socket will be distributed among all IAA devices on the socket
> > by the iaa_crypto driver.
> >
> > With deflate-iaa configured as the zswap compressor, and
> > sysctl vm.compress-batching is enabled, the first time zswap_store() has to
> > swapout a large folio on any given cpu, it will allocate the
> > pool->acomp_batch_ctx resources on that cpu, and set pool-
> >can_batch_comp
> > to BATCH_COMP_ENABLED. It will then proceed to call the main
> > __zswap_store_batch_core() compress batching function. Subsequent calls
> to
> > zswap_store() on the same cpu will go ahead and use the acomp_batch_ctx
> by
> > checking the pool->can_batch_comp status.
> >
> > Hence, we allocate the per-cpu pool->acomp_batch_ctx resources only on
> an
> > as-needed basis, to reduce memory footprint cost. The cost is not incurred
> > on cores that never get to swapout a large folio.
> >
> > This patch introduces the main __zswap_store_batch_core() function for
> > compress batching. This interface represents the extensible compress
> > batching architecture that can potentially be called with a batch of
> > any-order folios from shrink_folio_list(). In other words, although
> > zswap_store() calls __zswap_store_batch_core() with exactly one large folio
> > in this patch, we can reuse this interface to reclaim a batch of folios, to
> > significantly improve the reclaim path efficiency due to IAA's parallel
> > compression capability.
> >
> > The newly added functions that implement batched stores follow the
> > general structure of zswap_store() of a large folio. Some amount of
> > restructuring and optimization is done to minimize failure points
> > for a batch, fail early and maximize the zswap store pipeline occupancy
> > with SWAP_CRYPTO_BATCH_SIZE pages, potentially from multiple
> > folios. This is intended to maximize reclaim throughput with the IAA
> > hardware parallel compressions.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  include/linux/zswap.h |  84 ++++++
> >  mm/zswap.c            | 625
> ++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 709 insertions(+)
> >
> > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > index 9ad27ab3d222..6d3ef4780c69 100644
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -31,6 +31,88 @@ struct zswap_lruvec_state {
> >  	atomic_long_t nr_disk_swapins;
> >  };
> >
> > +/*
> > + * struct zswap_store_sub_batch_page:
> > + *
> > + * This represents one "zswap batching element", namely, the
> > + * attributes associated with a page in a large folio that will
> > + * be compressed and stored in zswap. The term "batch" is reserved
> > + * for a conceptual "batch" of folios that can be sent to
> > + * zswap_store() by reclaim. The term "sub-batch" is used to describe
> > + * a collection of "zswap batching elements", i.e., an array of
> > + * "struct zswap_store_sub_batch_page *".
> > + *
> > + * The zswap compress sub-batch size is specified by
> > + * SWAP_CRYPTO_BATCH_SIZE, currently set as 8UL if the
> > + * platform has Intel IAA. This means zswap can store a large folio
> > + * by creating sub-batches of up to 8 pages and compressing this
> > + * batch using IAA to parallelize the 8 compress jobs in hardware.
> > + * For e.g., a 64KB folio can be compressed as 2 sub-batches of
> > + * 8 pages each. This can significantly improve the zswap_store()
> > + * performance for large folios.
> > + *
> > + * Although the page itself is represented directly, the structure
> > + * adds a "u8 batch_idx" to represent an index for the folio in a
> > + * conceptual "batch of folios" that can be passed to zswap_store().
> > + * Conceptually, this allows for up to 256 folios that can be passed
> > + * to zswap_store(). If this conceptual number of folios sent to
> > + * zswap_store() exceeds 256, the "batch_idx" needs to become u16.
> > + */
> > +struct zswap_store_sub_batch_page {
> > +	u8 batch_idx;
> > +	swp_entry_t swpentry;
> > +	struct obj_cgroup *objcg;
> > +	struct zswap_entry *entry;
> > +	int error; /* folio error status. */
> > +};
> > +
> > +/*
> > + * struct zswap_store_pipeline_state:
> > + *
> > + * This stores state during IAA compress batching of (conceptually, a batch
> of)
> > + * folios. The term pipelining in this context, refers to breaking down
> > + * the batch of folios being reclaimed into sub-batches of
> > + * SWAP_CRYPTO_BATCH_SIZE pages, batch compressing and storing the
> > + * sub-batch. This concept could be further evolved to use overlap of CPU
> > + * computes with IAA computes. For instance, we could stage the post-
> compress
> > + * computes for sub-batch "N-1" to happen in parallel with IAA batch
> > + * compression of sub-batch "N".
> > + *
> > + * We begin by developing the concept of compress batching. Pipelining
> with
> > + * overlap can be future work.
> > + *
> > + * @errors: The errors status for the batch of reclaim folios passed in from
> > + *          a higher mm layer such as swap_writepage().
> > + * @pool: A valid zswap_pool.
> > + * @acomp_ctx: The per-cpu pointer to the crypto_acomp_ctx for the
> @pool.
> > + * @sub_batch: This is an array that represents the sub-batch of up to
> > + *             SWAP_CRYPTO_BATCH_SIZE pages that are being stored
> > + *             in zswap.
> > + * @comp_dsts: The destination buffers for crypto_acomp_compress() for
> each
> > + *             page being compressed.
> > + * @comp_dlens: The destination buffers' lengths from
> crypto_acomp_compress()
> > + *              obtained after crypto_acomp_poll() returns completion status,
> > + *              for each page being compressed.
> > + * @comp_errors: Compression errors for each page being compressed.
> > + * @nr_comp_pages: Total number of pages in @sub_batch.
> > + *
> > + * Note:
> > + * The max sub-batch size is SWAP_CRYPTO_BATCH_SIZE, currently 8UL.
> > + * Hence, if SWAP_CRYPTO_BATCH_SIZE exceeds 256, some of the
> > + * u8 members (except @comp_dsts) need to become u16.
> > + */
> > +struct zswap_store_pipeline_state {
> > +	int *errors;
> > +	struct zswap_pool *pool;
> > +	struct crypto_acomp_ctx *acomp_ctx;
> > +	struct zswap_store_sub_batch_page *sub_batch;
> > +	struct page **comp_pages;
> > +	u8 **comp_dsts;
> > +	unsigned int *comp_dlens;
> > +	int *comp_errors;
> > +	u8 nr_comp_pages;
> > +};
> 
> Why are these in the public header?

Thanks Johannes, for the detailed code review comments! Yes, these don't
need to belong in the public header. I will move them to zswap.c.

> 
> >  unsigned long zswap_total_pages(void);
> >  bool zswap_store(struct folio *folio);
> >  bool zswap_load(struct folio *folio);
> > @@ -45,6 +127,8 @@ bool zswap_never_enabled(void);
> >  #else
> >
> >  struct zswap_lruvec_state {};
> > +struct zswap_store_sub_batch_page {};
> > +struct zswap_store_pipeline_state {};
> >
> >  static inline bool zswap_store(struct folio *folio)
> >  {
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 2af736e38213..538aac3fb552 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -255,6 +255,12 @@ static int zswap_create_acomp_ctx(unsigned int
> cpu,
> >  				  char *tfm_name,
> >  				  unsigned int nr_reqs);
> >
> > +static bool __zswap_store_batch_core(
> > +	int node_id,
> > +	struct folio **folios,
> > +	int *errors,
> > +	unsigned int nr_folios);
> > +
> 
> Please reorder the functions to avoid forward decls.

Sure.

> 
> >  /*********************************
> >  * pool functions
> >  **********************************/
> > @@ -1626,6 +1632,12 @@ static ssize_t zswap_store_page(struct page
> *page,
> >  	return -EINVAL;
> >  }
> >
> > +/*
> > + * Modified to use the IAA compress batching framework implemented in
> > + * __zswap_store_batch_core() if sysctl vm.compress-batching is 1.
> > + * The batching code is intended to significantly improve folio store
> > + * performance over the sequential code.
> 
> This isn't helpful, please delete.

Ok.

> 
> >  bool zswap_store(struct folio *folio)
> >  {
> >  	long nr_pages = folio_nr_pages(folio);
> > @@ -1638,6 +1650,38 @@ bool zswap_store(struct folio *folio)
> >  	bool ret = false;
> >  	long index;
> >
> > +	/*
> > +	 * Improve large folio zswap_store() latency with IAA compress
> batching,
> > +	 * if this is enabled by setting sysctl vm.compress-batching to "1".
> > +	 * If enabled, the large folio's pages are compressed in parallel in
> > +	 * batches of SWAP_CRYPTO_BATCH_SIZE pages. If disabled, every
> page in
> > +	 * the large folio is compressed sequentially.
> > +	 */
> 
> Same here. Reduce to "Try to batch compress large folios, fall back to
> processing individual subpages if that fails."

Ok.

> 
> > +	if (folio_test_large(folio) && READ_ONCE(compress_batching)) {
> > +		pool = zswap_pool_current_get();
> 
> There is an existing zswap_pool_current_get() in zswap_store(), please
> reorder the sequence so you don't need to add an extra one.

Ok, will do this. I was trying to make the code less messy, but will find
a cleaner way.

> 
> > +		if (!pool) {
> > +			pr_err("Cannot setup acomp_batch_ctx for compress
> batching: no current pool found\n");
> 
> This is unnecessary.

Ok.

> 
> > +			goto sequential_store;
> > +		}
> > +
> > +		if (zswap_pool_can_batch(pool)) {
> 
> This function is introduced in another patch, where it isn't
> used. Please add functions and callers in the same patch.

Ok. Unintended side effects of trying to break down the changes
into smaller commits. Will address this in v4.

> 
> > +			int error = -1;
> > +			bool store_batch = __zswap_store_batch_core(
> > +						folio_nid(folio),
> > +						&folio, &error, 1);
> > +
> > +			if (store_batch) {
> > +				zswap_pool_put(pool);
> > +				if (!error)
> > +					ret = true;
> > +				return ret;
> > +			}
> > +		}
> 
> Please don't future proof code like this, only implement what is
> strictly necessary for the functionality in this patch. You're only
> adding a single caller with nr_folios=1, so it shouldn't be a
> parameter, and the function shouldn't have a that batch_idx loop.

Ok.

> 
> > +		zswap_pool_put(pool);
> > +	}
> > +
> > +sequential_store:
> > +
> >  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> >
> > @@ -1724,6 +1768,587 @@ bool zswap_store(struct folio *folio)
> >  	return ret;
> >  }
> >
> > +/*
> > + * Note: If SWAP_CRYPTO_BATCH_SIZE exceeds 256, change the
> > + * u8 stack variables in the next several functions, to u16.
> > + */
> > +
> > +/*
> > + * Propagate the "sbp" error condition to other batch elements belonging
> to
> > + * the same folio as "sbp".
> > + */
> > +static __always_inline void zswap_store_propagate_errors(
> > +	struct zswap_store_pipeline_state *zst,
> > +	u8 error_batch_idx)
> > +{
> 
> Please observe surrounding coding style on how to wrap >80 col
> function signatures.

I see. Ok.

> 
> Don't use __always_inline unless there is a clear, spelled out
> performance reason. Since it's an error path, that's doubtful.

The motivation was incompressible pages, but sure, will address in v4.

> 
> Please use a consistent namespace for all this:
> 
> CONFIG_ZSWAP_BATCH
> zswap_batch_store()
> zswap_batch_alloc_entries()
> zswap_batch_add_folios()
> zswap_batch_compress()
> 
> etc.
> 
> Again, order to avoid forward decls.
> 
> Try to keep the overall sequence of events between zswap_store() and
> zswap_batch_store() similar as much as possible for readability.

Definitely.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2024-11-07 18:53   ` Johannes Weiner
@ 2024-11-07 22:50     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-07 22:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Thursday, November 7, 2024 10:54 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA
> in zswap_store() of large folios.
> 
> On Wed, Nov 06, 2024 at 11:21:05AM -0800, Kanchana P Sridhar wrote:
> > +static void zswap_zpool_store_sub_batch(
> > +	struct zswap_store_pipeline_state *zst)
> 
> There is a zswap_store_sub_batch() below, which does something
> completely different. Naming is hard, but please invest a bit more
> time into this to make this readable.

Thanks Johannes, for the comments. Yes, I agree the naming could
be better.

> 
> > +{
> > +	u8 i;
> > +
> > +	for (i = 0; i < zst->nr_comp_pages; ++i) {
> > +		struct zswap_store_sub_batch_page *sbp = &zst-
> >sub_batch[i];
> > +		struct zpool *zpool;
> > +		unsigned long handle;
> > +		char *buf;
> > +		gfp_t gfp;
> > +		int err;
> > +
> > +		/* Skip pages that had compress errors. */
> > +		if (sbp->error)
> > +			continue;
> > +
> > +		zpool = zst->pool->zpool;
> > +		gfp = __GFP_NORETRY | __GFP_NOWARN |
> __GFP_KSWAPD_RECLAIM;
> > +		if (zpool_malloc_support_movable(zpool))
> > +			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> > +		err = zpool_malloc(zpool, zst->comp_dlens[i], gfp, &handle);
> > +
> > +		if (err) {
> > +			if (err == -ENOSPC)
> > +				zswap_reject_compress_poor++;
> > +			else
> > +				zswap_reject_alloc_fail++;
> > +
> > +			/*
> > +			 * An error should be propagated to other pages of
> the
> > +			 * same folio in the sub-batch, and zpool resources for
> > +			 * those pages (in sub-batch order prior to this zpool
> > +			 * error) should be de-allocated.
> > +			 */
> > +			zswap_store_propagate_errors(zst, sbp->batch_idx);
> > +			continue;
> > +		}
> > +
> > +		buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> > +		memcpy(buf, zst->comp_dsts[i], zst->comp_dlens[i]);
> > +		zpool_unmap_handle(zpool, handle);
> > +
> > +		sbp->entry->handle = handle;
> > +		sbp->entry->length = zst->comp_dlens[i];
> > +	}
> > +}
> > +
> > +/*
> > + * Returns true if the entry was successfully
> > + * stored in the xarray, and false otherwise.
> > + */
> > +static bool zswap_store_entry(swp_entry_t page_swpentry,
> > +			      struct zswap_entry *entry)
> > +{
> > +	struct zswap_entry *old =
> xa_store(swap_zswap_tree(page_swpentry),
> > +					   swp_offset(page_swpentry),
> > +					   entry, GFP_KERNEL);
> > +	if (xa_is_err(old)) {
> > +		int err = xa_err(old);
> > +
> > +		WARN_ONCE(err != -ENOMEM, "unexpected xarray error:
> %d\n", err);
> > +		zswap_reject_alloc_fail++;
> > +		return false;
> > +	}
> > +
> > +	/*
> > +	 * We may have had an existing entry that became stale when
> > +	 * the folio was redirtied and now the new version is being
> > +	 * swapped out. Get rid of the old.
> > +	 */
> > +	if (old)
> > +		zswap_entry_free(old);
> > +
> > +	return true;
> > +}
> > +
> > +static void zswap_batch_compress_post_proc(
> > +	struct zswap_store_pipeline_state *zst)
> > +{
> > +	int nr_objcg_pages = 0, nr_pages = 0;
> > +	struct obj_cgroup *objcg = NULL;
> > +	size_t compressed_bytes = 0;
> > +	u8 i;
> > +
> > +	zswap_zpool_store_sub_batch(zst);
> > +
> > +	for (i = 0; i < zst->nr_comp_pages; ++i) {
> > +		struct zswap_store_sub_batch_page *sbp = &zst-
> >sub_batch[i];
> > +
> > +		if (sbp->error)
> > +			continue;
> > +
> > +		if (!zswap_store_entry(sbp->swpentry, sbp->entry)) {
> > +			zswap_store_propagate_errors(zst, sbp->batch_idx);
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * The entry is successfully compressed and stored in the tree,
> > +		 * there is no further possibility of failure. Grab refs to the
> > +		 * pool and objcg. These refs will be dropped by
> > +		 * zswap_entry_free() when the entry is removed from the
> tree.
> > +		 */
> > +		zswap_pool_get(zst->pool);
> > +		if (sbp->objcg)
> > +			obj_cgroup_get(sbp->objcg);
> > +
> > +		/*
> > +		 * We finish initializing the entry while it's already in xarray.
> > +		 * This is safe because:
> > +		 *
> > +		 * 1. Concurrent stores and invalidations are excluded by folio
> > +		 *    lock.
> > +		 *
> > +		 * 2. Writeback is excluded by the entry not being on the LRU
> yet.
> > +		 *    The publishing order matters to prevent writeback from
> seeing
> > +		 *    an incoherent entry.
> > +		 */
> > +		sbp->entry->pool = zst->pool;
> > +		sbp->entry->swpentry = sbp->swpentry;
> > +		sbp->entry->objcg = sbp->objcg;
> > +		sbp->entry->referenced = true;
> > +		if (sbp->entry->length) {
> > +			INIT_LIST_HEAD(&sbp->entry->lru);
> > +			zswap_lru_add(&zswap_list_lru, sbp->entry);
> > +		}
> > +
> > +		if (!objcg && sbp->objcg) {
> > +			objcg = sbp->objcg;
> > +		} else if (objcg && sbp->objcg && (objcg != sbp->objcg)) {
> > +			obj_cgroup_charge_zswap(objcg,
> compressed_bytes);
> > +			count_objcg_events(objcg, ZSWPOUT,
> nr_objcg_pages);
> > +			compressed_bytes = 0;
> > +			nr_objcg_pages = 0;
> > +			objcg = sbp->objcg;
> > +		}
> > +
> > +		if (sbp->objcg) {
> > +			compressed_bytes += sbp->entry->length;
> > +			++nr_objcg_pages;
> > +		}
> > +
> > +		++nr_pages;
> > +	} /* for sub-batch pages. */
> > +
> > +	if (objcg) {
> > +		obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > +		count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
> > +	}
> > +
> > +	atomic_long_add(nr_pages, &zswap_stored_pages);
> > +	count_vm_events(ZSWPOUT, nr_pages);
> > +}
> > +
> > +static void zswap_store_sub_batch(struct zswap_store_pipeline_state
> *zst)
> > +{
> > +	u8 i;
> > +
> > +	for (i = 0; i < zst->nr_comp_pages; ++i) {
> > +		zst->comp_dsts[i] = zst->acomp_ctx->buffers[i];
> > +		zst->comp_dlens[i] = PAGE_SIZE;
> > +	} /* for sub-batch pages. */
> > +
> > +	/*
> > +	 * Batch compress sub-batch "N". If IAA is the compressor, the
> > +	 * hardware will compress multiple pages in parallel.
> > +	 */
> > +	zswap_compress_batch(zst);
> > +
> > +	zswap_batch_compress_post_proc(zst);
> 
> The control flow here is a mess. Keep loops over the same batch at the
> same function level. IOW, pull the nr_comp_pages loop out of
> zswap_batch_compress_post_proc() and call the function from the loop.

I see. Got it.

> 
> Also give it a more descriptive name. If that's hard to do, then
> you're probably doing too many different things in it. Create
> functions for a specific purpose, don't carve up sequences at
> arbitrary points.
> 
> My impression after trying to read this is that the existing
> zswap_store() sequence could be a subset of the batched store, where
> you can reuse most code to get the pool, charge the cgroup, allocate
> entries, store entries, bump the stats etc. for both cases. Alas, your
> naming choices make it a bit difficult to be sure.

Apologies for the naming choices. I will fix this. As I was trying to explain
in the commit log, my goal was to minimize failure points, since each failure
point requires unwinding state, which adds latency. Towards this goal, I tried
to alloc all entries upfront, and fail early to prevent unwinding state.
Especially since the upfront work being done for the batch, is needed in
any case (e.g. zswap_alloc_entries()).

This is where the trade-offs between treating the existing zswap_store()
sequence as a subset of the batched store are not very clear. I tried to
optimize the batched store for batching, while following the logical
structure of zswap_store().

> 
> Please explore this direction. Don't worry about the CONFIG symbol for
> now, we can still look at this later.

Definitely, I will think some more about this.

> 
> Right now, it's basically
> 
> 	if (special case)
> 		lots of duplicative code in slightly different order
> 	regular store sequence
> 
> and that isn't going to be maintainable.
> 
> Look for a high-level sequence that makes sense for both cases. E.g.:
> 
> 	if (!zswap_enabled)
> 		goto check_old;
> 
> 	get objcg
> 
> 	check limits
> 
> 	allocate memcg list lru
> 
> 	for each batch {
> 		for each entry {
> 			allocate entry
> 			acquire objcg ref
> 			acquire pool ref
> 		}
> 		compress
> 		for each entry {
> 			store in tree
> 			add to lru
> 			bump stats and counters
> 		}
> 	}
> 
> 	put objcg
> 
> 	return true;
> 
> check_error:
> 	...
> 
> and then set up the two loops such that they also makes sense when the
> folio is just a single page.

Thanks for this suggestion! I will explore this kind of structure. I hope
I have provided some explanations as to why I pursued the existing
batching structure. One other thing I wanted to add was the
"future proofing" you alluded to earlier (which I will fix). Many of
my design choices were motivated by minimizing latency gaps
(e.g. from state unwinding in case of errors) in the batch compress
pipeline when a reclaim batch of any-order folios is potentially
sent to zswap.

Thanks again,
Kanchana


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 08/13] mm: zswap: acomp_ctx mutex lock/unlock optimizations.
  2024-11-06 19:21 ` [PATCH v3 08/13] mm: zswap: acomp_ctx mutex lock/unlock optimizations Kanchana P Sridhar
@ 2024-11-08 20:14   ` Yosry Ahmed
  2024-11-08 21:34     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 38+ messages in thread
From: Yosry Ahmed @ 2024-11-08 20:14 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, wajdi.k.feghali, vinodh.gopal

On Wed, Nov 6, 2024 at 11:21 AM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch implements two changes with respect to the acomp_ctx mutex lock:

The commit subject is misleading, one of these is definitely not an
optimization.

Also, if we are doing two unrelated things we should do them in two
separate commits.

>
> 1) The mutex lock is not acquired/released in zswap_compress(). Instead,
>    zswap_store() acquires the mutex lock once before compressing each page
>    in a large folio, and releases the lock once all pages in the folio have
>    been compressed. This should reduce some compute cycles in case of large
>    folio stores.

I understand how bouncing the mutex around can regress performance,
but I expect this to be more due to things like cacheline bouncing and
allowing reclaim to make meaningful progress before giving up the
mutex, rather than the actual cycles spent acquiring the mutex.

Do you have any numbers to support that this is a net improvement? We
usually base optimizations on data.

> 2) In zswap_decompress(), the mutex lock is released after the conditional
>    zpool_unmap_handle() based on "src != acomp_ctx->buffer" rather than
>    before. This ensures that the value of "src" obtained earlier does not
>    change. If the mutex lock is released before the comparison of "src" it
>    is possible that another call to reclaim by the same process could
>    obtain the mutex lock and over-write the value of "src".

This seems like a bug fix for 9c500835f279 ("mm: zswap: fix kernel BUG
in sg_init_one"). That commit changed checking acomp_ctx->is_sleepable
outside the mutex, which seems to be safe, to checking
acomp_ctx->buffer.

If my understanding is correct, this needs to be sent separately as a
hotfix, with a proper Fixes tag and CC stable. The side effect would
be that we never unmap the zpool handle and essentially leak the
memory, right?

>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index f6316b66fb23..3e899fa61445 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -880,6 +880,9 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>         return 0;
>  }
>
> +/*
> + * The acomp_ctx->mutex must be locked/unlocked in the calling procedure.
> + */
>  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>                            struct zswap_pool *pool)
>  {
> @@ -895,8 +898,6 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>
>         acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
>
> -       mutex_lock(&acomp_ctx->mutex);
> -
>         dst = acomp_ctx->buffer;
>         sg_init_table(&input, 1);
>         sg_set_page(&input, page, PAGE_SIZE, 0);
> @@ -949,7 +950,6 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>         else if (alloc_ret)
>                 zswap_reject_alloc_fail++;
>
> -       mutex_unlock(&acomp_ctx->mutex);
>         return comp_ret == 0 && alloc_ret == 0;
>  }
>
> @@ -986,10 +986,16 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>         acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
>         BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
>         BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> -       mutex_unlock(&acomp_ctx->mutex);
>
>         if (src != acomp_ctx->buffer)
>                 zpool_unmap_handle(zpool, entry->handle);
> +
> +       /*
> +        * It is safer to unlock the mutex after the check for
> +        * "src != acomp_ctx->buffer" so that the value of "src"
> +        * does not change.
> +        */

This comment is unnecessary, we should only release the lock after we
are done accessing protected fields.

> +       mutex_unlock(&acomp_ctx->mutex);
>  }
>
>  /*********************************
> @@ -1487,6 +1493,7 @@ bool zswap_store(struct folio *folio)
>  {
>         long nr_pages = folio_nr_pages(folio);
>         swp_entry_t swp = folio->swap;
> +       struct crypto_acomp_ctx *acomp_ctx;
>         struct obj_cgroup *objcg = NULL;
>         struct mem_cgroup *memcg = NULL;
>         struct zswap_pool *pool;
> @@ -1526,6 +1533,9 @@ bool zswap_store(struct folio *folio)
>                 mem_cgroup_put(memcg);
>         }
>
> +       acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> +       mutex_lock(&acomp_ctx->mutex);
> +
>         for (index = 0; index < nr_pages; ++index) {
>                 struct page *page = folio_page(folio, index);
>                 ssize_t bytes;
> @@ -1547,6 +1557,7 @@ bool zswap_store(struct folio *folio)
>         ret = true;
>
>  put_pool:
> +       mutex_unlock(&acomp_ctx->mutex);
>         zswap_pool_put(pool);
>  put_objcg:
>         obj_cgroup_put(objcg);
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs.
  2024-11-07 22:21     ` Sridhar, Kanchana P
@ 2024-11-08 20:22       ` Yosry Ahmed
  2024-11-08 21:39         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 38+ messages in thread
From: Yosry Ahmed @ 2024-11-08 20:22 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh

On Thu, Nov 7, 2024 at 2:21 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Johannes,
>
> > -----Original Message-----
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Sent: Thursday, November 7, 2024 9:21 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > yosryahmed@google.com; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> > 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx
> > to be configurable in nr of acomp_reqs.
> >
> > On Wed, Nov 06, 2024 at 11:21:01AM -0800, Kanchana P Sridhar wrote:
> > > Modified the definition of "struct crypto_acomp_ctx" to represent a
> > > configurable number of acomp_reqs and the required number of buffers.
> > >
> > > Accordingly, refactored the code that allocates/deallocates the acomp_ctx
> > > resources, so that it can be called to create a regular acomp_ctx with
> > > exactly one acomp_req/buffer, for use in the the existing non-batching
> > > zswap_store(), as well as to create a separate "batching acomp_ctx" with
> > > multiple acomp_reqs/buffers for IAA compress batching.
> > >
> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > ---
> > >  mm/zswap.c | 149 ++++++++++++++++++++++++++++++++++++++----------
> > -----
> > >  1 file changed, 107 insertions(+), 42 deletions(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 3e899fa61445..02e031122fdf 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -143,9 +143,10 @@ bool zswap_never_enabled(void)
> > >
> > >  struct crypto_acomp_ctx {
> > >     struct crypto_acomp *acomp;
> > > -   struct acomp_req *req;
> > > +   struct acomp_req **reqs;
> > > +   u8 **buffers;
> > > +   unsigned int nr_reqs;
> > >     struct crypto_wait wait;
> > > -   u8 *buffer;
> > >     struct mutex mutex;
> > >     bool is_sleepable;
> > >  };
> > > @@ -241,6 +242,11 @@ static inline struct xarray
> > *swap_zswap_tree(swp_entry_t swp)
> > >     pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,         \
> > >              zpool_get_type((p)->zpool))
> > >
> > > +static int zswap_create_acomp_ctx(unsigned int cpu,
> > > +                             struct crypto_acomp_ctx *acomp_ctx,
> > > +                             char *tfm_name,
> > > +                             unsigned int nr_reqs);
> >
> > This looks unnecessary.
>
> Thanks for the code review comments. I will make sure to avoid the
> forward declarations.
>
> >
> > > +
> > >  /*********************************
> > >  * pool functions
> > >  **********************************/
> > > @@ -813,69 +819,128 @@ static void zswap_entry_free(struct
> > zswap_entry *entry)
> > >  /*********************************
> > >  * compressed storage functions
> > >  **********************************/
> > > -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node
> > *node)
> > > +static int zswap_create_acomp_ctx(unsigned int cpu,
> > > +                             struct crypto_acomp_ctx *acomp_ctx,
> > > +                             char *tfm_name,
> > > +                             unsigned int nr_reqs)
> > >  {
> > > -   struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > node);
> > > -   struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > >acomp_ctx, cpu);
> > >     struct crypto_acomp *acomp;
> > > -   struct acomp_req *req;
> > > -   int ret;
> > > +   int ret = -ENOMEM;
> > > +   int i, j;
> > >
> > > +   acomp_ctx->nr_reqs = 0;
> > >     mutex_init(&acomp_ctx->mutex);
> > >
> > > -   acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> > cpu_to_node(cpu));
> > > -   if (!acomp_ctx->buffer)
> > > -           return -ENOMEM;
> > > -
> > > -   acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> > cpu_to_node(cpu));
> > > +   acomp = crypto_alloc_acomp_node(tfm_name, 0, 0,
> > cpu_to_node(cpu));
> > >     if (IS_ERR(acomp)) {
> > >             pr_err("could not alloc crypto acomp %s : %ld\n",
> > > -                           pool->tfm_name, PTR_ERR(acomp));
> > > -           ret = PTR_ERR(acomp);
> > > -           goto acomp_fail;
> > > +                           tfm_name, PTR_ERR(acomp));
> > > +           return PTR_ERR(acomp);
> > >     }
> > > +
> > >     acomp_ctx->acomp = acomp;
> > >     acomp_ctx->is_sleepable = acomp_is_async(acomp);
> > >
> > > -   req = acomp_request_alloc(acomp_ctx->acomp);
> > > -   if (!req) {
> > > -           pr_err("could not alloc crypto acomp_request %s\n",
> > > -                  pool->tfm_name);
> > > -           ret = -ENOMEM;
> > > +   acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *),
> > > +                                     GFP_KERNEL, cpu_to_node(cpu));
> > > +   if (!acomp_ctx->buffers)
> > > +           goto buf_fail;
> > > +
> > > +   for (i = 0; i < nr_reqs; ++i) {
> > > +           acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2,
> > > +                                                GFP_KERNEL,
> > cpu_to_node(cpu));
> > > +           if (!acomp_ctx->buffers[i]) {
> > > +                   for (j = 0; j < i; ++j)
> > > +                           kfree(acomp_ctx->buffers[j]);
> > > +                   kfree(acomp_ctx->buffers);
> > > +                   ret = -ENOMEM;
> > > +                   goto buf_fail;
> > > +           }
> > > +   }
> > > +
> > > +   acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req
> > *),
> > > +                                  GFP_KERNEL, cpu_to_node(cpu));
> > > +   if (!acomp_ctx->reqs)
> > >             goto req_fail;
> > > +
> > > +   for (i = 0; i < nr_reqs; ++i) {
> > > +           acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx-
> > >acomp);
> > > +           if (!acomp_ctx->reqs[i]) {
> > > +                   pr_err("could not alloc crypto acomp_request
> > reqs[%d] %s\n",
> > > +                          i, tfm_name);
> > > +                   for (j = 0; j < i; ++j)
> > > +                           acomp_request_free(acomp_ctx->reqs[j]);
> > > +                   kfree(acomp_ctx->reqs);
> > > +                   ret = -ENOMEM;
> > > +                   goto req_fail;
> > > +           }
> > >     }
> > > -   acomp_ctx->req = req;
> > >
> > > +   /*
> > > +    * The crypto_wait is used only in fully synchronous, i.e., with scomp
> > > +    * or non-poll mode of acomp, hence there is only one "wait" per
> > > +    * acomp_ctx, with callback set to reqs[0], under the assumption that
> > > +    * there is at least 1 request per acomp_ctx.
> > > +    */
> > >     crypto_init_wait(&acomp_ctx->wait);
> > >     /*
> > >      * if the backend of acomp is async zip, crypto_req_done() will
> > wakeup
> > >      * crypto_wait_req(); if the backend of acomp is scomp, the callback
> > >      * won't be called, crypto_wait_req() will return without blocking.
> > >      */
> > > -   acomp_request_set_callback(req,
> > CRYPTO_TFM_REQ_MAY_BACKLOG,
> > > +   acomp_request_set_callback(acomp_ctx->reqs[0],
> > CRYPTO_TFM_REQ_MAY_BACKLOG,
> > >                                crypto_req_done, &acomp_ctx->wait);
> > >
> > > +   acomp_ctx->nr_reqs = nr_reqs;
> > >     return 0;
> > >
> > >  req_fail:
> > > +   for (i = 0; i < nr_reqs; ++i)
> > > +           kfree(acomp_ctx->buffers[i]);
> > > +   kfree(acomp_ctx->buffers);
> > > +buf_fail:
> > >     crypto_free_acomp(acomp_ctx->acomp);
> > > -acomp_fail:
> > > -   kfree(acomp_ctx->buffer);
> > >     return ret;
> > >  }
> > >
> > > -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node
> > *node)
> > > +static void zswap_delete_acomp_ctx(struct crypto_acomp_ctx
> > *acomp_ctx)
> > >  {
> > > -   struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > node);
> > > -   struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > >acomp_ctx, cpu);
> > > -
> > >     if (!IS_ERR_OR_NULL(acomp_ctx)) {
> > > -           if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > > -                   acomp_request_free(acomp_ctx->req);
> > > +           int i;
> > > +
> > > +           for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > > +                   if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> > > +                           acomp_request_free(acomp_ctx->reqs[i]);
> > > +           kfree(acomp_ctx->reqs);
> > > +
> > > +           for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > > +                   kfree(acomp_ctx->buffers[i]);
> > > +           kfree(acomp_ctx->buffers);
> > > +
> > >             if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > >                     crypto_free_acomp(acomp_ctx->acomp);
> > > -           kfree(acomp_ctx->buffer);
> > > +
> > > +           acomp_ctx->nr_reqs = 0;
> > > +           acomp_ctx = NULL;
> > >     }
> > > +}
> > > +
> > > +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node
> > *node)
> > > +{
> > > +   struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > node);
> > > +   struct crypto_acomp_ctx *acomp_ctx;
> > > +
> > > +   acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> > > +   return zswap_create_acomp_ctx(cpu, acomp_ctx, pool->tfm_name,
> > 1);
> > > +}
> > > +
> > > +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node
> > *node)
> > > +{
> > > +   struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > node);
> > > +   struct crypto_acomp_ctx *acomp_ctx;
> > > +
> > > +   acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> > > +   zswap_delete_acomp_ctx(acomp_ctx);
> > >
> > >     return 0;
> > >  }
> >
> > There are no other callers to these functions. Just do the work
> > directly in the cpu callbacks here like it used to be.
>
> There will be other callers to zswap_create_acomp_ctx() and
> zswap_delete_acomp_ctx() in patches 10 and 11 of this series, when the
> per-cpu "acomp_batch_ctx" is introduced in struct zswap_pool. I was trying
> to modularize the code first, so as to split the changes into smaller commits.
>
> The per-cpu "acomp_batch_ctx" resources are allocated in patch 11 in the
> "zswap_pool_can_batch()" function, that allocates batching resources
> for this cpu. This was to address Yosry's earlier comment about minimizing
> the memory footprint cost of batching.
>
> The way I decided to do this is by reusing the code that allocates the de-facto
> pool->acomp_ctx for the selected compressor for all cpu's in zswap_pool_create().
> However, I did not want to add the acomp_batch_ctx multiple reqs/buffers
> allocation to the cpuhp_state_add_instance() code path which would incur the
> memory cost on all cpu's.
>
> Instead, the approach I chose to follow is to allocate the batching resources
> in patch 11 only as needed, on "a given cpu" that has to store a large folio. Hope
> this explains the purpose of the modularization better.
>
> Other ideas towards accomplishing this are very welcome.

If we remove the sysctl as suggested by Johannes, then we can just
allocate the number of buffers based on the compressor and whether it
supports batching during the pool initialization in the cpu callbacks
only.

Right?

>
> Thanks,
> Kanchana
>
> >
> > Otherwise it looks good to me.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 10/13] mm: zswap: Add a per-cpu "acomp_batch_ctx" to struct zswap_pool.
  2024-11-06 19:21 ` [PATCH v3 10/13] mm: zswap: Add a per-cpu "acomp_batch_ctx" to struct zswap_pool Kanchana P Sridhar
@ 2024-11-08 20:23   ` Yosry Ahmed
  2024-11-09  1:04     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 38+ messages in thread
From: Yosry Ahmed @ 2024-11-08 20:23 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, wajdi.k.feghali, vinodh.gopal

On Wed, Nov 6, 2024 at 11:21 AM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch adds a separate per-cpu batching acomp context "acomp_batch_ctx"
> to the zswap_pool. The per-cpu acomp_batch_ctx pointer is allocated at pool
> creation time, but no per-cpu resources are allocated for it.
>
> The idea is to not incur the memory footprint cost of multiple acomp_reqs
> and buffers in the existing "acomp_ctx" for cases where compress batching
> is not possible; for instance, with software compressor algorithms, on
> systems without IAA, on systems with IAA that want to run the existing
> non-batching implementation of zswap_store() of large folios.
>
> By creating a separate acomp_batch_ctx, we have the ability to allocate
> additional memory per-cpu only if the zswap compressor supports batching,
> and if the user wants to enable the use of compress batching in
> zswap_store() to improve swapout performance of large folios.
>
> Suggested-by: Yosry Ahmed <yosryahmed@google.com>

This isn't needed if the sysctl is removed and we just allocate the
number of buffers during pool initialization, right? Same for the next
patch.


> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 02e031122fdf..80a928cf0f7e 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -160,6 +160,7 @@ struct crypto_acomp_ctx {
>  struct zswap_pool {
>         struct zpool *zpool;
>         struct crypto_acomp_ctx __percpu *acomp_ctx;
> +       struct crypto_acomp_ctx __percpu *acomp_batch_ctx;
>         struct percpu_ref ref;
>         struct list_head list;
>         struct work_struct release_work;
> @@ -287,10 +288,14 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>
>         pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
>         if (!pool->acomp_ctx) {
> -               pr_err("percpu alloc failed\n");
> +               pr_err("percpu acomp_ctx alloc failed\n");
>                 goto error;
>         }
>
> +       pool->acomp_batch_ctx = alloc_percpu(*pool->acomp_batch_ctx);
> +       if (!pool->acomp_batch_ctx)
> +               pr_err("percpu acomp_batch_ctx alloc failed\n");
> +
>         ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
>                                        &pool->node);
>         if (ret)
> @@ -312,6 +317,8 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>  ref_fail:
>         cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
>  error:
> +       if (pool->acomp_batch_ctx)
> +               free_percpu(pool->acomp_batch_ctx);
>         if (pool->acomp_ctx)
>                 free_percpu(pool->acomp_ctx);
>         if (pool->zpool)
> @@ -368,6 +375,8 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
>
>         cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
>         free_percpu(pool->acomp_ctx);
> +       if (pool->acomp_batch_ctx)
> +               free_percpu(pool->acomp_batch_ctx);
>
>         zpool_destroy_pool(pool->zpool);
>         kfree(pool);
> @@ -930,6 +939,11 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>         struct crypto_acomp_ctx *acomp_ctx;
>
> +       if (pool->acomp_batch_ctx) {
> +               acomp_ctx = per_cpu_ptr(pool->acomp_batch_ctx, cpu);
> +               acomp_ctx->nr_reqs = 0;
> +       }
> +
>         acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>         return zswap_create_acomp_ctx(cpu, acomp_ctx, pool->tfm_name, 1);
>  }
> @@ -939,6 +953,12 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>         struct crypto_acomp_ctx *acomp_ctx;
>
> +       if (pool->acomp_batch_ctx) {
> +               acomp_ctx = per_cpu_ptr(pool->acomp_batch_ctx, cpu);
> +               if (!IS_ERR_OR_NULL(acomp_ctx) && (acomp_ctx->nr_reqs > 0))
> +                       zswap_delete_acomp_ctx(acomp_ctx);
> +       }
> +
>         acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>         zswap_delete_acomp_ctx(acomp_ctx);
>
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout.
  2024-11-07 17:34   ` Johannes Weiner
  2024-11-07 22:24     ` Sridhar, Kanchana P
@ 2024-11-08 20:23     ` Yosry Ahmed
  2024-11-09  1:05       ` Sridhar, Kanchana P
  1 sibling, 1 reply; 38+ messages in thread
From: Yosry Ahmed @ 2024-11-08 20:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, wajdi.k.feghali,
	vinodh.gopal

On Thu, Nov 7, 2024 at 9:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Nov 06, 2024 at 11:21:04AM -0800, Kanchana P Sridhar wrote:
> > The sysctl vm.compress-batching parameter is 0 by default. If the platform
> > has Intel IAA, the user can run experiments with IAA compress batching of
> > large folios in zswap_store() as follows:
> >
> > sysctl vm.compress-batching=1
> > echo deflate-iaa > /sys/module/zswap/parameters/compressor
>
> A sysctl seems uncalled for. Can't the batching code be gated on
> deflate-iaa being the compressor? It can still be generalized later if
> another compressor is shown to benefit from batching.

+1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 08/13] mm: zswap: acomp_ctx mutex lock/unlock optimizations.
  2024-11-08 20:14   ` Yosry Ahmed
@ 2024-11-08 21:34     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-08 21:34 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, November 8, 2024 12:14 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 08/13] mm: zswap: acomp_ctx mutex lock/unlock
> optimizations.
> 
> On Wed, Nov 6, 2024 at 11:21 AM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch implements two changes with respect to the acomp_ctx mutex
> lock:
> 
> The commit subject is misleading, one of these is definitely not an
> optimization.
> 
> Also, if we are doing two unrelated things we should do them in two
> separate commits.

Thanks for the code review comments. I agree, these should be two
separate commits.

> 
> >
> > 1) The mutex lock is not acquired/released in zswap_compress(). Instead,
> >    zswap_store() acquires the mutex lock once before compressing each
> page
> >    in a large folio, and releases the lock once all pages in the folio have
> >    been compressed. This should reduce some compute cycles in case of
> large
> >    folio stores.
> 
> I understand how bouncing the mutex around can regress performance,
> but I expect this to be more due to things like cacheline bouncing and
> allowing reclaim to make meaningful progress before giving up the
> mutex, rather than the actual cycles spent acquiring the mutex.
> 
> Do you have any numbers to support that this is a net improvement? We
> usually base optimizations on data.

Makes sense. I will gather the data to motivate this. In my internal validation,
I have been re-evaluating if this acquire/release once per large folio store
still makes sense, because it runs the risk of introducing long latency paths
within a sleeping mutex. I will quantify the benefits of this (if at all) and update.

> 
> > 2) In zswap_decompress(), the mutex lock is released after the conditional
> >    zpool_unmap_handle() based on "src != acomp_ctx->buffer" rather than
> >    before. This ensures that the value of "src" obtained earlier does not
> >    change. If the mutex lock is released before the comparison of "src" it
> >    is possible that another call to reclaim by the same process could
> >    obtain the mutex lock and over-write the value of "src".
> 
> This seems like a bug fix for 9c500835f279 ("mm: zswap: fix kernel BUG
> in sg_init_one"). That commit changed checking acomp_ctx->is_sleepable
> outside the mutex, which seems to be safe, to checking
> acomp_ctx->buffer.
> 
> If my understanding is correct, this needs to be sent separately as a
> hotfix, with a proper Fixes tag and CC stable. The side effect would
> be that we never unmap the zpool handle and essentially leak the
> memory, right?

Sure, I will send this separately as a hotfix. Yes, the side effect you
describe is correct.

Thanks,
Kanchana

> 
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 19 +++++++++++++++----
> >  1 file changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index f6316b66fb23..3e899fa61445 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -880,6 +880,9 @@ static int zswap_cpu_comp_dead(unsigned int cpu,
> struct hlist_node *node)
> >         return 0;
> >  }
> >
> > +/*
> > + * The acomp_ctx->mutex must be locked/unlocked in the calling
> procedure.
> > + */
> >  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >                            struct zswap_pool *pool)
> >  {
> > @@ -895,8 +898,6 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >
> >         acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> >
> > -       mutex_lock(&acomp_ctx->mutex);
> > -
> >         dst = acomp_ctx->buffer;
> >         sg_init_table(&input, 1);
> >         sg_set_page(&input, page, PAGE_SIZE, 0);
> > @@ -949,7 +950,6 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >         else if (alloc_ret)
> >                 zswap_reject_alloc_fail++;
> >
> > -       mutex_unlock(&acomp_ctx->mutex);
> >         return comp_ret == 0 && alloc_ret == 0;
> >  }
> >
> > @@ -986,10 +986,16 @@ static void zswap_decompress(struct
> zswap_entry *entry, struct folio *folio)
> >         acomp_request_set_params(acomp_ctx->req, &input, &output, entry-
> >length, PAGE_SIZE);
> >         BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx-
> >req), &acomp_ctx->wait));
> >         BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> > -       mutex_unlock(&acomp_ctx->mutex);
> >
> >         if (src != acomp_ctx->buffer)
> >                 zpool_unmap_handle(zpool, entry->handle);
> > +
> > +       /*
> > +        * It is safer to unlock the mutex after the check for
> > +        * "src != acomp_ctx->buffer" so that the value of "src"
> > +        * does not change.
> > +        */
> 
> This comment is unnecessary, we should only release the lock after we
> are done accessing protected fields.
> 
> > +       mutex_unlock(&acomp_ctx->mutex);
> >  }
> >
> >  /*********************************
> > @@ -1487,6 +1493,7 @@ bool zswap_store(struct folio *folio)
> >  {
> >         long nr_pages = folio_nr_pages(folio);
> >         swp_entry_t swp = folio->swap;
> > +       struct crypto_acomp_ctx *acomp_ctx;
> >         struct obj_cgroup *objcg = NULL;
> >         struct mem_cgroup *memcg = NULL;
> >         struct zswap_pool *pool;
> > @@ -1526,6 +1533,9 @@ bool zswap_store(struct folio *folio)
> >                 mem_cgroup_put(memcg);
> >         }
> >
> > +       acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > +       mutex_lock(&acomp_ctx->mutex);
> > +
> >         for (index = 0; index < nr_pages; ++index) {
> >                 struct page *page = folio_page(folio, index);
> >                 ssize_t bytes;
> > @@ -1547,6 +1557,7 @@ bool zswap_store(struct folio *folio)
> >         ret = true;
> >
> >  put_pool:
> > +       mutex_unlock(&acomp_ctx->mutex);
> >         zswap_pool_put(pool);
> >  put_objcg:
> >         obj_cgroup_put(objcg);
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs.
  2024-11-08 20:22       ` Yosry Ahmed
@ 2024-11-08 21:39         ` Sridhar, Kanchana P
  2024-11-08 22:54           ` Yosry Ahmed
  0 siblings, 1 reply; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-08 21:39 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, November 8, 2024 12:22 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx
> to be configurable in nr of acomp_reqs.
> 
> On Thu, Nov 7, 2024 at 2:21 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Johannes,
> >
> > > -----Original Message-----
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Sent: Thursday, November 7, 2024 9:21 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > yosryahmed@google.com; nphamcs@gmail.com;
> > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> > > 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v3 09/13] mm: zswap: Modify struct
> crypto_acomp_ctx
> > > to be configurable in nr of acomp_reqs.
> > >
> > > On Wed, Nov 06, 2024 at 11:21:01AM -0800, Kanchana P Sridhar wrote:
> > > > Modified the definition of "struct crypto_acomp_ctx" to represent a
> > > > configurable number of acomp_reqs and the required number of buffers.
> > > >
> > > > Accordingly, refactored the code that allocates/deallocates the
> acomp_ctx
> > > > resources, so that it can be called to create a regular acomp_ctx with
> > > > exactly one acomp_req/buffer, for use in the the existing non-batching
> > > > zswap_store(), as well as to create a separate "batching acomp_ctx"
> with
> > > > multiple acomp_reqs/buffers for IAA compress batching.
> > > >
> > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > > ---
> > > >  mm/zswap.c | 149 ++++++++++++++++++++++++++++++++++++++-----
> -----
> > > -----
> > > >  1 file changed, 107 insertions(+), 42 deletions(-)
> > > >
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 3e899fa61445..02e031122fdf 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -143,9 +143,10 @@ bool zswap_never_enabled(void)
> > > >
> > > >  struct crypto_acomp_ctx {
> > > >     struct crypto_acomp *acomp;
> > > > -   struct acomp_req *req;
> > > > +   struct acomp_req **reqs;
> > > > +   u8 **buffers;
> > > > +   unsigned int nr_reqs;
> > > >     struct crypto_wait wait;
> > > > -   u8 *buffer;
> > > >     struct mutex mutex;
> > > >     bool is_sleepable;
> > > >  };
> > > > @@ -241,6 +242,11 @@ static inline struct xarray
> > > *swap_zswap_tree(swp_entry_t swp)
> > > >     pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,         \
> > > >              zpool_get_type((p)->zpool))
> > > >
> > > > +static int zswap_create_acomp_ctx(unsigned int cpu,
> > > > +                             struct crypto_acomp_ctx *acomp_ctx,
> > > > +                             char *tfm_name,
> > > > +                             unsigned int nr_reqs);
> > >
> > > This looks unnecessary.
> >
> > Thanks for the code review comments. I will make sure to avoid the
> > forward declarations.
> >
> > >
> > > > +
> > > >  /*********************************
> > > >  * pool functions
> > > >  **********************************/
> > > > @@ -813,69 +819,128 @@ static void zswap_entry_free(struct
> > > zswap_entry *entry)
> > > >  /*********************************
> > > >  * compressed storage functions
> > > >  **********************************/
> > > > -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node
> > > *node)
> > > > +static int zswap_create_acomp_ctx(unsigned int cpu,
> > > > +                             struct crypto_acomp_ctx *acomp_ctx,
> > > > +                             char *tfm_name,
> > > > +                             unsigned int nr_reqs)
> > > >  {
> > > > -   struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > > node);
> > > > -   struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > >acomp_ctx, cpu);
> > > >     struct crypto_acomp *acomp;
> > > > -   struct acomp_req *req;
> > > > -   int ret;
> > > > +   int ret = -ENOMEM;
> > > > +   int i, j;
> > > >
> > > > +   acomp_ctx->nr_reqs = 0;
> > > >     mutex_init(&acomp_ctx->mutex);
> > > >
> > > > -   acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> > > cpu_to_node(cpu));
> > > > -   if (!acomp_ctx->buffer)
> > > > -           return -ENOMEM;
> > > > -
> > > > -   acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> > > cpu_to_node(cpu));
> > > > +   acomp = crypto_alloc_acomp_node(tfm_name, 0, 0,
> > > cpu_to_node(cpu));
> > > >     if (IS_ERR(acomp)) {
> > > >             pr_err("could not alloc crypto acomp %s : %ld\n",
> > > > -                           pool->tfm_name, PTR_ERR(acomp));
> > > > -           ret = PTR_ERR(acomp);
> > > > -           goto acomp_fail;
> > > > +                           tfm_name, PTR_ERR(acomp));
> > > > +           return PTR_ERR(acomp);
> > > >     }
> > > > +
> > > >     acomp_ctx->acomp = acomp;
> > > >     acomp_ctx->is_sleepable = acomp_is_async(acomp);
> > > >
> > > > -   req = acomp_request_alloc(acomp_ctx->acomp);
> > > > -   if (!req) {
> > > > -           pr_err("could not alloc crypto acomp_request %s\n",
> > > > -                  pool->tfm_name);
> > > > -           ret = -ENOMEM;
> > > > +   acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *),
> > > > +                                     GFP_KERNEL, cpu_to_node(cpu));
> > > > +   if (!acomp_ctx->buffers)
> > > > +           goto buf_fail;
> > > > +
> > > > +   for (i = 0; i < nr_reqs; ++i) {
> > > > +           acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2,
> > > > +                                                GFP_KERNEL,
> > > cpu_to_node(cpu));
> > > > +           if (!acomp_ctx->buffers[i]) {
> > > > +                   for (j = 0; j < i; ++j)
> > > > +                           kfree(acomp_ctx->buffers[j]);
> > > > +                   kfree(acomp_ctx->buffers);
> > > > +                   ret = -ENOMEM;
> > > > +                   goto buf_fail;
> > > > +           }
> > > > +   }
> > > > +
> > > > +   acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req
> > > *),
> > > > +                                  GFP_KERNEL, cpu_to_node(cpu));
> > > > +   if (!acomp_ctx->reqs)
> > > >             goto req_fail;
> > > > +
> > > > +   for (i = 0; i < nr_reqs; ++i) {
> > > > +           acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx-
> > > >acomp);
> > > > +           if (!acomp_ctx->reqs[i]) {
> > > > +                   pr_err("could not alloc crypto acomp_request
> > > reqs[%d] %s\n",
> > > > +                          i, tfm_name);
> > > > +                   for (j = 0; j < i; ++j)
> > > > +                           acomp_request_free(acomp_ctx->reqs[j]);
> > > > +                   kfree(acomp_ctx->reqs);
> > > > +                   ret = -ENOMEM;
> > > > +                   goto req_fail;
> > > > +           }
> > > >     }
> > > > -   acomp_ctx->req = req;
> > > >
> > > > +   /*
> > > > +    * The crypto_wait is used only in fully synchronous, i.e., with scomp
> > > > +    * or non-poll mode of acomp, hence there is only one "wait" per
> > > > +    * acomp_ctx, with callback set to reqs[0], under the assumption that
> > > > +    * there is at least 1 request per acomp_ctx.
> > > > +    */
> > > >     crypto_init_wait(&acomp_ctx->wait);
> > > >     /*
> > > >      * if the backend of acomp is async zip, crypto_req_done() will
> > > wakeup
> > > >      * crypto_wait_req(); if the backend of acomp is scomp, the callback
> > > >      * won't be called, crypto_wait_req() will return without blocking.
> > > >      */
> > > > -   acomp_request_set_callback(req,
> > > CRYPTO_TFM_REQ_MAY_BACKLOG,
> > > > +   acomp_request_set_callback(acomp_ctx->reqs[0],
> > > CRYPTO_TFM_REQ_MAY_BACKLOG,
> > > >                                crypto_req_done, &acomp_ctx->wait);
> > > >
> > > > +   acomp_ctx->nr_reqs = nr_reqs;
> > > >     return 0;
> > > >
> > > >  req_fail:
> > > > +   for (i = 0; i < nr_reqs; ++i)
> > > > +           kfree(acomp_ctx->buffers[i]);
> > > > +   kfree(acomp_ctx->buffers);
> > > > +buf_fail:
> > > >     crypto_free_acomp(acomp_ctx->acomp);
> > > > -acomp_fail:
> > > > -   kfree(acomp_ctx->buffer);
> > > >     return ret;
> > > >  }
> > > >
> > > > -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node
> > > *node)
> > > > +static void zswap_delete_acomp_ctx(struct crypto_acomp_ctx
> > > *acomp_ctx)
> > > >  {
> > > > -   struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > > node);
> > > > -   struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > >acomp_ctx, cpu);
> > > > -
> > > >     if (!IS_ERR_OR_NULL(acomp_ctx)) {
> > > > -           if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > > > -                   acomp_request_free(acomp_ctx->req);
> > > > +           int i;
> > > > +
> > > > +           for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > > > +                   if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> > > > +                           acomp_request_free(acomp_ctx->reqs[i]);
> > > > +           kfree(acomp_ctx->reqs);
> > > > +
> > > > +           for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > > > +                   kfree(acomp_ctx->buffers[i]);
> > > > +           kfree(acomp_ctx->buffers);
> > > > +
> > > >             if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > > >                     crypto_free_acomp(acomp_ctx->acomp);
> > > > -           kfree(acomp_ctx->buffer);
> > > > +
> > > > +           acomp_ctx->nr_reqs = 0;
> > > > +           acomp_ctx = NULL;
> > > >     }
> > > > +}
> > > > +
> > > > +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node
> > > *node)
> > > > +{
> > > > +   struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > > node);
> > > > +   struct crypto_acomp_ctx *acomp_ctx;
> > > > +
> > > > +   acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> > > > +   return zswap_create_acomp_ctx(cpu, acomp_ctx, pool->tfm_name,
> > > 1);
> > > > +}
> > > > +
> > > > +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node
> > > *node)
> > > > +{
> > > > +   struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > > node);
> > > > +   struct crypto_acomp_ctx *acomp_ctx;
> > > > +
> > > > +   acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> > > > +   zswap_delete_acomp_ctx(acomp_ctx);
> > > >
> > > >     return 0;
> > > >  }
> > >
> > > There are no other callers to these functions. Just do the work
> > > directly in the cpu callbacks here like it used to be.
> >
> > There will be other callers to zswap_create_acomp_ctx() and
> > zswap_delete_acomp_ctx() in patches 10 and 11 of this series, when the
> > per-cpu "acomp_batch_ctx" is introduced in struct zswap_pool. I was trying
> > to modularize the code first, so as to split the changes into smaller commits.
> >
> > The per-cpu "acomp_batch_ctx" resources are allocated in patch 11 in the
> > "zswap_pool_can_batch()" function, that allocates batching resources
> > for this cpu. This was to address Yosry's earlier comment about minimizing
> > the memory footprint cost of batching.
> >
> > The way I decided to do this is by reusing the code that allocates the de-
> facto
> > pool->acomp_ctx for the selected compressor for all cpu's in
> zswap_pool_create().
> > However, I did not want to add the acomp_batch_ctx multiple reqs/buffers
> > allocation to the cpuhp_state_add_instance() code path which would incur
> the
> > memory cost on all cpu's.
> >
> > Instead, the approach I chose to follow is to allocate the batching resources
> > in patch 11 only as needed, on "a given cpu" that has to store a large folio.
> Hope
> > this explains the purpose of the modularization better.
> >
> > Other ideas towards accomplishing this are very welcome.
> 
> If we remove the sysctl as suggested by Johannes, then we can just
> allocate the number of buffers based on the compressor and whether it
> supports batching during the pool initialization in the cpu callbacks
> only.
> 
> Right?

Yes, we could do that if the sysctl is removed, as suggested by Johannes.
The only "drawback" of allocating the batching resources (assuming the
compressor allows batching) would be that the memory footprint penalty
would be incurred on every cpu. I was trying to further economize this
cost based on whether a given cpu actually needs to zswap_store() a
large folio, and only then allocate the batching resources. Although, I am
not sure if this would benefit any usage model.

If we agree the pool initialization is the best place to allocate the batching
resources, then I will make this change in v4.

Thanks,
Kanchana

> 
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > Otherwise it looks good to me.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs.
  2024-11-08 21:39         ` Sridhar, Kanchana P
@ 2024-11-08 22:54           ` Yosry Ahmed
  2024-11-09  1:03             ` Sridhar, Kanchana P
  0 siblings, 1 reply; 38+ messages in thread
From: Yosry Ahmed @ 2024-11-08 22:54 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh

[..]
> > > >
> > > > There are no other callers to these functions. Just do the work
> > > > directly in the cpu callbacks here like it used to be.
> > >
> > > There will be other callers to zswap_create_acomp_ctx() and
> > > zswap_delete_acomp_ctx() in patches 10 and 11 of this series, when the
> > > per-cpu "acomp_batch_ctx" is introduced in struct zswap_pool. I was trying
> > > to modularize the code first, so as to split the changes into smaller commits.
> > >
> > > The per-cpu "acomp_batch_ctx" resources are allocated in patch 11 in the
> > > "zswap_pool_can_batch()" function, that allocates batching resources
> > > for this cpu. This was to address Yosry's earlier comment about minimizing
> > > the memory footprint cost of batching.
> > >
> > > The way I decided to do this is by reusing the code that allocates the de-
> > facto
> > > pool->acomp_ctx for the selected compressor for all cpu's in
> > zswap_pool_create().
> > > However, I did not want to add the acomp_batch_ctx multiple reqs/buffers
> > > allocation to the cpuhp_state_add_instance() code path which would incur
> > the
> > > memory cost on all cpu's.
> > >
> > > Instead, the approach I chose to follow is to allocate the batching resources
> > > in patch 11 only as needed, on "a given cpu" that has to store a large folio.
> > Hope
> > > this explains the purpose of the modularization better.
> > >
> > > Other ideas towards accomplishing this are very welcome.
> >
> > If we remove the sysctl as suggested by Johannes, then we can just
> > allocate the number of buffers based on the compressor and whether it
> > supports batching during the pool initialization in the cpu callbacks
> > only.
> >
> > Right?
>
> Yes, we could do that if the sysctl is removed, as suggested by Johannes.
> The only "drawback" of allocating the batching resources (assuming the
> compressor allows batching) would be that the memory footprint penalty
> would be incurred on every cpu. I was trying to further economize this
> cost based on whether a given cpu actually needs to zswap_store() a
> large folio, and only then allocate the batching resources. Although, I am
> not sure if this would benefit any usage model.
>
> If we agree the pool initialization is the best place to allocate the batching
> resources, then I will make this change in v4.

IIUC the additional cost would apply if someone wants to use
deflate-iaa on hardware that supports batching but does not want to
use batching. I don't think catering to such a use case warrants the
complexity in advance, not until we have a user that genuinely cares.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs.
  2024-11-08 22:54           ` Yosry Ahmed
@ 2024-11-09  1:03             ` Sridhar, Kanchana P
  0 siblings, 0 replies; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-09  1:03 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, November 8, 2024 2:54 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx
> to be configurable in nr of acomp_reqs.
> 
> [..]
> > > > >
> > > > > There are no other callers to these functions. Just do the work
> > > > > directly in the cpu callbacks here like it used to be.
> > > >
> > > > There will be other callers to zswap_create_acomp_ctx() and
> > > > zswap_delete_acomp_ctx() in patches 10 and 11 of this series, when the
> > > > per-cpu "acomp_batch_ctx" is introduced in struct zswap_pool. I was
> trying
> > > > to modularize the code first, so as to split the changes into smaller
> commits.
> > > >
> > > > The per-cpu "acomp_batch_ctx" resources are allocated in patch 11 in
> the
> > > > "zswap_pool_can_batch()" function, that allocates batching resources
> > > > for this cpu. This was to address Yosry's earlier comment about
> minimizing
> > > > the memory footprint cost of batching.
> > > >
> > > > The way I decided to do this is by reusing the code that allocates the de-
> > > facto
> > > > pool->acomp_ctx for the selected compressor for all cpu's in
> > > zswap_pool_create().
> > > > However, I did not want to add the acomp_batch_ctx multiple
> reqs/buffers
> > > > allocation to the cpuhp_state_add_instance() code path which would
> incur
> > > the
> > > > memory cost on all cpu's.
> > > >
> > > > Instead, the approach I chose to follow is to allocate the batching
> resources
> > > > in patch 11 only as needed, on "a given cpu" that has to store a large
> folio.
> > > Hope
> > > > this explains the purpose of the modularization better.
> > > >
> > > > Other ideas towards accomplishing this are very welcome.
> > >
> > > If we remove the sysctl as suggested by Johannes, then we can just
> > > allocate the number of buffers based on the compressor and whether it
> > > supports batching during the pool initialization in the cpu callbacks
> > > only.
> > >
> > > Right?
> >
> > Yes, we could do that if the sysctl is removed, as suggested by Johannes.
> > The only "drawback" of allocating the batching resources (assuming the
> > compressor allows batching) would be that the memory footprint penalty
> > would be incurred on every cpu. I was trying to further economize this
> > cost based on whether a given cpu actually needs to zswap_store() a
> > large folio, and only then allocate the batching resources. Although, I am
> > not sure if this would benefit any usage model.
> >
> > If we agree the pool initialization is the best place to allocate the batching
> > resources, then I will make this change in v4.
> 
> IIUC the additional cost would apply if someone wants to use
> deflate-iaa on hardware that supports batching but does not want to
> use batching. I don't think catering to such a use case warrants the
> complexity in advance, not until we have a user that genuinely cares.

Sure, this makes sense. I will address this in v4.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 10/13] mm: zswap: Add a per-cpu "acomp_batch_ctx" to struct zswap_pool.
  2024-11-08 20:23   ` Yosry Ahmed
@ 2024-11-09  1:04     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-09  1:04 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, November 8, 2024 12:23 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 10/13] mm: zswap: Add a per-cpu "acomp_batch_ctx"
> to struct zswap_pool.
> 
> On Wed, Nov 6, 2024 at 11:21 AM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch adds a separate per-cpu batching acomp context
> "acomp_batch_ctx"
> > to the zswap_pool. The per-cpu acomp_batch_ctx pointer is allocated at
> pool
> > creation time, but no per-cpu resources are allocated for it.
> >
> > The idea is to not incur the memory footprint cost of multiple acomp_reqs
> > and buffers in the existing "acomp_ctx" for cases where compress batching
> > is not possible; for instance, with software compressor algorithms, on
> > systems without IAA, on systems with IAA that want to run the existing
> > non-batching implementation of zswap_store() of large folios.
> >
> > By creating a separate acomp_batch_ctx, we have the ability to allocate
> > additional memory per-cpu only if the zswap compressor supports batching,
> > and if the user wants to enable the use of compress batching in
> > zswap_store() to improve swapout performance of large folios.
> >
> > Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> 
> This isn't needed if the sysctl is removed and we just allocate the
> number of buffers during pool initialization, right? Same for the next
> patch.

That's correct.

Thanks,
Kanchana

> 
> 
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 22 +++++++++++++++++++++-
> >  1 file changed, 21 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 02e031122fdf..80a928cf0f7e 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -160,6 +160,7 @@ struct crypto_acomp_ctx {
> >  struct zswap_pool {
> >         struct zpool *zpool;
> >         struct crypto_acomp_ctx __percpu *acomp_ctx;
> > +       struct crypto_acomp_ctx __percpu *acomp_batch_ctx;
> >         struct percpu_ref ref;
> >         struct list_head list;
> >         struct work_struct release_work;
> > @@ -287,10 +288,14 @@ static struct zswap_pool
> *zswap_pool_create(char *type, char *compressor)
> >
> >         pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
> >         if (!pool->acomp_ctx) {
> > -               pr_err("percpu alloc failed\n");
> > +               pr_err("percpu acomp_ctx alloc failed\n");
> >                 goto error;
> >         }
> >
> > +       pool->acomp_batch_ctx = alloc_percpu(*pool->acomp_batch_ctx);
> > +       if (!pool->acomp_batch_ctx)
> > +               pr_err("percpu acomp_batch_ctx alloc failed\n");
> > +
> >         ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> >                                        &pool->node);
> >         if (ret)
> > @@ -312,6 +317,8 @@ static struct zswap_pool *zswap_pool_create(char
> *type, char *compressor)
> >  ref_fail:
> >         cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> >  error:
> > +       if (pool->acomp_batch_ctx)
> > +               free_percpu(pool->acomp_batch_ctx);
> >         if (pool->acomp_ctx)
> >                 free_percpu(pool->acomp_ctx);
> >         if (pool->zpool)
> > @@ -368,6 +375,8 @@ static void zswap_pool_destroy(struct zswap_pool
> *pool)
> >
> >         cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> >         free_percpu(pool->acomp_ctx);
> > +       if (pool->acomp_batch_ctx)
> > +               free_percpu(pool->acomp_batch_ctx);
> >
> >         zpool_destroy_pool(pool->zpool);
> >         kfree(pool);
> > @@ -930,6 +939,11 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> >         struct crypto_acomp_ctx *acomp_ctx;
> >
> > +       if (pool->acomp_batch_ctx) {
> > +               acomp_ctx = per_cpu_ptr(pool->acomp_batch_ctx, cpu);
> > +               acomp_ctx->nr_reqs = 0;
> > +       }
> > +
> >         acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> >         return zswap_create_acomp_ctx(cpu, acomp_ctx, pool->tfm_name, 1);
> >  }
> > @@ -939,6 +953,12 @@ static int zswap_cpu_comp_dead(unsigned int
> cpu, struct hlist_node *node)
> >         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> >         struct crypto_acomp_ctx *acomp_ctx;
> >
> > +       if (pool->acomp_batch_ctx) {
> > +               acomp_ctx = per_cpu_ptr(pool->acomp_batch_ctx, cpu);
> > +               if (!IS_ERR_OR_NULL(acomp_ctx) && (acomp_ctx->nr_reqs > 0))
> > +                       zswap_delete_acomp_ctx(acomp_ctx);
> > +       }
> > +
> >         acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> >         zswap_delete_acomp_ctx(acomp_ctx);
> >
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout.
  2024-11-08 20:23     ` Yosry Ahmed
@ 2024-11-09  1:05       ` Sridhar, Kanchana P
  0 siblings, 0 replies; 38+ messages in thread
From: Sridhar, Kanchana P @ 2024-11-09  1:05 UTC (permalink / raw)
  To: Yosry Ahmed, Johannes Weiner
  Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642,
	ryan.roberts, Huang, Ying, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	zanussi, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, November 8, 2024 12:24 PM
> To: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch
> for compress batching during swapout.
> 
> On Thu, Nov 7, 2024 at 9:34 AM Johannes Weiner <hannes@cmpxchg.org>
> wrote:
> >
> > On Wed, Nov 06, 2024 at 11:21:04AM -0800, Kanchana P Sridhar wrote:
> > > The sysctl vm.compress-batching parameter is 0 by default. If the platform
> > > has Intel IAA, the user can run experiments with IAA compress batching of
> > > large folios in zswap_store() as follows:
> > >
> > > sysctl vm.compress-batching=1
> > > echo deflate-iaa > /sys/module/zswap/parameters/compressor
> >
> > A sysctl seems uncalled for. Can't the batching code be gated on
> > deflate-iaa being the compressor? It can still be generalized later if
> > another compressor is shown to benefit from batching.
> 
> +1

Thanks Yosry & Johannes. Will proceed as suggested.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-11-09  1:05 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-06 19:20 [PATCH v3 00/13] zswap IAA compress batching Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 01/13] crypto: acomp - Define two new interfaces for compress/decompress batching Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 02/13] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 03/13] crypto: iaa - Implement compress/decompress batching API in iaa_crypto Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 04/13] crypto: iaa - Make async mode the default Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 05/13] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 06/13] crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to IAAs Kanchana P Sridhar
2024-11-06 19:20 ` [PATCH v3 07/13] crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA node Kanchana P Sridhar
2024-11-06 19:21 ` [PATCH v3 08/13] mm: zswap: acomp_ctx mutex lock/unlock optimizations Kanchana P Sridhar
2024-11-08 20:14   ` Yosry Ahmed
2024-11-08 21:34     ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 09/13] mm: zswap: Modify struct crypto_acomp_ctx to be configurable in nr of acomp_reqs Kanchana P Sridhar
2024-11-07 17:20   ` Johannes Weiner
2024-11-07 22:21     ` Sridhar, Kanchana P
2024-11-08 20:22       ` Yosry Ahmed
2024-11-08 21:39         ` Sridhar, Kanchana P
2024-11-08 22:54           ` Yosry Ahmed
2024-11-09  1:03             ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 10/13] mm: zswap: Add a per-cpu "acomp_batch_ctx" to struct zswap_pool Kanchana P Sridhar
2024-11-08 20:23   ` Yosry Ahmed
2024-11-09  1:04     ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 11/13] mm: zswap: Allocate acomp_batch_ctx resources for a given zswap_pool Kanchana P Sridhar
2024-11-07 17:31   ` Johannes Weiner
2024-11-07 22:22     ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 12/13] mm: Add sysctl vm.compress-batching switch for compress batching during swapout Kanchana P Sridhar
2024-11-06 20:17   ` Andrew Morton
2024-11-06 20:39     ` Sridhar, Kanchana P
2024-11-07 17:34   ` Johannes Weiner
2024-11-07 22:24     ` Sridhar, Kanchana P
2024-11-08 20:23     ` Yosry Ahmed
2024-11-09  1:05       ` Sridhar, Kanchana P
2024-11-06 19:21 ` [PATCH v3 13/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
2024-11-07 18:16   ` Johannes Weiner
2024-11-07 22:32     ` Sridhar, Kanchana P
2024-11-07 18:53   ` Johannes Weiner
2024-11-07 22:50     ` Sridhar, Kanchana P
2024-11-06 20:25 ` [PATCH v3 00/13] zswap IAA compress batching Andrew Morton
2024-11-06 20:44   ` Sridhar, Kanchana P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox