linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/16] zswap IAA compress batching
@ 2025-02-06  7:20 Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 01/16] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
                   ` (16 more replies)
  0 siblings, 17 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

IAA Compression Batching:
=========================

This patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel batch compression of pages in large folios to improve
zswap swapout latency.

Improvements seen with IAA compress batching vs. IAA sequential:

 usemem30 with 64K folios:
 -------------------------
  59.1% higher throughput
  30.3% lower elapsed time
  36.2% lower sys time 

 usemem30 with 2M folios:
 ------------------------
  60.2% higher throughput
  26.7% lower elapsed time
  30.5% lower sys time 

There is no performance impact to zstd with v6.

The major focus for v6 was to fix the performance regressions observed in
v5, highlighted by Yosry (Thanks Yosry):

 1) zstd performance regression.
 2) IAA batching vs. IAA non-batching regression.

The patch-series is organized as follows:

 1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
    patches are tagged with "crypto:" in the subject:

    Patch 1) Adds new acomp request chaining framework and interface based
             on Herbert Xu's ahash reference implementation in "[PATCH 2/6]
             crypto: hash - Add request chaining API" [1]. acomp algorithms
             can use request chaining through these interfaces:

             Setup the request chain:
               acomp_reqchain_init()
               acomp_request_chain()

             Process the request chain:
               acomp_do_req_chain(): synchronously (sequentially)
               acomp_do_async_req_chain(): asynchronously using submit/poll
                                           ops (in parallel)

    Patch 2) Adds acomp_alg/crypto_acomp interfaces for batch_compress(),
             batch_decompress() and get_batch_size(), that swap modules can
             invoke using the new batching API crypto_acomp_batch_compress(),
             crypto_acomp_batch_decompress() and crypto_acomp_batch_size().
             Additionally, crypto acomp provides a new
             acomp_has_async_batching() interface to query for these API
             before allocating batching resources for a given compressor in
             zswap/zram.

    Patch 3) New CRYPTO_ACOMP_REQ_POLL acomp_req flag to act as a gate for
             async poll mode in iaa_crypto.

    Patch 4) iaa-crypto driver implementations for sync/async
             crypto_acomp_batch_compress() and
             crypto_acomp_batch_decompress() developed using request
             chaining. If the iaa_crypto driver is set up for 'async'
             sync_mode, these batching implementations deploy the
             asynchronous request chaining implementation. 'async' is the
             recommended mode for realizing the benefits of IAA parallelism.
             If iaa_crypto is set up for 'sync' sync_mode, the synchronous
             version of the request chaining API is used.
             
             The "iaa_acomp_fixed_deflate" algorithm registers these
             implementations for its "batch_compress" and "batch_decompress"
             interfaces respectively and opts in with CRYPTO_ALG_REQ_CHAIN.
             Further, iaa_crypto provides an implementation for the
             "get_batch_size" interface: this returns the
             IAA_CRYPTO_MAX_BATCH_SIZE constant that iaa_crypto defines
             currently as 8U for IAA compression algorithms (iaa_crypto can
             change this if needed as we optimize our batching algorithm).

    Patch 5) Modifies the default iaa_crypto driver mode to async, now that
             iaa_crypto provides a truly async mode that gives
             significantly better latency than sync mode for the batching
             use case.

    Patch 6) Disables verify_compress by default, to facilitate users to
             run IAA easily for comparison with software compressors.

    Patch 7) Reorganizes the iaa_crypto driver code into logically related
             sections and avoids forward declarations, in order to facilitate
             Patch 8. This patch makes no functional changes.

    Patch 8) Makes a major infrastructure change in the iaa_crypto driver,
             to map IAA devices/work-queues to cores based on packages
             instead of NUMA nodes. This doesn't impact performance on
             the Sapphire Rapids system used for performance
             testing. However, this change fixes functional problems we
             found on Granite Rapids in internal validation, where the
             number of NUMA nodes is greater than the number of packages,
             which was resulting in over-utilization of some IAA devices
             and non-usage of other IAA devices as per the current NUMA
             based mapping infrastructure.
             This patch also eliminates duplication of device wqs in
             per-cpu wq_tables, thereby saving 140MiB on a 384 cores
             Granite Rapids server with 8 IAAs. Submitting this change now
             so that it can go through code reviews before it can be merged.

    Patch 9) Builds upon the new infrastructure for mapping IAAs to cores
             based on packages, and enables configuring a "global_wq" per
             IAA, which can be used as a global resource for compress jobs
             for the package. If the user configures 2WQs per IAA device,
             the driver will distribute compress jobs from all cores on the
             package to the "global_wqs" of all the IAA devices on that
             package, in a round-robin manner. This can be used to improve
             compression throughput for workloads that see a lot of swapout
             activity.

   Patch 10) Makes an important change to iaa_crypto driver's descriptor
             allocation, from blocking to non-blocking with
             retries/timeouts and mitigations in case of timeouts during
             compress/decompress ops. This prevents tasks getting blocked
             indefinitely, which was observed when testing 30 cores running
             workloads, with only 1 IAA enabled on Sapphire Rapids (out of
             4). These timeouts are typically only encountered, and
             associated mitigations exercised, only in configurations with
             1 IAA device shared by 30+ cores.

   Patch 11) Fixes a bug with the "deflate_generic_tfm" global being
             accessed without locks in the software decomp fallback code.

 2) zswap modifications to enable compress batching in zswap_store()
    of large folios (including pmd-mappable folios):

   Patch 12) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set
             as 8U) to denote the maximum number of acomp_ctx batching
             resources. Further, the "struct crypto_acomp_ctx" is modified
             to contain a configurable number of acomp_reqs and buffers.
             The cpu hotplug onlining code will allocate up to
             ZSWAP_MAX_BATCH_SIZE requests/buffers in the per-cpu
             acomp_ctx, thereby limiting the memory usage in zswap, and
             ensuring that non-batching compressors incur no memory penalty.

   Patch 13) Restructures & simplifies zswap_store() to make it amenable
             for batching. Moves the loop over the folio's pages to a new
             zswap_store_folio(), which in turn allocates zswap entries
             for all folio pages upfront, then calls zswap_compress() for
             each folio page.

   Patch 14) Introduces zswap_compress_folio() to compress all pages in a
             folio. 

   Patch 15) We modify zswap_compress_folio() to detect if the compressor
             supports batching. If so, the "acomp_ctx->nr_reqs" becomes the
             batch size with which we call crypto_acomp_batch_compress() to
             compress multiple folio pages in parallel in IAA. Upon
             successful compression of a batch, the compressed buffers are
             stored in zpool.

             For compressors that don't support batching,
             zswap_compress_folio() will call zswap_compress() for each
             page in the folio.

             However, although we observe significantly better IAA batching
             performance/throughput, there was also a significant
             performance regression observed with zstd/2M folios. This is
             fixed in patch 16.

             Based on the discussions in [2], patch 15 invokes
             crypto_acomp_batch_compress() with "NULL" for the @wait
             parameter. This causes iaa_crypto's iaa_comp_acompress_batch()
             to use asynchronous polling instead of async request chaining
             for now, until there is better clarity on request
             chaining. Further, testing with micro-benchmarks indicated a
             slight increase in latency with request chaining:

             crypto_acomp_batch_compress()  p05 (ns)   p50 (ns)  p99 (ns)
             -------------------------------------------------------------
             async polling                   5,279      5,589     8,875
             async request chaining          5,316      5,662     8,923
             -------------------------------------------------------------

   Patch 16) The zstd 2M regression is fixed. We now see no regressions
             with zstd, and impressive throughput/performance improvements
             with IAA batching vs. no-batching.


With v6 of this patch series, the IAA compress batching feature will be
enabled seamlessly on Intel platforms that have IAA by selecting
'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
sync_mode driver attribute.

[1]: https://lore.kernel.org/linux-crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herbert@gondor.apana.org.au/
[2]: https://patchwork.kernel.org/project/linux-mm/patch/20241221063119.29140-3-kanchana.p.sridhar@intel.com/


System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 2-1-2025,
commit 7de6fd8ab650, without and with this patch-series.
Data was gathered on an Intel Sapphire Rapids (SPR) server, dual-socket
56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD
disk partition swap. Core frequency was fixed at 2500MHz.

Other kernel configuration parameters:

    zswap compressor  : zstd, deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 0

IAA "compression verification" is disabled and IAA is run in the async
mode (the defaults with this series).

I ran experiments with these workloads:

1) usemem 30 processes with these large folios enabled to "always":
   - 64k
   - 2048k

2) Kernel compilation allmodconfig with 2G max memory, 32 threads, run in
   tmpfs with these large folios enabled to "always":
   - 64k

usemem30 and kernel compilation used different IAA WQs configurations:

  usemem30 IAA WQ configuration:
  ------------------------------
  1 WQ with 128 entries per device. Compress/decompress jobs are sent to
  the same WQ and IAA that is mapped to the cores. There is very less
  swapin activity in this workload, and allocating 2WQs (one for decomps,
  one for comps, each 64 entries) degrades compress batching latency. This
  IAA WQ configuration explains the insignificant performance gains seen
  with IAA batching in v5, and once again delivers the expected performance
  improvements with batching.

  Kernel compilation IAA WQ configuration:
  ----------------------------------------
  2WQs, with 64 entries each, are configured per IAA device. Compress jobs
  from all cores on a socket are distributed among all 4 IAA devices on the
  same socket.


Performance testing (usemem30):
===============================
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:

usemem --init-time -w -O -b 1 -s 10 -n 30 10g

One important difference in v6's experiments is that the 30 usemem
processes are pinned to 30 consecutive cores on socket 0, which makes use
of the IAA devices only on socket 0.


 64K folios: usemem30: deflate-iaa:
 ==================================

 -------------------------------------------------------------------------------
                 mm-unstable-2-1-2025            v6             v6
 -------------------------------------------------------------------------------
 zswap compressor         deflate-iaa   deflate-iaa    deflate-iaa  IAA Batching
                                                                        vs.
                                                                      Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)    6,039,595     9,679,965      9,537,327       59.1%      
 Avg throughput (KB/s)        201,319       322,665        317,910                  
 elapsed time (sec)            100.74         69.05          71.43      -30.3%      
 sys time (sec)              2,446.53      1,526.71       1,596.23      -36.2%      
                                                                                    
 -------------------------------------------------------------------------------
 memcg_high                   909,501       961,527        964,010                  
 memcg_swap_fail                1,580           733          2,393                  
 zswpout                   58,342,295    61,542,432     61,715,737                  
 zswpin                           425            80            442                  
 pswpout                            0             0              0                  
 pswpin                             0             0              0                  
 thp_swpout                         0             0              0                  
 thp_swpout_fallback                0             0              0                  
 64kB_swpout_fallback           1,580           733          2,393                  
 pgmajfault                     3,311         2,860          3,220                  
 anon_fault_alloc_64kB      4,924,571     4,924,545      4,924,104                 
 ZSWPOUT-64kB               3,644,769     3,845,652      3,854,791 
 SWPOUT-64kB                        0             0              0 
 -------------------------------------------------------------------------------


 2M folios: usemem30: deflate-iaa:
 =================================


 -------------------------------------------------------------------------------
                 mm-unstable-2-1-2025             v6             v6
 -------------------------------------------------------------------------------
 zswap compressor         deflate-iaa    deflate-iaa    deflate-iaa IAA Batching
                                                                        vs.
                                                                     Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)     6,334,585     10,068,264    10,230,633      60.2%     
 Avg throughput (KB/s)         211,152        335,608       341,021                
 elapsed time (sec)              87.68          65.74         62.86     -26.7%     
 sys time (sec)               2,031.84       1,454.93      1,370.87     -30.5%     
                                                                                   
 -------------------------------------------------------------------------------
 memcg_high                    115,322        121,226       120,093                
 memcg_swap_fail                   568            350           301                
 zswpout                   559,323,303     62,474,427    61,907,590                
 zswpin                            518            463            14                
 pswpout                             0              0             0                
 pswpin                              0              0             0                
 thp_swpout                          0              0             0                
 thp_swpout_fallback               568            350           301                
 pgmajfault                      3,298          3,247         2,826                
 anon_fault_alloc_2048kB       153,734        153,734       153,737               
 ZSWPOUT-2048kB                115,321        121,672       120,614           
 SWPOUT-2048kB                       0              0             0           
 -------------------------------------------------------------------------------


 64K folios: usemem30: zstd:
 ===========================

 -------------------------------------------------------------------------------
                mm-unstable-2-1-2025            v6            v6           v6
                                          Patch 15      Patch 16     Patch 16
                                            (regr)       (fixed)      (fixed)
 -------------------------------------------------------------------------------
 zswap compressor               zstd          zstd          zstd         zstd
 -------------------------------------------------------------------------------
 Total throughput (KB/s)   6,929,741     6,975,265     7,003,546    6,953,025
 Avg throughput (KB/s)       230,991       232,508       233,451      231,767
 elapsed time (sec)            88.59         87.32         87.45        88.57
 sys time (sec)             2,188.83      2,136.52      2,133.41     2,178.23

 -------------------------------------------------------------------------------
 memcg_high                  764,423       764,174       764,420      764,476
 memcg_swap_fail               1,236            15         1,234           16
 zswpout                  48,928,758    48,908,998    48,928,536   48,928,551
 zswpin                          421            68           396          100
 pswpout                           0             0             0            0
 pswpin                            0             0             0            0
 thp_swpout                        0             0             0            0
 thp_swpout_fallback               0             0             0            0
 64kB_swpout_fallback          1,236            15         1,234           16
 pgmajfault                    3,196         2,875         3,570        3,284
 anon_fault_alloc_64kB     4,924,288     4,924,406     4,924,161    4,924,064
 ZSWPOUT-64kB              3,056,753     3,056,772     3,056,745    3,057,979
 SWPOUT-64kB                       0             0             0            0
 -------------------------------------------------------------------------------


 2M folios: usemem30: zstd:
 ==========================

 -------------------------------------------------------------------------------
                mm-unstable-2-1-2025            v6            v6           v6
                                          Patch 15      Patch 16     Patch 16
                                            (regr)       (fixed)      (fixed)
 -------------------------------------------------------------------------------
 zswap compressor               zstd          zstd          zstd         zstd
 -------------------------------------------------------------------------------
 Total throughput (KB/s)   7,712,462     7,235,682     7,716,994    7,745,378
 Avg throughput (KB/s)       257,082       241,189       257,233      258,179
 elapsed time (sec)            84.94         89.54         86.96        85.82
 sys time (sec)             2,008.19      2,141.90      2,059.80     2,039.96
                                                                             
 -------------------------------------------------------------------------------
 memcg_high                   93,036        94,792        93,137       93,100
 memcg_swap_fail                 143           169            32           11
 zswpout                  48,062,240    48,929,604    48,113,722   48,073,739
 zswpin                          439           438            71            9
 pswpout                           0             0             0            0
 pswpin                            0             0             0            0
 thp_swpout                        0             0             0            0
 thp_swpout_fallback             143           169            32           11
 pgmajfault                    3,246         3,645         3,248        2,775
 anon_fault_alloc_2048kB     153,739       153,738       153,740      153,733
 ZSWPOUT-2048kB               93,726        95,398        93,940       93,883
 SWPOUT-2048kB                     0             0             0            0
 -------------------------------------------------------------------------------

 zstd 2M regression fix details:
 -------------------------------
 Patch 16 essentially adapts the batching implementation of
 zswap_store_folio() to be sequential, i.e., for the behavior to be the
 same as the earlier zswap_store_page() iteration over the folio's
 pages. It attempts to preserve common code paths.

 I wasn't able to quantify why the Patch 15 implementation caused
 the zstd regression with the usual profiling methods such as
 tracepoints/bpftrace. My best hypothesis as to why Patch 16 resolves the
 regression, is that it has to do with a combination of branch mispredicts
 and the working set in the zswap_store_folio() code blocks having to load
 and iterate over 512 pages in the 3 loops. I gathered perf event counts
 that seem to back up this hypothesis:


 -------------------------------------------------------------------------------
 usemem30, zstd,              v6 Patch 15          v6 Patch 16        Change in
 2M Folios,            zstd 2M regression           Fixes zstd       PMU events
 perf stats                                      2M regression         with fix
 -------------------------------------------------------------------------------
 branch-misses              1,571,192,128        1,545,342,571       -25,849,557
 L1-dcache-stores       1,211,615,528,323    1,190,695,049,961   -20,920,478,362
 L1-dcache-loads        3,357,273,843,074    3,307,817,975,881   -49,455,867,193
 LLC-store-misses           3,357,428,475        3,340,252,023       -17,176,452
 branch-load-misses         1,567,824,197        1,546,321,034       -21,503,163
 branch-loads           1,463,632,526,371    1,449,551,102,173   -14,081,424,198
 mem-stores             1,211,399,592,024    1,191,473,855,029   -19,925,736,995
 dTLB-loads             3,367,449,558,533    3,308,475,712,698   -58,973,845,835
 LLC-loads                  1,867,235,354        1,773,790,017       -93,445,337
 node-load-misses               4,057,323            3,959,741           -97,582
 major-faults                         241                    0              -241
 L1-dcache-load-misses     22,339,515,994       24,381,783,235     2,042,267,241
 L1-icache-load-misses     21,182,690,283       26,504,876,405     5,322,186,122
 LLC-load-misses              224,000,082          258,495,328        34,495,246
 node-loads                   221,425,627          256,372,686        34,947,059
 mem-loads                              0                    0                 0
 dTLB-load-misses               4,886,686            8,672,079         3,785,393
 iTLB-load-misses               1,548,637            4,268,093         2,719,456
 cache-misses              10,831,533,095       10,834,598,425         3,065,330
 minor-faults                     155,246              155,707               461 
 -------------------------------------------------------------------------------



 4K folios: usemem30: Regression testing:
 ========================================

 -------------------------------------------------------------------------------
                        mm-unstable             v6   mm-unstable           v6
 -------------------------------------------------------------------------------
 zswap compressor       deflate-iaa    deflate-iaa          zstd         zstd   
 -------------------------------------------------------------------------------
 Total throughput (KB/s)  5,155,471      6,031,332     6,453,431    6,566,026
 Avg throughput (KB/s)      171,849        201,044       215,114      218,867
 elapsed time (sec)          108.35          92.61         95.50        88.99
 sys time (sec)            2,400.32       2,212.06      2,417.16     2,207.35
                                                                                
 -------------------------------------------------------------------------------
 memcg_high                 670,635      1,007,763       764,456      764,470
 memcg_swap_fail                  0              0             0            0
 zswpout                 62,098,929     64,507,508    48,928,772   48,928,690
 zswpin                         425             77           457          461
 pswpout                          0              0             0            0
 pswpin                           0              0             0            0
 thp_swpout                       0              0             0            0
 thp_swpout_fallback              0              0             0            0
 pgmajfault                   3,271          2,864         3,240        3,242
 -------------------------------------------------------------------------------


Performance testing (Kernel compilation, allmodconfig):
=======================================================

The experiments with kernel compilation test, 32 threads, in tmpfs use the
"allmodconfig" that takes ~12 minutes, and has considerable swapout/swapin
activity. The cgroup's memory.max is set to 2G.


 64K folios: Kernel compilation/allmodconfig:
 ============================================

 -------------------------------------------------------------------------------
                     mm-unstable               v6   mm-unstable            v6
 -------------------------------------------------------------------------------
 zswap compressor    deflate-iaa      deflate-iaa          zstd          zstd   
 -------------------------------------------------------------------------------
 real_sec                 767.36           743.90        776.08        769.43
 user_sec              15,773.57        15,773.34     15,780.93     15,736.49
 sys_sec                4,209.63         4,013.51      5,392.85      5,046.05
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB   1,874,680        1,873,776     1,874,244     1,873,456
 -------------------------------------------------------------------------------
 memcg_high                    0                0             0             0
 memcg_swap_fail               0                0             0             0
 zswpout             109,623,799      110,737,958    89,488,777    81,553,126
 zswpin               33,303,441       33,295,883    26,753,716    23,266,542
 pswpout                     315              151            99           116
 pswpin                       80               54            64            32
 thp_swpout                    0                0             0             0
 thp_swpout_fallback           0                0             0             0
 64kB_swpout_fallback          0              348             0             0
 pgmajfault           35,606,216       35,462,017    28,488,538    24,703,903
 ZSWPOUT-64kB          3,551,578        3,596,675     2,814,435     2,603,649
 SWPOUT-64kB                  19                5             5             7
 -------------------------------------------------------------------------------


With the iaa_crypto driver changes for non-blocking descriptor allocations
no timeouts-with-mitigations were seen in compress/decompress jobs, for all
of the above experiments.


Summary:
========
The performance testing data with usemem 30 processes and kernel
compilation test show 60% throughput gains and 36% sys time reduction
(usemem30) and 5% sys time reduction (kernel compilation) with
zswap_store() large folios using IAA compress batching as compared to
IAA sequential. There is no performance regression for zstd.

We can expect to see even more significant performance and throughput
improvements if we use the parallelism offered by IAA to do reclaim
batching of 4K/large folios (really any-order folios), and using the
zswap_store() high throughput compression to batch-compress pages
comprising these folios, not just batching within large folios. This is the
reclaim batching patch 13 in v1, which will be submitted in a separate
patch-series.

Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 21.3% more memory savings with IAA, as
compared to zstd, for same performance. IAA batching demonstrates more than
2X the memory savings obtained by zstd for same performance.


Changes since v5:
=================
1) Rebased to mm-unstable as of 2-1-2025, commit 7de6fd8ab650.

Several improvements, regression fixes and bug fixes, based on Yosry's
v5 comments (Thanks Yosry!):

2) Fix for zstd performance regression in v5.
3) Performance debug and fix for marginal improvements with IAA batching
   vs. sequential.
4) Performance testing data compares IAA with and without batching, instead
   of IAA batching against zstd.
5) Commit logs/zswap comments not mentioning crypto_acomp implementation
   details.
6) Delete the pr_info_once() when batching resources are allocated in
   zswap_cpu_comp_prepare().
7) Use kcalloc_node() for the multiple acomp_ctx buffers/reqs in
   zswap_cpu_comp_prepare().
8) Simplify and consolidate error handling cleanup code in
   zswap_cpu_comp_prepare().
9) Introduce zswap_compress_folio() in a separate patch.
10) Bug fix in zswap_store_folio() when xa_store() failure can cause all
    compressed objects and entries to be freed, and UAF when zswap_store()
    tries to free the entries that were already added to the xarray prior
    to the failure.
11) Deleting compressed_bytes/bytes. zswap_store_folio() also comprehends
    the recent fixes in commit bf5eaaaf7941 ("mm/zswap: fix inconsistency
    when zswap_store_page() fails") by Hyeonggon Yoo.

iaa_crypto improvements/fixes/changes:

12) Enables asynchronous mode and makes it the default. With commit
    4ebd9a5ca478 ("crypto: iaa - Fix IAA disabling that occurs when
    sync_mode is set to 'async'"), async mode was previously just sync. We
    now have true async support.
13) Change idxd descriptor allocations from blocking to non-blocking with
    timeouts, and mitigations for compress/decompress ops that fail to
    obtain a descriptor. This is a fix for tasks blocked errors seen in
    configurations where 30+ cores are running workloads under high memory
    pressure, and sending comps/decomps to 1 IAA device.
14) Fixes a bug with unprotected access of "deflate_generic_tfm" in
    deflate_generic_decompress(), which can cause data corruption and
    zswap_decompress() kernel crash.
15) zswap uses crypto_acomp_batch_compress() with async polling instead of
    request chaining for slightly better latency. However, the request
    chaining framework itself is unchanged, preserved from v5.


Changes since v4:
=================
1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
3) Implemented IAA compress batching using request chaining.
4) zswap_store() batching simplifications suggested by Chengming, Yosry and
   Nhat, thanks to all!
   - New zswap_compress_folio() that is called by zswap_store().
   - Move the loop over folio's pages out of zswap_store() and into a
     zswap_store_folio() that stores all pages.
   - Allocate all zswap entries for the folio upfront.
   - Added zswap_batch_compress().
   - Branch to call zswap_compress() or zswap_batch_compress() inside
     zswap_compress_folio().
   - All iterations over pages kept in same function level.
   - No helpers other than the newly added zswap_store_folio() and
     zswap_compress_folio().


Changes since v3:
=================
1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
   based on packages instead of NUMA nodes.
3) Added acomp_has_async_batching() API to crypto acomp, that allows
   zswap/zram to query if a crypto_acomp has registered batch_compress and
   batch_decompress interfaces.
4) Clear the poll bits on the acomp_reqs passed to
   iaa_comp_a[de]compress_batch() so that a module like zswap can be
   confident about the acomp_reqs[0] not having the poll bit set before
   calling the fully synchronous API crypto_acomp_[de]compress().
   Herbert, I would appreciate it if you can review changes 2-4; in patches
   1-8 in v4. I did not want to introduce too many iaa_crypto changes in
   v4, given that patch 7 is already making a major change. I plan to work
   on incorporating the request chaining using the ahash interface in v5
   (I need to understand the basic crypto ahash better). Thanks Herbert!
5) Incorporated Johannes' suggestion to not have a sysctl to enable
   compress batching.
6) Incorporated Yosry's suggestion to allocate batching resources in the
   cpu hotplug onlining code, since there is no longer a sysctl to control
   batching. Thanks Yosry!
7) Incorporated Johannes' suggestions related to making the overall
   sequence of events between zswap_store() and zswap_batch_store() similar
   as much as possible for readability and control flow, better naming of
   procedures, avoiding forward declarations, not inlining error path
   procedures, deleting zswap internal details from zswap.h, etc. Thanks
   Johannes, really appreciate the direction!
   I have tried to explain the minimal future-proofing in terms of the
   zswap_batch_store() signature and the definition of "struct
   zswap_batch_store_sub_batch" in the comments for this struct. I hope the
   new code explains the control flow a bit better.


Changes since v2:
=================
1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
   returned by kmalloc_node() for acomp_ctx->buffers and for
   acomp_ctx->reqs.
3) Fixed a bug in zswap_pool_can_batch() for returning true if
   pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
   the per-cpu acomp_batch_ctx tests true for batching resources having
   been allocated on this cpu. Also, changed from per_cpu_ptr() to
   raw_cpu_ptr().
4) Incorporated the zswap_store_propagate_errors() compilation warning fix
   suggested by Dan Carpenter. Thanks Dan!
5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
   zswap.h, with SWAP_CRYPTO_BATCH_SIZE.

Changes since v1:
=================
1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
   async/poll mode, and to encapsulate the polling functionality in the
   iaa_crypto driver. Thanks Herbert!
3) Incorporated Herbert's and Yosry's suggestions to implement the batching
   API in iaa_crypto and to make its use seamless from zswap's
   perspective. Thanks Herbert and Yosry!
4) Incorporated Yosry's suggestion to make it more convenient for the user
   to enable compress batching, while minimizing the memory footprint
   cost. Thanks Yosry!
5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
   reclaim batching patch from this series, since it requires a broader
   discussion.


I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana



Kanchana P Sridhar (16):
  crypto: acomp - Add synchronous/asynchronous acomp request chaining.
  crypto: acomp - Define new interfaces for compress/decompress
    batching.
  crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable
    async mode.
  crypto: iaa - Implement batch_compress(), batch_decompress() API in
    iaa_crypto.
  crypto: iaa - Enable async mode and make it the default.
  crypto: iaa - Disable iaa_verify_compress by default.
  crypto: iaa - Re-organize the iaa_crypto driver code.
  crypto: iaa - Map IAA devices/wqs to cores based on packages instead
    of NUMA.
  crypto: iaa - Distribute compress jobs from all cores to all IAAs on a
    package.
  crypto: iaa - Descriptor allocation timeouts with mitigations in
    iaa_crypto.
  crypto: iaa - Fix for "deflate_generic_tfm" global being accessed
    without locks.
  mm: zswap: Allocate pool batching resources if the compressor supports
    batching.
  mm: zswap: Restructure & simplify zswap_store() to make it amenable
    for batching.
  mm: zswap: Introduce zswap_compress_folio() to compress all pages in a
    folio.
  mm: zswap: Compress batching with Intel IAA in zswap_store() of large
    folios.
  mm: zswap: Fix for zstd performance regression with 2M folios.

 .../driver-api/crypto/iaa/iaa-crypto.rst      |   11 +-
 crypto/acompress.c                            |  287 +++
 drivers/crypto/intel/iaa/iaa_crypto.h         |   30 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c    | 1724 ++++++++++++-----
 include/crypto/acompress.h                    |  157 ++
 include/crypto/algapi.h                       |   10 +
 include/crypto/internal/acompress.h           |   29 +
 include/linux/crypto.h                        |   31 +
 mm/zswap.c                                    |  449 ++++-
 9 files changed, 2170 insertions(+), 558 deletions(-)


base-commit: 7de6fd8ab65003f050aa58e705592745717ed318
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 01/16] crypto: acomp - Add synchronous/asynchronous acomp request chaining.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 02/16] crypto: acomp - Define new interfaces for compress/decompress batching Kanchana P Sridhar
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch is based on Herbert Xu's request chaining for ahash
("[PATCH 2/6] crypto: hash - Add request chaining API") [1]. The generic
framework for request chaining that's provided in the ahash implementation
has been used as reference to develop a similar synchronous request
chaining framework for crypto_acomp.

Furthermore, this commit develops an asynchronous request chaining
framework and API that iaa_crypto can use for request chaining with
parallelism, in order to fully benefit from Intel IAA's multiple
compress/decompress engines in hardware. This allows us to gain significant
latency improvements with IAA batching as compared to synchronous request
chaining.

 Usage of acomp request chaining API:
 ====================================

 Any crypto_acomp compressor can avail of request chaining as follows:

 Step 1: Create request chain:

  Request 0 (the first req in the chain):

  void acomp_reqchain_init(struct acomp_req *req,
                           u32 flags, crypto_completion_t compl,
                           void *data);

  Subsequent requests:

  void acomp_request_chain(struct acomp_req *req,
                           struct acomp_req *head);

 Step 2: Process the request chain using the specified compress/decompress
         "op":

  2.a) Synchronous: the chain of requests is processed in series:

       int acomp_do_req_chain(struct acomp_req *req,
                              int (*op)(struct acomp_req *req));

  2.b) Asynchronous: the chain of requests is processed in parallel using a
       submit-poll paradigm:

       int acomp_do_async_req_chain(struct acomp_req *req,
                                    int (*op_submit)(struct acomp_req *req),
                                    int (*op_poll)(struct acomp_req *req));

Request chaining will be used in subsequent patches to implement
compress/decompress batching in the iaa_crypto driver for the two supported
IAA driver sync_modes:

  sync_mode = 'sync' will use (2.a),
  sync_mode = 'async' will use (2.b).

These files are directly re-used from [1] which is not yet merged:

include/crypto/algapi.h
include/linux/crypto.h

Hence, I am adding Herbert as the co-developer of this acomp request
chaining patch.

[1]: https://lore.kernel.org/linux-crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herbert@gondor.apana.org.au/

Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by:
---
 crypto/acompress.c                  | 284 ++++++++++++++++++++++++++++
 include/crypto/acompress.h          |  41 ++++
 include/crypto/algapi.h             |  10 +
 include/crypto/internal/acompress.h |  10 +
 include/linux/crypto.h              |  31 +++
 5 files changed, 376 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index 6fdf0ff9f3c0..cb6444d09dd7 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -23,6 +23,19 @@ struct crypto_scomp;
 
 static const struct crypto_type crypto_acomp_type;
 
+struct acomp_save_req_state {
+	struct list_head head;
+	struct acomp_req *req0;
+	struct acomp_req *cur;
+	int (*op)(struct acomp_req *req);
+	crypto_completion_t compl;
+	void *data;
+};
+
+static void acomp_reqchain_done(void *data, int err);
+static int acomp_save_req(struct acomp_req *req, crypto_completion_t cplt);
+static void acomp_restore_req(struct acomp_req *req);
+
 static inline struct acomp_alg *__crypto_acomp_alg(struct crypto_alg *alg)
 {
 	return container_of(alg, struct acomp_alg, calg.base);
@@ -123,6 +136,277 @@ struct crypto_acomp *crypto_alloc_acomp_node(const char *alg_name, u32 type,
 }
 EXPORT_SYMBOL_GPL(crypto_alloc_acomp_node);
 
+static int acomp_save_req(struct acomp_req *req, crypto_completion_t cplt)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct acomp_save_req_state *state;
+	gfp_t gfp;
+	u32 flags;
+
+	if (!acomp_is_async(tfm))
+		return 0;
+
+	flags = acomp_request_flags(req);
+	gfp = (flags & CRYPTO_TFM_REQ_MAY_SLEEP) ?  GFP_KERNEL : GFP_ATOMIC;
+	state = kmalloc(sizeof(*state), gfp);
+	if (!state)
+		return -ENOMEM;
+
+	state->compl = req->base.complete;
+	state->data = req->base.data;
+	state->req0 = req;
+
+	req->base.complete = cplt;
+	req->base.data = state;
+
+	return 0;
+}
+
+static void acomp_restore_req(struct acomp_req *req)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct acomp_save_req_state *state;
+
+	if (!acomp_is_async(tfm))
+		return;
+
+	state = req->base.data;
+
+	req->base.complete = state->compl;
+	req->base.data = state->data;
+	kfree(state);
+}
+
+static int acomp_reqchain_finish(struct acomp_save_req_state *state,
+				 int err, u32 mask)
+{
+	struct acomp_req *req0 = state->req0;
+	struct acomp_req *req = state->cur;
+	struct acomp_req *n;
+
+	req->base.err = err;
+
+	if (req == req0)
+		INIT_LIST_HEAD(&req->base.list);
+	else
+		list_add_tail(&req->base.list, &req0->base.list);
+
+	list_for_each_entry_safe(req, n, &state->head, base.list) {
+		list_del_init(&req->base.list);
+
+		req->base.flags &= mask;
+		req->base.complete = acomp_reqchain_done;
+		req->base.data = state;
+		state->cur = req;
+		err = state->op(req);
+
+		if (err == -EINPROGRESS) {
+			if (!list_empty(&state->head))
+				err = -EBUSY;
+			goto out;
+		}
+
+		if (err == -EBUSY)
+			goto out;
+
+		req->base.err = err;
+		list_add_tail(&req->base.list, &req0->base.list);
+	}
+
+	acomp_restore_req(req0);
+
+out:
+	return err;
+}
+
+static void acomp_reqchain_done(void *data, int err)
+{
+	struct acomp_save_req_state *state = data;
+	crypto_completion_t compl = state->compl;
+
+	data = state->data;
+
+	if (err == -EINPROGRESS) {
+		if (!list_empty(&state->head))
+			return;
+		goto notify;
+	}
+
+	err = acomp_reqchain_finish(state, err, CRYPTO_TFM_REQ_MAY_BACKLOG);
+	if (err == -EBUSY)
+		return;
+
+notify:
+	compl(data, err);
+}
+
+int acomp_do_req_chain(struct acomp_req *req,
+		       int (*op)(struct acomp_req *req))
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct acomp_save_req_state *state;
+	struct acomp_save_req_state state0;
+	int err = 0;
+
+	if (!acomp_request_chained(req) || list_empty(&req->base.list) ||
+	    !crypto_acomp_req_chain(tfm))
+		return op(req);
+
+	state = &state0;
+
+	if (acomp_is_async(tfm)) {
+		err = acomp_save_req(req, acomp_reqchain_done);
+		if (err) {
+			struct acomp_req *r2;
+
+			req->base.err = err;
+			list_for_each_entry(r2, &req->base.list, base.list)
+				r2->base.err = err;
+
+			return err;
+		}
+
+		state = req->base.data;
+	}
+
+	state->op = op;
+	state->cur = req;
+	INIT_LIST_HEAD(&state->head);
+	list_splice(&req->base.list, &state->head);
+
+	err = op(req);
+	if (err == -EBUSY || err == -EINPROGRESS)
+		return -EBUSY;
+
+	return acomp_reqchain_finish(state, err, ~0);
+}
+EXPORT_SYMBOL_GPL(acomp_do_req_chain);
+
+static void acomp_async_reqchain_done(struct acomp_req *req0,
+				      struct list_head *state,
+				      int (*op_poll)(struct acomp_req *req))
+{
+	struct acomp_req *req, *n;
+	bool req0_done = false;
+	int err;
+
+	while (!list_empty(state)) {
+
+		if (!req0_done) {
+			err = op_poll(req0);
+			if (!(err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)) {
+				req0->base.err = err;
+				req0_done = true;
+			}
+		}
+
+		list_for_each_entry_safe(req, n, state, base.list) {
+			err = op_poll(req);
+
+			if (err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)
+				continue;
+
+			req->base.err = err;
+			list_del_init(&req->base.list);
+			list_add_tail(&req->base.list, &req0->base.list);
+		}
+	}
+
+	while (!req0_done) {
+		err = op_poll(req0);
+		if (!(err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)) {
+			req0->base.err = err;
+			break;
+		}
+	}
+}
+
+static int acomp_async_reqchain_finish(struct acomp_req *req0,
+				       struct list_head *state,
+				       int (*op_submit)(struct acomp_req *req),
+				       int (*op_poll)(struct acomp_req *req))
+{
+	struct acomp_req *req, *n;
+	int err = 0;
+
+	INIT_LIST_HEAD(&req0->base.list);
+
+	list_for_each_entry_safe(req, n, state, base.list) {
+		BUG_ON(req == req0);
+
+		err = op_submit(req);
+
+		if (!(err == -EINPROGRESS || err == -EBUSY)) {
+			req->base.err = err;
+			list_del_init(&req->base.list);
+			list_add_tail(&req->base.list, &req0->base.list);
+		}
+	}
+
+	acomp_async_reqchain_done(req0, state, op_poll);
+
+	return req0->base.err;
+}
+
+int acomp_do_async_req_chain(struct acomp_req *req,
+			     int (*op_submit)(struct acomp_req *req),
+			     int (*op_poll)(struct acomp_req *req))
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct list_head state;
+	struct acomp_req *r2;
+	int err = 0;
+	void *req0_data = req->base.data;
+
+	if (!acomp_request_chained(req) || list_empty(&req->base.list) ||
+		!acomp_is_async(tfm) || !crypto_acomp_req_chain(tfm)) {
+
+		err = op_submit(req);
+
+		if (err == -EINPROGRESS || err == -EBUSY) {
+			bool req0_done = false;
+
+			while (!req0_done) {
+				err = op_poll(req);
+				if (!(err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)) {
+					req->base.err = err;
+					break;
+				}
+			}
+		} else {
+			req->base.err = err;
+		}
+
+		req->base.data = req0_data;
+		if (acomp_is_async(tfm))
+			req->base.complete(req->base.data, req->base.err);
+
+		return err;
+	}
+
+	err = op_submit(req);
+	req->base.err = err;
+
+	if (err && !(err == -EINPROGRESS || err == -EBUSY))
+		goto err_prop;
+
+	INIT_LIST_HEAD(&state);
+	list_splice(&req->base.list, &state);
+
+	err = acomp_async_reqchain_finish(req, &state, op_submit, op_poll);
+	req->base.data = req0_data;
+	req->base.complete(req->base.data, req->base.err);
+
+	return err;
+
+err_prop:
+	list_for_each_entry(r2, &req->base.list, base.list)
+		r2->base.err = err;
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(acomp_do_async_req_chain);
+
 struct acomp_req *acomp_request_alloc(struct crypto_acomp *acomp)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp);
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 54937b615239..eadc24514056 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -206,6 +206,7 @@ static inline void acomp_request_set_callback(struct acomp_req *req,
 	req->base.data = data;
 	req->base.flags &= CRYPTO_ACOMP_ALLOC_OUTPUT;
 	req->base.flags |= flgs & ~CRYPTO_ACOMP_ALLOC_OUTPUT;
+	req->base.flags &= ~CRYPTO_TFM_REQ_CHAIN;
 }
 
 /**
@@ -237,6 +238,46 @@ static inline void acomp_request_set_params(struct acomp_req *req,
 		req->flags |= CRYPTO_ACOMP_ALLOC_OUTPUT;
 }
 
+static inline u32 acomp_request_flags(struct acomp_req *req)
+{
+	return req->base.flags;
+}
+
+static inline void acomp_reqchain_init(struct acomp_req *req,
+				       u32 flags, crypto_completion_t compl,
+				       void *data)
+{
+	acomp_request_set_callback(req, flags, compl, data);
+	crypto_reqchain_init(&req->base);
+}
+
+static inline void acomp_reqchain_clear(struct acomp_req *req, void *data)
+{
+	struct crypto_wait *wait = (struct crypto_wait *)data;
+	reinit_completion(&wait->completion);
+	crypto_reqchain_clear(&req->base);
+	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, data);
+}
+
+static inline void acomp_request_chain(struct acomp_req *req,
+				       struct acomp_req *head)
+{
+	crypto_request_chain(&req->base, &head->base);
+}
+
+int acomp_do_req_chain(struct acomp_req *req,
+		       int (*op)(struct acomp_req *req));
+
+int acomp_do_async_req_chain(struct acomp_req *req,
+			     int (*op_submit)(struct acomp_req *req),
+			     int (*op_poll)(struct acomp_req *req));
+
+static inline int acomp_request_err(struct acomp_req *req)
+{
+	return req->base.err;
+}
+
 /**
  * crypto_acomp_compress() -- Invoke asynchronous compress operation
  *
diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h
index 156de41ca760..c5df380c7d08 100644
--- a/include/crypto/algapi.h
+++ b/include/crypto/algapi.h
@@ -271,4 +271,14 @@ static inline u32 crypto_tfm_alg_type(struct crypto_tfm *tfm)
 	return tfm->__crt_alg->cra_flags & CRYPTO_ALG_TYPE_MASK;
 }
 
+static inline bool crypto_request_chained(struct crypto_async_request *req)
+{
+	return req->flags & CRYPTO_TFM_REQ_CHAIN;
+}
+
+static inline bool crypto_tfm_req_chain(struct crypto_tfm *tfm)
+{
+	return tfm->__crt_alg->cra_flags & CRYPTO_ALG_REQ_CHAIN;
+}
+
 #endif	/* _CRYPTO_ALGAPI_H */
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 8831edaafc05..53b4ef59b48c 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -84,6 +84,16 @@ static inline void __acomp_request_free(struct acomp_req *req)
 	kfree_sensitive(req);
 }
 
+static inline bool acomp_request_chained(struct acomp_req *req)
+{
+	return crypto_request_chained(&req->base);
+}
+
+static inline bool crypto_acomp_req_chain(struct crypto_acomp *tfm)
+{
+	return crypto_tfm_req_chain(&tfm->base);
+}
+
 /**
  * crypto_register_acomp() -- Register asynchronous compression algorithm
  *
diff --git a/include/linux/crypto.h b/include/linux/crypto.h
index b164da5e129e..7c27d557586c 100644
--- a/include/linux/crypto.h
+++ b/include/linux/crypto.h
@@ -13,6 +13,8 @@
 #define _LINUX_CRYPTO_H
 
 #include <linux/completion.h>
+#include <linux/errno.h>
+#include <linux/list.h>
 #include <linux/refcount.h>
 #include <linux/slab.h>
 #include <linux/types.h>
@@ -124,6 +126,9 @@
  */
 #define CRYPTO_ALG_FIPS_INTERNAL	0x00020000
 
+/* Set if the algorithm supports request chains. */
+#define CRYPTO_ALG_REQ_CHAIN		0x00040000
+
 /*
  * Transform masks and values (for crt_flags).
  */
@@ -133,6 +138,7 @@
 #define CRYPTO_TFM_REQ_FORBID_WEAK_KEYS	0x00000100
 #define CRYPTO_TFM_REQ_MAY_SLEEP	0x00000200
 #define CRYPTO_TFM_REQ_MAY_BACKLOG	0x00000400
+#define CRYPTO_TFM_REQ_CHAIN		0x00000800
 
 /*
  * Miscellaneous stuff.
@@ -174,6 +180,7 @@ struct crypto_async_request {
 	struct crypto_tfm *tfm;
 
 	u32 flags;
+	int err;
 };
 
 /**
@@ -540,5 +547,29 @@ int crypto_comp_decompress(struct crypto_comp *tfm,
 			   const u8 *src, unsigned int slen,
 			   u8 *dst, unsigned int *dlen);
 
+static inline void crypto_reqchain_init(struct crypto_async_request *req)
+{
+	req->err = -EINPROGRESS;
+	req->flags |= CRYPTO_TFM_REQ_CHAIN;
+	INIT_LIST_HEAD(&req->list);
+}
+
+static inline void crypto_reqchain_clear(struct crypto_async_request *req)
+{
+	req->flags &= ~CRYPTO_TFM_REQ_CHAIN;
+}
+
+static inline void crypto_request_chain(struct crypto_async_request *req,
+					struct crypto_async_request *head)
+{
+	req->err = -EINPROGRESS;
+	list_add_tail(&req->list, &head->list);
+}
+
+static inline bool crypto_tfm_is_async(struct crypto_tfm *tfm)
+{
+	return tfm->__crt_alg->cra_flags & CRYPTO_ALG_ASYNC;
+}
+
 #endif	/* _LINUX_CRYPTO_H */
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 02/16] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 01/16] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-16  5:10   ` Herbert Xu
  2025-02-06  7:20 ` [PATCH v6 03/16] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This commit adds get_batch_size(), batch_compress() and batch_decompress()
interfaces to:

  struct acomp_alg
  struct crypto_acomp

A crypto_acomp compression algorithm that supports batching of compressions
and decompressions must provide implementations for these API.

A new helper function acomp_has_async_batching() can be invoked to query if
a crypto_acomp has registered these batching interfaces.

A higher level module like zswap can call acomp_has_async_batching() to
detect if the compressor supports batching, and if so, it can call
the new crypto_acomp_batch_size() to detect the maximum batch-size
supported by the compressor. Based on this, zswap can use the minimum of
any zswap-specific upper limits for batch-size and the compressor's max
batch-size, to allocate batching resources.

This allows the iaa_crypto Intel IAA driver to register implementations for
the get_batch_size(), batch_compress() and batch_decompress() acomp_alg
interfaces, that can subsequently be invoked from the kernel zswap/zram
modules to compress/decompress pages in parallel in the IAA hardware
accelerator to improve swapout/swapin performance through these newly added
corresponding crypto_acomp API:

  crypto_acomp_batch_size()
  crypto_acomp_batch_compress()
  crypto_acomp_batch_decompress()

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c                  |   3 +
 include/crypto/acompress.h          | 111 ++++++++++++++++++++++++++++
 include/crypto/internal/acompress.h |  19 +++++
 3 files changed, 133 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index cb6444d09dd7..165559a8b9bd 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -84,6 +84,9 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm)
 
 	acomp->compress = alg->compress;
 	acomp->decompress = alg->decompress;
+	acomp->get_batch_size = alg->get_batch_size;
+	acomp->batch_compress = alg->batch_compress;
+	acomp->batch_decompress = alg->batch_decompress;
 	acomp->dst_free = alg->dst_free;
 	acomp->reqsize = alg->reqsize;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index eadc24514056..8451ade70fd8 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -43,6 +43,10 @@ struct acomp_req {
  *
  * @compress:		Function performs a compress operation
  * @decompress:		Function performs a de-compress operation
+ * @get_batch_size:     Maximum batch-size for batching compress/decompress
+ *                      operations.
+ * @batch_compress:	Function performs a batch compress operation
+ * @batch_decompress:	Function performs a batch decompress operation
  * @dst_free:		Frees destination buffer if allocated inside the
  *			algorithm
  * @reqsize:		Context size for (de)compression requests
@@ -51,6 +55,21 @@ struct acomp_req {
 struct crypto_acomp {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
+	bool (*batch_compress)(struct acomp_req *reqs[],
+			       struct crypto_wait *wait,
+			       struct page *pages[],
+			       u8 *dsts[],
+			       unsigned int dlens[],
+			       int errors[],
+			       int nr_pages);
+	bool (*batch_decompress)(struct acomp_req *reqs[],
+				 struct crypto_wait *wait,
+				 u8 *srcs[],
+				 struct page *pages[],
+				 unsigned int slens[],
+				 int errors[],
+				 int nr_pages);
 	void (*dst_free)(struct scatterlist *dst);
 	unsigned int reqsize;
 	struct crypto_tfm base;
@@ -142,6 +161,13 @@ static inline bool acomp_is_async(struct crypto_acomp *tfm)
 	       CRYPTO_ALG_ASYNC;
 }
 
+static inline bool acomp_has_async_batching(struct crypto_acomp *tfm)
+{
+	return (acomp_is_async(tfm) &&
+		(crypto_comp_alg_common(tfm)->base.cra_flags & CRYPTO_ALG_TYPE_ACOMPRESS) &&
+		tfm->get_batch_size && tfm->batch_compress && tfm->batch_decompress);
+}
+
 static inline struct crypto_acomp *crypto_acomp_reqtfm(struct acomp_req *req)
 {
 	return __crypto_acomp_tfm(req->base.tfm);
@@ -306,4 +332,89 @@ static inline int crypto_acomp_decompress(struct acomp_req *req)
 	return crypto_acomp_reqtfm(req)->decompress(req);
 }
 
+/**
+ * crypto_acomp_batch_size() -- Get the algorithm's batch size
+ *
+ * Function returns the algorithm's batch size for batching operations
+ *
+ * @tfm:	ACOMPRESS tfm handle allocated with crypto_alloc_acomp()
+ *
+ * Return:	crypto_acomp's batch size.
+ */
+static inline unsigned int crypto_acomp_batch_size(struct crypto_acomp *tfm)
+{
+	if (acomp_has_async_batching(tfm))
+		return tfm->get_batch_size();
+
+	return 1;
+}
+
+/**
+ * crypto_acomp_batch_compress() -- Invoke asynchronous compress of
+ *                                  a batch of requests
+ *
+ * Function invokes the asynchronous batch compress operation
+ *
+ * @reqs: @nr_pages asynchronous compress requests.
+ * @wait: crypto_wait for acomp batch compress with synchronous/asynchronous
+ *        request chaining. If NULL, the driver must provide a way to process
+ *        request completions asynchronously.
+ * @pages: Pages to be compressed.
+ * @dsts: Pre-allocated destination buffers to store results of compression.
+ * @dlens: Will contain the compressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages to be compressed.
+ *
+ * Returns true if all compress requests complete successfully,
+ * false otherwise.
+ */
+static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
+					       struct crypto_wait *wait,
+					       struct page *pages[],
+					       u8 *dsts[],
+					       unsigned int dlens[],
+					       int errors[],
+					       int nr_pages)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_compress(reqs, wait, pages, dsts,
+				   dlens, errors, nr_pages);
+}
+
+/**
+ * crypto_acomp_batch_decompress() -- Invoke asynchronous decompress of
+ *                                    a batch of requests
+ *
+ * Function invokes the asynchronous batch decompress operation
+ *
+ * @reqs: @nr_pages asynchronous decompress requests.
+ * @wait: crypto_wait for acomp batch decompress with synchronous/asynchronous
+ *        request chaining. If NULL, the driver must provide a way to process
+ *        request completions asynchronously.
+ * @srcs: The src buffers to be decompressed.
+ * @pages: The pages to store the decompressed buffers.
+ * @slens: Compressed lengths of @srcs.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages to be decompressed.
+ *
+ * Returns true if all decompress requests complete successfully,
+ * false otherwise.
+ */
+static inline bool crypto_acomp_batch_decompress(struct acomp_req *reqs[],
+						 struct crypto_wait *wait,
+						 u8 *srcs[],
+						 struct page *pages[],
+						 unsigned int slens[],
+						 int errors[],
+						 int nr_pages)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_decompress(reqs, wait, srcs, pages,
+				     slens, errors, nr_pages);
+}
+
 #endif
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 53b4ef59b48c..df0e192801ff 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -17,6 +17,10 @@
  *
  * @compress:	Function performs a compress operation
  * @decompress:	Function performs a de-compress operation
+ * @get_batch_size:     Maximum batch-size for batching compress/decompress
+ *                      operations.
+ * @batch_compress:	Function performs a batch compress operation
+ * @batch_decompress:	Function performs a batch decompress operation
  * @dst_free:	Frees destination buffer if allocated inside the algorithm
  * @init:	Initialize the cryptographic transformation object.
  *		This function is used to initialize the cryptographic
@@ -37,6 +41,21 @@
 struct acomp_alg {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
+	bool (*batch_compress)(struct acomp_req *reqs[],
+			       struct crypto_wait *wait,
+			       struct page *pages[],
+			       u8 *dsts[],
+			       unsigned int dlens[],
+			       int errors[],
+			       int nr_pages);
+	bool (*batch_decompress)(struct acomp_req *reqs[],
+				 struct crypto_wait *wait,
+				 u8 *srcs[],
+				 struct page *pages[],
+				 unsigned int slens[],
+				 int errors[],
+				 int nr_pages);
 	void (*dst_free)(struct scatterlist *dst);
 	int (*init)(struct crypto_acomp *tfm);
 	void (*exit)(struct crypto_acomp *tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 03/16] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 01/16] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 02/16] crypto: acomp - Define new interfaces for compress/decompress batching Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 04/16] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

If the iaa_crypto driver has async_mode set to true, and use_irq set to
false, it can still be forced to use synchronous mode by turning off the
CRYPTO_ACOMP_REQ_POLL flag in req->flags.

In other words, all three of the following need to be true for a request
to be processed in fully async poll mode:

 1) async_mode should be "true"
 2) use_irq should be "false"
 3) req->flags & CRYPTO_ACOMP_REQ_POLL should be "true"

Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 11 ++++++++++-
 include/crypto/acompress.h                 |  5 +++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index c3776b0de51d..d7983ab3c34a 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1520,6 +1520,10 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		return -EINVAL;
 	}
 
+	/* If the caller has requested no polling, disable async. */
+	if (!(req->flags & CRYPTO_ACOMP_REQ_POLL))
+		disable_async = true;
+
 	cpu = get_cpu();
 	wq = wq_table_next_wq(cpu);
 	put_cpu();
@@ -1712,6 +1716,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 {
 	struct crypto_tfm *tfm = req->base.tfm;
 	dma_addr_t src_addr, dst_addr;
+	bool disable_async = false;
 	int nr_sgs, cpu, ret = 0;
 	struct iaa_wq *iaa_wq;
 	struct device *dev;
@@ -1727,6 +1732,10 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		return -EINVAL;
 	}
 
+	/* If the caller has requested no polling, disable async. */
+	if (!(req->flags & CRYPTO_ACOMP_REQ_POLL))
+		disable_async = true;
+
 	if (!req->dst)
 		return iaa_comp_adecompress_alloc_dest(req);
 
@@ -1775,7 +1784,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		req->dst, req->dlen, sg_dma_len(req->dst));
 
 	ret = iaa_decompress(tfm, req, wq, src_addr, req->slen,
-			     dst_addr, &req->dlen, false);
+			     dst_addr, &req->dlen, disable_async);
 	if (ret == -EINPROGRESS)
 		return ret;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 8451ade70fd8..e090538e8406 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -14,6 +14,11 @@
 #include <linux/crypto.h>
 
 #define CRYPTO_ACOMP_ALLOC_OUTPUT	0x00000001
+/*
+ * If set, the driver must have a way to submit the req, then
+ * poll its completion status for success/error.
+ */
+#define CRYPTO_ACOMP_REQ_POLL		0x00000002
 #define CRYPTO_ACOMP_DST_MAX		131072
 
 /**
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 04/16] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2025-02-06  7:20 ` [PATCH v6 03/16] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 05/16] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch provides iaa_crypto driver implementations for the newly added
crypto_acomp batch_compress() and batch_decompress() interfaces using
acomp request chaining.

iaa_crypto also implements the new crypto_acomp get_batch_size() interface
that returns an iaa_driver specific constant, IAA_CRYPTO_MAX_BATCH_SIZE
(set to 8U currently).

This allows swap modules such as zswap/zram to allocate required batching
resources and then invoke fully asynchronous batch parallel
compression/decompression of pages on systems with Intel IAA, by invoking
these API, respectively:

 crypto_acomp_batch_size(...);
 crypto_acomp_batch_compress(...);
 crypto_acomp_batch_decompress(...);

This enables zswap compress batching code to be developed in
a manner similar to the current single-page synchronous calls to:

 crypto_acomp_compress(...);
 crypto_acomp_decompress(...);

thereby, facilitating encapsulated and modular hand-off between the kernel
zswap/zram code and the crypto_acomp layer.

Since iaa_crypto supports the use of acomp request chaining, this patch
also adds CRYPTO_ALG_REQ_CHAIN to the iaa_acomp_fixed_deflate algorithm's
cra_flags.

Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |   9 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 395 ++++++++++++++++++++-
 2 files changed, 403 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 56985e395263..b3b67c44ec8a 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -39,6 +39,15 @@
 					 IAA_DECOMP_CHECK_FOR_EOB | \
 					 IAA_DECOMP_STOP_ON_EOB)
 
+/*
+ * The maximum compress/decompress batch size for IAA's implementation of
+ * the crypto_acomp batch_compress() and batch_decompress() interfaces.
+ * The IAA compression algorithms should provide the crypto_acomp
+ * get_batch_size() interface through a function that returns this
+ * constant.
+ */
+#define IAA_CRYPTO_MAX_BATCH_SIZE 8U
+
 /* Representation of IAA workqueue */
 struct iaa_wq {
 	struct list_head	list;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index d7983ab3c34a..61134a7ad1da 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1807,6 +1807,396 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 	ctx->use_irq = use_irq;
 }
 
+static int iaa_comp_poll(struct acomp_req *req)
+{
+	struct idxd_desc *idxd_desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	struct idxd_wq *wq;
+	bool compress_op;
+	int ret;
+
+	idxd_desc = req->base.data;
+	if (!idxd_desc)
+		return -EAGAIN;
+
+	compress_op = (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS);
+	wq = idxd_desc->wq;
+	iaa_wq = idxd_wq_get_private(wq);
+	idxd = iaa_wq->iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	ret = check_completion(dev, idxd_desc->iax_completion, true, true);
+	if (ret == -EAGAIN)
+		return ret;
+	if (ret)
+		goto out;
+
+	req->dlen = idxd_desc->iax_completion->output_size;
+
+	/* Update stats */
+	if (compress_op) {
+		update_total_comp_bytes_out(req->dlen);
+		update_wq_comp_bytes(wq, req->dlen);
+	} else {
+		update_total_decomp_bytes_in(req->slen);
+		update_wq_decomp_bytes(wq, req->slen);
+	}
+
+	if (iaa_verify_compress && (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS)) {
+		struct crypto_tfm *tfm = req->base.tfm;
+		dma_addr_t src_addr, dst_addr;
+		u32 compression_crc;
+
+		compression_crc = idxd_desc->iax_completion->crc;
+
+		dma_sync_sg_for_device(dev, req->dst, 1, DMA_FROM_DEVICE);
+		dma_sync_sg_for_device(dev, req->src, 1, DMA_TO_DEVICE);
+
+		src_addr = sg_dma_address(req->src);
+		dst_addr = sg_dma_address(req->dst);
+
+		ret = iaa_compress_verify(tfm, req, wq, src_addr, req->slen,
+					  dst_addr, &req->dlen, compression_crc);
+	}
+out:
+	/* caller doesn't call crypto_wait_req, so no acomp_request_complete() */
+
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	idxd_free_desc(idxd_desc->wq, idxd_desc);
+
+	dev_dbg(dev, "%s: returning ret=%d\n", __func__, ret);
+
+	return ret;
+}
+
+static unsigned int iaa_comp_get_batch_size(void)
+{
+	return IAA_CRYPTO_MAX_BATCH_SIZE;
+}
+
+static void iaa_set_req_poll(
+	struct acomp_req *reqs[],
+	int nr_reqs,
+	bool set_flag)
+{
+	int i;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		set_flag ? (reqs[i]->flags |= CRYPTO_ACOMP_REQ_POLL) :
+			   (reqs[i]->flags &= ~CRYPTO_ACOMP_REQ_POLL);
+	}
+}
+
+/**
+ * This API provides IAA compress batching functionality for use by swap
+ * modules.
+ *
+ * @reqs: @nr_pages asynchronous compress requests.
+ * @wait: crypto_wait for acomp batch compress implemented using request
+ *        chaining. Required if async_mode is "false". If async_mode is "true",
+ *        and @wait is NULL, the completions will be processed using
+ *        asynchronous polling of the requests' completion statuses.
+ * @pages: Pages to be compressed by IAA.
+ * @dsts: Pre-allocated destination buffers to store results of IAA
+ *        compression. Each element of @dsts must be of size "PAGE_SIZE * 2".
+ * @dlens: Will contain the compressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *            to be compressed.
+ *
+ * Returns true if all compress requests complete successfully,
+ * false otherwise.
+ */
+static bool iaa_comp_acompress_batch(
+	struct acomp_req *reqs[],
+	struct crypto_wait *wait,
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_pages)
+{
+	struct scatterlist inputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	struct scatterlist outputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	bool compressions_done = false;
+	bool async = (async_mode && !use_irq);
+	bool async_poll = (async && !wait);
+	int i, err = 0;
+
+	BUG_ON(nr_pages > IAA_CRYPTO_MAX_BATCH_SIZE);
+	BUG_ON(!async && !wait);
+
+	if (async)
+		iaa_set_req_poll(reqs, nr_pages, true);
+	else
+		iaa_set_req_poll(reqs, nr_pages, false);
+
+	/*
+	 * Prepare and submit acomp_reqs to IAA. IAA will process these
+	 * compress jobs in parallel if async_mode is true.
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		sg_init_table(&inputs[i], 1);
+		sg_set_page(&inputs[i], pages[i], PAGE_SIZE, 0);
+
+		/*
+		 * Each dst buffer should be of size (PAGE_SIZE * 2).
+		 * Reflect same in sg_list.
+		 */
+		sg_init_one(&outputs[i], dsts[i], PAGE_SIZE * 2);
+		acomp_request_set_params(reqs[i], &inputs[i],
+					 &outputs[i], PAGE_SIZE, dlens[i]);
+
+		/*
+		 * As long as the API is called with a valid "wait", chain the
+		 * requests for synchronous/asynchronous compress ops.
+		 * If async_mode is in effect, but the API is called with a
+		 * NULL "wait", submit the requests first, and poll for
+		 * their completion status later, after all descriptors have
+		 * been submitted.
+		 */
+		if (!async_poll) {
+			/* acomp request chaining. */
+			if (i)
+				acomp_request_chain(reqs[i], reqs[0]);
+			else
+				acomp_reqchain_init(reqs[0], 0, crypto_req_done,
+						    wait);
+		} else {
+			errors[i] = iaa_comp_acompress(reqs[i]);
+
+			if (errors[i] != -EINPROGRESS) {
+				errors[i] = -EINVAL;
+				err = -EINVAL;
+			} else {
+				errors[i] = -EAGAIN;
+			}
+		}
+	}
+
+	if (!async_poll) {
+		if (async)
+			/* Process the request chain in parallel. */
+			err = crypto_wait_req(acomp_do_async_req_chain(reqs[0],
+					      iaa_comp_acompress, iaa_comp_poll),
+					      wait);
+		else
+			/* Process the request chain in series. */
+			err = crypto_wait_req(acomp_do_req_chain(reqs[0],
+					      iaa_comp_acompress), wait);
+
+		for (i = 0; i < nr_pages; ++i) {
+			errors[i] = acomp_request_err(reqs[i]);
+			if (errors[i]) {
+				err = -EINVAL;
+				pr_debug("Request chaining req %d compress error %d\n", i, errors[i]);
+			} else {
+				dlens[i] = reqs[i]->dlen;
+			}
+		}
+
+		goto reset_reqs;
+	}
+
+	/*
+	 * Asynchronously poll for and process IAA compress job completions.
+	 */
+	while (!compressions_done) {
+		compressions_done = true;
+
+		for (i = 0; i < nr_pages; ++i) {
+			/*
+			 * Skip, if the compression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					compressions_done = false;
+				else
+					err = -EINVAL;
+			} else {
+				dlens[i] = reqs[i]->dlen;
+			}
+		}
+	}
+
+reset_reqs:
+	/*
+	 * For the same 'reqs[]' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_deacompress(),
+	 * clear the CRYPTO_ACOMP_REQ_POLL bit on all acomp_reqs, and the
+	 * CRYPTO_TFM_REQ_CHAIN bit on the reqs[0].
+	 */
+	iaa_set_req_poll(reqs, nr_pages, false);
+	if (!async_poll)
+		acomp_reqchain_clear(reqs[0], wait);
+
+	return !err;
+}
+
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ *
+ * @reqs: @nr_pages asynchronous decompress requests.
+ * @wait: crypto_wait for acomp batch decompress implemented using request
+ *        chaining. Required if async_mode is "false". If async_mode is "true",
+ *        and @wait is NULL, the completions will be processed using
+ *        asynchronous polling of the requests' completion statuses.
+ * @srcs: The src buffers to be decompressed by IAA.
+ * @pages: The pages to store the decompressed buffers.
+ * @slens: Compressed lengths of @srcs.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *            to be decompressed.
+ *
+ * Returns true if all decompress requests complete successfully,
+ * false otherwise.
+ */
+static bool iaa_comp_adecompress_batch(
+	struct acomp_req *reqs[],
+	struct crypto_wait *wait,
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	int errors[],
+	int nr_pages)
+{
+	struct scatterlist inputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	struct scatterlist outputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	unsigned int dlens[IAA_CRYPTO_MAX_BATCH_SIZE];
+	bool decompressions_done = false;
+	bool async = (async_mode && !use_irq);
+	bool async_poll = (async && !wait);
+	int i, err = 0;
+
+	BUG_ON(nr_pages > IAA_CRYPTO_MAX_BATCH_SIZE);
+	BUG_ON(!async && !wait);
+
+	if (async)
+		iaa_set_req_poll(reqs, nr_pages, true);
+	else
+		iaa_set_req_poll(reqs, nr_pages, false);
+
+	/*
+	 * Prepare and submit acomp_reqs to IAA. IAA will process these
+	 * decompress jobs in parallel if async_mode is true.
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		dlens[i] = PAGE_SIZE;
+		sg_init_one(&inputs[i], srcs[i], slens[i]);
+		sg_init_table(&outputs[i], 1);
+		sg_set_page(&outputs[i], pages[i], PAGE_SIZE, 0);
+		acomp_request_set_params(reqs[i], &inputs[i],
+					&outputs[i], slens[i], dlens[i]);
+
+		/*
+		 * As long as the API is called with a valid "wait", chain the
+		 * requests for synchronous/asynchronous decompress ops.
+		 * If async_mode is in effect, but the API is called with a
+		 * NULL "wait", submit the requests first, and poll for
+		 * their completion status later, after all descriptors have
+		 * been submitted.
+		 */
+		if (!async_poll) {
+			/* acomp request chaining. */
+			if (i)
+				acomp_request_chain(reqs[i], reqs[0]);
+			else
+				acomp_reqchain_init(reqs[0], 0, crypto_req_done,
+						    wait);
+		} else {
+			errors[i] = iaa_comp_adecompress(reqs[i]);
+
+			if (errors[i] != -EINPROGRESS) {
+				errors[i] = -EINVAL;
+				err = -EINVAL;
+			} else {
+				errors[i] = -EAGAIN;
+			}
+		}
+	}
+
+	if (!async_poll) {
+		if (async)
+			/* Process the request chain in parallel. */
+			err = crypto_wait_req(acomp_do_async_req_chain(reqs[0],
+					      iaa_comp_adecompress, iaa_comp_poll),
+					      wait);
+		else
+			/* Process the request chain in series. */
+			err = crypto_wait_req(acomp_do_req_chain(reqs[0],
+					      iaa_comp_adecompress), wait);
+
+		for (i = 0; i < nr_pages; ++i) {
+			errors[i] = acomp_request_err(reqs[i]);
+			if (errors[i]) {
+				err = -EINVAL;
+				pr_debug("Request chaining req %d decompress error %d\n", i, errors[i]);
+			} else {
+				dlens[i] = reqs[i]->dlen;
+				BUG_ON(dlens[i] != PAGE_SIZE);
+			}
+		}
+
+		goto reset_reqs;
+	}
+
+	/*
+	 * Asynchronously poll for and process IAA decompress job completions.
+	 */
+	while (!decompressions_done) {
+		decompressions_done = true;
+
+		for (i = 0; i < nr_pages; ++i) {
+			/*
+			 * Skip, if the decompression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					decompressions_done = false;
+				else
+					err = -EINVAL;
+			} else {
+				dlens[i] = reqs[i]->dlen;
+				BUG_ON(dlens[i] != PAGE_SIZE);
+			}
+		}
+	}
+
+reset_reqs:
+	/*
+	 * For the same 'reqs[]' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_deacompress(),
+	 * clear the CRYPTO_ACOMP_REQ_POLL bit on all acomp_reqs, and the
+	 * CRYPTO_TFM_REQ_CHAIN bit on the reqs[0].
+	 */
+	iaa_set_req_poll(reqs, nr_pages, false);
+	if (!async_poll)
+		acomp_reqchain_clear(reqs[0], wait);
+
+	return !err;
+}
+
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
@@ -1832,10 +2222,13 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	.compress		= iaa_comp_acompress,
 	.decompress		= iaa_comp_adecompress,
 	.dst_free               = dst_free,
+	.get_batch_size		= iaa_comp_get_batch_size,
+	.batch_compress		= iaa_comp_acompress_batch,
+	.batch_decompress	= iaa_comp_adecompress_batch,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa",
-		.cra_flags		= CRYPTO_ALG_ASYNC,
+		.cra_flags		= CRYPTO_ALG_ASYNC | CRYPTO_ALG_REQ_CHAIN,
 		.cra_ctxsize		= sizeof(struct iaa_compression_ctx),
 		.cra_module		= THIS_MODULE,
 		.cra_priority		= IAA_ALG_PRIORITY,
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 05/16] crypto: iaa - Enable async mode and make it the default.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2025-02-06  7:20 ` [PATCH v6 04/16] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 06/16] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch enables the 'async' sync_mode in the driver. Further, it sets
the default sync_mode to 'async', which makes it easier for IAA hardware
acceleration in the iaa_crypto driver to be loaded by default in the most
efficient/recommended 'async' mode for parallel
compressions/decompressions, namely, asynchronous submission of
descriptors, followed by polling for job completions with or without
request chaining. Earlier, the "sync" mode used to be the default.

This way, anyone who wants to use IAA for zswap/zram can do so after
building the kernel, and without having to go through these steps to use
async mode:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo async > /sys/bus/dsa/drivers/crypto/sync_mode
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 Documentation/driver-api/crypto/iaa/iaa-crypto.rst | 11 ++---------
 drivers/crypto/intel/iaa/iaa_crypto_main.c         |  4 ++--
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 8e50b900d51c..782da5230fcd 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -272,7 +272,7 @@ The available attributes are:
       echo async_irq > /sys/bus/dsa/drivers/crypto/sync_mode
 
     Async mode without interrupts (caller must poll) can be enabled by
-    writing 'async' to it (please see Caveat)::
+    writing 'async' to it::
 
       echo async > /sys/bus/dsa/drivers/crypto/sync_mode
 
@@ -281,14 +281,7 @@ The available attributes are:
 
       echo sync > /sys/bus/dsa/drivers/crypto/sync_mode
 
-    The default mode is 'sync'.
-
-    Caveat: since the only mechanism that iaa_crypto currently implements
-    for async polling without interrupts is via the 'sync' mode as
-    described earlier, writing 'async' to
-    '/sys/bus/dsa/drivers/crypto/sync_mode' will internally enable the
-    'sync' mode. This is to ensure correct iaa_crypto behavior until true
-    async polling without interrupts is enabled in iaa_crypto.
+    The default mode is 'async'.
 
 .. _iaa_default_config:
 
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 61134a7ad1da..6d49f82165fe 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -153,7 +153,7 @@ static DRIVER_ATTR_RW(verify_compress);
  */
 
 /* Use async mode */
-static bool async_mode;
+static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
@@ -173,7 +173,7 @@ static int set_iaa_sync_mode(const char *name)
 		async_mode = false;
 		use_irq = false;
 	} else if (sysfs_streq(name, "async")) {
-		async_mode = false;
+		async_mode = true;
 		use_irq = false;
 	} else if (sysfs_streq(name, "async_irq")) {
 		async_mode = true;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 06/16] crypto: iaa - Disable iaa_verify_compress by default.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2025-02-06  7:20 ` [PATCH v6 05/16] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 07/16] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default with "iaa_verify_compress" disabled, to
facilitate performance comparisons with software compressors (which also
do not run compress verification by default). Earlier, iaa_crypto compress
verification used to be enabled by default.

With this patch, if users want to enable compress verification, they can do
so with these steps:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo 1 > /sys/bus/dsa/drivers/crypto/verify_compress
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 6d49f82165fe..f4807a828034 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -94,7 +94,7 @@ static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
 /* Verify results of IAA compress or not */
-static bool iaa_verify_compress = true;
+static bool iaa_verify_compress = false;
 
 static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
 {
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 07/16] crypto: iaa - Re-organize the iaa_crypto driver code.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2025-02-06  7:20 ` [PATCH v6 06/16] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 08/16] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch merely reorganizes the code in iaa_crypto_main.c, so that
the functions are consolidated into logically related sub-sections of
code.

This is expected to make the code more maintainable and for it to be easier
to replace functional layers and/or add new features.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 540 +++++++++++----------
 1 file changed, 275 insertions(+), 265 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index f4807a828034..2c5b7ce041d6 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -24,6 +24,9 @@
 
 #define IAA_ALG_PRIORITY               300
 
+/**************************************
+ * Driver internal global variables.
+ **************************************/
 /* number of iaa instances probed */
 static unsigned int nr_iaa;
 static unsigned int nr_cpus;
@@ -36,55 +39,46 @@ static unsigned int cpus_per_iaa;
 static struct crypto_comp *deflate_generic_tfm;
 
 /* Per-cpu lookup table for balanced wqs */
-static struct wq_table_entry __percpu *wq_table;
+static struct wq_table_entry __percpu *wq_table = NULL;
 
-static struct idxd_wq *wq_table_next_wq(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (++entry->cur_wq >= entry->n_wqs)
-		entry->cur_wq = 0;
-
-	if (!entry->wqs[entry->cur_wq])
-		return NULL;
-
-	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
-		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
-		 entry->wqs[entry->cur_wq]->id, cpu);
-
-	return entry->wqs[entry->cur_wq];
-}
-
-static void wq_table_add(int cpu, struct idxd_wq *wq)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
-		return;
-
-	entry->wqs[entry->n_wqs++] = wq;
-
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
-}
-
-static void wq_table_free_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+/* Verify results of IAA compress or not */
+static bool iaa_verify_compress = false;
 
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
-}
+/*
+ * The iaa crypto driver supports three 'sync' methods determining how
+ * compressions and decompressions are performed:
+ *
+ * - sync:      the compression or decompression completes before
+ *              returning.  This is the mode used by the async crypto
+ *              interface when the sync mode is set to 'sync' and by
+ *              the sync crypto interface regardless of setting.
+ *
+ * - async:     the compression or decompression is submitted and returns
+ *              immediately.  Completion interrupts are not used so
+ *              the caller is responsible for polling the descriptor
+ *              for completion.  This mode is applicable to only the
+ *              async crypto interface and is ignored for anything
+ *              else.
+ *
+ * - async_irq: the compression or decompression is submitted and
+ *              returns immediately.  Completion interrupts are
+ *              enabled so the caller can wait for the completion and
+ *              yield to other threads.  When the compression or
+ *              decompression completes, the completion is signaled
+ *              and the caller awakened.  This mode is applicable to
+ *              only the async crypto interface and is ignored for
+ *              anything else.
+ *
+ * These modes can be set using the iaa_crypto sync_mode driver
+ * attribute.
+ */
 
-static void wq_table_clear_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+/* Use async mode */
+static bool async_mode = true;
+/* Use interrupts */
+static bool use_irq;
 
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
-}
+static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
@@ -93,9 +87,9 @@ DEFINE_MUTEX(iaa_devices_lock);
 static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
-/* Verify results of IAA compress or not */
-static bool iaa_verify_compress = false;
-
+/**************************************************
+ * Driver attributes along with get/set functions.
+ **************************************************/
 static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
 {
 	return sprintf(buf, "%d\n", iaa_verify_compress);
@@ -123,40 +117,6 @@ static ssize_t verify_compress_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(verify_compress);
 
-/*
- * The iaa crypto driver supports three 'sync' methods determining how
- * compressions and decompressions are performed:
- *
- * - sync:      the compression or decompression completes before
- *              returning.  This is the mode used by the async crypto
- *              interface when the sync mode is set to 'sync' and by
- *              the sync crypto interface regardless of setting.
- *
- * - async:     the compression or decompression is submitted and returns
- *              immediately.  Completion interrupts are not used so
- *              the caller is responsible for polling the descriptor
- *              for completion.  This mode is applicable to only the
- *              async crypto interface and is ignored for anything
- *              else.
- *
- * - async_irq: the compression or decompression is submitted and
- *              returns immediately.  Completion interrupts are
- *              enabled so the caller can wait for the completion and
- *              yield to other threads.  When the compression or
- *              decompression completes, the completion is signaled
- *              and the caller awakened.  This mode is applicable to
- *              only the async crypto interface and is ignored for
- *              anything else.
- *
- * These modes can be set using the iaa_crypto sync_mode driver
- * attribute.
- */
-
-/* Use async mode */
-static bool async_mode = true;
-/* Use interrupts */
-static bool use_irq;
-
 /**
  * set_iaa_sync_mode - Set IAA sync mode
  * @name: The name of the sync mode
@@ -219,8 +179,9 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
-static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
-
+/****************************
+ * Driver compression modes.
+ ****************************/
 static int find_empty_iaa_compression_mode(void)
 {
 	int i = -EINVAL;
@@ -411,11 +372,6 @@ static void free_device_compression_mode(struct iaa_device *iaa_device,
 						IDXD_OP_FLAG_WR_SRC2_AECS_COMP | \
 						IDXD_OP_FLAG_AECS_RW_TGLS)
 
-static int check_completion(struct device *dev,
-			    struct iax_completion_record *comp,
-			    bool compress,
-			    bool only_once);
-
 static int init_device_compression_mode(struct iaa_device *iaa_device,
 					struct iaa_compression_mode *mode,
 					int idx, struct idxd_wq *wq)
@@ -502,6 +458,10 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
 	}
 }
 
+/***********************************************************
+ * Functions for use in crypto probe and remove interfaces:
+ * allocate/init/query/deallocate devices/wqs.
+ ***********************************************************/
 static struct iaa_device *iaa_device_alloc(void)
 {
 	struct iaa_device *iaa_device;
@@ -614,16 +574,6 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 	}
 }
 
-static void clear_wq_table(void)
-{
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
-
-	pr_debug("cleared wq table\n");
-}
-
 static void free_iaa_device(struct iaa_device *iaa_device)
 {
 	if (!iaa_device)
@@ -704,43 +654,6 @@ static int iaa_wq_put(struct idxd_wq *wq)
 	return ret;
 }
 
-static void free_wq_table(void)
-{
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_free_entry(cpu);
-
-	free_percpu(wq_table);
-
-	pr_debug("freed wq table\n");
-}
-
-static int alloc_wq_table(int max_wqs)
-{
-	struct wq_table_entry *entry;
-	int cpu;
-
-	wq_table = alloc_percpu(struct wq_table_entry);
-	if (!wq_table)
-		return -ENOMEM;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++) {
-		entry = per_cpu_ptr(wq_table, cpu);
-		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
-		if (!entry->wqs) {
-			free_wq_table();
-			return -ENOMEM;
-		}
-
-		entry->max_wqs = max_wqs;
-	}
-
-	pr_debug("initialized wq table\n");
-
-	return 0;
-}
-
 static int save_iaa_wq(struct idxd_wq *wq)
 {
 	struct iaa_device *iaa_device, *found = NULL;
@@ -829,6 +742,87 @@ static void remove_iaa_wq(struct idxd_wq *wq)
 		cpus_per_iaa = 1;
 }
 
+/***************************************************************
+ * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
+ ***************************************************************/
+static void wq_table_free_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	kfree(entry->wqs);
+	memset(entry, 0, sizeof(*entry));
+}
+
+static void wq_table_clear_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	entry->n_wqs = 0;
+	entry->cur_wq = 0;
+	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+}
+
+static void clear_wq_table(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		wq_table_clear_entry(cpu);
+
+	pr_debug("cleared wq table\n");
+}
+
+static void free_wq_table(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		wq_table_free_entry(cpu);
+
+	free_percpu(wq_table);
+
+	pr_debug("freed wq table\n");
+}
+
+static int alloc_wq_table(int max_wqs)
+{
+	struct wq_table_entry *entry;
+	int cpu;
+
+	wq_table = alloc_percpu(struct wq_table_entry);
+	if (!wq_table)
+		return -ENOMEM;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		entry = per_cpu_ptr(wq_table, cpu);
+		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
+		if (!entry->wqs) {
+			free_wq_table();
+			return -ENOMEM;
+		}
+
+		entry->max_wqs = max_wqs;
+	}
+
+	pr_debug("initialized wq table\n");
+
+	return 0;
+}
+
+static void wq_table_add(int cpu, struct idxd_wq *wq)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+		return;
+
+	entry->wqs[entry->n_wqs++] = wq;
+
+	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
+		 entry->wqs[entry->n_wqs - 1]->idxd->id,
+		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+}
+
 static int wq_table_add_wqs(int iaa, int cpu)
 {
 	struct iaa_device *iaa_device, *found_device = NULL;
@@ -939,6 +933,29 @@ static void rebalance_wq_table(void)
 	}
 }
 
+/***************************************************************
+ * Assign work-queues for driver ops using per-cpu wq_tables.
+ ***************************************************************/
+static struct idxd_wq *wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	if (++entry->cur_wq >= entry->n_wqs)
+		entry->cur_wq = 0;
+
+	if (!entry->wqs[entry->cur_wq])
+		return NULL;
+
+	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
+		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
+		 entry->wqs[entry->cur_wq]->id, cpu);
+
+	return entry->wqs[entry->cur_wq];
+}
+
+/*************************************************
+ * Core iaa_crypto compress/decompress functions.
+ *************************************************/
 static inline int check_completion(struct device *dev,
 				   struct iax_completion_record *comp,
 				   bool compress,
@@ -1020,13 +1037,130 @@ static int deflate_generic_decompress(struct acomp_req *req)
 
 static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
 				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr);
+				dma_addr_t *src_addr, dma_addr_t *dst_addr)
+{
+	int ret = 0;
+	int nr_sgs;
+
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		goto out;
+	}
+	*src_addr = sg_dma_address(req->src);
+	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
+		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
+		req->src, req->slen, sg_dma_len(req->src));
+
+	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+		goto out;
+	}
+	*dst_addr = sg_dma_address(req->dst);
+	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
+		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
+		req->dst, req->dlen, sg_dma_len(req->dst));
+out:
+	return ret;
+}
 
 static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 			       struct idxd_wq *wq,
 			       dma_addr_t src_addr, unsigned int slen,
 			       dma_addr_t dst_addr, unsigned int *dlen,
-			       u32 compression_crc);
+			       u32 compression_crc)
+{
+	struct iaa_device_compression_mode *active_compression_mode;
+	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	struct iaa_device *iaa_device;
+	struct idxd_desc *idxd_desc;
+	struct iax_hw_desc *desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int ret = 0;
+
+	iaa_wq = idxd_wq_get_private(wq);
+	iaa_device = iaa_wq->iaa_device;
+	idxd = iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
+
+	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	if (IS_ERR(idxd_desc)) {
+		dev_dbg(dev, "idxd descriptor allocation failed\n");
+		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
+			PTR_ERR(idxd_desc));
+		return PTR_ERR(idxd_desc);
+	}
+	desc = idxd_desc->iax_hw;
+
+	/* Verify (optional) - decompress and check crc, suppress dest write */
+
+	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
+	desc->opcode = IAX_OPCODE_DECOMPRESS;
+	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
+	desc->priv = 0;
+
+	desc->src1_addr = (u64)dst_addr;
+	desc->src1_size = *dlen;
+	desc->dst_addr = (u64)src_addr;
+	desc->max_dst_size = slen;
+	desc->completion_addr = idxd_desc->compl_dma;
+
+	dev_dbg(dev, "(verify) compression mode %s,"
+		" desc->src1_addr %llx, desc->src1_size %d,"
+		" desc->dst_addr %llx, desc->max_dst_size %d,"
+		" desc->src2_addr %llx, desc->src2_size %d\n",
+		active_compression_mode->name,
+		desc->src1_addr, desc->src1_size, desc->dst_addr,
+		desc->max_dst_size, desc->src2_addr, desc->src2_size);
+
+	ret = idxd_submit_desc(wq, idxd_desc);
+	if (ret) {
+		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
+		goto err;
+	}
+
+	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+	if (ret) {
+		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
+		goto err;
+	}
+
+	if (compression_crc != idxd_desc->iax_completion->crc) {
+		ret = -EINVAL;
+		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
+			" comp=0x%x, decomp=0x%x\n", compression_crc,
+			idxd_desc->iax_completion->crc);
+		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
+			       8, 1, idxd_desc->iax_completion, 64, 0);
+		goto err;
+	}
+
+	idxd_free_desc(wq, idxd_desc);
+out:
+	return ret;
+err:
+	idxd_free_desc(wq, idxd_desc);
+	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
+
+	goto out;
+}
 
 static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 			      enum idxd_complete_type comp_type,
@@ -1245,133 +1379,6 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 	goto out;
 }
 
-static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr)
-{
-	int ret = 0;
-	int nr_sgs;
-
-	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
-	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
-
-	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		goto out;
-	}
-	*src_addr = sg_dma_address(req->src);
-	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
-		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
-		req->src, req->slen, sg_dma_len(req->src));
-
-	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-		goto out;
-	}
-	*dst_addr = sg_dma_address(req->dst);
-	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
-		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
-		req->dst, req->dlen, sg_dma_len(req->dst));
-out:
-	return ret;
-}
-
-static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
-			       struct idxd_wq *wq,
-			       dma_addr_t src_addr, unsigned int slen,
-			       dma_addr_t dst_addr, unsigned int *dlen,
-			       u32 compression_crc)
-{
-	struct iaa_device_compression_mode *active_compression_mode;
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
-	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
-	struct iax_hw_desc *desc;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
-	int ret = 0;
-
-	iaa_wq = idxd_wq_get_private(wq);
-	iaa_device = iaa_wq->iaa_device;
-	idxd = iaa_device->idxd;
-	pdev = idxd->pdev;
-	dev = &pdev->dev;
-
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
-	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
-			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
-	}
-	desc = idxd_desc->iax_hw;
-
-	/* Verify (optional) - decompress and check crc, suppress dest write */
-
-	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
-	desc->opcode = IAX_OPCODE_DECOMPRESS;
-	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
-	desc->priv = 0;
-
-	desc->src1_addr = (u64)dst_addr;
-	desc->src1_size = *dlen;
-	desc->dst_addr = (u64)src_addr;
-	desc->max_dst_size = slen;
-	desc->completion_addr = idxd_desc->compl_dma;
-
-	dev_dbg(dev, "(verify) compression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n",
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
-	ret = idxd_submit_desc(wq, idxd_desc);
-	if (ret) {
-		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
-		goto err;
-	}
-
-	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
-	if (ret) {
-		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
-		goto err;
-	}
-
-	if (compression_crc != idxd_desc->iax_completion->crc) {
-		ret = -EINVAL;
-		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
-			" comp=0x%x, decomp=0x%x\n", compression_crc,
-			idxd_desc->iax_completion->crc);
-		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
-			       8, 1, idxd_desc->iax_completion, 64, 0);
-		goto err;
-	}
-
-	idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
-}
-
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			  struct idxd_wq *wq,
 			  dma_addr_t src_addr, unsigned int slen,
@@ -2197,6 +2204,9 @@ static bool iaa_comp_adecompress_batch(
 	return !err;
 }
 
+/*********************************************
+ * Interfaces to crypto_alg and crypto_acomp.
+ *********************************************/
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 08/16] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (6 preceding siblings ...)
  2025-02-06  7:20 ` [PATCH v6 07/16] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 09/16] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the algorithm for mapping available IAA devices and
wqs to cores, as they are being discovered, based on packages instead of
NUMA nodes. This leads to a more realistic mapping of IAA devices as
compression/decompression resources for a package, rather than for a NUMA
node. This also resolves problems that were observed during internal
validation on Intel platforms with many more NUMA nodes than packages: for
such cases, the earlier NUMA based allocation caused some IAAs to be
over-subscribed and some to not be utilized at all.

As a result of this change from NUMA to packages, some of the core
functions used by the iaa_crypto driver's "probe" and "remove" API
have been re-written. The new infrastructure maintains a static/global
mapping of "local wqs" per IAA device, in the "struct iaa_device" itself.
The earlier implementation would allocate memory per-cpu for this data,
which never changes once the IAA devices/wqs have been initialized.

Two main outcomes from this new iaa_crypto driver infrastructure are:

1) Resolves "task blocked for more than x seconds" errors observed during
   internal validation on Intel systems with the earlier NUMA node based
   mappings, which was root-caused to the non-optimal IAA-to-core mappings
   described earlier.

2) Results in a NUM_THREADS factor reduction in memory footprint cost of
   initializing IAA devices/wqs, due to eliminating the per-cpu copies of
   each IAA device's wqs. On a 384 cores Intel Granite Rapids server with
   8 IAA devices, this saves 140MiB.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  17 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 276 ++++++++++++---------
 2 files changed, 171 insertions(+), 122 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index b3b67c44ec8a..74d25e62df12 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -55,6 +55,7 @@ struct iaa_wq {
 	struct idxd_wq		*wq;
 	int			ref;
 	bool			remove;
+	bool			mapped;
 
 	struct iaa_device	*iaa_device;
 
@@ -72,6 +73,13 @@ struct iaa_device_compression_mode {
 	dma_addr_t			aecs_comp_table_dma_addr;
 };
 
+struct wq_table_entry {
+	struct idxd_wq **wqs;
+	int	max_wqs;
+	int	n_wqs;
+	int	cur_wq;
+};
+
 /* Representation of IAA device with wqs, populated by probe */
 struct iaa_device {
 	struct list_head		list;
@@ -82,19 +90,14 @@ struct iaa_device {
 	int				n_wq;
 	struct list_head		wqs;
 
+	struct wq_table_entry		*iaa_local_wqs;
+
 	atomic64_t			comp_calls;
 	atomic64_t			comp_bytes;
 	atomic64_t			decomp_calls;
 	atomic64_t			decomp_bytes;
 };
 
-struct wq_table_entry {
-	struct idxd_wq **wqs;
-	int	max_wqs;
-	int	n_wqs;
-	int	cur_wq;
-};
-
 #define IAA_AECS_ALIGN			32
 
 /*
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 2c5b7ce041d6..418f78454875 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -30,8 +30,9 @@
 /* number of iaa instances probed */
 static unsigned int nr_iaa;
 static unsigned int nr_cpus;
-static unsigned int nr_nodes;
-static unsigned int nr_cpus_per_node;
+static unsigned int nr_packages;
+static unsigned int nr_cpus_per_package;
+static unsigned int nr_iaa_per_package;
 
 /* Number of physical cpus sharing each iaa instance */
 static unsigned int cpus_per_iaa;
@@ -462,17 +463,46 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
  * Functions for use in crypto probe and remove interfaces:
  * allocate/init/query/deallocate devices/wqs.
  ***********************************************************/
-static struct iaa_device *iaa_device_alloc(void)
+static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 {
+	struct wq_table_entry *local;
 	struct iaa_device *iaa_device;
 
 	iaa_device = kzalloc(sizeof(*iaa_device), GFP_KERNEL);
 	if (!iaa_device)
-		return NULL;
+		goto err;
+
+	iaa_device->idxd = idxd;
+
+	/* IAA device's local wqs. */
+	iaa_device->iaa_local_wqs = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->iaa_local_wqs)
+		goto err;
+
+	local = iaa_device->iaa_local_wqs;
+
+	local->wqs = kzalloc(iaa_device->idxd->max_wqs * sizeof(struct wq *), GFP_KERNEL);
+	if (!local->wqs)
+		goto err;
+
+	local->max_wqs = iaa_device->idxd->max_wqs;
+	local->n_wqs = 0;
 
 	INIT_LIST_HEAD(&iaa_device->wqs);
 
 	return iaa_device;
+
+err:
+	if (iaa_device) {
+		if (iaa_device->iaa_local_wqs) {
+			if (iaa_device->iaa_local_wqs->wqs)
+				kfree(iaa_device->iaa_local_wqs->wqs);
+			kfree(iaa_device->iaa_local_wqs);
+		}
+		kfree(iaa_device);
+	}
+
+	return NULL;
 }
 
 static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
@@ -491,12 +521,10 @@ static struct iaa_device *add_iaa_device(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
 
-	iaa_device = iaa_device_alloc();
+	iaa_device = iaa_device_alloc(idxd);
 	if (!iaa_device)
 		return NULL;
 
-	iaa_device->idxd = idxd;
-
 	list_add_tail(&iaa_device->list, &iaa_devices);
 
 	nr_iaa++;
@@ -537,6 +565,7 @@ static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 	iaa_wq->wq = wq;
 	iaa_wq->iaa_device = iaa_device;
 	idxd_wq_set_private(wq, iaa_wq);
+	iaa_wq->mapped = false;
 
 	list_add_tail(&iaa_wq->list, &iaa_device->wqs);
 
@@ -580,6 +609,13 @@ static void free_iaa_device(struct iaa_device *iaa_device)
 		return;
 
 	remove_device_compression_modes(iaa_device);
+
+	if (iaa_device->iaa_local_wqs) {
+		if (iaa_device->iaa_local_wqs->wqs)
+			kfree(iaa_device->iaa_local_wqs->wqs);
+		kfree(iaa_device->iaa_local_wqs);
+	}
+
 	kfree(iaa_device);
 }
 
@@ -716,9 +752,14 @@ static int save_iaa_wq(struct idxd_wq *wq)
 	if (WARN_ON(nr_iaa == 0))
 		return -EINVAL;
 
-	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+	cpus_per_iaa = (nr_packages * nr_cpus_per_package) / nr_iaa;
 	if (!cpus_per_iaa)
 		cpus_per_iaa = 1;
+
+	nr_iaa_per_package = nr_iaa / nr_packages;
+	if (!nr_iaa_per_package)
+		nr_iaa_per_package = 1;
+
 out:
 	return 0;
 }
@@ -735,53 +776,45 @@ static void remove_iaa_wq(struct idxd_wq *wq)
 	}
 
 	if (nr_iaa) {
-		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+		cpus_per_iaa = (nr_packages * nr_cpus_per_package) / nr_iaa;
 		if (!cpus_per_iaa)
 			cpus_per_iaa = 1;
-	} else
+
+		nr_iaa_per_package = nr_iaa / nr_packages;
+		if (!nr_iaa_per_package)
+			nr_iaa_per_package = 1;
+	} else {
 		cpus_per_iaa = 1;
+		nr_iaa_per_package = 1;
+	}
 }
 
 /***************************************************************
  * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
  ***************************************************************/
-static void wq_table_free_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
-}
-
-static void wq_table_clear_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
-}
-
-static void clear_wq_table(void)
+/*
+ * Given a cpu, find the closest IAA instance.  The idea is to try to
+ * choose the most appropriate IAA instance for a caller and spread
+ * available workqueues around to clients.
+ */
+static inline int cpu_to_iaa(int cpu)
 {
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
+	int package_id, base_iaa, iaa = 0;
 
-	pr_debug("cleared wq table\n");
-}
+	if (!nr_packages || !nr_iaa_per_package)
+		return 0;
 
-static void free_wq_table(void)
-{
-	int cpu;
+	package_id = topology_logical_package_id(cpu);
+	base_iaa = package_id * nr_iaa_per_package;
+	iaa = base_iaa + ((cpu % nr_cpus_per_package) / cpus_per_iaa);
 
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_free_entry(cpu);
+	pr_debug("cpu = %d, package_id = %d, base_iaa = %d, iaa = %d",
+		 cpu, package_id, base_iaa, iaa);
 
-	free_percpu(wq_table);
+	if (iaa >= 0 && iaa < nr_iaa)
+		return iaa;
 
-	pr_debug("freed wq table\n");
+	return (nr_iaa - 1);
 }
 
 static int alloc_wq_table(int max_wqs)
@@ -795,13 +828,11 @@ static int alloc_wq_table(int max_wqs)
 
 	for (cpu = 0; cpu < nr_cpus; cpu++) {
 		entry = per_cpu_ptr(wq_table, cpu);
-		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
-		if (!entry->wqs) {
-			free_wq_table();
-			return -ENOMEM;
-		}
 
+		entry->wqs = NULL;
 		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
 	}
 
 	pr_debug("initialized wq table\n");
@@ -809,33 +840,27 @@ static int alloc_wq_table(int max_wqs)
 	return 0;
 }
 
-static void wq_table_add(int cpu, struct idxd_wq *wq)
+static void wq_table_add(int cpu, struct wq_table_entry *iaa_local_wqs)
 {
 	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
 
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
-		return;
-
-	entry->wqs[entry->n_wqs++] = wq;
+	entry->wqs = iaa_local_wqs->wqs;
+	entry->max_wqs = iaa_local_wqs->max_wqs;
+	entry->n_wqs = iaa_local_wqs->n_wqs;
+	entry->cur_wq = 0;
 
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
+	pr_debug("%s: cpu %d: added %d iaa local wqs up to wq %d.%d\n", __func__,
+		 cpu, entry->n_wqs,
 		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+		 entry->wqs[entry->n_wqs - 1]->id);
 }
 
 static int wq_table_add_wqs(int iaa, int cpu)
 {
 	struct iaa_device *iaa_device, *found_device = NULL;
-	int ret = 0, cur_iaa = 0, n_wqs_added = 0;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
+	int ret = 0, cur_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		idxd = iaa_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
 
 		if (cur_iaa != iaa) {
 			cur_iaa++;
@@ -843,7 +868,8 @@ static int wq_table_add_wqs(int iaa, int cpu)
 		}
 
 		found_device = iaa_device;
-		dev_dbg(dev, "getting wq from iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 		break;
 	}
@@ -858,29 +884,58 @@ static int wq_table_add_wqs(int iaa, int cpu)
 		}
 		cur_iaa = 0;
 
-		idxd = found_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
-		dev_dbg(dev, "getting wq from only iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from only iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 	}
 
-	list_for_each_entry(iaa_wq, &found_device->wqs, list) {
-		wq_table_add(cpu, iaa_wq->wq);
-		pr_debug("rebalance: added wq for cpu=%d: iaa wq %d.%d\n",
-			 cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
-		n_wqs_added++;
+	wq_table_add(cpu, found_device->iaa_local_wqs);
+
+out:
+	return ret;
+}
+
+static int map_iaa_device_wqs(struct iaa_device *iaa_device)
+{
+	struct wq_table_entry *local;
+	int ret = 0, n_wqs_added = 0;
+	struct iaa_wq *iaa_wq;
+
+	local = iaa_device->iaa_local_wqs;
+
+	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+		if (iaa_wq->mapped && ++n_wqs_added)
+			continue;
+
+		pr_debug("iaa_device %px: processing wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+
+		if (WARN_ON(local->n_wqs == local->max_wqs))
+			break;
+
+		local->wqs[local->n_wqs++] = iaa_wq->wq;
+		pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+
+		iaa_wq->mapped = true;
+		++n_wqs_added;
 	}
 
-	if (!n_wqs_added) {
-		pr_debug("couldn't find any iaa wqs!\n");
+	if (!n_wqs_added && !iaa_device->n_wq) {
+		pr_debug("iaa_device %d: couldn't find any iaa wqs!\n", iaa_device->idxd->id);
 		ret = -EINVAL;
-		goto out;
 	}
-out:
+
 	return ret;
 }
 
+static void map_iaa_devices(void)
+{
+	struct iaa_device *iaa_device;
+
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		BUG_ON(map_iaa_device_wqs(iaa_device));
+	}
+}
+
 /*
  * Rebalance the wq table so that given a cpu, it's easy to find the
  * closest IAA instance.  The idea is to try to choose the most
@@ -889,48 +944,42 @@ static int wq_table_add_wqs(int iaa, int cpu)
  */
 static void rebalance_wq_table(void)
 {
-	const struct cpumask *node_cpus;
-	int node, cpu, iaa = -1;
+	int cpu, iaa;
 
 	if (nr_iaa == 0)
 		return;
 
-	pr_debug("rebalance: nr_nodes=%d, nr_cpus %d, nr_iaa %d, cpus_per_iaa %d\n",
-		 nr_nodes, nr_cpus, nr_iaa, cpus_per_iaa);
+	map_iaa_devices();
 
-	clear_wq_table();
+	pr_debug("rebalance: nr_packages=%d, nr_cpus %d, nr_iaa %d, cpus_per_iaa %d\n",
+		 nr_packages, nr_cpus, nr_iaa, cpus_per_iaa);
 
-	if (nr_iaa == 1) {
-		for (cpu = 0; cpu < nr_cpus; cpu++) {
-			if (WARN_ON(wq_table_add_wqs(0, cpu))) {
-				pr_debug("could not add any wqs for iaa 0 to cpu %d!\n", cpu);
-				return;
-			}
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		iaa = cpu_to_iaa(cpu);
+		pr_debug("rebalance: cpu=%d iaa=%d\n", cpu, iaa);
+
+		if (WARN_ON(iaa == -1)) {
+			pr_debug("rebalance (cpu_to_iaa(%d)) failed!\n", cpu);
+			return;
 		}
 
-		return;
+		if (WARN_ON(wq_table_add_wqs(iaa, cpu))) {
+			pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
+			return;
+		}
 	}
 
-	for_each_node_with_cpus(node) {
-		node_cpus = cpumask_of_node(node);
-
-		for (cpu = 0; cpu <  cpumask_weight(node_cpus); cpu++) {
-			int node_cpu = cpumask_nth(cpu, node_cpus);
-
-			if (WARN_ON(node_cpu >= nr_cpu_ids)) {
-				pr_debug("node_cpu %d doesn't exist!\n", node_cpu);
-				return;
-			}
-
-			if ((cpu % cpus_per_iaa) == 0)
-				iaa++;
+	pr_debug("Finished rebalance local wqs.");
+}
 
-			if (WARN_ON(wq_table_add_wqs(iaa, node_cpu))) {
-				pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
-				return;
-			}
-		}
+static void free_wq_tables(void)
+{
+	if (wq_table) {
+		free_percpu(wq_table);
+		wq_table = NULL;
 	}
+
+	pr_debug("freed local wq table\n");
 }
 
 /***************************************************************
@@ -2347,7 +2396,7 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	free_iaa_wq(idxd_wq_get_private(wq));
 err_save:
 	if (first_wq)
-		free_wq_table();
+		free_wq_tables();
 err_alloc:
 	mutex_unlock(&iaa_devices_lock);
 	idxd_drv_disable_wq(wq);
@@ -2397,7 +2446,9 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 
 	if (nr_iaa == 0) {
 		iaa_crypto_enabled = false;
-		free_wq_table();
+		free_wq_tables();
+		BUG_ON(!list_empty(&iaa_devices));
+		INIT_LIST_HEAD(&iaa_devices);
 		module_put(THIS_MODULE);
 
 		pr_info("iaa_crypto now DISABLED\n");
@@ -2423,16 +2474,11 @@ static struct idxd_device_driver iaa_crypto_driver = {
 static int __init iaa_crypto_init_module(void)
 {
 	int ret = 0;
-	int node;
+	INIT_LIST_HEAD(&iaa_devices);
 
 	nr_cpus = num_possible_cpus();
-	for_each_node_with_cpus(node)
-		nr_nodes++;
-	if (!nr_nodes) {
-		pr_err("IAA couldn't find any nodes with cpus\n");
-		return -ENODEV;
-	}
-	nr_cpus_per_node = nr_cpus / nr_nodes;
+	nr_cpus_per_package = topology_num_cores_per_package();
+	nr_packages = topology_max_packages();
 
 	if (crypto_has_comp("deflate-generic", 0, 0))
 		deflate_generic_tfm = crypto_alloc_comp("deflate-generic", 0, 0);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 09/16] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (7 preceding siblings ...)
  2025-02-06  7:20 ` [PATCH v6 08/16] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 10/16] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This change enables processes running on any logical core on a package to
use all the IAA devices enabled on that package for compress jobs. In
other words, compressions originating from any process in a package will be
distributed in round-robin manner to the available IAA devices on the same
package.

This is not the default behavior, and is recommended only for highly
contended scenarios, when there is significant swapout/swapin activity. The
commit log describes how to enable this feature through driver parameters,
but the key thing to note is that this requires configuring 2 work-queues
per IAA device (each with 64 entries), with 1 WQ used solely for decompress
jobs, and the other WQ used solely for compress jobs. Hence the above
recommendation.

The main premise behind this change is to make sure that no compress
engines on any IAA device are un-utilized/under-utilized/over-utilized.
In other words, the compress engines on all IAA devices are considered a
global resource for that package, thus maximizing compression throughput.

This allows the use of all IAA devices present in a given package for
(batched) compressions originating from zswap/zram, from all cores
on this package.

A new per-cpu "global_wq_table" implements this in the iaa_crypto driver.
We can think of the global WQ per IAA as a WQ to which all cores on
that package can submit compress jobs.

To avail of this feature, the user must configure 2 WQs per IAA in order to
enable distribution of compress jobs to multiple IAA devices.

Each IAA will have 2 WQs:
 wq.0 (local WQ):
   Used for decompress jobs from cores mapped by the cpu_to_iaa() "even
   balancing of logical cores to IAA devices" algorithm.

 wq.1 (global WQ):
   Used for compress jobs from *all* logical cores on that package.

The iaa_crypto driver will place all global WQs from all same-package IAA
devices in the global_wq_table per cpu on that package. When the driver
receives a compress job, it will lookup the "next" global WQ in the cpu's
global_wq_table to submit the descriptor.

The starting wq in the global_wq_table for each cpu is the global wq
associated with the IAA nearest to it, so that we stagger the starting
global wq for each process. This results in very uniform usage of all IAAs
for compress jobs.

Two new driver module parameters are added for this feature:

g_wqs_per_iaa (default 0):

 /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa

 This represents the number of global WQs that can be configured per IAA
 device. The recommended setting is 1 to enable the use of this feature
 once the user configures 2 WQs per IAA using higher level scripts as
 described in Documentation/driver-api/crypto/iaa/iaa-crypto.rst.

g_consec_descs_per_gwq (default 1):

 /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq

 This represents the number of consecutive compress jobs that will be
 submitted to the same global WQ (i.e. to the same IAA device) from a given
 core, before moving to the next global WQ. The default is 1, which is also
 the recommended setting to avail of this feature.

The decompress jobs from any core will be sent to the "local" IAA, namely
the one that the driver assigns with the cpu_to_iaa() mapping algorithm
that evenly balances the assignment of logical cores to IAA devices on a
package.

On a 2-package Sapphire Rapids server where each package has 56 cores and
4 IAA devices, this is how the compress/decompress jobs will be mapped
when the user configures 2 WQs per IAA device (which implies wq.1 will
be added to the global WQ table for each logical core on that package):

 package(s):        2
 package0 CPU(s):   0-55,112-167
 package1 CPU(s):   56-111,168-223

 Compress jobs:
 --------------
 package 0:
 iaa_crypto will send compress jobs from all cpus (0-55,112-167) to all IAA
 devices on the package (iax1/iax3/iax5/iax7) in round-robin manner:
 iaa:   iax1           iax3           iax5           iax7

 package 1:
 iaa_crypto will send compress jobs from all cpus (56-111,168-223) to all
 IAA devices on the package (iax9/iax11/iax13/iax15) in round-robin manner:
 iaa:   iax9           iax11          iax13           iax15

 Decompress jobs:
 ----------------
 package 0:
 cpu   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
 iaa:  iax1           iax3           iax5           iax7

 package 1:
 cpu   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
 iaa:  iax9           iax11          iax13           iax15

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |   1 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 385 ++++++++++++++++++++-
 2 files changed, 378 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 74d25e62df12..c46c70ecf355 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -91,6 +91,7 @@ struct iaa_device {
 	struct list_head		wqs;
 
 	struct wq_table_entry		*iaa_local_wqs;
+	struct wq_table_entry		*iaa_global_wqs;
 
 	atomic64_t			comp_calls;
 	atomic64_t			comp_bytes;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 418f78454875..4ca9028d6050 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -42,6 +42,18 @@ static struct crypto_comp *deflate_generic_tfm;
 /* Per-cpu lookup table for balanced wqs */
 static struct wq_table_entry __percpu *wq_table = NULL;
 
+static struct wq_table_entry **pkg_global_wq_tables = NULL;
+
+/* Per-cpu lookup table for global wqs shared by all cpus. */
+static struct wq_table_entry __percpu *global_wq_table = NULL;
+
+/*
+ * Per-cpu counter of consecutive descriptors allocated to
+ * the same wq in the global_wq_table, so that we know
+ * when to switch to the next wq in the global_wq_table.
+ */
+static int __percpu *num_consec_descs_per_wq = NULL;
+
 /* Verify results of IAA compress or not */
 static bool iaa_verify_compress = false;
 
@@ -79,6 +91,16 @@ static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
+/* Number of global wqs per iaa*/
+static int g_wqs_per_iaa = 0;
+
+/*
+ * Number of consecutive descriptors to allocate from a
+ * given global wq before switching to the next wq in
+ * the global_wq_table.
+ */
+static int g_consec_descs_per_gwq = 1;
+
 static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
 LIST_HEAD(iaa_devices);
@@ -180,6 +202,60 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
+static ssize_t g_wqs_per_iaa_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_wqs_per_iaa);
+}
+
+static ssize_t g_wqs_per_iaa_store(struct device_driver *driver,
+				   const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_wqs_per_iaa);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_wqs_per_iaa);
+
+static ssize_t g_consec_descs_per_gwq_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_consec_descs_per_gwq);
+}
+
+static ssize_t g_consec_descs_per_gwq_store(struct device_driver *driver,
+					    const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_consec_descs_per_gwq);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_consec_descs_per_gwq);
+
 /****************************
  * Driver compression modes.
  ****************************/
@@ -465,7 +541,7 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
  ***********************************************************/
 static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 {
-	struct wq_table_entry *local;
+	struct wq_table_entry *local, *global;
 	struct iaa_device *iaa_device;
 
 	iaa_device = kzalloc(sizeof(*iaa_device), GFP_KERNEL);
@@ -488,6 +564,20 @@ static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 	local->max_wqs = iaa_device->idxd->max_wqs;
 	local->n_wqs = 0;
 
+	/* IAA device's global wqs. */
+	iaa_device->iaa_global_wqs = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->iaa_global_wqs)
+		goto err;
+
+	global = iaa_device->iaa_global_wqs;
+
+	global->wqs = kzalloc(iaa_device->idxd->max_wqs * sizeof(struct wq *), GFP_KERNEL);
+	if (!global->wqs)
+		goto err;
+
+	global->max_wqs = iaa_device->idxd->max_wqs;
+	global->n_wqs = 0;
+
 	INIT_LIST_HEAD(&iaa_device->wqs);
 
 	return iaa_device;
@@ -499,6 +589,8 @@ static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 				kfree(iaa_device->iaa_local_wqs->wqs);
 			kfree(iaa_device->iaa_local_wqs);
 		}
+		if (iaa_device->iaa_global_wqs)
+			kfree(iaa_device->iaa_global_wqs);
 		kfree(iaa_device);
 	}
 
@@ -616,6 +708,12 @@ static void free_iaa_device(struct iaa_device *iaa_device)
 		kfree(iaa_device->iaa_local_wqs);
 	}
 
+	if (iaa_device->iaa_global_wqs) {
+		if (iaa_device->iaa_global_wqs->wqs)
+			kfree(iaa_device->iaa_global_wqs->wqs);
+		kfree(iaa_device->iaa_global_wqs);
+	}
+
 	kfree(iaa_device);
 }
 
@@ -817,6 +915,58 @@ static inline int cpu_to_iaa(int cpu)
 	return (nr_iaa - 1);
 }
 
+static void free_global_wq_table(void)
+{
+	if (global_wq_table) {
+		free_percpu(global_wq_table);
+		global_wq_table = NULL;
+	}
+
+	if (num_consec_descs_per_wq) {
+		free_percpu(num_consec_descs_per_wq);
+		num_consec_descs_per_wq = NULL;
+	}
+
+	pr_debug("freed global wq table\n");
+}
+
+static int pkg_global_wq_tables_alloc(void)
+{
+	int i, j;
+
+	pkg_global_wq_tables = kzalloc(nr_packages * sizeof(*pkg_global_wq_tables), GFP_KERNEL);
+	if (!pkg_global_wq_tables)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_packages; ++i) {
+		pkg_global_wq_tables[i] = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+
+		if (!pkg_global_wq_tables[i]) {
+			for (j = 0; j < i; ++j)
+				kfree(pkg_global_wq_tables[j]);
+			kfree(pkg_global_wq_tables);
+			pkg_global_wq_tables = NULL;
+			return -ENOMEM;
+		}
+		pkg_global_wq_tables[i]->wqs = NULL;
+	}
+
+	return 0;
+}
+
+static void pkg_global_wq_tables_dealloc(void)
+{
+	int i;
+
+	for (i = 0; i < nr_packages; ++i) {
+		if (pkg_global_wq_tables[i]->wqs)
+			kfree(pkg_global_wq_tables[i]->wqs);
+		kfree(pkg_global_wq_tables[i]);
+	}
+	kfree(pkg_global_wq_tables);
+	pkg_global_wq_tables = NULL;
+}
+
 static int alloc_wq_table(int max_wqs)
 {
 	struct wq_table_entry *entry;
@@ -835,6 +985,35 @@ static int alloc_wq_table(int max_wqs)
 		entry->cur_wq = 0;
 	}
 
+	global_wq_table = alloc_percpu(struct wq_table_entry);
+	if (!global_wq_table)
+		return 0;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		entry = per_cpu_ptr(global_wq_table, cpu);
+
+		entry->wqs = NULL;
+		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+	}
+
+	num_consec_descs_per_wq = alloc_percpu(int);
+	if (!num_consec_descs_per_wq) {
+		free_global_wq_table();
+		return 0;
+	}
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+		*num_consec_descs = 0;
+	}
+
+	if (pkg_global_wq_tables_alloc()) {
+		free_global_wq_table();
+		return 0;
+	}
+
 	pr_debug("initialized wq table\n");
 
 	return 0;
@@ -895,13 +1074,120 @@ static int wq_table_add_wqs(int iaa, int cpu)
 	return ret;
 }
 
+static void pkg_global_wq_tables_reinit(void)
+{
+	int i, cur_iaa = 0, pkg = 0, nr_pkg_wqs = 0;
+	struct iaa_device *iaa_device;
+	struct wq_table_entry *global;
+
+	if (!pkg_global_wq_tables)
+		return;
+
+	/* Reallocate per-package wqs. */
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		global = iaa_device->iaa_global_wqs;
+		nr_pkg_wqs += global->n_wqs;
+
+		if (++cur_iaa == nr_iaa_per_package) {
+			nr_pkg_wqs = nr_pkg_wqs ? max_t(int, iaa_device->idxd->max_wqs, nr_pkg_wqs) : 0;
+
+			if (pkg_global_wq_tables[pkg]->wqs) {
+				kfree(pkg_global_wq_tables[pkg]->wqs);
+				pkg_global_wq_tables[pkg]->wqs = NULL;
+			}
+
+			if (nr_pkg_wqs)
+				pkg_global_wq_tables[pkg]->wqs = kzalloc(nr_pkg_wqs *
+									 sizeof(struct wq *),
+									 GFP_KERNEL);
+
+			pkg_global_wq_tables[pkg]->n_wqs = 0;
+			pkg_global_wq_tables[pkg]->cur_wq = 0;
+			pkg_global_wq_tables[pkg]->max_wqs = nr_pkg_wqs;
+
+			if (++pkg == nr_packages)
+				break;
+			cur_iaa = 0;
+			nr_pkg_wqs = 0;
+		}
+	}
+
+	pkg = 0;
+	cur_iaa = 0;
+
+	/* Re-initialize per-package wqs. */
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		global = iaa_device->iaa_global_wqs;
+
+		if (pkg_global_wq_tables[pkg]->wqs)
+			for (i = 0; i < global->n_wqs; ++i)
+				pkg_global_wq_tables[pkg]->wqs[pkg_global_wq_tables[pkg]->n_wqs++] = global->wqs[i];
+
+		pr_debug("pkg_global_wq_tables[%d] has %d wqs", pkg, pkg_global_wq_tables[pkg]->n_wqs);
+
+		if (++cur_iaa == nr_iaa_per_package) {
+			if (++pkg == nr_packages)
+				break;
+			cur_iaa = 0;
+		}
+	}
+}
+
+static void global_wq_table_add(int cpu, struct wq_table_entry *pkg_global_wq_table)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+
+	/* This could be NULL. */
+	entry->wqs = pkg_global_wq_table->wqs;
+	entry->max_wqs = pkg_global_wq_table->max_wqs;
+	entry->n_wqs = pkg_global_wq_table->n_wqs;
+	entry->cur_wq = 0;
+
+	if (entry->wqs)
+		pr_debug("%s: cpu %d: added %d iaa global wqs up to wq %d.%d\n", __func__,
+			 cpu, entry->n_wqs,
+			 entry->wqs[entry->n_wqs - 1]->idxd->id,
+			 entry->wqs[entry->n_wqs - 1]->id);
+}
+
+static void global_wq_table_set_start_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int start_wq = g_wqs_per_iaa * (cpu_to_iaa(cpu) % nr_iaa_per_package);
+
+	if ((start_wq >= 0) && (start_wq < entry->n_wqs))
+		entry->cur_wq = start_wq;
+}
+
+static void global_wq_table_add_wqs(void)
+{
+	int cpu;
+
+	if (!pkg_global_wq_tables)
+		return;
+
+	for (cpu = 0; cpu < nr_cpus; cpu += nr_cpus_per_package) {
+		/* cpu's on the same package get the same global_wq_table. */
+		int package_id = topology_logical_package_id(cpu);
+		int pkg_cpu;
+
+		for (pkg_cpu = cpu; pkg_cpu < cpu + nr_cpus_per_package; ++pkg_cpu) {
+			if (pkg_global_wq_tables[package_id]->n_wqs > 0) {
+				global_wq_table_add(pkg_cpu, pkg_global_wq_tables[package_id]);
+				global_wq_table_set_start_wq(pkg_cpu);
+			}
+		}
+	}
+}
+
 static int map_iaa_device_wqs(struct iaa_device *iaa_device)
 {
-	struct wq_table_entry *local;
+	struct wq_table_entry *local, *global;
 	int ret = 0, n_wqs_added = 0;
 	struct iaa_wq *iaa_wq;
 
 	local = iaa_device->iaa_local_wqs;
+	global = iaa_device->iaa_global_wqs;
 
 	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
 		if (iaa_wq->mapped && ++n_wqs_added)
@@ -909,11 +1195,18 @@ static int map_iaa_device_wqs(struct iaa_device *iaa_device)
 
 		pr_debug("iaa_device %px: processing wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
 
-		if (WARN_ON(local->n_wqs == local->max_wqs))
-			break;
+		if ((!n_wqs_added || ((n_wqs_added + g_wqs_per_iaa) < iaa_device->n_wq)) &&
+			(local->n_wqs < local->max_wqs)) {
+
+			local->wqs[local->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		} else {
+			if (WARN_ON(global->n_wqs == global->max_wqs))
+				break;
 
-		local->wqs[local->n_wqs++] = iaa_wq->wq;
-		pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+			global->wqs[global->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %px: added global wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		}
 
 		iaa_wq->mapped = true;
 		++n_wqs_added;
@@ -969,6 +1262,10 @@ static void rebalance_wq_table(void)
 		}
 	}
 
+	if (iaa_crypto_enabled && pkg_global_wq_tables) {
+		pkg_global_wq_tables_reinit();
+		global_wq_table_add_wqs();
+	}
 	pr_debug("Finished rebalance local wqs.");
 }
 
@@ -979,7 +1276,17 @@ static void free_wq_tables(void)
 		wq_table = NULL;
 	}
 
-	pr_debug("freed local wq table\n");
+	if (global_wq_table) {
+		free_percpu(global_wq_table);
+		global_wq_table = NULL;
+	}
+
+	if (num_consec_descs_per_wq) {
+		free_percpu(num_consec_descs_per_wq);
+		num_consec_descs_per_wq = NULL;
+	}
+
+	pr_debug("freed wq tables\n");
 }
 
 /***************************************************************
@@ -1002,6 +1309,35 @@ static struct idxd_wq *wq_table_next_wq(int cpu)
 	return entry->wqs[entry->cur_wq];
 }
 
+/*
+ * Caller should make sure to call only if the
+ * per_cpu_ptr "global_wq_table" is non-NULL
+ * and has at least one wq configured.
+ */
+static struct idxd_wq *global_wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+
+	/*
+	 * Fall-back to local IAA's wq if there were no global wqs configured
+	 * for any IAA device, or if there were problems in setting up global
+	 * wqs for this cpu's package.
+	 */
+	if (!entry->wqs)
+		return wq_table_next_wq(cpu);
+
+	if ((*num_consec_descs) == g_consec_descs_per_gwq) {
+		if (++entry->cur_wq >= entry->n_wqs)
+			entry->cur_wq = 0;
+		*num_consec_descs = 0;
+	}
+
+	++(*num_consec_descs);
+
+	return entry->wqs[entry->cur_wq];
+}
+
 /*************************************************
  * Core iaa_crypto compress/decompress functions.
  *************************************************/
@@ -1563,6 +1899,7 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	struct idxd_wq *wq;
 	struct device *dev;
 	int order = -1;
+	struct wq_table_entry *entry;
 
 	compression_ctx = crypto_tfm_ctx(tfm);
 
@@ -1581,8 +1918,15 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		disable_async = true;
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (!entry || !entry->wqs || entry->n_wqs == 0) {
+		wq = wq_table_next_wq(cpu);
+	} else {
+		wq = global_wq_table_next_wq(cpu);
+	}
 	put_cpu();
+
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
 		return -ENODEV;
@@ -2446,6 +2790,7 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 
 	if (nr_iaa == 0) {
 		iaa_crypto_enabled = false;
+		pkg_global_wq_tables_dealloc();
 		free_wq_tables();
 		BUG_ON(!list_empty(&iaa_devices));
 		INIT_LIST_HEAD(&iaa_devices);
@@ -2515,6 +2860,20 @@ static int __init iaa_crypto_init_module(void)
 		goto err_sync_attr_create;
 	}
 
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_wqs_per_iaa);
+	if (ret) {
+		pr_debug("IAA g_wqs_per_iaa attr creation failed\n");
+		goto err_g_wqs_per_iaa_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_consec_descs_per_gwq);
+	if (ret) {
+		pr_debug("IAA g_consec_descs_per_gwq attr creation failed\n");
+		goto err_g_consec_descs_per_gwq_attr_create;
+	}
+
 	if (iaa_crypto_debugfs_init())
 		pr_warn("debugfs init failed, stats not available\n");
 
@@ -2522,6 +2881,12 @@ static int __init iaa_crypto_init_module(void)
 out:
 	return ret;
 
+err_g_consec_descs_per_gwq_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+err_g_wqs_per_iaa_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_sync_mode);
 err_sync_attr_create:
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
@@ -2545,6 +2910,10 @@ static void __exit iaa_crypto_cleanup_module(void)
 			   &driver_attr_sync_mode);
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_consec_descs_per_gwq);
 	idxd_driver_unregister(&iaa_crypto_driver);
 	iaa_aecs_cleanup_fixed();
 	crypto_free_comp(deflate_generic_tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 10/16] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (8 preceding siblings ...)
  2025-02-06  7:20 ` [PATCH v6 09/16] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 11/16] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks Kanchana P Sridhar
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the descriptor allocation from blocking to non-blocking
with bounded retries or "timeouts".

This is necessary to prevent task blocked errors with higher
compress/decompress batch-sizes in high contention scenarios, for instance,
when the platform has only 1 IAA device enabled. With 1 IAA device enabled
per package on a dual-package SPR with 56 cores/package, there are 112
logical cores mapped to this single IAA device. In this scenario, the task
blocked errors can occur because idxd_alloc_desc() is called with
IDXD_OP_BLOCK. Any process that is able to obtain IAA_CRYPTO_MAX_BATCH_SIZE
(8UL) descriptors, will cause contention for allocating descriptors for all
other processes. Under IDXD_OP_BLOCK, this can cause compress/decompress
jobs to stall in stress test scenarios (e.g. zswap_store() of 2M folios).

In order to make the iaa_crypto driver be more fail-safe, this commit
implements the following:

1) Change compress/decompress descriptor allocations to be non-blocking
   with retries ("timeouts").
2) Return compress error to zswap if descriptor allocation with timeouts
   fails during compress ops. zswap_store() will return an error and the
   folio gets stored in the backing swap device.
3) Fallback to software decompress if descriptor allocation with timeouts
   fails during decompress ops.
4) Bug fixes for freeing the descriptor consistently in all error cases.

With these fixes, there are no task blocked errors seen under stress
testing conditions, and no performance degradation observed.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  3 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 81 +++++++++++++---------
 2 files changed, 51 insertions(+), 33 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index c46c70ecf355..b175e9065025 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -21,6 +21,9 @@
 
 #define IAA_COMPLETION_TIMEOUT		1000000
 
+#define IAA_ALLOC_DESC_COMP_TIMEOUT	   1000
+#define IAA_ALLOC_DESC_DECOMP_TIMEOUT	    500
+
 #define IAA_ANALYTICS_ERROR		0x0a
 #define IAA_ERROR_DECOMP_BUF_OVERFLOW	0x0b
 #define IAA_ERROR_COMP_BUF_OVERFLOW	0x19
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 4ca9028d6050..5292d8f7ebd6 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1406,6 +1406,7 @@ static int deflate_generic_decompress(struct acomp_req *req)
 	void *src, *dst;
 	int ret;
 
+	req->dlen = PAGE_SIZE;
 	src = kmap_local_page(sg_page(req->src)) + req->src->offset;
 	dst = kmap_local_page(sg_page(req->dst)) + req->dst->offset;
 
@@ -1469,7 +1470,8 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	int alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1485,7 +1487,11 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < IAA_ALLOC_DESC_DECOMP_TIMEOUT)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "idxd descriptor allocation failed\n");
 		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
@@ -1661,7 +1667,8 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	int alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1677,7 +1684,11 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < IAA_ALLOC_DESC_COMP_TIMEOUT)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "idxd descriptor allocation failed\n");
 		dev_dbg(dev, "iaa compress failed: ret=%ld\n", PTR_ERR(idxd_desc));
@@ -1753,15 +1764,10 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 
 	*compression_crc = idxd_desc->iax_completion->crc;
 
-	if (!ctx->async_mode || disable_async)
-		idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
 err:
 	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
+out:
+	return ret;
 }
 
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
@@ -1773,7 +1779,8 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	int alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1789,12 +1796,18 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < IAA_ALLOC_DESC_DECOMP_TIMEOUT)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "idxd descriptor allocation failed\n");
 		dev_dbg(dev, "iaa decompress failed: ret=%ld\n",
 			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
+		ret = PTR_ERR(idxd_desc);
+		idxd_desc = NULL;
+		goto fallback_software_decomp;
 	}
 	desc = idxd_desc->iax_hw;
 
@@ -1837,7 +1850,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	ret = idxd_submit_desc(wq, idxd_desc);
 	if (ret) {
 		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-		goto err;
+		goto fallback_software_decomp;
 	}
 
 	/* Update stats */
@@ -1851,19 +1864,20 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	}
 
 	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+
+fallback_software_decomp:
 	if (ret) {
-		dev_dbg(dev, "%s: check_completion failed ret=%d\n", __func__, ret);
-		if (idxd_desc->iax_completion->status == IAA_ANALYTICS_ERROR) {
+		dev_dbg(dev, "%s: desc allocation/submission/check_completion failed ret=%d\n", __func__, ret);
+		if (idxd_desc && idxd_desc->iax_completion->status == IAA_ANALYTICS_ERROR) {
 			pr_warn("%s: falling back to deflate-generic decompress, "
 				"analytics error code %x\n", __func__,
 				idxd_desc->iax_completion->error_code);
-			ret = deflate_generic_decompress(req);
-			if (ret) {
-				dev_dbg(dev, "%s: deflate-generic failed ret=%d\n",
-					__func__, ret);
-				goto err;
-			}
-		} else {
+		}
+
+		ret = deflate_generic_decompress(req);
+
+		if (ret) {
+			pr_err("%s: iaa decompress failed: fallback to deflate-generic software decompress error ret=%d\n", __func__, ret);
 			goto err;
 		}
 	} else {
@@ -1872,19 +1886,15 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	*dlen = req->dlen;
 
-	if (!ctx->async_mode || disable_async)
-		idxd_free_desc(wq, idxd_desc);
-
 	/* Update stats */
 	update_total_decomp_bytes_in(slen);
 	update_wq_decomp_bytes(wq, slen);
+
+err:
+	if (idxd_desc)
+		idxd_free_desc(wq, idxd_desc);
 out:
 	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa decompress failed: ret=%d\n", ret);
-
-	goto out;
 }
 
 static int iaa_comp_acompress(struct acomp_req *req)
@@ -2375,6 +2385,7 @@ static bool iaa_comp_acompress_batch(
 			if (errors[i] != -EINPROGRESS) {
 				errors[i] = -EINVAL;
 				err = -EINVAL;
+				pr_debug("%s desc was not submitted\n", __func__);
 			} else {
 				errors[i] = -EAGAIN;
 			}
@@ -2521,9 +2532,13 @@ static bool iaa_comp_adecompress_batch(
 		} else {
 			errors[i] = iaa_comp_adecompress(reqs[i]);
 
+			/*
+			 * If it failed desc allocation/submission, errors[i] can
+			 * be 0 or error value from software decompress.
+			 */
 			if (errors[i] != -EINPROGRESS) {
-				errors[i] = -EINVAL;
 				err = -EINVAL;
+				pr_debug("%s desc was not submitted\n", __func__);
 			} else {
 				errors[i] = -EAGAIN;
 			}
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 11/16] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (9 preceding siblings ...)
  2025-02-06  7:20 ` [PATCH v6 10/16] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06  7:20 ` [PATCH v6 12/16] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

The mainline implementation of "deflate_generic_decompress" has a bug in
the usage of this global variable:

static struct crypto_comp *deflate_generic_tfm;

The "deflate_generic_tfm" is allocated at module init time, and freed
during module cleanup. Any calls to software decompress, for instance, if
descriptor allocation fails or job submission fails, will trigger this bug
in the deflate_generic_decompress() procedure. The problem is the
unprotected access of "deflate_generic_tfm" in this procedure. While
stress testing workloads under high memory pressure, with 1 IAA device
and "deflate-iaa" as the compressor, the descriptor allocation times out
and the software fallback route is taken. With multiple processes calling:

        ret = crypto_comp_decompress(deflate_generic_tfm,
                                     src, req->slen, dst, &req->dlen);

we end up with data corruption, that results in req->dlen being larger
than PAGE_SIZE. zswap_decompress() subsequently raises a kernel bug.

This bug can manifest under high contention and memory pressure situations
with high likelihood. This has been resolved by adding a mutex, which is
locked before accessing "deflate_generic_tfm" and unlocked after the
crypto_comp call is done.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 5292d8f7ebd6..6796c783dd16 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -105,6 +105,7 @@ static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
+DEFINE_MUTEX(deflate_generic_tfm_lock);
 
 /* If enabled, IAA hw crypto algos are registered, unavailable otherwise */
 static bool iaa_crypto_enabled;
@@ -1407,6 +1408,9 @@ static int deflate_generic_decompress(struct acomp_req *req)
 	int ret;
 
 	req->dlen = PAGE_SIZE;
+
+	mutex_lock(&deflate_generic_tfm_lock);
+
 	src = kmap_local_page(sg_page(req->src)) + req->src->offset;
 	dst = kmap_local_page(sg_page(req->dst)) + req->dst->offset;
 
@@ -1416,6 +1420,8 @@ static int deflate_generic_decompress(struct acomp_req *req)
 	kunmap_local(src);
 	kunmap_local(dst);
 
+	mutex_unlock(&deflate_generic_tfm_lock);
+
 	update_total_sw_decomp_calls();
 
 	return ret;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 12/16] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (10 preceding siblings ...)
  2025-02-06  7:20 ` [PATCH v6 11/16] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06 18:55   ` Yosry Ahmed
  2025-02-06  7:20 ` [PATCH v6 13/16] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching Kanchana P Sridhar
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch adds support for the per-CPU acomp_ctx to track multiple
compression/decompression requests. The zswap_cpu_comp_prepare() cpu
onlining code will check if the compressor supports batching. If so, it
will allocate the necessary batching resources.

However, zswap does not use more than one request yet. Follow-up patches
will actually utilize the multiple acomp_ctx requests/buffers for batch
compression/decompression of multiple pages.

The newly added ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used
for batching. There is no extra memory usage for compressors that do not
support batching.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 132 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 98 insertions(+), 34 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index a2baceed3bf9..dc7d1ff04b22 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -78,6 +78,16 @@ static bool zswap_pool_reached_full;
 
 #define ZSWAP_PARAM_UNSET ""
 
+/*
+ * For compression batching of large folios:
+ * Maximum number of acomp compress requests that will be processed
+ * in a batch, iff the zswap compressor supports batching.
+ * This limit exists because we preallocate enough requests and buffers
+ * in the per-cpu acomp_ctx accordingly. Hence, a higher limit means higher
+ * memory usage.
+ */
+#define ZSWAP_MAX_BATCH_SIZE 8U
+
 static int zswap_setup(void);
 
 /* Enable/disable zswap */
@@ -143,9 +153,10 @@ bool zswap_never_enabled(void)
 
 struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
+	struct acomp_req **reqs;
+	u8 **buffers;
+	unsigned int nr_reqs;
 	struct crypto_wait wait;
-	u8 *buffer;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -821,15 +832,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	struct crypto_acomp *acomp = NULL;
-	struct acomp_req *req = NULL;
-	u8 *buffer = NULL;
-	int ret;
+	unsigned int nr_reqs = 1;
+	int ret = -ENOMEM;
+	int i;
 
-	buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!buffer) {
-		ret = -ENOMEM;
-		goto fail;
-	}
+	acomp_ctx->buffers = NULL;
+	acomp_ctx->reqs = NULL;
+	acomp_ctx->nr_reqs = 0;
 
 	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
 	if (IS_ERR(acomp)) {
@@ -839,12 +848,30 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 		goto fail;
 	}
 
-	req = acomp_request_alloc(acomp);
-	if (!req) {
-		pr_err("could not alloc crypto acomp_request %s\n",
-		       pool->tfm_name);
-		ret = -ENOMEM;
+	if (acomp_has_async_batching(acomp))
+		nr_reqs = min(ZSWAP_MAX_BATCH_SIZE, crypto_acomp_batch_size(acomp));
+
+	acomp_ctx->buffers = kcalloc_node(nr_reqs, sizeof(u8 *), GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffers)
+		goto fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+		if (!acomp_ctx->buffers[i])
+			goto fail;
+	}
+
+	acomp_ctx->reqs = kcalloc_node(nr_reqs, sizeof(struct acomp_req *), GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->reqs)
 		goto fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->reqs[i] = acomp_request_alloc(acomp);
+		if (!acomp_ctx->reqs[i]) {
+			pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
+			       i, pool->tfm_name);
+			goto fail;
+		}
 	}
 
 	/*
@@ -853,6 +880,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	 * again resulting in a deadlock.
 	 */
 	mutex_lock(&acomp_ctx->mutex);
+
+	/*
+	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
+	 * or non-poll mode of acomp, hence there is only one "wait" per
+	 * acomp_ctx, with callback set to reqs[0], under the assumption that
+	 * there is at least 1 request per acomp_ctx.
+	 */
 	crypto_init_wait(&acomp_ctx->wait);
 
 	/*
@@ -860,20 +894,33 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
-	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+	acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
-	acomp_ctx->buffer = buffer;
+	acomp_ctx->nr_reqs = nr_reqs;
 	acomp_ctx->acomp = acomp;
 	acomp_ctx->is_sleepable = acomp_is_async(acomp);
-	acomp_ctx->req = req;
 	mutex_unlock(&acomp_ctx->mutex);
 	return 0;
 
 fail:
+	if (acomp_ctx->buffers) {
+		for (i = 0; i < nr_reqs; ++i)
+			kfree(acomp_ctx->buffers[i]);
+		kfree(acomp_ctx->buffers);
+		acomp_ctx->buffers = NULL;
+	}
+
+	if (acomp_ctx->reqs) {
+		for (i = 0; i < nr_reqs; ++i)
+			if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
+				acomp_request_free(acomp_ctx->reqs[i]);
+		kfree(acomp_ctx->reqs);
+		acomp_ctx->reqs = NULL;
+	}
+
 	if (acomp)
 		crypto_free_acomp(acomp);
-	kfree(buffer);
 	return ret;
 }
 
@@ -883,14 +930,31 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 
 	mutex_lock(&acomp_ctx->mutex);
+
 	if (!IS_ERR_OR_NULL(acomp_ctx)) {
-		if (!IS_ERR_OR_NULL(acomp_ctx->req))
-			acomp_request_free(acomp_ctx->req);
-		acomp_ctx->req = NULL;
+		int i;
+
+		if (acomp_ctx->reqs) {
+			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+				if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
+					acomp_request_free(acomp_ctx->reqs[i]);
+			kfree(acomp_ctx->reqs);
+			acomp_ctx->reqs = NULL;
+		}
+
+		if (acomp_ctx->buffers) {
+			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+				kfree(acomp_ctx->buffers[i]);
+			kfree(acomp_ctx->buffers);
+			acomp_ctx->buffers = NULL;
+		}
+
 		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 			crypto_free_acomp(acomp_ctx->acomp);
-		kfree(acomp_ctx->buffer);
+
+		acomp_ctx->nr_reqs = 0;
 	}
+
 	mutex_unlock(&acomp_ctx->mutex);
 
 	return 0;
@@ -903,7 +967,7 @@ static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
 	for (;;) {
 		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
 		mutex_lock(&acomp_ctx->mutex);
-		if (likely(acomp_ctx->req))
+		if (likely(acomp_ctx->reqs))
 			return acomp_ctx;
 		/*
 		 * It is possible that we were migrated to a different CPU after
@@ -935,7 +999,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	u8 *dst;
 
 	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
-	dst = acomp_ctx->buffer;
+	dst = acomp_ctx->buffers[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
@@ -945,7 +1009,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * giving the dst buffer with enough length to avoid buffer overflow.
 	 */
 	sg_init_one(&output, dst, PAGE_SIZE * 2);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
 
 	/*
 	 * it maybe looks a little bit silly that we send an asynchronous request,
@@ -959,8 +1023,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * but in different threads running on different cpu, we have different
 	 * acomp instance, so multiple threads can do (de)compression in parallel.
 	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
+	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
+	dlen = acomp_ctx->reqs[0]->dlen;
 	if (comp_ret)
 		goto unlock;
 
@@ -1011,19 +1075,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	 */
 	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
 	    !virt_addr_valid(src)) {
-		memcpy(acomp_ctx->buffer, src, entry->length);
-		src = acomp_ctx->buffer;
+		memcpy(acomp_ctx->buffers[0], src, entry->length);
+		src = acomp_ctx->buffers[0];
 		zpool_unmap_handle(zpool, entry->handle);
 	}
 
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_folio(&output, folio, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
-	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
-	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE);
+	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait));
+	BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
 
-	if (src != acomp_ctx->buffer)
+	if (src != acomp_ctx->buffers[0])
 		zpool_unmap_handle(zpool, entry->handle);
 	acomp_ctx_put_unlock(acomp_ctx);
 }
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 13/16] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (11 preceding siblings ...)
  2025-02-06  7:20 ` [PATCH v6 12/16] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
@ 2025-02-06  7:20 ` Kanchana P Sridhar
  2025-02-06  7:21 ` [PATCH v6 14/16] mm: zswap: Introduce zswap_compress_folio() to compress all pages in a folio Kanchana P Sridhar
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:20 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch introduces zswap_store_folio() that implements all the computes
done earlier in zswap_store_page() for a single-page, for all the pages in
a folio. This allows us to move the loop over the folio's pages from
zswap_store() to zswap_store_folio().

zswap_store_folio() starts by allocating all zswap entries required to
store the folio. Next, it calls zswap_compress() for all pages in the
folio. Finally, it adds the entries to the xarray and LRU, charges zswap
memory and increments zswap stats.

The error handling and cleanup required for all failure scenarios that can
occur while storing a folio in zswap are now consolidated to a
"store_folio_failed" label in zswap_store_folio().

These changes facilitate developing support for compress batching in
zswap_store_folio().

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 164 ++++++++++++++++++++++++++++++++---------------------
 1 file changed, 98 insertions(+), 66 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index dc7d1ff04b22..af682bf0f690 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1509,81 +1509,116 @@ static void shrink_worker(struct work_struct *w)
 * main API
 **********************************/
 
-static bool zswap_store_page(struct page *page,
-			     struct obj_cgroup *objcg,
-			     struct zswap_pool *pool)
+/*
+ * Store all pages in a folio.
+ *
+ * The error handling from all failure points is consolidated to the
+ * "store_folio_failed" label, based on the initialization of the zswap entries'
+ * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
+ * entry's handle is subsequently modified only upon a successful zpool_malloc()
+ * after the page is compressed.
+ */
+static bool zswap_store_folio(struct folio *folio,
+			       struct obj_cgroup *objcg,
+			       struct zswap_pool *pool)
 {
-	swp_entry_t page_swpentry = page_swap_entry(page);
-	struct zswap_entry *entry, *old;
+	long index, from_index = 0, nr_pages = folio_nr_pages(folio);
+	struct zswap_entry **entries = NULL;
+	int node_id = folio_nid(folio);
 
-	/* allocate entry */
-	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
-	if (!entry) {
-		zswap_reject_kmemcache_fail++;
+	entries = kmalloc(nr_pages * sizeof(*entries), GFP_KERNEL);
+	if (!entries)
 		return false;
-	}
 
-	if (!zswap_compress(page, entry, pool))
-		goto compress_failed;
+	for (index = 0; index < nr_pages; ++index) {
+		entries[index] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
 
-	old = xa_store(swap_zswap_tree(page_swpentry),
-		       swp_offset(page_swpentry),
-		       entry, GFP_KERNEL);
-	if (xa_is_err(old)) {
-		int err = xa_err(old);
+		if (!entries[index]) {
+			zswap_reject_kmemcache_fail++;
+			nr_pages = index;
+			goto store_folio_failed;
+		}
 
-		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-		zswap_reject_alloc_fail++;
-		goto store_failed;
+		entries[index]->handle = (unsigned long)ERR_PTR(-EINVAL);
 	}
 
-	/*
-	 * We may have had an existing entry that became stale when
-	 * the folio was redirtied and now the new version is being
-	 * swapped out. Get rid of the old.
-	 */
-	if (old)
-		zswap_entry_free(old);
+	for (index = 0; index < nr_pages; ++index) {
+		struct page *page = folio_page(folio, index);
 
-	/*
-	 * The entry is successfully compressed and stored in the tree, there is
-	 * no further possibility of failure. Grab refs to the pool and objcg,
-	 * charge zswap memory, and increment zswap_stored_pages.
-	 * The opposite actions will be performed by zswap_entry_free()
-	 * when the entry is removed from the tree.
-	 */
-	zswap_pool_get(pool);
-	if (objcg) {
-		obj_cgroup_get(objcg);
-		obj_cgroup_charge_zswap(objcg, entry->length);
+		if (!zswap_compress(page, entries[index], pool))
+			goto store_folio_failed;
 	}
-	atomic_long_inc(&zswap_stored_pages);
 
-	/*
-	 * We finish initializing the entry while it's already in xarray.
-	 * This is safe because:
-	 *
-	 * 1. Concurrent stores and invalidations are excluded by folio lock.
-	 *
-	 * 2. Writeback is excluded by the entry not being on the LRU yet.
-	 *    The publishing order matters to prevent writeback from seeing
-	 *    an incoherent entry.
-	 */
-	entry->pool = pool;
-	entry->swpentry = page_swpentry;
-	entry->objcg = objcg;
-	entry->referenced = true;
-	if (entry->length) {
-		INIT_LIST_HEAD(&entry->lru);
-		zswap_lru_add(&zswap_list_lru, entry);
+	for (index = 0; index < nr_pages; ++index) {
+		swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, index));
+		struct zswap_entry *old, *entry = entries[index];
+
+		old = xa_store(swap_zswap_tree(page_swpentry),
+			       swp_offset(page_swpentry),
+			       entry, GFP_KERNEL);
+		if (xa_is_err(old)) {
+			int err = xa_err(old);
+
+			WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+			zswap_reject_alloc_fail++;
+			from_index = index;
+			goto store_folio_failed;
+		}
+
+		/*
+		 * We may have had an existing entry that became stale when
+		 * the folio was redirtied and now the new version is being
+		 * swapped out. Get rid of the old.
+		 */
+		if (old)
+			zswap_entry_free(old);
+
+		/*
+		 * The entry is successfully compressed and stored in the tree, there is
+		 * no further possibility of failure. Grab refs to the pool and objcg,
+		 * charge zswap memory, and increment zswap_stored_pages.
+		 * The opposite actions will be performed by zswap_entry_free()
+		 * when the entry is removed from the tree.
+		 */
+		zswap_pool_get(pool);
+		if (objcg) {
+			obj_cgroup_get(objcg);
+			obj_cgroup_charge_zswap(objcg, entry->length);
+		}
+		atomic_long_inc(&zswap_stored_pages);
+
+		/*
+		 * We finish initializing the entry while it's already in xarray.
+		 * This is safe because:
+		 *
+		 * 1. Concurrent stores and invalidations are excluded by folio lock.
+		 *
+		 * 2. Writeback is excluded by the entry not being on the LRU yet.
+		 *    The publishing order matters to prevent writeback from seeing
+		 *    an incoherent entry.
+		 */
+		entry->pool = pool;
+		entry->swpentry = page_swpentry;
+		entry->objcg = objcg;
+		entry->referenced = true;
+		if (entry->length) {
+			INIT_LIST_HEAD(&entry->lru);
+			zswap_lru_add(&zswap_list_lru, entry);
+		}
 	}
 
+	kfree(entries);
 	return true;
 
-store_failed:
-	zpool_free(pool->zpool, entry->handle);
-compress_failed:
-	zswap_entry_cache_free(entry);
+store_folio_failed:
+	for (index = from_index; index < nr_pages; ++index) {
+		if (!IS_ERR_VALUE(entries[index]->handle))
+			zpool_free(pool->zpool, entries[index]->handle);
+
+		zswap_entry_cache_free(entries[index]);
+	}
+
+	kfree(entries);
 	return false;
 }
 
@@ -1595,7 +1630,6 @@ bool zswap_store(struct folio *folio)
 	struct mem_cgroup *memcg = NULL;
 	struct zswap_pool *pool;
 	bool ret = false;
-	long index;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1629,12 +1663,9 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
-	for (index = 0; index < nr_pages; ++index) {
-		struct page *page = folio_page(folio, index);
+	if (!zswap_store_folio(folio, objcg, pool))
+		goto put_pool;
 
-		if (!zswap_store_page(page, objcg, pool))
-			goto put_pool;
-	}
 
 	if (objcg)
 		count_objcg_events(objcg, ZSWPOUT, nr_pages);
@@ -1661,6 +1692,7 @@ bool zswap_store(struct folio *folio)
 		pgoff_t offset = swp_offset(swp);
 		struct zswap_entry *entry;
 		struct xarray *tree;
+		long index;
 
 		for (index = 0; index < nr_pages; ++index) {
 			tree = swap_zswap_tree(swp_entry(type, offset + index));
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 14/16] mm: zswap: Introduce zswap_compress_folio() to compress all pages in a folio.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (12 preceding siblings ...)
  2025-02-06  7:20 ` [PATCH v6 13/16] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching Kanchana P Sridhar
@ 2025-02-06  7:21 ` Kanchana P Sridhar
  2025-02-06  7:21 ` [PATCH v6 15/16] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch encapsulates away the computes for compressing the pages in a
folio and storing the compressed memory in zpool, into a distinct
zswap_compress_folio() procedure.

For now, zswap_compress_folio() simply calls zswap_compress() for each page
in the folio it is called with.

This facilitates adding compress batching in subsequent patches.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index af682bf0f690..6563d12e907b 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1509,6 +1509,22 @@ static void shrink_worker(struct work_struct *w)
 * main API
 **********************************/
 
+static bool zswap_compress_folio(struct folio *folio,
+				 struct zswap_entry *entries[],
+				 struct zswap_pool *pool)
+{
+	long index, nr_pages = folio_nr_pages(folio);
+
+	for (index = 0; index < nr_pages; ++index) {
+		struct page *page = folio_page(folio, index);
+
+		if (!zswap_compress(page, entries[index], pool))
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * Store all pages in a folio.
  *
@@ -1542,12 +1558,8 @@ static bool zswap_store_folio(struct folio *folio,
 		entries[index]->handle = (unsigned long)ERR_PTR(-EINVAL);
 	}
 
-	for (index = 0; index < nr_pages; ++index) {
-		struct page *page = folio_page(folio, index);
-
-		if (!zswap_compress(page, entries[index], pool))
-			goto store_folio_failed;
-	}
+	if (!zswap_compress_folio(folio, entries, pool))
+		goto store_folio_failed;
 
 	for (index = 0; index < nr_pages; ++index) {
 		swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, index));
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 15/16] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (13 preceding siblings ...)
  2025-02-06  7:21 ` [PATCH v6 14/16] mm: zswap: Introduce zswap_compress_folio() to compress all pages in a folio Kanchana P Sridhar
@ 2025-02-06  7:21 ` Kanchana P Sridhar
  2025-02-06 19:10   ` Yosry Ahmed
  2025-02-06  7:21 ` [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios Kanchana P Sridhar
  2025-02-11 17:05 ` [PATCH v6 00/16] zswap IAA compress batching Eric Biggers
  16 siblings, 1 reply; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

zswap_compress_folio() is modified to detect if the pool's acomp_ctx has
more than one "nr_reqs", which will be the case if the cpu onlining code
has allocated multiple batching resources in the acomp_ctx. If so, it means
compress batching can be used with a batch-size of "acomp_ctx->nr_reqs".

If compress batching can be used, zswap_compress_folio() will invoke the
newly added zswap_batch_compress() procedure to compress and store the
folio in batches of "acomp_ctx->nr_reqs" pages.

With Intel IAA, the iaa_crypto driver will compress each batch of pages in
parallel in hardware.

Hence, zswap_batch_compress() does the same computes for a batch, as
zswap_compress() does for a page; and returns true if the batch was
successfully compressed/stored, and false otherwise.

If the pool does not support compress batching, or the folio has only one
page, zswap_compress_folio() calls zswap_compress() for each individual
page in the folio, as before.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 113 insertions(+), 9 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 6563d12e907b..f1cba77eda62 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -985,10 +985,11 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
 	mutex_unlock(&acomp_ctx->mutex);
 }
 
+/* The per-cpu @acomp_ctx mutex should be locked/unlocked in the caller. */
 static bool zswap_compress(struct page *page, struct zswap_entry *entry,
-			   struct zswap_pool *pool)
+			   struct zswap_pool *pool,
+			   struct crypto_acomp_ctx *acomp_ctx)
 {
-	struct crypto_acomp_ctx *acomp_ctx;
 	struct scatterlist input, output;
 	int comp_ret = 0, alloc_ret = 0;
 	unsigned int dlen = PAGE_SIZE;
@@ -998,7 +999,6 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	gfp_t gfp;
 	u8 *dst;
 
-	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
 	dst = acomp_ctx->buffers[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
@@ -1051,7 +1051,6 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	else if (alloc_ret)
 		zswap_reject_alloc_fail++;
 
-	acomp_ctx_put_unlock(acomp_ctx);
 	return comp_ret == 0 && alloc_ret == 0;
 }
 
@@ -1509,20 +1508,125 @@ static void shrink_worker(struct work_struct *w)
 * main API
 **********************************/
 
+/* The per-cpu @acomp_ctx mutex should be locked/unlocked in the caller. */
+static bool zswap_batch_compress(struct folio *folio,
+				 long index,
+				 unsigned int batch_size,
+				 struct zswap_entry *entries[],
+				 struct zswap_pool *pool,
+				 struct crypto_acomp_ctx *acomp_ctx)
+{
+	int comp_errors[ZSWAP_MAX_BATCH_SIZE] = { 0 };
+	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
+	struct page *pages[ZSWAP_MAX_BATCH_SIZE];
+	unsigned int i, nr_batch_pages;
+	bool ret = true;
+
+	nr_batch_pages = min((unsigned int)(folio_nr_pages(folio) - index), batch_size);
+
+	for (i = 0; i < nr_batch_pages; ++i) {
+		pages[i] = folio_page(folio, index + i);
+		dlens[i] = PAGE_SIZE;
+	}
+
+	/*
+	 * Batch compress @nr_batch_pages. If IAA is the compressor, the
+	 * hardware will compress @nr_batch_pages in parallel.
+	 */
+	ret = crypto_acomp_batch_compress(
+		acomp_ctx->reqs,
+		NULL,
+		pages,
+		acomp_ctx->buffers,
+		dlens,
+		comp_errors,
+		nr_batch_pages);
+
+	if (ret) {
+		/*
+		 * All batch pages were successfully compressed.
+		 * Store the pages in zpool.
+		 */
+		struct zpool *zpool = pool->zpool;
+		gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
+
+		if (zpool_malloc_support_movable(zpool))
+			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
+
+		for (i = 0; i < nr_batch_pages; ++i) {
+			unsigned long handle;
+			char *buf;
+			int err;
+
+			err = zpool_malloc(zpool, dlens[i], gfp, &handle);
+
+			if (err) {
+				if (err == -ENOSPC)
+					zswap_reject_compress_poor++;
+				else
+					zswap_reject_alloc_fail++;
+
+				ret = false;
+				break;
+			}
+
+			buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
+			memcpy(buf, acomp_ctx->buffers[i], dlens[i]);
+			zpool_unmap_handle(zpool, handle);
+
+			entries[i]->handle = handle;
+			entries[i]->length = dlens[i];
+		}
+	} else {
+		/* Some batch pages had compression errors. */
+		for (i = 0; i < nr_batch_pages; ++i) {
+			if (comp_errors[i]) {
+				if (comp_errors[i] == -ENOSPC)
+					zswap_reject_compress_poor++;
+				else
+					zswap_reject_compress_fail++;
+			}
+		}
+	}
+
+	return ret;
+}
+
 static bool zswap_compress_folio(struct folio *folio,
 				 struct zswap_entry *entries[],
 				 struct zswap_pool *pool)
 {
 	long index, nr_pages = folio_nr_pages(folio);
+	struct crypto_acomp_ctx *acomp_ctx;
+	unsigned int batch_size;
+	bool ret = true;
 
-	for (index = 0; index < nr_pages; ++index) {
-		struct page *page = folio_page(folio, index);
+	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+	batch_size = acomp_ctx->nr_reqs;
+
+	if ((batch_size > 1) && (nr_pages > 1)) {
+		for (index = 0; index < nr_pages; index += batch_size) {
+
+			if (!zswap_batch_compress(folio, index, batch_size,
+						  &entries[index], pool, acomp_ctx)) {
+				ret = false;
+				goto unlock_acomp_ctx;
+			}
+		}
+	} else {
+		for (index = 0; index < nr_pages; ++index) {
+			struct page *page = folio_page(folio, index);
 
-		if (!zswap_compress(page, entries[index], pool))
-			return false;
+			if (!zswap_compress(page, entries[index], pool, acomp_ctx)) {
+				ret = false;
+				goto unlock_acomp_ctx;
+			}
+		}
 	}
 
-	return true;
+unlock_acomp_ctx:
+	acomp_ctx_put_unlock(acomp_ctx);
+	return ret;
 }
 
 /*
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios.
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (14 preceding siblings ...)
  2025-02-06  7:21 ` [PATCH v6 15/16] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
@ 2025-02-06  7:21 ` Kanchana P Sridhar
  2025-02-06 19:15   ` Yosry Ahmed
  2025-02-20 23:28   ` Nhat Pham
  2025-02-11 17:05 ` [PATCH v6 00/16] zswap IAA compress batching Eric Biggers
  16 siblings, 2 replies; 30+ messages in thread
From: Kanchana P Sridhar @ 2025-02-06  7:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

With the previous patch that enables support for batch compressions in
zswap_compress_folio(), a 6.2% throughput regression was seen with zstd and
2M folios, using vm-scalability/usemem.

For compressors that don't support batching, this was root-caused to the
following zswap_store_folio() structure:

 Batched stores:
 ---------------
 - Allocate all entries,
 - Compress all entries,
 - Store all entries in xarray/LRU.

Hence, the above structure is maintained only for batched stores, and the
following structure is implemented for sequential stores of large folio pages,
that fixes the zstd regression, while preserving common code paths for batched
and sequential stores of a folio:

 Sequential stores:
 ------------------
 For each page in folio:
  - allocate an entry,
  - compress the page,
  - store the entry in xarray/LRU.

This is submitted as a separate patch only for code review purposes. I will
squash this with the previous commit in subsequent versions of this
patch-series.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 193 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 111 insertions(+), 82 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index f1cba77eda62..7bfc720a6201 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1592,40 +1592,32 @@ static bool zswap_batch_compress(struct folio *folio,
 	return ret;
 }
 
-static bool zswap_compress_folio(struct folio *folio,
-				 struct zswap_entry *entries[],
-				 struct zswap_pool *pool)
+static __always_inline bool zswap_compress_folio(struct folio *folio,
+						 struct zswap_entry *entries[],
+						 long from_index,
+						 struct zswap_pool *pool,
+						 unsigned int batch_size,
+						 struct crypto_acomp_ctx *acomp_ctx)
 {
-	long index, nr_pages = folio_nr_pages(folio);
-	struct crypto_acomp_ctx *acomp_ctx;
-	unsigned int batch_size;
+	long index = from_index, nr_pages = folio_nr_pages(folio);
 	bool ret = true;
 
-	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
-	batch_size = acomp_ctx->nr_reqs;
-
 	if ((batch_size > 1) && (nr_pages > 1)) {
-		for (index = 0; index < nr_pages; index += batch_size) {
+		for (; index < nr_pages; index += batch_size) {
 
 			if (!zswap_batch_compress(folio, index, batch_size,
 						  &entries[index], pool, acomp_ctx)) {
 				ret = false;
-				goto unlock_acomp_ctx;
+				break;
 			}
 		}
 	} else {
-		for (index = 0; index < nr_pages; ++index) {
-			struct page *page = folio_page(folio, index);
+		struct page *page = folio_page(folio, index);
 
-			if (!zswap_compress(page, entries[index], pool, acomp_ctx)) {
-				ret = false;
-				goto unlock_acomp_ctx;
-			}
-		}
+		if (!zswap_compress(page, entries[index], pool, acomp_ctx))
+			ret = false;
 	}
 
-unlock_acomp_ctx:
-	acomp_ctx_put_unlock(acomp_ctx);
 	return ret;
 }
 
@@ -1637,92 +1629,128 @@ static bool zswap_compress_folio(struct folio *folio,
  * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
  * entry's handle is subsequently modified only upon a successful zpool_malloc()
  * after the page is compressed.
+ *
+ * For compressors that don't support batching, the following structure
+ * showed a performance regression with zstd/2M folios:
+ *
+ * Batched stores:
+ * ---------------
+ *  - Allocate all entries,
+ *  - Compress all entries,
+ *  - Store all entries in xarray/LRU.
+ *
+ * Hence, the above structure is maintained only for batched stores, and the
+ * following structure is implemented for sequential stores of large folio pages,
+ * that fixes the regression, while preserving common code paths for batched
+ * and sequential stores of a folio:
+ *
+ * Sequential stores:
+ * ------------------
+ * For each page in folio:
+ *  - allocate an entry,
+ *  - compress the page,
+ *  - store the entry in xarray/LRU.
  */
 static bool zswap_store_folio(struct folio *folio,
 			       struct obj_cgroup *objcg,
 			       struct zswap_pool *pool)
 {
-	long index, from_index = 0, nr_pages = folio_nr_pages(folio);
+	long index = 0, from_index = 0, nr_pages, nr_folio_pages = folio_nr_pages(folio);
 	struct zswap_entry **entries = NULL;
+	struct crypto_acomp_ctx *acomp_ctx;
 	int node_id = folio_nid(folio);
+	unsigned int batch_size;
 
-	entries = kmalloc(nr_pages * sizeof(*entries), GFP_KERNEL);
+	entries = kmalloc(nr_folio_pages * sizeof(*entries), GFP_KERNEL);
 	if (!entries)
 		return false;
 
-	for (index = 0; index < nr_pages; ++index) {
-		entries[index] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
+	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+	batch_size = acomp_ctx->nr_reqs;
 
-		if (!entries[index]) {
-			zswap_reject_kmemcache_fail++;
-			nr_pages = index;
-			goto store_folio_failed;
+	nr_pages = (batch_size > 1) ? nr_folio_pages : 1;
+
+	while (1) {
+		for (index = from_index; index < nr_pages; ++index) {
+			entries[index] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
+
+			if (!entries[index]) {
+				zswap_reject_kmemcache_fail++;
+				nr_pages = index;
+				goto store_folio_failed;
+			}
+
+			entries[index]->handle = (unsigned long)ERR_PTR(-EINVAL);
 		}
 
-		entries[index]->handle = (unsigned long)ERR_PTR(-EINVAL);
-	}
+		if (!zswap_compress_folio(folio, entries, from_index, pool, batch_size, acomp_ctx))
+			goto store_folio_failed;
 
-	if (!zswap_compress_folio(folio, entries, pool))
-		goto store_folio_failed;
+		for (index = from_index; index < nr_pages; ++index) {
+			swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, index));
+			struct zswap_entry *old, *entry = entries[index];
 
-	for (index = 0; index < nr_pages; ++index) {
-		swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, index));
-		struct zswap_entry *old, *entry = entries[index];
+			old = xa_store(swap_zswap_tree(page_swpentry),
+				swp_offset(page_swpentry),
+				entry, GFP_KERNEL);
+			if (xa_is_err(old)) {
+				int err = xa_err(old);
 
-		old = xa_store(swap_zswap_tree(page_swpentry),
-			       swp_offset(page_swpentry),
-			       entry, GFP_KERNEL);
-		if (xa_is_err(old)) {
-			int err = xa_err(old);
+				WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+				zswap_reject_alloc_fail++;
+				from_index = index;
+				goto store_folio_failed;
+			}
 
-			WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-			zswap_reject_alloc_fail++;
-			from_index = index;
-			goto store_folio_failed;
-		}
+			/*
+			 * We may have had an existing entry that became stale when
+			 * the folio was redirtied and now the new version is being
+			 * swapped out. Get rid of the old.
+			 */
+			if (old)
+				zswap_entry_free(old);
 
-		/*
-		 * We may have had an existing entry that became stale when
-		 * the folio was redirtied and now the new version is being
-		 * swapped out. Get rid of the old.
-		 */
-		if (old)
-			zswap_entry_free(old);
+			/*
+			 * The entry is successfully compressed and stored in the tree, there is
+			 * no further possibility of failure. Grab refs to the pool and objcg,
+			 * charge zswap memory, and increment zswap_stored_pages.
+			 * The opposite actions will be performed by zswap_entry_free()
+			 * when the entry is removed from the tree.
+			 */
+			zswap_pool_get(pool);
+			if (objcg) {
+				obj_cgroup_get(objcg);
+				obj_cgroup_charge_zswap(objcg, entry->length);
+			}
+			atomic_long_inc(&zswap_stored_pages);
 
-		/*
-		 * The entry is successfully compressed and stored in the tree, there is
-		 * no further possibility of failure. Grab refs to the pool and objcg,
-		 * charge zswap memory, and increment zswap_stored_pages.
-		 * The opposite actions will be performed by zswap_entry_free()
-		 * when the entry is removed from the tree.
-		 */
-		zswap_pool_get(pool);
-		if (objcg) {
-			obj_cgroup_get(objcg);
-			obj_cgroup_charge_zswap(objcg, entry->length);
+			/*
+			 * We finish initializing the entry while it's already in xarray.
+			 * This is safe because:
+			 *
+			 * 1. Concurrent stores and invalidations are excluded by folio lock.
+			 *
+			 * 2. Writeback is excluded by the entry not being on the LRU yet.
+			 *    The publishing order matters to prevent writeback from seeing
+			 *    an incoherent entry.
+			 */
+			entry->pool = pool;
+			entry->swpentry = page_swpentry;
+			entry->objcg = objcg;
+			entry->referenced = true;
+			if (entry->length) {
+				INIT_LIST_HEAD(&entry->lru);
+				zswap_lru_add(&zswap_list_lru, entry);
+			}
 		}
-		atomic_long_inc(&zswap_stored_pages);
 
-		/*
-		 * We finish initializing the entry while it's already in xarray.
-		 * This is safe because:
-		 *
-		 * 1. Concurrent stores and invalidations are excluded by folio lock.
-		 *
-		 * 2. Writeback is excluded by the entry not being on the LRU yet.
-		 *    The publishing order matters to prevent writeback from seeing
-		 *    an incoherent entry.
-		 */
-		entry->pool = pool;
-		entry->swpentry = page_swpentry;
-		entry->objcg = objcg;
-		entry->referenced = true;
-		if (entry->length) {
-			INIT_LIST_HEAD(&entry->lru);
-			zswap_lru_add(&zswap_list_lru, entry);
-		}
+		from_index = nr_pages++;
+
+		if (nr_pages > nr_folio_pages)
+			break;
 	}
 
+	acomp_ctx_put_unlock(acomp_ctx);
 	kfree(entries);
 	return true;
 
@@ -1734,6 +1762,7 @@ static bool zswap_store_folio(struct folio *folio,
 		zswap_entry_cache_free(entries[index]);
 	}
 
+	acomp_ctx_put_unlock(acomp_ctx);
 	kfree(entries);
 	return false;
 }
-- 
2.27.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 12/16] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-02-06  7:20 ` [PATCH v6 12/16] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
@ 2025-02-06 18:55   ` Yosry Ahmed
  2025-02-28 10:00     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 30+ messages in thread
From: Yosry Ahmed @ 2025-02-06 18:55 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	wajdi.k.feghali, vinodh.gopal

On Wed, Feb 05, 2025 at 11:20:58PM -0800, Kanchana P Sridhar wrote:
> This patch adds support for the per-CPU acomp_ctx to track multiple
> compression/decompression requests. The zswap_cpu_comp_prepare() cpu

nit: s/cpu/CPU

> onlining code will check if the compressor supports batching. If so, it
> will allocate the necessary batching resources.
> 
> However, zswap does not use more than one request yet. Follow-up patches
> will actually utilize the multiple acomp_ctx requests/buffers for batch
> compression/decompression of multiple pages.
> 
> The newly added ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used
> for batching. There is no extra memory usage for compressors that do not
> support batching.

That's not entirely accurate, there's a tiny bit of extra overhead to
allocate the arrays. It can be avoided, but I am not sure it's worth the
complexity.

> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 132 +++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 98 insertions(+), 34 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a2baceed3bf9..dc7d1ff04b22 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -78,6 +78,16 @@ static bool zswap_pool_reached_full;
>  
>  #define ZSWAP_PARAM_UNSET ""
>  
> +/*
> + * For compression batching of large folios:
> + * Maximum number of acomp compress requests that will be processed
> + * in a batch, iff the zswap compressor supports batching.
> + * This limit exists because we preallocate enough requests and buffers
> + * in the per-cpu acomp_ctx accordingly. Hence, a higher limit means higher
> + * memory usage.
> + */
> +#define ZSWAP_MAX_BATCH_SIZE 8U
> +
>  static int zswap_setup(void);
>  
>  /* Enable/disable zswap */
> @@ -143,9 +153,10 @@ bool zswap_never_enabled(void)
>  
>  struct crypto_acomp_ctx {
>  	struct crypto_acomp *acomp;
> -	struct acomp_req *req;
> +	struct acomp_req **reqs;
> +	u8 **buffers;
> +	unsigned int nr_reqs;
>  	struct crypto_wait wait;
> -	u8 *buffer;
>  	struct mutex mutex;
>  	bool is_sleepable;
>  };
> @@ -821,15 +832,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>  	struct crypto_acomp *acomp = NULL;
> -	struct acomp_req *req = NULL;
> -	u8 *buffer = NULL;
> -	int ret;
> +	unsigned int nr_reqs = 1;
> +	int ret = -ENOMEM;
> +	int i;
>  
> -	buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> -	if (!buffer) {
> -		ret = -ENOMEM;
> -		goto fail;
> -	}
> +	acomp_ctx->buffers = NULL;
> +	acomp_ctx->reqs = NULL;
> +	acomp_ctx->nr_reqs = 0;
>  
>  	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
>  	if (IS_ERR(acomp)) {
> @@ -839,12 +848,30 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  		goto fail;
>  	}
>  
> -	req = acomp_request_alloc(acomp);
> -	if (!req) {
> -		pr_err("could not alloc crypto acomp_request %s\n",
> -		       pool->tfm_name);
> -		ret = -ENOMEM;
> +	if (acomp_has_async_batching(acomp))
> +		nr_reqs = min(ZSWAP_MAX_BATCH_SIZE, crypto_acomp_batch_size(acomp));

Do we need to check acomp_has_async_batching() here? Shouldn't
crypto_acomp_batch_size() just return 1 if batching is not supported?

> +
> +	acomp_ctx->buffers = kcalloc_node(nr_reqs, sizeof(u8 *), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!acomp_ctx->buffers)
> +		goto fail;
> +
> +	for (i = 0; i < nr_reqs; ++i) {
> +		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> +		if (!acomp_ctx->buffers[i])
> +			goto fail;
> +	}
> +
> +	acomp_ctx->reqs = kcalloc_node(nr_reqs, sizeof(struct acomp_req *), GFP_KERNEL, cpu_to_node(cpu));
> +	if (!acomp_ctx->reqs)
>  		goto fail;
> +
> +	for (i = 0; i < nr_reqs; ++i) {
> +		acomp_ctx->reqs[i] = acomp_request_alloc(acomp);
> +		if (!acomp_ctx->reqs[i]) {
> +			pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
> +			       i, pool->tfm_name);
> +			goto fail;
> +		}
>  	}
>  
>  	/*
> @@ -853,6 +880,13 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	 * again resulting in a deadlock.
>  	 */
>  	mutex_lock(&acomp_ctx->mutex);

I had moved all the acomp_ctx initializations under the mutex to keep
its state always fully initialized or uninitialized for anyone holding
the lock. With this change, acomp_ctx->reqs will be set to non-NULL
before the mutex is held and the acomp_ctx is fully initialized.

The code in the compression/decompression path uses acomp_ctx->reqes to
check if the acomp_ctx can be used. While I don't believe it's currently
possible for zswap_cpu_comp_prepare() to race with these paths, I did
this to be future proof. I don't want the code to end up initializing
some of the struct under the lock and some of it without it.

So I think there's two options here:
- Do the due diligence check that holding the mutex is not required when
  initializing acomp_ctx here, and remove the mutex locking here
  completely.
- Keep the initializations in the lock critical section (i.e. allocate
  everything first, then initialize under the lock).

> +
> +	/*
> +	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
> +	 * or non-poll mode of acomp, hence there is only one "wait" per
> +	 * acomp_ctx, with callback set to reqs[0], under the assumption that
> +	 * there is at least 1 request per acomp_ctx.
> +	 */
>  	crypto_init_wait(&acomp_ctx->wait);
>  
>  	/*
> @@ -860,20 +894,33 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
>  	 * won't be called, crypto_wait_req() will return without blocking.
>  	 */
> -	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> +	acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
>  				   crypto_req_done, &acomp_ctx->wait);
>  
> -	acomp_ctx->buffer = buffer;
> +	acomp_ctx->nr_reqs = nr_reqs;
>  	acomp_ctx->acomp = acomp;
>  	acomp_ctx->is_sleepable = acomp_is_async(acomp);
> -	acomp_ctx->req = req;
>  	mutex_unlock(&acomp_ctx->mutex);
>  	return 0;
>  
>  fail:
> +	if (acomp_ctx->buffers) {
> +		for (i = 0; i < nr_reqs; ++i)
> +			kfree(acomp_ctx->buffers[i]);
> +		kfree(acomp_ctx->buffers);
> +		acomp_ctx->buffers = NULL;
> +	}
> +
> +	if (acomp_ctx->reqs) {
> +		for (i = 0; i < nr_reqs; ++i)
> +			if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> +				acomp_request_free(acomp_ctx->reqs[i]);
> +		kfree(acomp_ctx->reqs);
> +		acomp_ctx->reqs = NULL;
> +	}
> +
>  	if (acomp)
>  		crypto_free_acomp(acomp);
> -	kfree(buffer);
>  	return ret;
>  }
>  
> @@ -883,14 +930,31 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>  
>  	mutex_lock(&acomp_ctx->mutex);
> +
>  	if (!IS_ERR_OR_NULL(acomp_ctx)) {
> -		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> -			acomp_request_free(acomp_ctx->req);
> -		acomp_ctx->req = NULL;
> +		int i;
> +
> +		if (acomp_ctx->reqs) {
> +			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +				if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> +					acomp_request_free(acomp_ctx->reqs[i]);
> +			kfree(acomp_ctx->reqs);
> +			acomp_ctx->reqs = NULL;
> +		}
> +
> +		if (acomp_ctx->buffers) {
> +			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +				kfree(acomp_ctx->buffers[i]);
> +			kfree(acomp_ctx->buffers);
> +			acomp_ctx->buffers = NULL;
> +		}
> +

The code here seems to be almost exactly like the failure path in
zswap_cpu_comp_prepare(), would it be better to put it in a helper?

>  		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>  			crypto_free_acomp(acomp_ctx->acomp);
> -		kfree(acomp_ctx->buffer);
> +
> +		acomp_ctx->nr_reqs = 0;
>  	}
> +
>  	mutex_unlock(&acomp_ctx->mutex);
>  
>  	return 0;
> @@ -903,7 +967,7 @@ static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
>  	for (;;) {
>  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
>  		mutex_lock(&acomp_ctx->mutex);
> -		if (likely(acomp_ctx->req))
> +		if (likely(acomp_ctx->reqs))
>  			return acomp_ctx;
>  		/*
>  		 * It is possible that we were migrated to a different CPU after
> @@ -935,7 +999,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	u8 *dst;
>  
>  	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> -	dst = acomp_ctx->buffer;
> +	dst = acomp_ctx->buffers[0];
>  	sg_init_table(&input, 1);
>  	sg_set_page(&input, page, PAGE_SIZE, 0);
>  
> @@ -945,7 +1009,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	 * giving the dst buffer with enough length to avoid buffer overflow.
>  	 */
>  	sg_init_one(&output, dst, PAGE_SIZE * 2);
> -	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
>  
>  	/*
>  	 * it maybe looks a little bit silly that we send an asynchronous request,
> @@ -959,8 +1023,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	 * but in different threads running on different cpu, we have different
>  	 * acomp instance, so multiple threads can do (de)compression in parallel.
>  	 */
> -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> -	dlen = acomp_ctx->req->dlen;
> +	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
> +	dlen = acomp_ctx->reqs[0]->dlen;
>  	if (comp_ret)
>  		goto unlock;
>  
> @@ -1011,19 +1075,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  	 */
>  	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
>  	    !virt_addr_valid(src)) {
> -		memcpy(acomp_ctx->buffer, src, entry->length);
> -		src = acomp_ctx->buffer;
> +		memcpy(acomp_ctx->buffers[0], src, entry->length);
> +		src = acomp_ctx->buffers[0];
>  		zpool_unmap_handle(zpool, entry->handle);
>  	}
>  
>  	sg_init_one(&input, src, entry->length);
>  	sg_init_table(&output, 1);
>  	sg_set_folio(&output, folio, PAGE_SIZE, 0);
> -	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> -	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> -	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> +	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE);
> +	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait));
> +	BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
>  
> -	if (src != acomp_ctx->buffer)
> +	if (src != acomp_ctx->buffers[0])
>  		zpool_unmap_handle(zpool, entry->handle);
>  	acomp_ctx_put_unlock(acomp_ctx);
>  }
> -- 
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 15/16] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2025-02-06  7:21 ` [PATCH v6 15/16] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
@ 2025-02-06 19:10   ` Yosry Ahmed
  2025-02-06 19:24     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 30+ messages in thread
From: Yosry Ahmed @ 2025-02-06 19:10 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	wajdi.k.feghali, vinodh.gopal

On Wed, Feb 05, 2025 at 11:21:01PM -0800, Kanchana P Sridhar wrote:
> zswap_compress_folio() is modified to detect if the pool's acomp_ctx has
> more than one "nr_reqs", which will be the case if the cpu onlining code
> has allocated multiple batching resources in the acomp_ctx. If so, it means
> compress batching can be used with a batch-size of "acomp_ctx->nr_reqs".
> 
> If compress batching can be used, zswap_compress_folio() will invoke the
> newly added zswap_batch_compress() procedure to compress and store the
> folio in batches of "acomp_ctx->nr_reqs" pages.
> 
> With Intel IAA, the iaa_crypto driver will compress each batch of pages in
> parallel in hardware.
> 
> Hence, zswap_batch_compress() does the same computes for a batch, as
> zswap_compress() does for a page; and returns true if the batch was
> successfully compressed/stored, and false otherwise.
> 
> If the pool does not support compress batching, or the folio has only one
> page, zswap_compress_folio() calls zswap_compress() for each individual
> page in the folio, as before.
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 113 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 6563d12e907b..f1cba77eda62 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -985,10 +985,11 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
>  	mutex_unlock(&acomp_ctx->mutex);
>  }
>  
> +/* The per-cpu @acomp_ctx mutex should be locked/unlocked in the caller. */

Please use lockdep assertions rather than comments for internal locking rules.

>  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> -			   struct zswap_pool *pool)
> +			   struct zswap_pool *pool,
> +			   struct crypto_acomp_ctx *acomp_ctx)
>  {
> -	struct crypto_acomp_ctx *acomp_ctx;
>  	struct scatterlist input, output;
>  	int comp_ret = 0, alloc_ret = 0;
>  	unsigned int dlen = PAGE_SIZE;
> @@ -998,7 +999,6 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	gfp_t gfp;
>  	u8 *dst;
>  
> -	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
>  	dst = acomp_ctx->buffers[0];
>  	sg_init_table(&input, 1);
>  	sg_set_page(&input, page, PAGE_SIZE, 0);
> @@ -1051,7 +1051,6 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	else if (alloc_ret)
>  		zswap_reject_alloc_fail++;
>  
> -	acomp_ctx_put_unlock(acomp_ctx);
>  	return comp_ret == 0 && alloc_ret == 0;
>  }
>  
> @@ -1509,20 +1508,125 @@ static void shrink_worker(struct work_struct *w)
>  * main API
>  **********************************/
>  
> +/* The per-cpu @acomp_ctx mutex should be locked/unlocked in the caller. */
> +static bool zswap_batch_compress(struct folio *folio,
> +				 long index,
> +				 unsigned int batch_size,
> +				 struct zswap_entry *entries[],
> +				 struct zswap_pool *pool,
> +				 struct crypto_acomp_ctx *acomp_ctx)
> +{
> +	int comp_errors[ZSWAP_MAX_BATCH_SIZE] = { 0 };
> +	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
> +	struct page *pages[ZSWAP_MAX_BATCH_SIZE];
> +	unsigned int i, nr_batch_pages;
> +	bool ret = true;
> +
> +	nr_batch_pages = min((unsigned int)(folio_nr_pages(folio) - index), batch_size);
> +
> +	for (i = 0; i < nr_batch_pages; ++i) {
> +		pages[i] = folio_page(folio, index + i);
> +		dlens[i] = PAGE_SIZE;
> +	}
> +
> +	/*
> +	 * Batch compress @nr_batch_pages. If IAA is the compressor, the
> +	 * hardware will compress @nr_batch_pages in parallel.
> +	 */

Please do not specifically mention IAA in zswap.c, as batching could be
supported in the future by other compressors.

> +	ret = crypto_acomp_batch_compress(
> +		acomp_ctx->reqs,
> +		NULL,
> +		pages,
> +		acomp_ctx->buffers,
> +		dlens,
> +		comp_errors,
> +		nr_batch_pages);

Does crypto_acomp_batch_compress() not require calling
crypto_wait_req()?

> +
> +	if (ret) {
> +		/*
> +		 * All batch pages were successfully compressed.
> +		 * Store the pages in zpool.
> +		 */
> +		struct zpool *zpool = pool->zpool;
> +		gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
> +
> +		if (zpool_malloc_support_movable(zpool))
> +			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> +
> +		for (i = 0; i < nr_batch_pages; ++i) {
> +			unsigned long handle;
> +			char *buf;
> +			int err;
> +
> +			err = zpool_malloc(zpool, dlens[i], gfp, &handle);
> +
> +			if (err) {
> +				if (err == -ENOSPC)
> +					zswap_reject_compress_poor++;
> +				else
> +					zswap_reject_alloc_fail++;
> +
> +				ret = false;
> +				break;
> +			}
> +
> +			buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> +			memcpy(buf, acomp_ctx->buffers[i], dlens[i]);
> +			zpool_unmap_handle(zpool, handle);
> +
> +			entries[i]->handle = handle;
> +			entries[i]->length = dlens[i];
> +		}
> +	} else {
> +		/* Some batch pages had compression errors. */
> +		for (i = 0; i < nr_batch_pages; ++i) {
> +			if (comp_errors[i]) {
> +				if (comp_errors[i] == -ENOSPC)
> +					zswap_reject_compress_poor++;
> +				else
> +					zswap_reject_compress_fail++;
> +			}
> +		}
> +	}

This function is awfully close to zswap_compress(). It's essentially a
vectorized version and uses crypto_acomp_batch_compress() instead of
crypto_acomp_compress().

My questions are:
- Can we use crypto_acomp_batch_compress() for the non-batched case as
  well to unify the code? Does it cause any regressions?

- If we have to use different compressions APIs, can we at least reuse
  the rest of the code? We can abstract the compression call into a
  helper that chooses the appropriate API based on the batch size. The
  rest should be the same AFAICT.

> +
> +	return ret;
> +}
> +
>  static bool zswap_compress_folio(struct folio *folio,
>  				 struct zswap_entry *entries[],
>  				 struct zswap_pool *pool)
>  {
>  	long index, nr_pages = folio_nr_pages(folio);
> +	struct crypto_acomp_ctx *acomp_ctx;
> +	unsigned int batch_size;
> +	bool ret = true;
>  
> -	for (index = 0; index < nr_pages; ++index) {
> -		struct page *page = folio_page(folio, index);
> +	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> +	batch_size = acomp_ctx->nr_reqs;
> +
> +	if ((batch_size > 1) && (nr_pages > 1)) {
> +		for (index = 0; index < nr_pages; index += batch_size) {
> +
> +			if (!zswap_batch_compress(folio, index, batch_size,
> +						  &entries[index], pool, acomp_ctx)) {
> +				ret = false;
> +				goto unlock_acomp_ctx;
> +			}
> +		}
> +	} else {
> +		for (index = 0; index < nr_pages; ++index) {
> +			struct page *page = folio_page(folio, index);
>  
> -		if (!zswap_compress(page, entries[index], pool))
> -			return false;
> +			if (!zswap_compress(page, entries[index], pool, acomp_ctx)) {
> +				ret = false;
> +				goto unlock_acomp_ctx;
> +			}
> +		}
>  	}
>  
> -	return true;
> +unlock_acomp_ctx:
> +	acomp_ctx_put_unlock(acomp_ctx);
> +	return ret;
>  }
>  
>  /*
> -- 
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios.
  2025-02-06  7:21 ` [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios Kanchana P Sridhar
@ 2025-02-06 19:15   ` Yosry Ahmed
  2025-02-28 10:00     ` Sridhar, Kanchana P
  2025-02-20 23:28   ` Nhat Pham
  1 sibling, 1 reply; 30+ messages in thread
From: Yosry Ahmed @ 2025-02-06 19:15 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	wajdi.k.feghali, vinodh.gopal

On Wed, Feb 05, 2025 at 11:21:02PM -0800, Kanchana P Sridhar wrote:
> With the previous patch that enables support for batch compressions in
> zswap_compress_folio(), a 6.2% throughput regression was seen with zstd and
> 2M folios, using vm-scalability/usemem.
> 
> For compressors that don't support batching, this was root-caused to the
> following zswap_store_folio() structure:
> 
>  Batched stores:
>  ---------------
>  - Allocate all entries,
>  - Compress all entries,
>  - Store all entries in xarray/LRU.
> 
> Hence, the above structure is maintained only for batched stores, and the
> following structure is implemented for sequential stores of large folio pages,
> that fixes the zstd regression, while preserving common code paths for batched
> and sequential stores of a folio:
> 
>  Sequential stores:
>  ------------------
>  For each page in folio:
>   - allocate an entry,
>   - compress the page,
>   - store the entry in xarray/LRU.
> 
> This is submitted as a separate patch only for code review purposes. I will
> squash this with the previous commit in subsequent versions of this
> patch-series.

Could it be the cache locality?

I wonder if we should do what Chengming initially suggested and batch
everything at ZSWAP_MAX_BATCH_SIZE instead. Instead of
zswap_compress_folio() operating on the entire folio, we can operate on
batches of size ZSWAP_MAX_BATCH_SIZE, regardless of whether the
underlying compressor supports batching.

If we do this, instead of:
- Allocate all entries
- Compress all entries
- Store all entries

We can do:
  - For each batch (8 entries)
  	- Allocate all entries
	- Compress all entries
	- Store all entries

This should help unify the code, and I suspect it may also fix the zstd
regression. We can also skip the entries array allocation and use one on
the stack.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v6 15/16] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2025-02-06 19:10   ` Yosry Ahmed
@ 2025-02-06 19:24     ` Sridhar, Kanchana P
  2025-02-28 10:00       ` Sridhar, Kanchana P
  0 siblings, 1 reply; 30+ messages in thread
From: Sridhar, Kanchana P @ 2025-02-06 19:24 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, February 6, 2025 11:11 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 15/16] mm: zswap: Compress batching with Intel IAA
> in zswap_store() of large folios.
> 
> On Wed, Feb 05, 2025 at 11:21:01PM -0800, Kanchana P Sridhar wrote:
> > zswap_compress_folio() is modified to detect if the pool's acomp_ctx has
> > more than one "nr_reqs", which will be the case if the cpu onlining code
> > has allocated multiple batching resources in the acomp_ctx. If so, it means
> > compress batching can be used with a batch-size of "acomp_ctx->nr_reqs".
> >
> > If compress batching can be used, zswap_compress_folio() will invoke the
> > newly added zswap_batch_compress() procedure to compress and store the
> > folio in batches of "acomp_ctx->nr_reqs" pages.
> >
> > With Intel IAA, the iaa_crypto driver will compress each batch of pages in
> > parallel in hardware.
> >
> > Hence, zswap_batch_compress() does the same computes for a batch, as
> > zswap_compress() does for a page; and returns true if the batch was
> > successfully compressed/stored, and false otherwise.
> >
> > If the pool does not support compress batching, or the folio has only one
> > page, zswap_compress_folio() calls zswap_compress() for each individual
> > page in the folio, as before.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 122
> +++++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 113 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 6563d12e907b..f1cba77eda62 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -985,10 +985,11 @@ static void acomp_ctx_put_unlock(struct
> crypto_acomp_ctx *acomp_ctx)
> >  	mutex_unlock(&acomp_ctx->mutex);
> >  }
> >
> > +/* The per-cpu @acomp_ctx mutex should be locked/unlocked in the
> caller. */
> 
> Please use lockdep assertions rather than comments for internal locking rules.

Sure. Thanks for the suggestion.

> 
> >  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > -			   struct zswap_pool *pool)
> > +			   struct zswap_pool *pool,
> > +			   struct crypto_acomp_ctx *acomp_ctx)
> >  {
> > -	struct crypto_acomp_ctx *acomp_ctx;
> >  	struct scatterlist input, output;
> >  	int comp_ret = 0, alloc_ret = 0;
> >  	unsigned int dlen = PAGE_SIZE;
> > @@ -998,7 +999,6 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >  	gfp_t gfp;
> >  	u8 *dst;
> >
> > -	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> >  	dst = acomp_ctx->buffers[0];
> >  	sg_init_table(&input, 1);
> >  	sg_set_page(&input, page, PAGE_SIZE, 0);
> > @@ -1051,7 +1051,6 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >  	else if (alloc_ret)
> >  		zswap_reject_alloc_fail++;
> >
> > -	acomp_ctx_put_unlock(acomp_ctx);
> >  	return comp_ret == 0 && alloc_ret == 0;
> >  }
> >
> > @@ -1509,20 +1508,125 @@ static void shrink_worker(struct work_struct
> *w)
> >  * main API
> >  **********************************/
> >
> > +/* The per-cpu @acomp_ctx mutex should be locked/unlocked in the
> caller. */
> > +static bool zswap_batch_compress(struct folio *folio,
> > +				 long index,
> > +				 unsigned int batch_size,
> > +				 struct zswap_entry *entries[],
> > +				 struct zswap_pool *pool,
> > +				 struct crypto_acomp_ctx *acomp_ctx)
> > +{
> > +	int comp_errors[ZSWAP_MAX_BATCH_SIZE] = { 0 };
> > +	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
> > +	struct page *pages[ZSWAP_MAX_BATCH_SIZE];
> > +	unsigned int i, nr_batch_pages;
> > +	bool ret = true;
> > +
> > +	nr_batch_pages = min((unsigned int)(folio_nr_pages(folio) - index),
> batch_size);
> > +
> > +	for (i = 0; i < nr_batch_pages; ++i) {
> > +		pages[i] = folio_page(folio, index + i);
> > +		dlens[i] = PAGE_SIZE;
> > +	}
> > +
> > +	/*
> > +	 * Batch compress @nr_batch_pages. If IAA is the compressor, the
> > +	 * hardware will compress @nr_batch_pages in parallel.
> > +	 */
> 
> Please do not specifically mention IAA in zswap.c, as batching could be
> supported in the future by other compressors.

Ok.

> 
> > +	ret = crypto_acomp_batch_compress(
> > +		acomp_ctx->reqs,
> > +		NULL,
> > +		pages,
> > +		acomp_ctx->buffers,
> > +		dlens,
> > +		comp_errors,
> > +		nr_batch_pages);
> 
> Does crypto_acomp_batch_compress() not require calling
> crypto_wait_req()?

It actually doesn't. If the crypto_wait parameter is NULL, the API requires
the driver to provide a way to process request completions asynchronously,
as described in patch 2 that adds the crypto batching API.

> 
> > +
> > +	if (ret) {
> > +		/*
> > +		 * All batch pages were successfully compressed.
> > +		 * Store the pages in zpool.
> > +		 */
> > +		struct zpool *zpool = pool->zpool;
> > +		gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN |
> __GFP_KSWAPD_RECLAIM;
> > +
> > +		if (zpool_malloc_support_movable(zpool))
> > +			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> > +
> > +		for (i = 0; i < nr_batch_pages; ++i) {
> > +			unsigned long handle;
> > +			char *buf;
> > +			int err;
> > +
> > +			err = zpool_malloc(zpool, dlens[i], gfp, &handle);
> > +
> > +			if (err) {
> > +				if (err == -ENOSPC)
> > +					zswap_reject_compress_poor++;
> > +				else
> > +					zswap_reject_alloc_fail++;
> > +
> > +				ret = false;
> > +				break;
> > +			}
> > +
> > +			buf = zpool_map_handle(zpool, handle,
> ZPOOL_MM_WO);
> > +			memcpy(buf, acomp_ctx->buffers[i], dlens[i]);
> > +			zpool_unmap_handle(zpool, handle);
> > +
> > +			entries[i]->handle = handle;
> > +			entries[i]->length = dlens[i];
> > +		}
> > +	} else {
> > +		/* Some batch pages had compression errors. */
> > +		for (i = 0; i < nr_batch_pages; ++i) {
> > +			if (comp_errors[i]) {
> > +				if (comp_errors[i] == -ENOSPC)
> > +					zswap_reject_compress_poor++;
> > +				else
> > +					zswap_reject_compress_fail++;
> > +			}
> > +		}
> > +	}
> 
> This function is awfully close to zswap_compress(). It's essentially a
> vectorized version and uses crypto_acomp_batch_compress() instead of
> crypto_acomp_compress().
> 
> My questions are:
> - Can we use crypto_acomp_batch_compress() for the non-batched case as
>   well to unify the code? Does it cause any regressions?
> 
> - If we have to use different compressions APIs, can we at least reuse
>   the rest of the code? We can abstract the compression call into a
>   helper that chooses the appropriate API based on the batch size. The
>   rest should be the same AFAICT.

All good ideas. Let me think about this some more, and gather some data.

Thanks,
Kanchana

> 
> > +
> > +	return ret;
> > +}
> > +
> >  static bool zswap_compress_folio(struct folio *folio,
> >  				 struct zswap_entry *entries[],
> >  				 struct zswap_pool *pool)
> >  {
> >  	long index, nr_pages = folio_nr_pages(folio);
> > +	struct crypto_acomp_ctx *acomp_ctx;
> > +	unsigned int batch_size;
> > +	bool ret = true;
> >
> > -	for (index = 0; index < nr_pages; ++index) {
> > -		struct page *page = folio_page(folio, index);
> > +	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > +	batch_size = acomp_ctx->nr_reqs;
> > +
> > +	if ((batch_size > 1) && (nr_pages > 1)) {
> > +		for (index = 0; index < nr_pages; index += batch_size) {
> > +
> > +			if (!zswap_batch_compress(folio, index, batch_size,
> > +						  &entries[index], pool,
> acomp_ctx)) {
> > +				ret = false;
> > +				goto unlock_acomp_ctx;
> > +			}
> > +		}
> > +	} else {
> > +		for (index = 0; index < nr_pages; ++index) {
> > +			struct page *page = folio_page(folio, index);
> >
> > -		if (!zswap_compress(page, entries[index], pool))
> > -			return false;
> > +			if (!zswap_compress(page, entries[index], pool,
> acomp_ctx)) {
> > +				ret = false;
> > +				goto unlock_acomp_ctx;
> > +			}
> > +		}
> >  	}
> >
> > -	return true;
> > +unlock_acomp_ctx:
> > +	acomp_ctx_put_unlock(acomp_ctx);
> > +	return ret;
> >  }
> >
> >  /*
> > --
> > 2.27.0
> >


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 00/16] zswap IAA compress batching
  2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
                   ` (15 preceding siblings ...)
  2025-02-06  7:21 ` [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios Kanchana P Sridhar
@ 2025-02-11 17:05 ` Eric Biggers
  2025-02-11 17:52   ` Nhat Pham
  16 siblings, 1 reply; 30+ messages in thread
From: Eric Biggers @ 2025-02-11 17:05 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Wed, Feb 05, 2025 at 11:20:46PM -0800, Kanchana P Sridhar wrote:
> IAA Compression Batching:
> =========================
> 
> This patch-series introduces the use of the Intel Analytics Accelerator
> (IAA) for parallel batch compression of pages in large folios to improve
> zswap swapout latency.

So, zswap is passed a large folio to swap out, and it divides it into 4K pages
and compresses each independently.  The performance improvement in this patchset
comes entirely from compressing the folio's pages in parallel, synchronously,
using IAA.

Before even considering IAA and going through all the pain of supporting
batching with an off-CPU offload, wouldn't it make a lot more sense to try just
compressing each folio in software as a single unit?  Compared to the existing
approach of compressing the folio in 4K chunks, that should be much faster and
produce a much better compression ratio.  Compression algorithms are very much
designed for larger amounts of data, so that they can find more matches.

It looks like the mm subsystem used to always break up folios when swapping them
out, but that is now been fixed.  It looks like zswap just hasn't been updated
to do otherwise yet?

FWIW, here are some speed and compression ratio results I collected in a
compression benchmark module that tests feeding vmlinux (uncompressed_size:
26624 KiB) though zstd in 4 KiB page or 2 MiB folio-sized chunks:

zstd level 3, 4K chunks: 86 ms; compressed_size 9429 KiB
zstd level 3, 2M chunks: 57 ms; compressed_size 8251 KiB
zstd level 1, 4K chunks: 65 ms; compressed_size 9806 KiB
zstd level 1, 2M chunks: 34 ms; compressed_size 8878 KiB

The current zswap parameterization is "zstd level 3, 4K chunks".  I would
recommend "zstd level 1, 2M chunks", which would be 2.5 times as fast and give a
6% better compression ratio.

What is preventing zswap from compressing whole folios?

- Eric


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 00/16] zswap IAA compress batching
  2025-02-11 17:05 ` [PATCH v6 00/16] zswap IAA compress batching Eric Biggers
@ 2025-02-11 17:52   ` Nhat Pham
  0 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-02-11 17:52 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosry.ahmed,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Tue, Feb 11, 2025 at 9:05 AM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Wed, Feb 05, 2025 at 11:20:46PM -0800, Kanchana P Sridhar wrote:
>
> So, zswap is passed a large folio to swap out, and it divides it into 4K pages
> and compresses each independently.  The performance improvement in this patchset
> comes entirely from compressing the folio's pages in parallel, synchronously,
> using IAA.
>
> Before even considering IAA and going through all the pain of supporting
> batching with an off-CPU offload, wouldn't it make a lot more sense to try just
> compressing each folio in software as a single unit?  Compared to the existing
> approach of compressing the folio in 4K chunks, that should be much faster and
> produce a much better compression ratio.  Compression algorithms are very much
> designed for larger amounts of data, so that they can find more matches.
>
> It looks like the mm subsystem used to always break up folios when swapping them
> out, but that is now been fixed.  It looks like zswap just hasn't been updated
> to do otherwise yet?
>
> FWIW, here are some speed and compression ratio results I collected in a
> compression benchmark module that tests feeding vmlinux (uncompressed_size:
> 26624 KiB) though zstd in 4 KiB page or 2 MiB folio-sized chunks:
>
> zstd level 3, 4K chunks: 86 ms; compressed_size 9429 KiB
> zstd level 3, 2M chunks: 57 ms; compressed_size 8251 KiB
> zstd level 1, 4K chunks: 65 ms; compressed_size 9806 KiB
> zstd level 1, 2M chunks: 34 ms; compressed_size 8878 KiB
>
> The current zswap parameterization is "zstd level 3, 4K chunks".  I would
> recommend "zstd level 1, 2M chunks", which would be 2.5 times as fast and give a
> 6% better compression ratio.
>
> What is preventing zswap from compressing whole folios?

Thanks for the input, Eric! That was one of the directions we have
been exploring for zswap and zram. Here's what's going on:

The first issue is zsmalloc, which is the backend memory allocator for
zswap, currently does not support larger-than-4K object size. Barry
Song is working on this:

https://lore.kernel.org/linux-mm/20241121222521.83458-1-21cnbao@gmail.com/

Performance-wise, compressing whole folios also means that at swap-in
time, you have to decompress and load the entire folio/chunk. This can
create extra memory pressure (for example, you have to either allocate
a huge page or multiple small pages for the folio/chunk), which is
particularly bad when the system is already in trouble :) I believe
that is one of the blockers for the above patch series as well.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 02/16] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-06  7:20 ` [PATCH v6 02/16] crypto: acomp - Define new interfaces for compress/decompress batching Kanchana P Sridhar
@ 2025-02-16  5:10   ` Herbert Xu
  2025-02-28 10:00     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 30+ messages in thread
From: Herbert Xu @ 2025-02-16  5:10 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Wed, Feb 05, 2025 at 11:20:48PM -0800, Kanchana P Sridhar wrote:
> This commit adds get_batch_size(), batch_compress() and batch_decompress()
> interfaces to:
> 
>   struct acomp_alg
>   struct crypto_acomp
> 
> A crypto_acomp compression algorithm that supports batching of compressions
> and decompressions must provide implementations for these API.
> 
> A new helper function acomp_has_async_batching() can be invoked to query if
> a crypto_acomp has registered these batching interfaces.
> 
> A higher level module like zswap can call acomp_has_async_batching() to
> detect if the compressor supports batching, and if so, it can call
> the new crypto_acomp_batch_size() to detect the maximum batch-size
> supported by the compressor. Based on this, zswap can use the minimum of
> any zswap-specific upper limits for batch-size and the compressor's max
> batch-size, to allocate batching resources.
> 
> This allows the iaa_crypto Intel IAA driver to register implementations for
> the get_batch_size(), batch_compress() and batch_decompress() acomp_alg
> interfaces, that can subsequently be invoked from the kernel zswap/zram
> modules to compress/decompress pages in parallel in the IAA hardware
> accelerator to improve swapout/swapin performance through these newly added
> corresponding crypto_acomp API:
> 
>   crypto_acomp_batch_size()
>   crypto_acomp_batch_compress()
>   crypto_acomp_batch_decompress()
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  crypto/acompress.c                  |   3 +
>  include/crypto/acompress.h          | 111 ++++++++++++++++++++++++++++
>  include/crypto/internal/acompress.h |  19 +++++
>  3 files changed, 133 insertions(+)

Please get rid of these batch interfaces.  The whole point of
request chaining is to remove the distinction between batching
and normal requests.  IOW, if a request is chained then it is
automatically processed as a batch.  If it's a singleton then
normal processing will occur.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios.
  2025-02-06  7:21 ` [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios Kanchana P Sridhar
  2025-02-06 19:15   ` Yosry Ahmed
@ 2025-02-20 23:28   ` Nhat Pham
  2025-02-21  3:24     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 30+ messages in thread
From: Nhat Pham @ 2025-02-20 23:28 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	wajdi.k.feghali, vinodh.gopal

On Wed, Feb 5, 2025 at 11:21 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> With the previous patch that enables support for batch compressions in
> zswap_compress_folio(), a 6.2% throughput regression was seen with zstd and
> 2M folios, using vm-scalability/usemem.
>
> For compressors that don't support batching, this was root-caused to the
> following zswap_store_folio() structure:
>
>  Batched stores:
>  ---------------
>  - Allocate all entries,
>  - Compress all entries,
>  - Store all entries in xarray/LRU.

Can you clarify why the above structure leads to performance regression?

>
> Hence, the above structure is maintained only for batched stores, and the
> following structure is implemented for sequential stores of large folio pages,
> that fixes the zstd regression, while preserving common code paths for batched
> and sequential stores of a folio:
>
>  Sequential stores:
>  ------------------
>  For each page in folio:
>   - allocate an entry,
>   - compress the page,
>   - store the entry in xarray/LRU.
>
> This is submitted as a separate patch only for code review purposes. I will
> squash this with the previous commit in subsequent versions of this
> patch-series.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 193 ++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 111 insertions(+), 82 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index f1cba77eda62..7bfc720a6201 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1592,40 +1592,32 @@ static bool zswap_batch_compress(struct folio *folio,
>         return ret;
>  }
>
> -static bool zswap_compress_folio(struct folio *folio,
> -                                struct zswap_entry *entries[],
> -                                struct zswap_pool *pool)
> +static __always_inline bool zswap_compress_folio(struct folio *folio,
> +                                                struct zswap_entry *entries[],
> +                                                long from_index,
> +                                                struct zswap_pool *pool,
> +                                                unsigned int batch_size,
> +                                                struct crypto_acomp_ctx *acomp_ctx)
>  {
> -       long index, nr_pages = folio_nr_pages(folio);
> -       struct crypto_acomp_ctx *acomp_ctx;
> -       unsigned int batch_size;
> +       long index = from_index, nr_pages = folio_nr_pages(folio);
>         bool ret = true;
>
> -       acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> -       batch_size = acomp_ctx->nr_reqs;
> -
>         if ((batch_size > 1) && (nr_pages > 1)) {
> -               for (index = 0; index < nr_pages; index += batch_size) {
> +               for (; index < nr_pages; index += batch_size) {
>
>                         if (!zswap_batch_compress(folio, index, batch_size,
>                                                   &entries[index], pool, acomp_ctx)) {
>                                 ret = false;
> -                               goto unlock_acomp_ctx;
> +                               break;
>                         }
>                 }
>         } else {
> -               for (index = 0; index < nr_pages; ++index) {
> -                       struct page *page = folio_page(folio, index);
> +               struct page *page = folio_page(folio, index);
>
> -                       if (!zswap_compress(page, entries[index], pool, acomp_ctx)) {
> -                               ret = false;
> -                               goto unlock_acomp_ctx;
> -                       }
> -               }
> +               if (!zswap_compress(page, entries[index], pool, acomp_ctx))
> +                       ret = false;
>         }
>
> -unlock_acomp_ctx:
> -       acomp_ctx_put_unlock(acomp_ctx);
>         return ret;
>  }
>
> @@ -1637,92 +1629,128 @@ static bool zswap_compress_folio(struct folio *folio,
>   * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
>   * entry's handle is subsequently modified only upon a successful zpool_malloc()
>   * after the page is compressed.
> + *
> + * For compressors that don't support batching, the following structure
> + * showed a performance regression with zstd/2M folios:
> + *
> + * Batched stores:
> + * ---------------
> + *  - Allocate all entries,
> + *  - Compress all entries,
> + *  - Store all entries in xarray/LRU.
> + *
> + * Hence, the above structure is maintained only for batched stores, and the
> + * following structure is implemented for sequential stores of large folio pages,
> + * that fixes the regression, while preserving common code paths for batched
> + * and sequential stores of a folio:
> + *
> + * Sequential stores:
> + * ------------------
> + * For each page in folio:
> + *  - allocate an entry,
> + *  - compress the page,
> + *  - store the entry in xarray/LRU.
>   */
>  static bool zswap_store_folio(struct folio *folio,
>                                struct obj_cgroup *objcg,
>                                struct zswap_pool *pool)
>  {
> -       long index, from_index = 0, nr_pages = folio_nr_pages(folio);
> +       long index = 0, from_index = 0, nr_pages, nr_folio_pages = folio_nr_pages(folio);
>         struct zswap_entry **entries = NULL;
> +       struct crypto_acomp_ctx *acomp_ctx;
>         int node_id = folio_nid(folio);
> +       unsigned int batch_size;
>
> -       entries = kmalloc(nr_pages * sizeof(*entries), GFP_KERNEL);
> +       entries = kmalloc(nr_folio_pages * sizeof(*entries), GFP_KERNEL);
>         if (!entries)
>                 return false;
>
> -       for (index = 0; index < nr_pages; ++index) {
> -               entries[index] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> +       acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> +       batch_size = acomp_ctx->nr_reqs;
>
> -               if (!entries[index]) {
> -                       zswap_reject_kmemcache_fail++;
> -                       nr_pages = index;
> -                       goto store_folio_failed;
> +       nr_pages = (batch_size > 1) ? nr_folio_pages : 1;
> +
> +       while (1) {
> +               for (index = from_index; index < nr_pages; ++index) {
> +                       entries[index] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> +
> +                       if (!entries[index]) {
> +                               zswap_reject_kmemcache_fail++;
> +                               nr_pages = index;
> +                               goto store_folio_failed;
> +                       }
> +
> +                       entries[index]->handle = (unsigned long)ERR_PTR(-EINVAL);
>                 }
>
> -               entries[index]->handle = (unsigned long)ERR_PTR(-EINVAL);
> -       }
> +               if (!zswap_compress_folio(folio, entries, from_index, pool, batch_size, acomp_ctx))
> +                       goto store_folio_failed;
>
> -       if (!zswap_compress_folio(folio, entries, pool))
> -               goto store_folio_failed;
> +               for (index = from_index; index < nr_pages; ++index) {
> +                       swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, index));
> +                       struct zswap_entry *old, *entry = entries[index];
>
> -       for (index = 0; index < nr_pages; ++index) {
> -               swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, index));
> -               struct zswap_entry *old, *entry = entries[index];
> +                       old = xa_store(swap_zswap_tree(page_swpentry),
> +                               swp_offset(page_swpentry),
> +                               entry, GFP_KERNEL);
> +                       if (xa_is_err(old)) {
> +                               int err = xa_err(old);
>
> -               old = xa_store(swap_zswap_tree(page_swpentry),
> -                              swp_offset(page_swpentry),
> -                              entry, GFP_KERNEL);
> -               if (xa_is_err(old)) {
> -                       int err = xa_err(old);
> +                               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> +                               zswap_reject_alloc_fail++;
> +                               from_index = index;
> +                               goto store_folio_failed;
> +                       }
>
> -                       WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> -                       zswap_reject_alloc_fail++;
> -                       from_index = index;
> -                       goto store_folio_failed;
> -               }
> +                       /*
> +                        * We may have had an existing entry that became stale when
> +                        * the folio was redirtied and now the new version is being
> +                        * swapped out. Get rid of the old.
> +                        */
> +                       if (old)
> +                               zswap_entry_free(old);
>
> -               /*
> -                * We may have had an existing entry that became stale when
> -                * the folio was redirtied and now the new version is being
> -                * swapped out. Get rid of the old.
> -                */
> -               if (old)
> -                       zswap_entry_free(old);
> +                       /*
> +                        * The entry is successfully compressed and stored in the tree, there is
> +                        * no further possibility of failure. Grab refs to the pool and objcg,
> +                        * charge zswap memory, and increment zswap_stored_pages.
> +                        * The opposite actions will be performed by zswap_entry_free()
> +                        * when the entry is removed from the tree.
> +                        */
> +                       zswap_pool_get(pool);
> +                       if (objcg) {
> +                               obj_cgroup_get(objcg);
> +                               obj_cgroup_charge_zswap(objcg, entry->length);
> +                       }
> +                       atomic_long_inc(&zswap_stored_pages);
>
> -               /*
> -                * The entry is successfully compressed and stored in the tree, there is
> -                * no further possibility of failure. Grab refs to the pool and objcg,
> -                * charge zswap memory, and increment zswap_stored_pages.
> -                * The opposite actions will be performed by zswap_entry_free()
> -                * when the entry is removed from the tree.
> -                */
> -               zswap_pool_get(pool);
> -               if (objcg) {
> -                       obj_cgroup_get(objcg);
> -                       obj_cgroup_charge_zswap(objcg, entry->length);
> +                       /*
> +                        * We finish initializing the entry while it's already in xarray.
> +                        * This is safe because:
> +                        *
> +                        * 1. Concurrent stores and invalidations are excluded by folio lock.
> +                        *
> +                        * 2. Writeback is excluded by the entry not being on the LRU yet.
> +                        *    The publishing order matters to prevent writeback from seeing
> +                        *    an incoherent entry.
> +                        */
> +                       entry->pool = pool;
> +                       entry->swpentry = page_swpentry;
> +                       entry->objcg = objcg;
> +                       entry->referenced = true;
> +                       if (entry->length) {
> +                               INIT_LIST_HEAD(&entry->lru);
> +                               zswap_lru_add(&zswap_list_lru, entry);
> +                       }
>                 }
> -               atomic_long_inc(&zswap_stored_pages);
>
> -               /*
> -                * We finish initializing the entry while it's already in xarray.
> -                * This is safe because:
> -                *
> -                * 1. Concurrent stores and invalidations are excluded by folio lock.
> -                *
> -                * 2. Writeback is excluded by the entry not being on the LRU yet.
> -                *    The publishing order matters to prevent writeback from seeing
> -                *    an incoherent entry.
> -                */
> -               entry->pool = pool;
> -               entry->swpentry = page_swpentry;
> -               entry->objcg = objcg;
> -               entry->referenced = true;
> -               if (entry->length) {
> -                       INIT_LIST_HEAD(&entry->lru);
> -                       zswap_lru_add(&zswap_list_lru, entry);
> -               }
> +               from_index = nr_pages++;
> +
> +               if (nr_pages > nr_folio_pages)
> +                       break;
>         }
>
> +       acomp_ctx_put_unlock(acomp_ctx);
>         kfree(entries);
>         return true;
>
> @@ -1734,6 +1762,7 @@ static bool zswap_store_folio(struct folio *folio,
>                 zswap_entry_cache_free(entries[index]);
>         }
>
> +       acomp_ctx_put_unlock(acomp_ctx);
>         kfree(entries);
>         return false;
>  }
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios.
  2025-02-20 23:28   ` Nhat Pham
@ 2025-02-21  3:24     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 30+ messages in thread
From: Sridhar, Kanchana P @ 2025-02-21  3:24 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Thursday, February 20, 2025 3:29 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 16/16] mm: zswap: Fix for zstd performance
> regression with 2M folios.
> 
> On Wed, Feb 5, 2025 at 11:21 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > With the previous patch that enables support for batch compressions in
> > zswap_compress_folio(), a 6.2% throughput regression was seen with zstd
> and
> > 2M folios, using vm-scalability/usemem.
> >
> > For compressors that don't support batching, this was root-caused to the
> > following zswap_store_folio() structure:
> >
> >  Batched stores:
> >  ---------------
> >  - Allocate all entries,
> >  - Compress all entries,
> >  - Store all entries in xarray/LRU.
> 
> Can you clarify why the above structure leads to performance regression?

Good question. My best hypothesis is that it has to do with a combination
of branch mis-predicts and the working set size in the zswap_store_folio() code
blocks having to load and iterate over 512 pages in the 3 loops. 

There could also be caching issues involved as Yosry pointed out, since there
are more LLC-store-misses in patch 15 (that has the regression) as compared to
patch 16 (that fixes the regression).

The v6 cover letter has a section "zstd 2M regression fix details" in which
I have posted patch 15 and patch 16 perf event counters, that seem to back up
this hypothesis.

All I can say for certain is that, for compressors that don't support batching
(e.g. zstd), this code flow resolves the regression (I have verified this through
10 runs in each case, to rule out variance):

Sequential stores:                                                                                                                    
 ------------------                                                                                                                    
 For each page in folio:                                                                                                               
  - allocate an entry,                                                                                                                 
  - compress the page,                                                                                                                 
  - store the entry in xarray/LRU.                                                                                                     

Which explains the implementation in patch 16. In any case, please do let me
know if you have other insights based on the perf event counts.

Also, I have been incorporating the suggestions from Yosry (some of which have
to do with processing large folios in ZSWAP_MAX_BATCH_SIZE increments to see
if that resolves the regression), and from Herbert with regards to request chaining.
I hope to share the results soon.

Thanks,
Kanchana

> 
> >
> > Hence, the above structure is maintained only for batched stores, and the
> > following structure is implemented for sequential stores of large folio pages,
> > that fixes the zstd regression, while preserving common code paths for
> batched
> > and sequential stores of a folio:
> >
> >  Sequential stores:
> >  ------------------
> >  For each page in folio:
> >   - allocate an entry,
> >   - compress the page,
> >   - store the entry in xarray/LRU.
> >
> > This is submitted as a separate patch only for code review purposes. I will
> > squash this with the previous commit in subsequent versions of this
> > patch-series.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 193 ++++++++++++++++++++++++++++++----------------------
> -
> >  1 file changed, 111 insertions(+), 82 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index f1cba77eda62..7bfc720a6201 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1592,40 +1592,32 @@ static bool zswap_batch_compress(struct folio
> *folio,
> >         return ret;
> >  }
> >
> > -static bool zswap_compress_folio(struct folio *folio,
> > -                                struct zswap_entry *entries[],
> > -                                struct zswap_pool *pool)
> > +static __always_inline bool zswap_compress_folio(struct folio *folio,
> > +                                                struct zswap_entry *entries[],
> > +                                                long from_index,
> > +                                                struct zswap_pool *pool,
> > +                                                unsigned int batch_size,
> > +                                                struct crypto_acomp_ctx *acomp_ctx)
> >  {
> > -       long index, nr_pages = folio_nr_pages(folio);
> > -       struct crypto_acomp_ctx *acomp_ctx;
> > -       unsigned int batch_size;
> > +       long index = from_index, nr_pages = folio_nr_pages(folio);
> >         bool ret = true;
> >
> > -       acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > -       batch_size = acomp_ctx->nr_reqs;
> > -
> >         if ((batch_size > 1) && (nr_pages > 1)) {
> > -               for (index = 0; index < nr_pages; index += batch_size) {
> > +               for (; index < nr_pages; index += batch_size) {
> >
> >                         if (!zswap_batch_compress(folio, index, batch_size,
> >                                                   &entries[index], pool, acomp_ctx)) {
> >                                 ret = false;
> > -                               goto unlock_acomp_ctx;
> > +                               break;
> >                         }
> >                 }
> >         } else {
> > -               for (index = 0; index < nr_pages; ++index) {
> > -                       struct page *page = folio_page(folio, index);
> > +               struct page *page = folio_page(folio, index);
> >
> > -                       if (!zswap_compress(page, entries[index], pool, acomp_ctx)) {
> > -                               ret = false;
> > -                               goto unlock_acomp_ctx;
> > -                       }
> > -               }
> > +               if (!zswap_compress(page, entries[index], pool, acomp_ctx))
> > +                       ret = false;
> >         }
> >
> > -unlock_acomp_ctx:
> > -       acomp_ctx_put_unlock(acomp_ctx);
> >         return ret;
> >  }
> >
> > @@ -1637,92 +1629,128 @@ static bool zswap_compress_folio(struct folio
> *folio,
> >   * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
> >   * entry's handle is subsequently modified only upon a successful
> zpool_malloc()
> >   * after the page is compressed.
> > + *
> > + * For compressors that don't support batching, the following structure
> > + * showed a performance regression with zstd/2M folios:
> > + *
> > + * Batched stores:
> > + * ---------------
> > + *  - Allocate all entries,
> > + *  - Compress all entries,
> > + *  - Store all entries in xarray/LRU.
> > + *
> > + * Hence, the above structure is maintained only for batched stores, and
> the
> > + * following structure is implemented for sequential stores of large folio
> pages,
> > + * that fixes the regression, while preserving common code paths for
> batched
> > + * and sequential stores of a folio:
> > + *
> > + * Sequential stores:
> > + * ------------------
> > + * For each page in folio:
> > + *  - allocate an entry,
> > + *  - compress the page,
> > + *  - store the entry in xarray/LRU.
> >   */
> >  static bool zswap_store_folio(struct folio *folio,
> >                                struct obj_cgroup *objcg,
> >                                struct zswap_pool *pool)
> >  {
> > -       long index, from_index = 0, nr_pages = folio_nr_pages(folio);
> > +       long index = 0, from_index = 0, nr_pages, nr_folio_pages =
> folio_nr_pages(folio);
> >         struct zswap_entry **entries = NULL;
> > +       struct crypto_acomp_ctx *acomp_ctx;
> >         int node_id = folio_nid(folio);
> > +       unsigned int batch_size;
> >
> > -       entries = kmalloc(nr_pages * sizeof(*entries), GFP_KERNEL);
> > +       entries = kmalloc(nr_folio_pages * sizeof(*entries), GFP_KERNEL);
> >         if (!entries)
> >                 return false;
> >
> > -       for (index = 0; index < nr_pages; ++index) {
> > -               entries[index] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> > +       acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > +       batch_size = acomp_ctx->nr_reqs;
> >
> > -               if (!entries[index]) {
> > -                       zswap_reject_kmemcache_fail++;
> > -                       nr_pages = index;
> > -                       goto store_folio_failed;
> > +       nr_pages = (batch_size > 1) ? nr_folio_pages : 1;
> > +
> > +       while (1) {
> > +               for (index = from_index; index < nr_pages; ++index) {
> > +                       entries[index] = zswap_entry_cache_alloc(GFP_KERNEL,
> node_id);
> > +
> > +                       if (!entries[index]) {
> > +                               zswap_reject_kmemcache_fail++;
> > +                               nr_pages = index;
> > +                               goto store_folio_failed;
> > +                       }
> > +
> > +                       entries[index]->handle = (unsigned long)ERR_PTR(-EINVAL);
> >                 }
> >
> > -               entries[index]->handle = (unsigned long)ERR_PTR(-EINVAL);
> > -       }
> > +               if (!zswap_compress_folio(folio, entries, from_index, pool,
> batch_size, acomp_ctx))
> > +                       goto store_folio_failed;
> >
> > -       if (!zswap_compress_folio(folio, entries, pool))
> > -               goto store_folio_failed;
> > +               for (index = from_index; index < nr_pages; ++index) {
> > +                       swp_entry_t page_swpentry =
> page_swap_entry(folio_page(folio, index));
> > +                       struct zswap_entry *old, *entry = entries[index];
> >
> > -       for (index = 0; index < nr_pages; ++index) {
> > -               swp_entry_t page_swpentry = page_swap_entry(folio_page(folio,
> index));
> > -               struct zswap_entry *old, *entry = entries[index];
> > +                       old = xa_store(swap_zswap_tree(page_swpentry),
> > +                               swp_offset(page_swpentry),
> > +                               entry, GFP_KERNEL);
> > +                       if (xa_is_err(old)) {
> > +                               int err = xa_err(old);
> >
> > -               old = xa_store(swap_zswap_tree(page_swpentry),
> > -                              swp_offset(page_swpentry),
> > -                              entry, GFP_KERNEL);
> > -               if (xa_is_err(old)) {
> > -                       int err = xa_err(old);
> > +                               WARN_ONCE(err != -ENOMEM, "unexpected xarray
> error: %d\n", err);
> > +                               zswap_reject_alloc_fail++;
> > +                               from_index = index;
> > +                               goto store_folio_failed;
> > +                       }
> >
> > -                       WARN_ONCE(err != -ENOMEM, "unexpected xarray error:
> %d\n", err);
> > -                       zswap_reject_alloc_fail++;
> > -                       from_index = index;
> > -                       goto store_folio_failed;
> > -               }
> > +                       /*
> > +                        * We may have had an existing entry that became stale when
> > +                        * the folio was redirtied and now the new version is being
> > +                        * swapped out. Get rid of the old.
> > +                        */
> > +                       if (old)
> > +                               zswap_entry_free(old);
> >
> > -               /*
> > -                * We may have had an existing entry that became stale when
> > -                * the folio was redirtied and now the new version is being
> > -                * swapped out. Get rid of the old.
> > -                */
> > -               if (old)
> > -                       zswap_entry_free(old);
> > +                       /*
> > +                        * The entry is successfully compressed and stored in the tree,
> there is
> > +                        * no further possibility of failure. Grab refs to the pool and
> objcg,
> > +                        * charge zswap memory, and increment
> zswap_stored_pages.
> > +                        * The opposite actions will be performed by
> zswap_entry_free()
> > +                        * when the entry is removed from the tree.
> > +                        */
> > +                       zswap_pool_get(pool);
> > +                       if (objcg) {
> > +                               obj_cgroup_get(objcg);
> > +                               obj_cgroup_charge_zswap(objcg, entry->length);
> > +                       }
> > +                       atomic_long_inc(&zswap_stored_pages);
> >
> > -               /*
> > -                * The entry is successfully compressed and stored in the tree,
> there is
> > -                * no further possibility of failure. Grab refs to the pool and objcg,
> > -                * charge zswap memory, and increment zswap_stored_pages.
> > -                * The opposite actions will be performed by zswap_entry_free()
> > -                * when the entry is removed from the tree.
> > -                */
> > -               zswap_pool_get(pool);
> > -               if (objcg) {
> > -                       obj_cgroup_get(objcg);
> > -                       obj_cgroup_charge_zswap(objcg, entry->length);
> > +                       /*
> > +                        * We finish initializing the entry while it's already in xarray.
> > +                        * This is safe because:
> > +                        *
> > +                        * 1. Concurrent stores and invalidations are excluded by folio
> lock.
> > +                        *
> > +                        * 2. Writeback is excluded by the entry not being on the LRU
> yet.
> > +                        *    The publishing order matters to prevent writeback from
> seeing
> > +                        *    an incoherent entry.
> > +                        */
> > +                       entry->pool = pool;
> > +                       entry->swpentry = page_swpentry;
> > +                       entry->objcg = objcg;
> > +                       entry->referenced = true;
> > +                       if (entry->length) {
> > +                               INIT_LIST_HEAD(&entry->lru);
> > +                               zswap_lru_add(&zswap_list_lru, entry);
> > +                       }
> >                 }
> > -               atomic_long_inc(&zswap_stored_pages);
> >
> > -               /*
> > -                * We finish initializing the entry while it's already in xarray.
> > -                * This is safe because:
> > -                *
> > -                * 1. Concurrent stores and invalidations are excluded by folio lock.
> > -                *
> > -                * 2. Writeback is excluded by the entry not being on the LRU yet.
> > -                *    The publishing order matters to prevent writeback from seeing
> > -                *    an incoherent entry.
> > -                */
> > -               entry->pool = pool;
> > -               entry->swpentry = page_swpentry;
> > -               entry->objcg = objcg;
> > -               entry->referenced = true;
> > -               if (entry->length) {
> > -                       INIT_LIST_HEAD(&entry->lru);
> > -                       zswap_lru_add(&zswap_list_lru, entry);
> > -               }
> > +               from_index = nr_pages++;
> > +
> > +               if (nr_pages > nr_folio_pages)
> > +                       break;
> >         }
> >
> > +       acomp_ctx_put_unlock(acomp_ctx);
> >         kfree(entries);
> >         return true;
> >
> > @@ -1734,6 +1762,7 @@ static bool zswap_store_folio(struct folio *folio,
> >                 zswap_entry_cache_free(entries[index]);
> >         }
> >
> > +       acomp_ctx_put_unlock(acomp_ctx);
> >         kfree(entries);
> >         return false;
> >  }
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v6 12/16] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-02-06 18:55   ` Yosry Ahmed
@ 2025-02-28 10:00     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 30+ messages in thread
From: Sridhar, Kanchana P @ 2025-02-28 10:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, February 6, 2025 10:55 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 12/16] mm: zswap: Allocate pool batching resources if
> the compressor supports batching.
> 
> On Wed, Feb 05, 2025 at 11:20:58PM -0800, Kanchana P Sridhar wrote:
> > This patch adds support for the per-CPU acomp_ctx to track multiple
> > compression/decompression requests. The zswap_cpu_comp_prepare() cpu
> 
> nit: s/cpu/CPU

Thanks, this is fixed in v7.

> 
> > onlining code will check if the compressor supports batching. If so, it
> > will allocate the necessary batching resources.
> >
> > However, zswap does not use more than one request yet. Follow-up
> patches
> > will actually utilize the multiple acomp_ctx requests/buffers for batch
> > compression/decompression of multiple pages.
> >
> > The newly added ZSWAP_MAX_BATCH_SIZE limits the amount of extra
> memory used
> > for batching. There is no extra memory usage for compressors that do not
> > support batching.
> 
> That's not entirely accurate, there's a tiny bit of extra overhead to
> allocate the arrays. It can be avoided, but I am not sure it's worth the
> complexity.

You're right. The v7 commit log has been amended accordingly.

> 
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 132 +++++++++++++++++++++++++++++++++++++++--------
> ------
> >  1 file changed, 98 insertions(+), 34 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index a2baceed3bf9..dc7d1ff04b22 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -78,6 +78,16 @@ static bool zswap_pool_reached_full;
> >
> >  #define ZSWAP_PARAM_UNSET ""
> >
> > +/*
> > + * For compression batching of large folios:
> > + * Maximum number of acomp compress requests that will be processed
> > + * in a batch, iff the zswap compressor supports batching.
> > + * This limit exists because we preallocate enough requests and buffers
> > + * in the per-cpu acomp_ctx accordingly. Hence, a higher limit means
> higher
> > + * memory usage.
> > + */
> > +#define ZSWAP_MAX_BATCH_SIZE 8U
> > +
> >  static int zswap_setup(void);
> >
> >  /* Enable/disable zswap */
> > @@ -143,9 +153,10 @@ bool zswap_never_enabled(void)
> >
> >  struct crypto_acomp_ctx {
> >  	struct crypto_acomp *acomp;
> > -	struct acomp_req *req;
> > +	struct acomp_req **reqs;
> > +	u8 **buffers;
> > +	unsigned int nr_reqs;
> >  	struct crypto_wait wait;
> > -	u8 *buffer;
> >  	struct mutex mutex;
> >  	bool is_sleepable;
> >  };
> > @@ -821,15 +832,13 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> >  	struct crypto_acomp *acomp = NULL;
> > -	struct acomp_req *req = NULL;
> > -	u8 *buffer = NULL;
> > -	int ret;
> > +	unsigned int nr_reqs = 1;
> > +	int ret = -ENOMEM;
> > +	int i;
> >
> > -	buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> cpu_to_node(cpu));
> > -	if (!buffer) {
> > -		ret = -ENOMEM;
> > -		goto fail;
> > -	}
> > +	acomp_ctx->buffers = NULL;
> > +	acomp_ctx->reqs = NULL;
> > +	acomp_ctx->nr_reqs = 0;
> >
> >  	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
> >  	if (IS_ERR(acomp)) {
> > @@ -839,12 +848,30 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  		goto fail;
> >  	}
> >
> > -	req = acomp_request_alloc(acomp);
> > -	if (!req) {
> > -		pr_err("could not alloc crypto acomp_request %s\n",
> > -		       pool->tfm_name);
> > -		ret = -ENOMEM;
> > +	if (acomp_has_async_batching(acomp))
> > +		nr_reqs = min(ZSWAP_MAX_BATCH_SIZE,
> crypto_acomp_batch_size(acomp));
> 
> Do we need to check acomp_has_async_batching() here? Shouldn't
> crypto_acomp_batch_size() just return 1 if batching is not supported?

Good catch. This is fixed in v7.

> 
> > +
> > +	acomp_ctx->buffers = kcalloc_node(nr_reqs, sizeof(u8 *),
> GFP_KERNEL, cpu_to_node(cpu));
> > +	if (!acomp_ctx->buffers)
> > +		goto fail;
> > +
> > +	for (i = 0; i < nr_reqs; ++i) {
> > +		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2,
> GFP_KERNEL, cpu_to_node(cpu));
> > +		if (!acomp_ctx->buffers[i])
> > +			goto fail;
> > +	}
> > +
> > +	acomp_ctx->reqs = kcalloc_node(nr_reqs, sizeof(struct acomp_req *),
> GFP_KERNEL, cpu_to_node(cpu));
> > +	if (!acomp_ctx->reqs)
> >  		goto fail;
> > +
> > +	for (i = 0; i < nr_reqs; ++i) {
> > +		acomp_ctx->reqs[i] = acomp_request_alloc(acomp);
> > +		if (!acomp_ctx->reqs[i]) {
> > +			pr_err("could not alloc crypto acomp_request
> reqs[%d] %s\n",
> > +			       i, pool->tfm_name);
> > +			goto fail;
> > +		}
> >  	}
> >
> >  	/*
> > @@ -853,6 +880,13 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  	 * again resulting in a deadlock.
> >  	 */
> >  	mutex_lock(&acomp_ctx->mutex);
> 
> I had moved all the acomp_ctx initializations under the mutex to keep
> its state always fully initialized or uninitialized for anyone holding
> the lock. With this change, acomp_ctx->reqs will be set to non-NULL
> before the mutex is held and the acomp_ctx is fully initialized.
> 
> The code in the compression/decompression path uses acomp_ctx->reqes to
> check if the acomp_ctx can be used. While I don't believe it's currently
> possible for zswap_cpu_comp_prepare() to race with these paths, I did
> this to be future proof. I don't want the code to end up initializing
> some of the struct under the lock and some of it without it.
> 
> So I think there's two options here:
> - Do the due diligence check that holding the mutex is not required when
>   initializing acomp_ctx here, and remove the mutex locking here
>   completely.
> - Keep the initializations in the lock critical section (i.e. allocate
>   everything first, then initialize under the lock).

Thanks again, another great catch! In v7, I have attempted to simplify
the acomp_ctx resources allocation/deallocation vis-à-vis CPU onlining
and offlining. The main idea I have implemented is that the lifetime of
the acomp_ctx resources should be from pool creation to pool deletion.
The v7 code comments and commit log provide more details. I would greatly
appreciate your review comments on this approach.

> 
> > +
> > +	/*
> > +	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
> > +	 * or non-poll mode of acomp, hence there is only one "wait" per
> > +	 * acomp_ctx, with callback set to reqs[0], under the assumption that
> > +	 * there is at least 1 request per acomp_ctx.
> > +	 */
> >  	crypto_init_wait(&acomp_ctx->wait);
> >
> >  	/*
> > @@ -860,20 +894,33 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
> >  	 * won't be called, crypto_wait_req() will return without blocking.
> >  	 */
> > -	acomp_request_set_callback(req,
> CRYPTO_TFM_REQ_MAY_BACKLOG,
> > +	acomp_request_set_callback(acomp_ctx->reqs[0],
> CRYPTO_TFM_REQ_MAY_BACKLOG,
> >  				   crypto_req_done, &acomp_ctx->wait);
> >
> > -	acomp_ctx->buffer = buffer;
> > +	acomp_ctx->nr_reqs = nr_reqs;
> >  	acomp_ctx->acomp = acomp;
> >  	acomp_ctx->is_sleepable = acomp_is_async(acomp);
> > -	acomp_ctx->req = req;
> >  	mutex_unlock(&acomp_ctx->mutex);
> >  	return 0;
> >
> >  fail:
> > +	if (acomp_ctx->buffers) {
> > +		for (i = 0; i < nr_reqs; ++i)
> > +			kfree(acomp_ctx->buffers[i]);
> > +		kfree(acomp_ctx->buffers);
> > +		acomp_ctx->buffers = NULL;
> > +	}
> > +
> > +	if (acomp_ctx->reqs) {
> > +		for (i = 0; i < nr_reqs; ++i)
> > +			if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> > +				acomp_request_free(acomp_ctx->reqs[i]);
> > +		kfree(acomp_ctx->reqs);
> > +		acomp_ctx->reqs = NULL;
> > +	}
> > +
> >  	if (acomp)
> >  		crypto_free_acomp(acomp);
> > -	kfree(buffer);
> >  	return ret;
> >  }
> >
> > @@ -883,14 +930,31 @@ static int zswap_cpu_comp_dead(unsigned int
> cpu, struct hlist_node *node)
> >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> >
> >  	mutex_lock(&acomp_ctx->mutex);
> > +
> >  	if (!IS_ERR_OR_NULL(acomp_ctx)) {
> > -		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > -			acomp_request_free(acomp_ctx->req);
> > -		acomp_ctx->req = NULL;
> > +		int i;
> > +
> > +		if (acomp_ctx->reqs) {
> > +			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > +				if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> > +					acomp_request_free(acomp_ctx-
> >reqs[i]);
> > +			kfree(acomp_ctx->reqs);
> > +			acomp_ctx->reqs = NULL;
> > +		}
> > +
> > +		if (acomp_ctx->buffers) {
> > +			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > +				kfree(acomp_ctx->buffers[i]);
> > +			kfree(acomp_ctx->buffers);
> > +			acomp_ctx->buffers = NULL;
> > +		}
> > +
> 
> The code here seems to be almost exactly like the failure path in
> zswap_cpu_comp_prepare(), would it be better to put it in a helper?

Makes sense, and addressed in v7.

> 
> >  		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> >  			crypto_free_acomp(acomp_ctx->acomp);
> > -		kfree(acomp_ctx->buffer);
> > +
> > +		acomp_ctx->nr_reqs = 0;
> >  	}
> > +
> >  	mutex_unlock(&acomp_ctx->mutex);
> >
> >  	return 0;
> > @@ -903,7 +967,7 @@ static struct crypto_acomp_ctx
> *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
> >  	for (;;) {
> >  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> >  		mutex_lock(&acomp_ctx->mutex);
> > -		if (likely(acomp_ctx->req))
> > +		if (likely(acomp_ctx->reqs))
> >  			return acomp_ctx;
> >  		/*
> >  		 * It is possible that we were migrated to a different CPU
> after
> > @@ -935,7 +999,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >  	u8 *dst;
> >
> >  	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > -	dst = acomp_ctx->buffer;
> > +	dst = acomp_ctx->buffers[0];
> >  	sg_init_table(&input, 1);
> >  	sg_set_page(&input, page, PAGE_SIZE, 0);
> >
> > @@ -945,7 +1009,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >  	 * giving the dst buffer with enough length to avoid buffer overflow.
> >  	 */
> >  	sg_init_one(&output, dst, PAGE_SIZE * 2);
> > -	acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> > +	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output,
> PAGE_SIZE, dlen);
> >
> >  	/*
> >  	 * it maybe looks a little bit silly that we send an asynchronous
> request,
> > @@ -959,8 +1023,8 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >  	 * but in different threads running on different cpu, we have different
> >  	 * acomp instance, so multiple threads can do (de)compression in
> parallel.
> >  	 */
> > -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >req), &acomp_ctx->wait);
> > -	dlen = acomp_ctx->req->dlen;
> > +	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >reqs[0]), &acomp_ctx->wait);
> > +	dlen = acomp_ctx->reqs[0]->dlen;
> >  	if (comp_ret)
> >  		goto unlock;
> >
> > @@ -1011,19 +1075,19 @@ static void zswap_decompress(struct
> zswap_entry *entry, struct folio *folio)
> >  	 */
> >  	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool))
> ||
> >  	    !virt_addr_valid(src)) {
> > -		memcpy(acomp_ctx->buffer, src, entry->length);
> > -		src = acomp_ctx->buffer;
> > +		memcpy(acomp_ctx->buffers[0], src, entry->length);
> > +		src = acomp_ctx->buffers[0];
> >  		zpool_unmap_handle(zpool, entry->handle);
> >  	}
> >
> >  	sg_init_one(&input, src, entry->length);
> >  	sg_init_table(&output, 1);
> >  	sg_set_folio(&output, folio, PAGE_SIZE, 0);
> > -	acomp_request_set_params(acomp_ctx->req, &input, &output,
> entry->length, PAGE_SIZE);
> > -	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx-
> >req), &acomp_ctx->wait));
> > -	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> > +	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output,
> entry->length, PAGE_SIZE);
> > +	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx-
> >reqs[0]), &acomp_ctx->wait));
> > +	BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
> >
> > -	if (src != acomp_ctx->buffer)
> > +	if (src != acomp_ctx->buffers[0])
> >  		zpool_unmap_handle(zpool, entry->handle);
> >  	acomp_ctx_put_unlock(acomp_ctx);
> >  }
> > --
> > 2.27.0
> >


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v6 15/16] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2025-02-06 19:24     ` Sridhar, Kanchana P
@ 2025-02-28 10:00       ` Sridhar, Kanchana P
  0 siblings, 0 replies; 30+ messages in thread
From: Sridhar, Kanchana P @ 2025-02-28 10:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Thursday, February 6, 2025 11:24 AM
> To: Yosry Ahmed <yosry.ahmed@linux.dev>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v6 15/16] mm: zswap: Compress batching with Intel IAA
> in zswap_store() of large folios.
> 
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Thursday, February 6, 2025 11:11 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> > herbert@gondor.apana.org.au; davem@davemloft.net;
> > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v6 15/16] mm: zswap: Compress batching with Intel
> IAA
> > in zswap_store() of large folios.
> >
> > On Wed, Feb 05, 2025 at 11:21:01PM -0800, Kanchana P Sridhar wrote:
> > > zswap_compress_folio() is modified to detect if the pool's acomp_ctx has
> > > more than one "nr_reqs", which will be the case if the cpu onlining code
> > > has allocated multiple batching resources in the acomp_ctx. If so, it means
> > > compress batching can be used with a batch-size of "acomp_ctx-
> >nr_reqs".
> > >
> > > If compress batching can be used, zswap_compress_folio() will invoke the
> > > newly added zswap_batch_compress() procedure to compress and store
> the
> > > folio in batches of "acomp_ctx->nr_reqs" pages.
> > >
> > > With Intel IAA, the iaa_crypto driver will compress each batch of pages in
> > > parallel in hardware.
> > >
> > > Hence, zswap_batch_compress() does the same computes for a batch, as
> > > zswap_compress() does for a page; and returns true if the batch was
> > > successfully compressed/stored, and false otherwise.
> > >
> > > If the pool does not support compress batching, or the folio has only one
> > > page, zswap_compress_folio() calls zswap_compress() for each individual
> > > page in the folio, as before.
> > >
> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > ---
> > >  mm/zswap.c | 122
> > +++++++++++++++++++++++++++++++++++++++++++++++++----
> > >  1 file changed, 113 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 6563d12e907b..f1cba77eda62 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -985,10 +985,11 @@ static void acomp_ctx_put_unlock(struct
> > crypto_acomp_ctx *acomp_ctx)
> > >  	mutex_unlock(&acomp_ctx->mutex);
> > >  }
> > >
> > > +/* The per-cpu @acomp_ctx mutex should be locked/unlocked in the
> > caller. */
> >
> > Please use lockdep assertions rather than comments for internal locking
> rules.
> 
> Sure. Thanks for the suggestion.
> 
> >
> > >  static bool zswap_compress(struct page *page, struct zswap_entry
> *entry,
> > > -			   struct zswap_pool *pool)
> > > +			   struct zswap_pool *pool,
> > > +			   struct crypto_acomp_ctx *acomp_ctx)
> > >  {
> > > -	struct crypto_acomp_ctx *acomp_ctx;
> > >  	struct scatterlist input, output;
> > >  	int comp_ret = 0, alloc_ret = 0;
> > >  	unsigned int dlen = PAGE_SIZE;
> > > @@ -998,7 +999,6 @@ static bool zswap_compress(struct page *page,
> > struct zswap_entry *entry,
> > >  	gfp_t gfp;
> > >  	u8 *dst;
> > >
> > > -	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > >  	dst = acomp_ctx->buffers[0];
> > >  	sg_init_table(&input, 1);
> > >  	sg_set_page(&input, page, PAGE_SIZE, 0);
> > > @@ -1051,7 +1051,6 @@ static bool zswap_compress(struct page *page,
> > struct zswap_entry *entry,
> > >  	else if (alloc_ret)
> > >  		zswap_reject_alloc_fail++;
> > >
> > > -	acomp_ctx_put_unlock(acomp_ctx);
> > >  	return comp_ret == 0 && alloc_ret == 0;
> > >  }
> > >
> > > @@ -1509,20 +1508,125 @@ static void shrink_worker(struct
> work_struct
> > *w)
> > >  * main API
> > >  **********************************/
> > >
> > > +/* The per-cpu @acomp_ctx mutex should be locked/unlocked in the
> > caller. */
> > > +static bool zswap_batch_compress(struct folio *folio,
> > > +				 long index,
> > > +				 unsigned int batch_size,
> > > +				 struct zswap_entry *entries[],
> > > +				 struct zswap_pool *pool,
> > > +				 struct crypto_acomp_ctx *acomp_ctx)
> > > +{
> > > +	int comp_errors[ZSWAP_MAX_BATCH_SIZE] = { 0 };
> > > +	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
> > > +	struct page *pages[ZSWAP_MAX_BATCH_SIZE];
> > > +	unsigned int i, nr_batch_pages;
> > > +	bool ret = true;
> > > +
> > > +	nr_batch_pages = min((unsigned int)(folio_nr_pages(folio) - index),
> > batch_size);
> > > +
> > > +	for (i = 0; i < nr_batch_pages; ++i) {
> > > +		pages[i] = folio_page(folio, index + i);
> > > +		dlens[i] = PAGE_SIZE;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Batch compress @nr_batch_pages. If IAA is the compressor, the
> > > +	 * hardware will compress @nr_batch_pages in parallel.
> > > +	 */
> >
> > Please do not specifically mention IAA in zswap.c, as batching could be
> > supported in the future by other compressors.
> 
> Ok.
> 
> >
> > > +	ret = crypto_acomp_batch_compress(
> > > +		acomp_ctx->reqs,
> > > +		NULL,
> > > +		pages,
> > > +		acomp_ctx->buffers,
> > > +		dlens,
> > > +		comp_errors,
> > > +		nr_batch_pages);
> >
> > Does crypto_acomp_batch_compress() not require calling
> > crypto_wait_req()?
> 
> It actually doesn't. If the crypto_wait parameter is NULL, the API requires
> the driver to provide a way to process request completions asynchronously,
> as described in patch 2 that adds the crypto batching API.
> 
> >
> > > +
> > > +	if (ret) {
> > > +		/*
> > > +		 * All batch pages were successfully compressed.
> > > +		 * Store the pages in zpool.
> > > +		 */
> > > +		struct zpool *zpool = pool->zpool;
> > > +		gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN |
> > __GFP_KSWAPD_RECLAIM;
> > > +
> > > +		if (zpool_malloc_support_movable(zpool))
> > > +			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> > > +
> > > +		for (i = 0; i < nr_batch_pages; ++i) {
> > > +			unsigned long handle;
> > > +			char *buf;
> > > +			int err;
> > > +
> > > +			err = zpool_malloc(zpool, dlens[i], gfp, &handle);
> > > +
> > > +			if (err) {
> > > +				if (err == -ENOSPC)
> > > +					zswap_reject_compress_poor++;
> > > +				else
> > > +					zswap_reject_alloc_fail++;
> > > +
> > > +				ret = false;
> > > +				break;
> > > +			}
> > > +
> > > +			buf = zpool_map_handle(zpool, handle,
> > ZPOOL_MM_WO);
> > > +			memcpy(buf, acomp_ctx->buffers[i], dlens[i]);
> > > +			zpool_unmap_handle(zpool, handle);
> > > +
> > > +			entries[i]->handle = handle;
> > > +			entries[i]->length = dlens[i];
> > > +		}
> > > +	} else {
> > > +		/* Some batch pages had compression errors. */
> > > +		for (i = 0; i < nr_batch_pages; ++i) {
> > > +			if (comp_errors[i]) {
> > > +				if (comp_errors[i] == -ENOSPC)
> > > +					zswap_reject_compress_poor++;
> > > +				else
> > > +					zswap_reject_compress_fail++;
> > > +			}
> > > +		}
> > > +	}
> >
> > This function is awfully close to zswap_compress(). It's essentially a
> > vectorized version and uses crypto_acomp_batch_compress() instead of
> > crypto_acomp_compress().
> >
> > My questions are:
> > - Can we use crypto_acomp_batch_compress() for the non-batched case as
> >   well to unify the code? Does it cause any regressions?
> >
> > - If we have to use different compressions APIs, can we at least reuse
> >   the rest of the code? We can abstract the compression call into a
> >   helper that chooses the appropriate API based on the batch size. The
> >   rest should be the same AFAICT.
> 
> All good ideas. Let me think about this some more, and gather some data.

Based on Herbert's suggestion, in v7, I have separated out the compress batching
API to a new zswap_batch_compress() that uses request chaining.
zswap_compress() exists in its current form with the only difference being that
I get the mutex lock once, in zswap_store_folio() to decide whether or not to
batch.

I will provide more details in response to your comments about exploring
smaller batches to address the zstd regression seen in v5 and fixed in v6.

Thanks,
Kanchana

> 
> Thanks,
> Kanchana
> 
> >
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >  static bool zswap_compress_folio(struct folio *folio,
> > >  				 struct zswap_entry *entries[],
> > >  				 struct zswap_pool *pool)
> > >  {
> > >  	long index, nr_pages = folio_nr_pages(folio);
> > > +	struct crypto_acomp_ctx *acomp_ctx;
> > > +	unsigned int batch_size;
> > > +	bool ret = true;
> > >
> > > -	for (index = 0; index < nr_pages; ++index) {
> > > -		struct page *page = folio_page(folio, index);
> > > +	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > > +	batch_size = acomp_ctx->nr_reqs;
> > > +
> > > +	if ((batch_size > 1) && (nr_pages > 1)) {
> > > +		for (index = 0; index < nr_pages; index += batch_size) {
> > > +
> > > +			if (!zswap_batch_compress(folio, index, batch_size,
> > > +						  &entries[index], pool,
> > acomp_ctx)) {
> > > +				ret = false;
> > > +				goto unlock_acomp_ctx;
> > > +			}
> > > +		}
> > > +	} else {
> > > +		for (index = 0; index < nr_pages; ++index) {
> > > +			struct page *page = folio_page(folio, index);
> > >
> > > -		if (!zswap_compress(page, entries[index], pool))
> > > -			return false;
> > > +			if (!zswap_compress(page, entries[index], pool,
> > acomp_ctx)) {
> > > +				ret = false;
> > > +				goto unlock_acomp_ctx;
> > > +			}
> > > +		}
> > >  	}
> > >
> > > -	return true;
> > > +unlock_acomp_ctx:
> > > +	acomp_ctx_put_unlock(acomp_ctx);
> > > +	return ret;
> > >  }
> > >
> > >  /*
> > > --
> > > 2.27.0
> > >


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios.
  2025-02-06 19:15   ` Yosry Ahmed
@ 2025-02-28 10:00     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 30+ messages in thread
From: Sridhar, Kanchana P @ 2025-02-28 10:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

Hi Yosry, Nhat,

> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, February 6, 2025 11:15 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 16/16] mm: zswap: Fix for zstd performance
> regression with 2M folios.
> 
> On Wed, Feb 05, 2025 at 11:21:02PM -0800, Kanchana P Sridhar wrote:
> > With the previous patch that enables support for batch compressions in
> > zswap_compress_folio(), a 6.2% throughput regression was seen with zstd
> and
> > 2M folios, using vm-scalability/usemem.
> >
> > For compressors that don't support batching, this was root-caused to the
> > following zswap_store_folio() structure:
> >
> >  Batched stores:
> >  ---------------
> >  - Allocate all entries,
> >  - Compress all entries,
> >  - Store all entries in xarray/LRU.
> >
> > Hence, the above structure is maintained only for batched stores, and the
> > following structure is implemented for sequential stores of large folio pages,
> > that fixes the zstd regression, while preserving common code paths for
> batched
> > and sequential stores of a folio:
> >
> >  Sequential stores:
> >  ------------------
> >  For each page in folio:
> >   - allocate an entry,
> >   - compress the page,
> >   - store the entry in xarray/LRU.
> >
> > This is submitted as a separate patch only for code review purposes. I will
> > squash this with the previous commit in subsequent versions of this
> > patch-series.
> 
> Could it be the cache locality?

It is possible, since there are more LLC-store-misses in v6, patch 15 (that has
the regression) as compared to patch v6 16 (that fixes the regression).

> 
> I wonder if we should do what Chengming initially suggested and batch
> everything at ZSWAP_MAX_BATCH_SIZE instead. Instead of
> zswap_compress_folio() operating on the entire folio, we can operate on
> batches of size ZSWAP_MAX_BATCH_SIZE, regardless of whether the
> underlying compressor supports batching.
> 
> If we do this, instead of:
> - Allocate all entries
> - Compress all entries
> - Store all entries
> 
> We can do:
>   - For each batch (8 entries)
>   	- Allocate all entries
> 	- Compress all entries
> 	- Store all entries
> 
> This should help unify the code, and I suspect it may also fix the zstd
> regression. We can also skip the entries array allocation and use one on
> the stack.

I tried the above, and additional approaches to unify the code, and all of
them cause a regression in zstd throughput and latency in the usemem 30
processes experiments:

1) Operate on batches of size ZSWAP_MAX_BATCH_SIZE, regardless of
     whether the underlying compressor supports batching. I tried this without
     and with upfront allocation of all entries.
2) Unify post-compression zpool malloc, map handle, memcpy computes
    done in zswap_compress() and zswap_batch_compress() by moving them
    to a procedure and calling this from both these functions. Tried without
    and with inlining this procedure.
3) With the v7 changes to make the acomp_ctx resources' lifetime be from
    pool creation to pool deletion, I wanted to make another functional change in
    acomp_ctx_get_cpu_lock(), as described in the code comments in v7.
    However, I found that making just this change in zswap_decompress() caused
    zstd regression:

     do { 
                   acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
     } while (!acomp_ctx);

		/*
		 * If the CPU onlining code successfully allocates acomp_ctx resources,
		 * it sets acomp_ctx->initialized to true. Until this happens, we have
		 * two options:
		 *
		 * 1. Return NULL and fail all stores on this CPU.
		 * 2. Retry, until onlining has finished allocating resources.
		 *
		 * In theory, option 1 could be more appropriate, because it
		 * allows the calling procedure to decide how it wants to handle
		 * reclaim racing with CPU hotplug. For instance, it might be Ok
		 * for compress to return an error for the backing swap device
		 * to store the folio. Decompress could wait until we get a
		 * valid and locked mutex after onlining has completed. For now,
		 * we go with option 2 because adding a do-while in
		 * zswap_decompress() adds latency for software compressors.
		*/ 

My main learning from these attempts to try alternate approaches to resolve
the zstd regression and to unify common computes into distinct procedures
that zswap_compress() and zswap_batch_compress() will each call, is this:
zstd is extremely sensitive to any and all compute overheads. The above
approaches yield approx.. these zstd latencies with 64K folios quite consistently
(I will summarize the mm-unstable data in v7):

Total throughput (KB/s)              6,909,838       
Average throughput (KB/s)        230,327
elapsed time (sec)                       94.58
sys time (sec)                               2,358.36

The v7 implementation of zswap_store_folios() consistently gives better latency:

Total throughput (KB/s)              6,939,253      
Average throughput (KB/s)        231,308
elapsed time (sec)                       88.64
sys time (sec)                               2,197.54

My main goal in v7 has been to not regress zstd by introducing compute
overheads, while deriving the benefits of batching with request chaining,
while trying to preserve the v6 structure of zswap_store_folio() that aims
to strike a balance between:

1) Common code for batching and sequential.
2) Making that common code "behave" sequentially so that zstd does not
    see a latency regression.

I would appreciate it if you can review the v7 implementation and let me
know if this would be acceptable.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v6 02/16] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-16  5:10   ` Herbert Xu
@ 2025-02-28 10:00     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 30+ messages in thread
From: Sridhar, Kanchana P @ 2025-02-28 10:00 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

Hi Herbert,

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Saturday, February 15, 2025 9:11 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v6 02/16] crypto: acomp - Define new interfaces for
> compress/decompress batching.
> 
> On Wed, Feb 05, 2025 at 11:20:48PM -0800, Kanchana P Sridhar wrote:
> > This commit adds get_batch_size(), batch_compress() and
> batch_decompress()
> > interfaces to:
> >
> >   struct acomp_alg
> >   struct crypto_acomp
> >
> > A crypto_acomp compression algorithm that supports batching of
> compressions
> > and decompressions must provide implementations for these API.
> >
> > A new helper function acomp_has_async_batching() can be invoked to
> query if
> > a crypto_acomp has registered these batching interfaces.
> >
> > A higher level module like zswap can call acomp_has_async_batching() to
> > detect if the compressor supports batching, and if so, it can call
> > the new crypto_acomp_batch_size() to detect the maximum batch-size
> > supported by the compressor. Based on this, zswap can use the minimum of
> > any zswap-specific upper limits for batch-size and the compressor's max
> > batch-size, to allocate batching resources.
> >
> > This allows the iaa_crypto Intel IAA driver to register implementations for
> > the get_batch_size(), batch_compress() and batch_decompress() acomp_alg
> > interfaces, that can subsequently be invoked from the kernel zswap/zram
> > modules to compress/decompress pages in parallel in the IAA hardware
> > accelerator to improve swapout/swapin performance through these newly
> added
> > corresponding crypto_acomp API:
> >
> >   crypto_acomp_batch_size()
> >   crypto_acomp_batch_compress()
> >   crypto_acomp_batch_decompress()
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  crypto/acompress.c                  |   3 +
> >  include/crypto/acompress.h          | 111 ++++++++++++++++++++++++++++
> >  include/crypto/internal/acompress.h |  19 +++++
> >  3 files changed, 133 insertions(+)
> 
> Please get rid of these batch interfaces.  The whole point of
> request chaining is to remove the distinction between batching
> and normal requests.  IOW, if a request is chained then it is
> automatically processed as a batch.  If it's a singleton then
> normal processing will occur.

Thanks. I have incorporated this suggestion in v7. I would appreciate
it if you can review patch 1 in the v7 that I will be submitting shortly.

Thanks,
Kanchana

> 
> Thanks,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-02-28 10:00 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-06  7:20 [PATCH v6 00/16] zswap IAA compress batching Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 01/16] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 02/16] crypto: acomp - Define new interfaces for compress/decompress batching Kanchana P Sridhar
2025-02-16  5:10   ` Herbert Xu
2025-02-28 10:00     ` Sridhar, Kanchana P
2025-02-06  7:20 ` [PATCH v6 03/16] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 04/16] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 05/16] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 06/16] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 07/16] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 08/16] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 09/16] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 10/16] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 11/16] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks Kanchana P Sridhar
2025-02-06  7:20 ` [PATCH v6 12/16] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
2025-02-06 18:55   ` Yosry Ahmed
2025-02-28 10:00     ` Sridhar, Kanchana P
2025-02-06  7:20 ` [PATCH v6 13/16] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching Kanchana P Sridhar
2025-02-06  7:21 ` [PATCH v6 14/16] mm: zswap: Introduce zswap_compress_folio() to compress all pages in a folio Kanchana P Sridhar
2025-02-06  7:21 ` [PATCH v6 15/16] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
2025-02-06 19:10   ` Yosry Ahmed
2025-02-06 19:24     ` Sridhar, Kanchana P
2025-02-28 10:00       ` Sridhar, Kanchana P
2025-02-06  7:21 ` [PATCH v6 16/16] mm: zswap: Fix for zstd performance regression with 2M folios Kanchana P Sridhar
2025-02-06 19:15   ` Yosry Ahmed
2025-02-28 10:00     ` Sridhar, Kanchana P
2025-02-20 23:28   ` Nhat Pham
2025-02-21  3:24     ` Sridhar, Kanchana P
2025-02-11 17:05 ` [PATCH v6 00/16] zswap IAA compress batching Eric Biggers
2025-02-11 17:52   ` Nhat Pham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox