linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/12] zswap IAA compress batching
@ 2024-12-21  6:31 Kanchana P Sridhar
  2024-12-21  6:31 ` [PATCH v5 01/12] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
                   ` (12 more replies)
  0 siblings, 13 replies; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar


IAA Compression Batching with acomp Request Chaining:
=====================================================

This patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel batch compression of pages in large folios to improve
zswap swapout latency, resulting in sys time reduction by 22% (usemem30)
and by 27% (kernel compilation); as well as a 30% increase in usemem30
throughput with IAA batching as compared to zstd.

The patch-series is organized as follows:

 1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
    patches are tagged with "crypto:" in the subject:

    Patch 1) Adds new acomp request chaining framework and interface based
             on Herbert Xu's ahash reference implementation in "[PATCH 2/6]
             crypto: hash - Add request chaining API" [1]. acomp algorithms
             can use request chaining through these interfaces:

             Setup the request chain:
               acomp_reqchain_init()
               acomp_request_chain()

             Process the request chain:
               acomp_do_req_chain(): synchronously (sequentially)
               acomp_do_async_req_chain(): asynchronously using submit/poll
                                           ops (in parallel)

    Patch 2) Adds acomp_alg/crypto_acomp interfaces for batch_compress(),
             batch_decompress() and get_batch_size(), that swap modules can
             invoke using the new batching API crypto_acomp_batch_compress(),
             crypto_acomp_batch_decompress() and crypto_acomp_batch_size().
             Additionally, crypto acomp provides a new
             acomp_has_async_batching() interface to query for these API
             before allocating batching resources for a given compressor in
             zswap/zram.

    Patch 3) New CRYPTO_ACOMP_REQ_POLL acomp_req flag to act as a gate for
             async poll mode in iaa_crypto.

    Patch 4) iaa-crypto driver implementations for sync/async
             crypto_acomp_batch_compress() and
             crypto_acomp_batch_decompress() developed using request
             chaining. If the iaa_crypto driver is set up for 'async'
             sync_mode, these batching implementations deploy the
             asynchronous request chaining implementation. 'async' is the
             recommended mode for realizing the benefits of IAA parallelism.
             If iaa_crypto is set up for 'sync' sync_mode, the synchronous
             version of the request chaining API is used.
             
             The "iaa_acomp_fixed_deflate" algorithm registers these
             implementations for its "batch_compress" and "batch_decompress"
             interfaces respectively and opts in with CRYPTO_ALG_REQ_CHAIN.
             Further, iaa_crypto provides an implementation for the
             "get_batch_size" interface: this returns the
             IAA_CRYPTO_MAX_BATCH_SIZE constant that iaa_crypto defines
             currently as 8U for IAA compression algorithms (iaa_crypto can
             change this if needed as we optimize our batching algorithm).

    Patch 5) Modifies the default iaa_crypto driver mode to async, now that
             iaa_crypto provides a truly async mode that gives
             significantly better latency than sync mode for the batching
             use case.

    Patch 6) Disables verify_compress by default, to facilitate users to
             run IAA easily for comparison with software compressors.

    Patch 7) Reorganizes the iaa_crypto driver code into logically related
             sections and avoids forward declarations, in order to facilitate
             Patch 8. This patch makes no functional changes.

    Patch 8) Makes a major infrastructure change in the iaa_crypto driver,
             to map IAA devices/work-queues to cores based on packages
             instead of NUMA nodes. This doesn't impact performance on
             the Sapphire Rapids system used for performance
             testing. However, this change fixes functional problems we
             found on Granite Rapids in internal validation, where the
             number of NUMA nodes is greater than the number of packages,
             which was resulting in over-utilization of some IAA devices
             and non-usage of other IAA devices as per the current NUMA
             based mapping infrastructure.
             This patch also eliminates duplication of device wqs in
             per-cpu wq_tables, thereby saving 140MiB on a 384 cores
             Granite Rapids server with 8 IAAs. Submitting this change now
             so that it can go through code reviews before it can be merged.

    Patch 9) Builds upon the new infrastructure for mapping IAAs to cores
             based on packages, and enables configuring a "global_wq" per
             IAA, which can be used as a global resource for compress jobs
             for the package. If the user configures 2WQs per IAA device,
             the driver will distribute compress jobs from all cores on the
             package to the "global_wqs" of all the IAA devices on that
             package, in a round-robin manner. This can be used to improve
             compression throughput for workloads that see a lot of swapout
             activity.

 2) zswap modifications to enable compress batching in zswap_store()
    of large folios (including pmd-mappable folios):

    Patch 10) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set
              as 8U) to denote the maximum number of acomp_ctx batching
              resources. Further, the "struct crypto_acomp_ctx" is modified
              to contain a configurable number of acomp_reqs and buffers.
              The cpu hotplug onlining code will query
              acomp_has_async_batching() and if this returns "true", will
              further get the compressor defined maximum batch size, and
              will use the minimum of zswap's upper limit and the
              compressor's maximum batch size to allocate
              acomp_reqs/buffers if the acomp supports batching, and 1
              acomp_req/buffer if not.

    Patch 11) Restructures & simplifies zswap_store() to make it amenable
              for batching. Moves the loop over the folio's pages to a new
              zswap_store_folio(), which in turn allocates zswap entries
              for all folio pages upfront, before proceeding to call a
              newly added zswap_compress_folio(), which simply calls
              zswap_compress() for each folio page.

    Patch 12) Finally, this patch modifies zswap_compress_folio() to detect
              if the pool's acomp_ctx has batching resources. If so, the
              "acomp_ctx->nr_reqs" becomes the batch size to use to call
              crypto_acomp_batch_compress() for every "acomp_ctx->nr_reqs"
              pages in the large folio. The crypto API calls into the new
              iaa_crypto "iaa_comp_acompress_batch()" that does batching
              with request chaining. Upon successful compression of a
              batch, the compressed buffers are stored in zpool.

With v5 of this patch series, the IAA compress batching feature will be
enabled seamlessly on Intel platforms that have IAA by selecting
'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
sync_mode driver attribute.

[1]: https://lore.kernel.org/linux-crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herbert@gondor.apana.org.au/


System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 12-20-2024,
commit 5555a83c82d6, without and with this patch-series.
Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores
per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
partition swap. Core frequency was fixed at 2500MHz.

Other kernel configuration parameters:

    zswap compressor  : zstd, deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 0, 2

IAA "compression verification" is disabled and IAA is run in the async
mode (the defaults with this series). 2WQs are configured per IAA
device. Compress jobs from all cores on a socket are distributed among all
4 IAA devices on the same socket.

I ran experiments with these workloads:

1) usemem 30 processes with these large folios enabled to "always":
   - 16k/32k/64k
   - 2048k

2) Kernel compilation allmodconfig with 2G max memory, 32 threads, run in
   tmpfs with these large folios enabled to "always":
   - 16k/32k/64k


IAA compress batching performance: sync vs. async request chaining:
===================================================================
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:

usemem --init-time -w -O -s 10 -n 30 10g

"async polling" here refers to the v4 implementation of batch compression
without request chaining, which is used as baseline to compare the request
chaining implementations in v5.

These are the latencies measured using bcc profiling with bpftrace for the
various iaa_crypto modes:

 -------------------------------------------------------------------------------
 usemem30: 16k/32k/64k Folios         crypto_acomp_batch_compress() latency
 
 iaa_crypto batching          count     mean       p50       p99
 implementation                         (ns)      (ns)      (ns)
 -------------------------------------------------------------------------------

 async polling            5,210,702    10,083     9,675   17,488
                                                                
 sync request chaining    5,396,532    33,391    32,977   39,426
                                                                
 async request chaining   5,509,777     9,959     9,611   16,590

 -------------------------------------------------------------------------------

This demonstrates that async request chaining doesn't cause IAA compress
batching performance regression wrt the v4 implementation without request
chaining.


Performance testing (usemem30):
===============================
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:

usemem --init-time -w -O -s 10 -n 30 10g


 16k/32/64k folios: usemem30: zstd:
 ==================================

 -------------------------------------------------------------------------------
                        mm-unstable-12-20-2024   v5 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor                      zstd             zstd  
 vm.page-cluster                          2                2 
                                                             
 -------------------------------------------------------------------------------
 Total throughput (KB/s)          6,143,774        6,180,657  
 Avg throughput (KB/s)              204,792          206,021  
 elapsed time (sec)                  110.45           112.02  
 sys time (sec)                    2,628.55         2,684.53  
                                                              
 -------------------------------------------------------------------------------
 memcg_high                         469,269          481,665  
 memcg_swap_fail                      1,198              910  
 zswpout                         48,932,319       48,931,447  
 zswpin                                 384              398  
 pswpout                                  0                0  
 pswpin                                   0                0  
 thp_swpout                               0                0  
 thp_swpout_fallback                      0                0  
 16kB-swpout_fallback                     0                0                                   
 32kB_swpout_fallback                     0                0  
 64kB_swpout_fallback                 1,198              910  
 pgmajfault                           3,459            3,090  
 swap_ra                                 96              100  
 swap_ra_hit                             48               54  
 ZSWPOUT-16kB                             2                2  
 ZSWPOUT-32kB                             2                0  
 ZSWPOUT-64kB                     3,057,060        3,057,286  
 SWPOUT-16kB                              0                0  
 SWPOUT-32kB                              0                0  
 SWPOUT-64kB                              0                0  
 -------------------------------------------------------------------------------


 16k/32/64k folios: usemem30: deflate-iaa:
 =========================================

 -------------------------------------------------------------------------------
                    mm-unstable-12-20-2024     v5 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa        deflate-iaa      IAA Batching          
 vm.page-cluster                        2                  2       vs.     vs.
                                                                   Seq    zstd
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        7,679,064         8,027,314         5%    30%
 Avg throughput (KB/s)            255,968           267,577         5%    30%
 elapsed time (sec)                 90.82             87.53        -4%   -22%
 sys time (sec)                  2,205.73          2,099.80        -5%   -22%
                                                                    
 -------------------------------------------------------------------------------
 memcg_high                       716,670           722,693         
 memcg_swap_fail                    1,187             1,251         
 zswpout                       64,511,695        64,510,499         
 zswpin                               483               477         
 pswpout                                0                 0         
 pswpin                                 0                 0         
 thp_swpout                             0                 0         
 thp_swpout_fallback                    0                 0         
 16kB-swpout_fallback                   0                 0                                                   
 32kB_swpout_fallback                   0                 0         
 64kB_swpout_fallback               1,187             1,251         
 pgmajfault                         3,180             3,187         
 swap_ra                              175               155         
 swap_ra_hit                          114                76         
 ZSWPOUT-16kB                           5                 3         
 ZSWPOUT-32kB                           1                 2         
 ZSWPOUT-64kB                   4,030,709         4,030,573         
 SWPOUT-16kB                            0                 0         
 SWPOUT-32kB                            0                 0         
 SWPOUT-64kB                            0                 0         
 -------------------------------------------------------------------------------


 2M folios: usemem30: zstd:
 ==========================

 -------------------------------------------------------------------------------
               mm-unstable-12-20-2024   v5 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor               zstd             zstd  
 vm.page-cluster                   2                2  
                                                        
 -------------------------------------------------------------------------------
 Total throughput (KB/s)   6,643,427        6,534,525     
 Avg throughput (KB/s)       221,447          217,817     
 elapsed time (sec)           102.92           104.44     
 sys time (sec)             2,332.67         2,415.00     
                                                     
 -------------------------------------------------------------------------------
 memcg_high                   61,999           60,770
 memcg_swap_fail                  37               47
 zswpout                  48,934,491       48,934,952
 zswpin                          386              404
 pswpout                           0                0
 pswpin                            0                0
 thp_swpout                        0                0
 thp_swpout_fallback              37               47
 pgmajfault                    5,010            4,646
 swap_ra                       5,836            4,692
 swap_ra_hit                   5,790            4,640
 ZSWPOUT-2048kB               95,529           95,520
 SWPOUT-2048kB                     0                0
 -------------------------------------------------------------------------------


 2M folios: usemem30: deflate-iaa:
 =================================

 -------------------------------------------------------------------------------
                 mm-unstable-12-20-2024        v5 of this patch-series
 -------------------------------------------------------------------------------
 zswap compressor           deflate-iaa      deflate-iaa     IAA Batching          
 vm.page-cluster                      2                2      vs.     vs.
                                                              Seq    zstd
 -------------------------------------------------------------------------------
 Total throughput (KB/s)      8,197,457        8,427,981       3%     29%
 Avg throughput (KB/s)          273,248          280,932       3%     29%
 elapsed time (sec)               86.79            83.45      -4%    -20%
 sys time (sec)                2,044.02         1,925.84      -6%    -20%
                                                                
 -------------------------------------------------------------------------------
 memcg_high                      94,008           88,809        
 memcg_swap_fail                     50               57        
 zswpout                     64,521,910       64,520,405        
 zswpin                             421              452        
 pswpout                              0                0        
 pswpin                               0                0        
 thp_swpout                           0                0        
 thp_swpout_fallback                 50               57        
 pgmajfault                       9,658            8,958        
 swap_ra                         19,633           17,341        
 swap_ra_hit                     19,579           17,278        
 ZSWPOUT-2048kB                 125,916          125,913        
 SWPOUT-2048kB                        0                0        
 -------------------------------------------------------------------------------


Performance testing (Kernel compilation, allmodconfig):
=======================================================

The experiments with kernel compilation test, 32 threads, in tmpfs use the
"allmodconfig" that takes ~12 minutes, and has considerable swapout
activity. The cgroup's memory.max is set to 2G.


 16k/32k/64k folios: Kernel compilation/allmodconfig:
 ====================================================
 w/o: mm-unstable-12-20-2024

 -------------------------------------------------------------------------------
                               w/o            v5            w/o             v5
 -------------------------------------------------------------------------------
 zswap compressor             zstd          zstd    deflate-iaa    deflate-iaa          
 vm.page-cluster                 0             0              0              0
                                                                              
 -------------------------------------------------------------------------------
 real_sec                   792.04        793.92         783.43         766.93
 user_sec                15,781.73     15,772.48      15,753.22      15,766.53
 sys_sec                  5,302.83      5,308.05       3,982.30       3,853.21
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB     1,871,908     1,873,368      1,871,836      1,873,168
 -------------------------------------------------------------------------------
 memcg_high                      0             0              0              0
 memcg_swap_fail                 0             0              0              0
 zswpout                90,775,917    91,653,816    106,964,482    110,380,500
 zswpin                 26,099,486    26,611,908     31,598,420     32,618,221
 pswpout                        48            96            331            331
 pswpin                         48            89            320            310
 thp_swpout                      0             0              0              0
 thp_swpout_fallback             0             0              0              0
 16kB_swpout_fallback            0             0              0              0                         
 32kB_swpout_fallback            0             0              0              0
 64kB_swpout_fallback            0         2,337          7,943          5,512
 pgmajfault             27,858,798    28,438,518     33,970,455     34,999,918
 swap_ra                         0             0              0              0
 swap_ra_hit                 2,173         2,913          2,192          5,248
 ZSWPOUT-16kB            1,292,865     1,306,214      1,463,397      1,483,056
 ZSWPOUT-32kB              695,446       705,451        830,676        829,992
 ZSWPOUT-64kB            2,938,716     2,958,250      3,520,199      3,634,972
 SWPOUT-16kB                     0             0              0              0
 SWPOUT-32kB                     0             0              0              0
 SWPOUT-64kB                     3             6             20             19
 -------------------------------------------------------------------------------



Summary:
========
The performance testing data with usemem 30 processes and kernel
compilation test show 30% throughput gains and 22% sys time reduction
(usemem30) and 27% sys time reduction (kernel compilation) with
zswap_store() large folios using IAA compress batching as compared to
zstd.

The iaa_crypto wq stats will show almost the same number of compress calls
for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively.
We see a latency reduction of 2.5% by distributing compress jobs among all
IAA devices on the socket (based on v1 data).

We can expect to see even more significant performance and throughput
improvements if we use the parallelism offered by IAA to do reclaim
batching of 4K/large folios (really any-order folios), and using the
zswap_store() high throughput compression to batch-compress pages
comprising these folios, not just batching within large folios. This is the
reclaim batching patch 13 in v1, which will be submitted in a separate
patch-series.

Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 3.5% reduction in workload run time as
compared to software compressors.


Changes since v4:
=================
1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
3) Implemented IAA compress batching using request chaining.
4) zswap_store() batching simplifications suggested by Chengming, Yosry and
   Nhat, thanks to all!
   - New zswap_compress_folio() that is called by zswap_store().
   - Move the loop over folio's pages out of zswap_store() and into a
     zswap_store_folio() that stores all pages.
   - Allocate all zswap entries for the folio upfront.
   - Added zswap_batch_compress().
   - Branch to call zswap_compress() or zswap_batch_compress() inside
     zswap_compress_folio().
   - All iterations over pages kept in same function level.
   - No helpers other than the newly added zswap_store_folio() and
     zswap_compress_folio().


Changes since v3:
=================
1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
   based on packages instead of NUMA nodes.
3) Added acomp_has_async_batching() API to crypto acomp, that allows
   zswap/zram to query if a crypto_acomp has registered batch_compress and
   batch_decompress interfaces.
4) Clear the poll bits on the acomp_reqs passed to
   iaa_comp_a[de]compress_batch() so that a module like zswap can be
   confident about the acomp_reqs[0] not having the poll bit set before
   calling the fully synchronous API crypto_acomp_[de]compress().
   Herbert, I would appreciate it if you can review changes 2-4; in patches
   1-8 in v4. I did not want to introduce too many iaa_crypto changes in
   v4, given that patch 7 is already making a major change. I plan to work
   on incorporating the request chaining using the ahash interface in v5
   (I need to understand the basic crypto ahash better). Thanks Herbert!
5) Incorporated Johannes' suggestion to not have a sysctl to enable
   compress batching.
6) Incorporated Yosry's suggestion to allocate batching resources in the
   cpu hotplug onlining code, since there is no longer a sysctl to control
   batching. Thanks Yosry!
7) Incorporated Johannes' suggestions related to making the overall
   sequence of events between zswap_store() and zswap_batch_store() similar
   as much as possible for readability and control flow, better naming of
   procedures, avoiding forward declarations, not inlining error path
   procedures, deleting zswap internal details from zswap.h, etc. Thanks
   Johannes, really appreciate the direction!
   I have tried to explain the minimal future-proofing in terms of the
   zswap_batch_store() signature and the definition of "struct
   zswap_batch_store_sub_batch" in the comments for this struct. I hope the
   new code explains the control flow a bit better.


Changes since v2:
=================
1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
   returned by kmalloc_node() for acomp_ctx->buffers and for
   acomp_ctx->reqs.
3) Fixed a bug in zswap_pool_can_batch() for returning true if
   pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
   the per-cpu acomp_batch_ctx tests true for batching resources having
   been allocated on this cpu. Also, changed from per_cpu_ptr() to
   raw_cpu_ptr().
4) Incorporated the zswap_store_propagate_errors() compilation warning fix
   suggested by Dan Carpenter. Thanks Dan!
5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
   zswap.h, with SWAP_CRYPTO_BATCH_SIZE.

Changes since v1:
=================
1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
   async/poll mode, and to encapsulate the polling functionality in the
   iaa_crypto driver. Thanks Herbert!
3) Incorporated Herbert's and Yosry's suggestions to implement the batching
   API in iaa_crypto and to make its use seamless from zswap's
   perspective. Thanks Herbert and Yosry!
4) Incorporated Yosry's suggestion to make it more convenient for the user
   to enable compress batching, while minimizing the memory footprint
   cost. Thanks Yosry!
5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
   reclaim batching patch from this series, since it requires a broader
   discussion.


I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana



Kanchana P Sridhar (12):
  crypto: acomp - Add synchronous/asynchronous acomp request chaining.
  crypto: acomp - Define new interfaces for compress/decompress
    batching.
  crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable
    async mode.
  crypto: iaa - Implement batch_compress(), batch_decompress() API in
    iaa_crypto.
  crypto: iaa - Make async mode the default.
  crypto: iaa - Disable iaa_verify_compress by default.
  crypto: iaa - Re-organize the iaa_crypto driver code.
  crypto: iaa - Map IAA devices/wqs to cores based on packages instead
    of NUMA.
  crypto: iaa - Distribute compress jobs from all cores to all IAAs on a
    package.
  mm: zswap: Allocate pool batching resources if the crypto_alg supports
    batching.
  mm: zswap: Restructure & simplify zswap_store() to make it amenable
    for batching.
  mm: zswap: Compress batching with Intel IAA in zswap_store() of large
    folios.

 crypto/acompress.c                         |  287 ++++
 drivers/crypto/intel/iaa/iaa_crypto.h      |   27 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 1697 +++++++++++++++-----
 include/crypto/acompress.h                 |  157 ++
 include/crypto/algapi.h                    |   10 +
 include/crypto/internal/acompress.h        |   29 +
 include/linux/crypto.h                     |   31 +
 mm/zswap.c                                 |  406 +++--
 8 files changed, 2103 insertions(+), 541 deletions(-)


base-commit: 5555a83c82d66729e4abaf16ae28d6bd81f9a64a
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 01/12] crypto: acomp - Add synchronous/asynchronous acomp request chaining.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2024-12-21  6:31 ` [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching Kanchana P Sridhar
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch is based on Herbert Xu's request chaining for ahash
("[PATCH 2/6] crypto: hash - Add request chaining API") [1]. The generic
framework for request chaining that's provided in the ahash implementation
has been used as reference to develop a similar synchronous request
chaining framework for crypto_acomp.

Furthermore, this commit develops an asynchronous request chaining
framework and API that iaa_crypto can use for request chaining with
parallelism, in order to fully benefit from Intel IAA's multiple
compress/decompress engines in hardware. This allows us to gain significant
latency improvements with IAA batching as compared to synchronous request
chaining.

 Usage of acomp request chaining API:
 ====================================

 Any crypto_acomp compressor can avail of request chaining as follows: by calling one of

 Step 1: Create request chain:

  Request 0 (the first req in the chain):

  void acomp_reqchain_init(struct acomp_req *req,
                           u32 flags, crypto_completion_t compl,
                           void *data);

  Subsequent requests:

  void acomp_request_chain(struct acomp_req *req,
                           struct acomp_req *head);

 Step 2: Process the request chain using the specified compress/decompress
         "op":

  2.a) Synchronous: the chain of requests is processed in series:

       int acomp_do_req_chain(struct acomp_req *req,
                              int (*op)(struct acomp_req *req));

  2.b) Asynchronous: the chain of requests is processed in parallel using a
       submit-poll paradigm:

       int acomp_do_async_req_chain(struct acomp_req *req,
                                    int (*op_submit)(struct acomp_req *req),
                                    int (*op_poll)(struct acomp_req *req));

Request chaining will be used in subsequent patches to implement
compress/decompress batching in the iaa_crypto driver for the two supported
IAA driver sync_modes:

  sync_mode = 'sync' will use (2.a),
  sync_mode = 'async' will use (2.b).

These files are directly re-used from [1] which is not yet merged:

include/crypto/algapi.h
include/linux/crypto.h

Hence, I am adding Herbert as the co-developer of this acomp request
chaining patch.

[1]: https://lore.kernel.org/linux-crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herbert@gondor.apana.org.au/

Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by:
---
 crypto/acompress.c                  | 284 ++++++++++++++++++++++++++++
 include/crypto/acompress.h          |  41 ++++
 include/crypto/algapi.h             |  10 +
 include/crypto/internal/acompress.h |  10 +
 include/linux/crypto.h              |  31 +++
 5 files changed, 376 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index 6fdf0ff9f3c0..cb6444d09dd7 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -23,6 +23,19 @@ struct crypto_scomp;
 
 static const struct crypto_type crypto_acomp_type;
 
+struct acomp_save_req_state {
+	struct list_head head;
+	struct acomp_req *req0;
+	struct acomp_req *cur;
+	int (*op)(struct acomp_req *req);
+	crypto_completion_t compl;
+	void *data;
+};
+
+static void acomp_reqchain_done(void *data, int err);
+static int acomp_save_req(struct acomp_req *req, crypto_completion_t cplt);
+static void acomp_restore_req(struct acomp_req *req);
+
 static inline struct acomp_alg *__crypto_acomp_alg(struct crypto_alg *alg)
 {
 	return container_of(alg, struct acomp_alg, calg.base);
@@ -123,6 +136,277 @@ struct crypto_acomp *crypto_alloc_acomp_node(const char *alg_name, u32 type,
 }
 EXPORT_SYMBOL_GPL(crypto_alloc_acomp_node);
 
+static int acomp_save_req(struct acomp_req *req, crypto_completion_t cplt)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct acomp_save_req_state *state;
+	gfp_t gfp;
+	u32 flags;
+
+	if (!acomp_is_async(tfm))
+		return 0;
+
+	flags = acomp_request_flags(req);
+	gfp = (flags & CRYPTO_TFM_REQ_MAY_SLEEP) ?  GFP_KERNEL : GFP_ATOMIC;
+	state = kmalloc(sizeof(*state), gfp);
+	if (!state)
+		return -ENOMEM;
+
+	state->compl = req->base.complete;
+	state->data = req->base.data;
+	state->req0 = req;
+
+	req->base.complete = cplt;
+	req->base.data = state;
+
+	return 0;
+}
+
+static void acomp_restore_req(struct acomp_req *req)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct acomp_save_req_state *state;
+
+	if (!acomp_is_async(tfm))
+		return;
+
+	state = req->base.data;
+
+	req->base.complete = state->compl;
+	req->base.data = state->data;
+	kfree(state);
+}
+
+static int acomp_reqchain_finish(struct acomp_save_req_state *state,
+				 int err, u32 mask)
+{
+	struct acomp_req *req0 = state->req0;
+	struct acomp_req *req = state->cur;
+	struct acomp_req *n;
+
+	req->base.err = err;
+
+	if (req == req0)
+		INIT_LIST_HEAD(&req->base.list);
+	else
+		list_add_tail(&req->base.list, &req0->base.list);
+
+	list_for_each_entry_safe(req, n, &state->head, base.list) {
+		list_del_init(&req->base.list);
+
+		req->base.flags &= mask;
+		req->base.complete = acomp_reqchain_done;
+		req->base.data = state;
+		state->cur = req;
+		err = state->op(req);
+
+		if (err == -EINPROGRESS) {
+			if (!list_empty(&state->head))
+				err = -EBUSY;
+			goto out;
+		}
+
+		if (err == -EBUSY)
+			goto out;
+
+		req->base.err = err;
+		list_add_tail(&req->base.list, &req0->base.list);
+	}
+
+	acomp_restore_req(req0);
+
+out:
+	return err;
+}
+
+static void acomp_reqchain_done(void *data, int err)
+{
+	struct acomp_save_req_state *state = data;
+	crypto_completion_t compl = state->compl;
+
+	data = state->data;
+
+	if (err == -EINPROGRESS) {
+		if (!list_empty(&state->head))
+			return;
+		goto notify;
+	}
+
+	err = acomp_reqchain_finish(state, err, CRYPTO_TFM_REQ_MAY_BACKLOG);
+	if (err == -EBUSY)
+		return;
+
+notify:
+	compl(data, err);
+}
+
+int acomp_do_req_chain(struct acomp_req *req,
+		       int (*op)(struct acomp_req *req))
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct acomp_save_req_state *state;
+	struct acomp_save_req_state state0;
+	int err = 0;
+
+	if (!acomp_request_chained(req) || list_empty(&req->base.list) ||
+	    !crypto_acomp_req_chain(tfm))
+		return op(req);
+
+	state = &state0;
+
+	if (acomp_is_async(tfm)) {
+		err = acomp_save_req(req, acomp_reqchain_done);
+		if (err) {
+			struct acomp_req *r2;
+
+			req->base.err = err;
+			list_for_each_entry(r2, &req->base.list, base.list)
+				r2->base.err = err;
+
+			return err;
+		}
+
+		state = req->base.data;
+	}
+
+	state->op = op;
+	state->cur = req;
+	INIT_LIST_HEAD(&state->head);
+	list_splice(&req->base.list, &state->head);
+
+	err = op(req);
+	if (err == -EBUSY || err == -EINPROGRESS)
+		return -EBUSY;
+
+	return acomp_reqchain_finish(state, err, ~0);
+}
+EXPORT_SYMBOL_GPL(acomp_do_req_chain);
+
+static void acomp_async_reqchain_done(struct acomp_req *req0,
+				      struct list_head *state,
+				      int (*op_poll)(struct acomp_req *req))
+{
+	struct acomp_req *req, *n;
+	bool req0_done = false;
+	int err;
+
+	while (!list_empty(state)) {
+
+		if (!req0_done) {
+			err = op_poll(req0);
+			if (!(err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)) {
+				req0->base.err = err;
+				req0_done = true;
+			}
+		}
+
+		list_for_each_entry_safe(req, n, state, base.list) {
+			err = op_poll(req);
+
+			if (err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)
+				continue;
+
+			req->base.err = err;
+			list_del_init(&req->base.list);
+			list_add_tail(&req->base.list, &req0->base.list);
+		}
+	}
+
+	while (!req0_done) {
+		err = op_poll(req0);
+		if (!(err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)) {
+			req0->base.err = err;
+			break;
+		}
+	}
+}
+
+static int acomp_async_reqchain_finish(struct acomp_req *req0,
+				       struct list_head *state,
+				       int (*op_submit)(struct acomp_req *req),
+				       int (*op_poll)(struct acomp_req *req))
+{
+	struct acomp_req *req, *n;
+	int err = 0;
+
+	INIT_LIST_HEAD(&req0->base.list);
+
+	list_for_each_entry_safe(req, n, state, base.list) {
+		BUG_ON(req == req0);
+
+		err = op_submit(req);
+
+		if (!(err == -EINPROGRESS || err == -EBUSY)) {
+			req->base.err = err;
+			list_del_init(&req->base.list);
+			list_add_tail(&req->base.list, &req0->base.list);
+		}
+	}
+
+	acomp_async_reqchain_done(req0, state, op_poll);
+
+	return req0->base.err;
+}
+
+int acomp_do_async_req_chain(struct acomp_req *req,
+			     int (*op_submit)(struct acomp_req *req),
+			     int (*op_poll)(struct acomp_req *req))
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct list_head state;
+	struct acomp_req *r2;
+	int err = 0;
+	void *req0_data = req->base.data;
+
+	if (!acomp_request_chained(req) || list_empty(&req->base.list) ||
+		!acomp_is_async(tfm) || !crypto_acomp_req_chain(tfm)) {
+
+		err = op_submit(req);
+
+		if (err == -EINPROGRESS || err == -EBUSY) {
+			bool req0_done = false;
+
+			while (!req0_done) {
+				err = op_poll(req);
+				if (!(err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)) {
+					req->base.err = err;
+					break;
+				}
+			}
+		} else {
+			req->base.err = err;
+		}
+
+		req->base.data = req0_data;
+		if (acomp_is_async(tfm))
+			req->base.complete(req->base.data, req->base.err);
+
+		return err;
+	}
+
+	err = op_submit(req);
+	req->base.err = err;
+
+	if (err && !(err == -EINPROGRESS || err == -EBUSY))
+		goto err_prop;
+
+	INIT_LIST_HEAD(&state);
+	list_splice(&req->base.list, &state);
+
+	err = acomp_async_reqchain_finish(req, &state, op_submit, op_poll);
+	req->base.data = req0_data;
+	req->base.complete(req->base.data, req->base.err);
+
+	return err;
+
+err_prop:
+	list_for_each_entry(r2, &req->base.list, base.list)
+		r2->base.err = err;
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(acomp_do_async_req_chain);
+
 struct acomp_req *acomp_request_alloc(struct crypto_acomp *acomp)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp);
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 54937b615239..eadc24514056 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -206,6 +206,7 @@ static inline void acomp_request_set_callback(struct acomp_req *req,
 	req->base.data = data;
 	req->base.flags &= CRYPTO_ACOMP_ALLOC_OUTPUT;
 	req->base.flags |= flgs & ~CRYPTO_ACOMP_ALLOC_OUTPUT;
+	req->base.flags &= ~CRYPTO_TFM_REQ_CHAIN;
 }
 
 /**
@@ -237,6 +238,46 @@ static inline void acomp_request_set_params(struct acomp_req *req,
 		req->flags |= CRYPTO_ACOMP_ALLOC_OUTPUT;
 }
 
+static inline u32 acomp_request_flags(struct acomp_req *req)
+{
+	return req->base.flags;
+}
+
+static inline void acomp_reqchain_init(struct acomp_req *req,
+				       u32 flags, crypto_completion_t compl,
+				       void *data)
+{
+	acomp_request_set_callback(req, flags, compl, data);
+	crypto_reqchain_init(&req->base);
+}
+
+static inline void acomp_reqchain_clear(struct acomp_req *req, void *data)
+{
+	struct crypto_wait *wait = (struct crypto_wait *)data;
+	reinit_completion(&wait->completion);
+	crypto_reqchain_clear(&req->base);
+	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, data);
+}
+
+static inline void acomp_request_chain(struct acomp_req *req,
+				       struct acomp_req *head)
+{
+	crypto_request_chain(&req->base, &head->base);
+}
+
+int acomp_do_req_chain(struct acomp_req *req,
+		       int (*op)(struct acomp_req *req));
+
+int acomp_do_async_req_chain(struct acomp_req *req,
+			     int (*op_submit)(struct acomp_req *req),
+			     int (*op_poll)(struct acomp_req *req));
+
+static inline int acomp_request_err(struct acomp_req *req)
+{
+	return req->base.err;
+}
+
 /**
  * crypto_acomp_compress() -- Invoke asynchronous compress operation
  *
diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h
index 156de41ca760..c5df380c7d08 100644
--- a/include/crypto/algapi.h
+++ b/include/crypto/algapi.h
@@ -271,4 +271,14 @@ static inline u32 crypto_tfm_alg_type(struct crypto_tfm *tfm)
 	return tfm->__crt_alg->cra_flags & CRYPTO_ALG_TYPE_MASK;
 }
 
+static inline bool crypto_request_chained(struct crypto_async_request *req)
+{
+	return req->flags & CRYPTO_TFM_REQ_CHAIN;
+}
+
+static inline bool crypto_tfm_req_chain(struct crypto_tfm *tfm)
+{
+	return tfm->__crt_alg->cra_flags & CRYPTO_ALG_REQ_CHAIN;
+}
+
 #endif	/* _CRYPTO_ALGAPI_H */
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 8831edaafc05..53b4ef59b48c 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -84,6 +84,16 @@ static inline void __acomp_request_free(struct acomp_req *req)
 	kfree_sensitive(req);
 }
 
+static inline bool acomp_request_chained(struct acomp_req *req)
+{
+	return crypto_request_chained(&req->base);
+}
+
+static inline bool crypto_acomp_req_chain(struct crypto_acomp *tfm)
+{
+	return crypto_tfm_req_chain(&tfm->base);
+}
+
 /**
  * crypto_register_acomp() -- Register asynchronous compression algorithm
  *
diff --git a/include/linux/crypto.h b/include/linux/crypto.h
index b164da5e129e..7c27d557586c 100644
--- a/include/linux/crypto.h
+++ b/include/linux/crypto.h
@@ -13,6 +13,8 @@
 #define _LINUX_CRYPTO_H
 
 #include <linux/completion.h>
+#include <linux/errno.h>
+#include <linux/list.h>
 #include <linux/refcount.h>
 #include <linux/slab.h>
 #include <linux/types.h>
@@ -124,6 +126,9 @@
  */
 #define CRYPTO_ALG_FIPS_INTERNAL	0x00020000
 
+/* Set if the algorithm supports request chains. */
+#define CRYPTO_ALG_REQ_CHAIN		0x00040000
+
 /*
  * Transform masks and values (for crt_flags).
  */
@@ -133,6 +138,7 @@
 #define CRYPTO_TFM_REQ_FORBID_WEAK_KEYS	0x00000100
 #define CRYPTO_TFM_REQ_MAY_SLEEP	0x00000200
 #define CRYPTO_TFM_REQ_MAY_BACKLOG	0x00000400
+#define CRYPTO_TFM_REQ_CHAIN		0x00000800
 
 /*
  * Miscellaneous stuff.
@@ -174,6 +180,7 @@ struct crypto_async_request {
 	struct crypto_tfm *tfm;
 
 	u32 flags;
+	int err;
 };
 
 /**
@@ -540,5 +547,29 @@ int crypto_comp_decompress(struct crypto_comp *tfm,
 			   const u8 *src, unsigned int slen,
 			   u8 *dst, unsigned int *dlen);
 
+static inline void crypto_reqchain_init(struct crypto_async_request *req)
+{
+	req->err = -EINPROGRESS;
+	req->flags |= CRYPTO_TFM_REQ_CHAIN;
+	INIT_LIST_HEAD(&req->list);
+}
+
+static inline void crypto_reqchain_clear(struct crypto_async_request *req)
+{
+	req->flags &= ~CRYPTO_TFM_REQ_CHAIN;
+}
+
+static inline void crypto_request_chain(struct crypto_async_request *req,
+					struct crypto_async_request *head)
+{
+	req->err = -EINPROGRESS;
+	list_add_tail(&req->list, &head->list);
+}
+
+static inline bool crypto_tfm_is_async(struct crypto_tfm *tfm)
+{
+	return tfm->__crt_alg->cra_flags & CRYPTO_ALG_ASYNC;
+}
+
 #endif	/* _LINUX_CRYPTO_H */
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
  2024-12-21  6:31 ` [PATCH v5 01/12] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2024-12-28 11:46   ` Herbert Xu
  2024-12-21  6:31 ` [PATCH v5 03/12] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This commit adds get_batch_size(), batch_compress() and batch_decompress()
interfaces to:

  struct acomp_alg
  struct crypto_acomp

A crypto_acomp compression algorithm that supports batching of compressions
and decompressions must provide implementations for these API.

A new helper function acomp_has_async_batching() can be invoked to query if
a crypto_acomp has registered these batching interfaces.

A higher level module like zswap can call acomp_has_async_batching() to
detect if the compressor supports batching, and if so, it can call
the new crypto_acomp_batch_size() to detect the maximum batch-size
supported by the compressor. Based on this, zswap can use the minimum of
any zswap-specific upper limits for batch-size and the compressor's max
batch-size, to allocate batching resources.

This allows the iaa_crypto Intel IAA driver to register implementations for
the get_batch_size(), batch_compress() and batch_decompress() acomp_alg
interfaces, that can subsequently be invoked from the kernel zswap/zram
modules to compress/decompress pages in parallel in the IAA hardware
accelerator to improve swapout/swapin performance through these newly added
corresponding crypto_acomp API:

  crypto_acomp_batch_size()
  crypto_acomp_batch_compress()
  crypto_acomp_batch_decompress()

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c                  |   3 +
 include/crypto/acompress.h          | 111 ++++++++++++++++++++++++++++
 include/crypto/internal/acompress.h |  19 +++++
 3 files changed, 133 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index cb6444d09dd7..165559a8b9bd 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -84,6 +84,9 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm)
 
 	acomp->compress = alg->compress;
 	acomp->decompress = alg->decompress;
+	acomp->get_batch_size = alg->get_batch_size;
+	acomp->batch_compress = alg->batch_compress;
+	acomp->batch_decompress = alg->batch_decompress;
 	acomp->dst_free = alg->dst_free;
 	acomp->reqsize = alg->reqsize;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index eadc24514056..8451ade70fd8 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -43,6 +43,10 @@ struct acomp_req {
  *
  * @compress:		Function performs a compress operation
  * @decompress:		Function performs a de-compress operation
+ * @get_batch_size:     Maximum batch-size for batching compress/decompress
+ *                      operations.
+ * @batch_compress:	Function performs a batch compress operation
+ * @batch_decompress:	Function performs a batch decompress operation
  * @dst_free:		Frees destination buffer if allocated inside the
  *			algorithm
  * @reqsize:		Context size for (de)compression requests
@@ -51,6 +55,21 @@ struct acomp_req {
 struct crypto_acomp {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
+	bool (*batch_compress)(struct acomp_req *reqs[],
+			       struct crypto_wait *wait,
+			       struct page *pages[],
+			       u8 *dsts[],
+			       unsigned int dlens[],
+			       int errors[],
+			       int nr_pages);
+	bool (*batch_decompress)(struct acomp_req *reqs[],
+				 struct crypto_wait *wait,
+				 u8 *srcs[],
+				 struct page *pages[],
+				 unsigned int slens[],
+				 int errors[],
+				 int nr_pages);
 	void (*dst_free)(struct scatterlist *dst);
 	unsigned int reqsize;
 	struct crypto_tfm base;
@@ -142,6 +161,13 @@ static inline bool acomp_is_async(struct crypto_acomp *tfm)
 	       CRYPTO_ALG_ASYNC;
 }
 
+static inline bool acomp_has_async_batching(struct crypto_acomp *tfm)
+{
+	return (acomp_is_async(tfm) &&
+		(crypto_comp_alg_common(tfm)->base.cra_flags & CRYPTO_ALG_TYPE_ACOMPRESS) &&
+		tfm->get_batch_size && tfm->batch_compress && tfm->batch_decompress);
+}
+
 static inline struct crypto_acomp *crypto_acomp_reqtfm(struct acomp_req *req)
 {
 	return __crypto_acomp_tfm(req->base.tfm);
@@ -306,4 +332,89 @@ static inline int crypto_acomp_decompress(struct acomp_req *req)
 	return crypto_acomp_reqtfm(req)->decompress(req);
 }
 
+/**
+ * crypto_acomp_batch_size() -- Get the algorithm's batch size
+ *
+ * Function returns the algorithm's batch size for batching operations
+ *
+ * @tfm:	ACOMPRESS tfm handle allocated with crypto_alloc_acomp()
+ *
+ * Return:	crypto_acomp's batch size.
+ */
+static inline unsigned int crypto_acomp_batch_size(struct crypto_acomp *tfm)
+{
+	if (acomp_has_async_batching(tfm))
+		return tfm->get_batch_size();
+
+	return 1;
+}
+
+/**
+ * crypto_acomp_batch_compress() -- Invoke asynchronous compress of
+ *                                  a batch of requests
+ *
+ * Function invokes the asynchronous batch compress operation
+ *
+ * @reqs: @nr_pages asynchronous compress requests.
+ * @wait: crypto_wait for acomp batch compress with synchronous/asynchronous
+ *        request chaining. If NULL, the driver must provide a way to process
+ *        request completions asynchronously.
+ * @pages: Pages to be compressed.
+ * @dsts: Pre-allocated destination buffers to store results of compression.
+ * @dlens: Will contain the compressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages to be compressed.
+ *
+ * Returns true if all compress requests complete successfully,
+ * false otherwise.
+ */
+static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
+					       struct crypto_wait *wait,
+					       struct page *pages[],
+					       u8 *dsts[],
+					       unsigned int dlens[],
+					       int errors[],
+					       int nr_pages)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_compress(reqs, wait, pages, dsts,
+				   dlens, errors, nr_pages);
+}
+
+/**
+ * crypto_acomp_batch_decompress() -- Invoke asynchronous decompress of
+ *                                    a batch of requests
+ *
+ * Function invokes the asynchronous batch decompress operation
+ *
+ * @reqs: @nr_pages asynchronous decompress requests.
+ * @wait: crypto_wait for acomp batch decompress with synchronous/asynchronous
+ *        request chaining. If NULL, the driver must provide a way to process
+ *        request completions asynchronously.
+ * @srcs: The src buffers to be decompressed.
+ * @pages: The pages to store the decompressed buffers.
+ * @slens: Compressed lengths of @srcs.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages to be decompressed.
+ *
+ * Returns true if all decompress requests complete successfully,
+ * false otherwise.
+ */
+static inline bool crypto_acomp_batch_decompress(struct acomp_req *reqs[],
+						 struct crypto_wait *wait,
+						 u8 *srcs[],
+						 struct page *pages[],
+						 unsigned int slens[],
+						 int errors[],
+						 int nr_pages)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(reqs[0]);
+
+	return tfm->batch_decompress(reqs, wait, srcs, pages,
+				     slens, errors, nr_pages);
+}
+
 #endif
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 53b4ef59b48c..df0e192801ff 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -17,6 +17,10 @@
  *
  * @compress:	Function performs a compress operation
  * @decompress:	Function performs a de-compress operation
+ * @get_batch_size:     Maximum batch-size for batching compress/decompress
+ *                      operations.
+ * @batch_compress:	Function performs a batch compress operation
+ * @batch_decompress:	Function performs a batch decompress operation
  * @dst_free:	Frees destination buffer if allocated inside the algorithm
  * @init:	Initialize the cryptographic transformation object.
  *		This function is used to initialize the cryptographic
@@ -37,6 +41,21 @@
 struct acomp_alg {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
+	bool (*batch_compress)(struct acomp_req *reqs[],
+			       struct crypto_wait *wait,
+			       struct page *pages[],
+			       u8 *dsts[],
+			       unsigned int dlens[],
+			       int errors[],
+			       int nr_pages);
+	bool (*batch_decompress)(struct acomp_req *reqs[],
+				 struct crypto_wait *wait,
+				 u8 *srcs[],
+				 struct page *pages[],
+				 unsigned int slens[],
+				 int errors[],
+				 int nr_pages);
 	void (*dst_free)(struct scatterlist *dst);
 	int (*init)(struct crypto_acomp *tfm);
 	void (*exit)(struct crypto_acomp *tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 03/12] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
  2024-12-21  6:31 ` [PATCH v5 01/12] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
  2024-12-21  6:31 ` [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2024-12-21  6:31 ` [PATCH v5 04/12] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

If the iaa_crypto driver has async_mode set to true, and use_irq set to
false, it can still be forced to use synchronous mode by turning off the
CRYPTO_ACOMP_REQ_POLL flag in req->flags.

In other words, all three of the following need to be true for a request
to be processed in fully async poll mode:

 1) async_mode should be "true"
 2) use_irq should be "false"
 3) req->flags & CRYPTO_ACOMP_REQ_POLL should be "true"

Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 11 ++++++++++-
 include/crypto/acompress.h                 |  5 +++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 9e557649e5d0..29d03df39fab 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1520,6 +1520,10 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		return -EINVAL;
 	}
 
+	/* If the caller has requested no polling, disable async. */
+	if (!(req->flags & CRYPTO_ACOMP_REQ_POLL))
+		disable_async = true;
+
 	cpu = get_cpu();
 	wq = wq_table_next_wq(cpu);
 	put_cpu();
@@ -1712,6 +1716,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 {
 	struct crypto_tfm *tfm = req->base.tfm;
 	dma_addr_t src_addr, dst_addr;
+	bool disable_async = false;
 	int nr_sgs, cpu, ret = 0;
 	struct iaa_wq *iaa_wq;
 	struct device *dev;
@@ -1727,6 +1732,10 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		return -EINVAL;
 	}
 
+	/* If the caller has requested no polling, disable async. */
+	if (!(req->flags & CRYPTO_ACOMP_REQ_POLL))
+		disable_async = true;
+
 	if (!req->dst)
 		return iaa_comp_adecompress_alloc_dest(req);
 
@@ -1775,7 +1784,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		req->dst, req->dlen, sg_dma_len(req->dst));
 
 	ret = iaa_decompress(tfm, req, wq, src_addr, req->slen,
-			     dst_addr, &req->dlen, false);
+			     dst_addr, &req->dlen, disable_async);
 	if (ret == -EINPROGRESS)
 		return ret;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 8451ade70fd8..e090538e8406 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -14,6 +14,11 @@
 #include <linux/crypto.h>
 
 #define CRYPTO_ACOMP_ALLOC_OUTPUT	0x00000001
+/*
+ * If set, the driver must have a way to submit the req, then
+ * poll its completion status for success/error.
+ */
+#define CRYPTO_ACOMP_REQ_POLL		0x00000002
 #define CRYPTO_ACOMP_DST_MAX		131072
 
 /**
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 04/12] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2024-12-21  6:31 ` [PATCH v5 03/12] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2024-12-22  4:07   ` kernel test robot
  2024-12-21  6:31 ` [PATCH v5 05/12] crypto: iaa - Make async mode the default Kanchana P Sridhar
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch provides iaa_crypto driver implementations for the newly added
crypto_acomp batch_compress() and batch_decompress() interfaces using
acomp request chaining.

iaa_crypto also implements the new crypto_acomp get_batch_size() interface
that returns an iaa_driver specific constant, IAA_CRYPTO_MAX_BATCH_SIZE
(set to 8U currently).

This allows swap modules such as zswap/zram to allocate required batching
resources and then invoke fully asynchronous batch parallel
compression/decompression of pages on systems with Intel IAA, by invoking
these API, respectively:

 crypto_acomp_batch_size(...);
 crypto_acomp_batch_compress(...);
 crypto_acomp_batch_decompress(...);

This enables zswap compress batching code to be developed in
a manner similar to the current single-page synchronous calls to:

 crypto_acomp_compress(...);
 crypto_acomp_decompress(...);

thereby, facilitating encapsulated and modular hand-off between the kernel
zswap/zram code and the crypto_acomp layer.

Since iaa_crypto supports the use of acomp request chaining, this patch
also adds CRYPTO_ALG_REQ_CHAIN to the iaa_acomp_fixed_deflate algorithm's
cra_flags.

Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |   9 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 395 ++++++++++++++++++++-
 2 files changed, 403 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 56985e395263..b3b67c44ec8a 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -39,6 +39,15 @@
 					 IAA_DECOMP_CHECK_FOR_EOB | \
 					 IAA_DECOMP_STOP_ON_EOB)
 
+/*
+ * The maximum compress/decompress batch size for IAA's implementation of
+ * the crypto_acomp batch_compress() and batch_decompress() interfaces.
+ * The IAA compression algorithms should provide the crypto_acomp
+ * get_batch_size() interface through a function that returns this
+ * constant.
+ */
+#define IAA_CRYPTO_MAX_BATCH_SIZE 8U
+
 /* Representation of IAA workqueue */
 struct iaa_wq {
 	struct list_head	list;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 29d03df39fab..b51b0b4b9ac3 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1807,6 +1807,396 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 	ctx->use_irq = use_irq;
 }
 
+static int iaa_comp_poll(struct acomp_req *req)
+{
+	struct idxd_desc *idxd_desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	struct idxd_wq *wq;
+	bool compress_op;
+	int ret;
+
+	idxd_desc = req->base.data;
+	if (!idxd_desc)
+		return -EAGAIN;
+
+	compress_op = (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS);
+	wq = idxd_desc->wq;
+	iaa_wq = idxd_wq_get_private(wq);
+	idxd = iaa_wq->iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	ret = check_completion(dev, idxd_desc->iax_completion, true, true);
+	if (ret == -EAGAIN)
+		return ret;
+	if (ret)
+		goto out;
+
+	req->dlen = idxd_desc->iax_completion->output_size;
+
+	/* Update stats */
+	if (compress_op) {
+		update_total_comp_bytes_out(req->dlen);
+		update_wq_comp_bytes(wq, req->dlen);
+	} else {
+		update_total_decomp_bytes_in(req->slen);
+		update_wq_decomp_bytes(wq, req->slen);
+	}
+
+	if (iaa_verify_compress && (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS)) {
+		struct crypto_tfm *tfm = req->base.tfm;
+		dma_addr_t src_addr, dst_addr;
+		u32 compression_crc;
+
+		compression_crc = idxd_desc->iax_completion->crc;
+
+		dma_sync_sg_for_device(dev, req->dst, 1, DMA_FROM_DEVICE);
+		dma_sync_sg_for_device(dev, req->src, 1, DMA_TO_DEVICE);
+
+		src_addr = sg_dma_address(req->src);
+		dst_addr = sg_dma_address(req->dst);
+
+		ret = iaa_compress_verify(tfm, req, wq, src_addr, req->slen,
+					  dst_addr, &req->dlen, compression_crc);
+	}
+out:
+	/* caller doesn't call crypto_wait_req, so no acomp_request_complete() */
+
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	idxd_free_desc(idxd_desc->wq, idxd_desc);
+
+	dev_dbg(dev, "%s: returning ret=%d\n", __func__, ret);
+
+	return ret;
+}
+
+static unsigned int iaa_comp_get_batch_size(void)
+{
+	return IAA_CRYPTO_MAX_BATCH_SIZE;
+}
+
+static void iaa_set_req_poll(
+	struct acomp_req *reqs[],
+	int nr_reqs,
+	bool set_flag)
+{
+	int i;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		set_flag ? (reqs[i]->flags |= CRYPTO_ACOMP_REQ_POLL) :
+			   (reqs[i]->flags &= ~CRYPTO_ACOMP_REQ_POLL);
+	}
+}
+
+/**
+ * This API provides IAA compress batching functionality for use by swap
+ * modules.
+ *
+ * @reqs: @nr_pages asynchronous compress requests.
+ * @wait: crypto_wait for acomp batch compress implemented using request
+ *        chaining. Required if async_mode is "false". If async_mode is "true",
+ *        and @wait is NULL, the completions will be processed using
+ *        asynchronous polling of the requests' completion statuses.
+ * @pages: Pages to be compressed by IAA.
+ * @dsts: Pre-allocated destination buffers to store results of IAA
+ *        compression. Each element of @dsts must be of size "PAGE_SIZE * 2".
+ * @dlens: Will contain the compressed lengths.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *            to be compressed.
+ *
+ * Returns true if all compress requests complete successfully,
+ * false otherwise.
+ */
+static bool iaa_comp_acompress_batch(
+	struct acomp_req *reqs[],
+	struct crypto_wait *wait,
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_pages)
+{
+	struct scatterlist inputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	struct scatterlist outputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	bool compressions_done = false;
+	bool async = (async_mode && !use_irq);
+	bool async_poll = (async && !wait);
+	int i, err = 0;
+
+	BUG_ON(nr_pages > IAA_CRYPTO_MAX_BATCH_SIZE);
+	BUG_ON(!async && !wait);
+
+	if (async)
+		iaa_set_req_poll(reqs, nr_pages, true);
+	else
+		iaa_set_req_poll(reqs, nr_pages, false);
+
+	/*
+	 * Prepare and submit acomp_reqs to IAA. IAA will process these
+	 * compress jobs in parallel if async_mode is true.
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		sg_init_table(&inputs[i], 1);
+		sg_set_page(&inputs[i], pages[i], PAGE_SIZE, 0);
+
+		/*
+		 * Each dst buffer should be of size (PAGE_SIZE * 2).
+		 * Reflect same in sg_list.
+		 */
+		sg_init_one(&outputs[i], dsts[i], PAGE_SIZE * 2);
+		acomp_request_set_params(reqs[i], &inputs[i],
+					 &outputs[i], PAGE_SIZE, dlens[i]);
+
+		/*
+		 * As long as the API is called with a valid "wait", chain the
+		 * requests for synchronous/asynchronous compress ops.
+		 * If async_mode is in effect, but the API is called with a
+		 * NULL "wait", submit the requests first, and poll for
+		 * their completion status later, after all descriptors have
+		 * been submitted.
+		 */
+		if (!async_poll) {
+			/* acomp request chaining. */
+			if (i)
+				acomp_request_chain(reqs[i], reqs[0]);
+			else
+				acomp_reqchain_init(reqs[0], 0, crypto_req_done,
+						    wait);
+		} else {
+			errors[i] = iaa_comp_acompress(reqs[i]);
+
+			if (errors[i] != -EINPROGRESS) {
+				errors[i] = -EINVAL;
+				err = -EINVAL;
+			} else {
+				errors[i] = -EAGAIN;
+			}
+		}
+	}
+
+	if (!async_poll) {
+		if (async)
+			/* Process the request chain in parallel. */
+			err = crypto_wait_req(acomp_do_async_req_chain(reqs[0],
+					      iaa_comp_acompress, iaa_comp_poll),
+					      wait);
+		else
+			/* Process the request chain in series. */
+			err = crypto_wait_req(acomp_do_req_chain(reqs[0],
+					      iaa_comp_acompress), wait);
+
+		for (i = 0; i < nr_pages; ++i) {
+			errors[i] = acomp_request_err(reqs[i]);
+			if (errors[i]) {
+				err = -EINVAL;
+				pr_debug("Request chaining req %d compress error %d\n", i, errors[i]);
+			} else {
+				dlens[i] = reqs[i]->dlen;
+			}
+		}
+
+		goto reset_reqs;
+	}
+
+	/*
+	 * Asynchronously poll for and process IAA compress job completions.
+	 */
+	while (!compressions_done) {
+		compressions_done = true;
+
+		for (i = 0; i < nr_pages; ++i) {
+			/*
+			 * Skip, if the compression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					compressions_done = false;
+				else
+					err = -EINVAL;
+			} else {
+				dlens[i] = reqs[i]->dlen;
+			}
+		}
+	}
+
+reset_reqs:
+	/*
+	 * For the same 'reqs[]' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_deacompress(),
+	 * clear the CRYPTO_ACOMP_REQ_POLL bit on all acomp_reqs, and the
+	 * CRYPTO_TFM_REQ_CHAIN bit on the reqs[0].
+	 */
+	iaa_set_req_poll(reqs, nr_pages, false);
+	if (!async_poll)
+		acomp_reqchain_clear(reqs[0], wait);
+
+	return !err;
+}
+
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ *
+ * @reqs: @nr_pages asynchronous decompress requests.
+ * @wait: crypto_wait for acomp batch decompress implemented using request
+ *        chaining. Required if async_mode is "false". If async_mode is "true",
+ *        and @wait is NULL, the completions will be processed using
+ *        asynchronous polling of the requests' completion statuses.
+ * @srcs: The src buffers to be decompressed by IAA.
+ * @pages: The pages to store the decompressed buffers.
+ * @slens: Compressed lengths of @srcs.
+ * @errors: zero on successful compression of the corresponding
+ *          req, or error code in case of error.
+ * @nr_pages: The number of pages, up to IAA_CRYPTO_MAX_BATCH_SIZE,
+ *            to be decompressed.
+ *
+ * Returns true if all decompress requests complete successfully,
+ * false otherwise.
+ */
+static bool iaa_comp_adecompress_batch(
+	struct acomp_req *reqs[],
+	struct crypto_wait *wait,
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	int errors[],
+	int nr_pages)
+{
+	struct scatterlist inputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	struct scatterlist outputs[IAA_CRYPTO_MAX_BATCH_SIZE];
+	unsigned int dlens[IAA_CRYPTO_MAX_BATCH_SIZE];
+	bool decompressions_done = false;
+	bool async = (async_mode && !use_irq);
+	bool async_poll = (async && !wait);
+	int i, err = 0;
+
+	BUG_ON(nr_pages > IAA_CRYPTO_MAX_BATCH_SIZE);
+	BUG_ON(!async && !wait);
+
+	if (async)
+		iaa_set_req_poll(reqs, nr_pages, true);
+	else
+		iaa_set_req_poll(reqs, nr_pages, false);
+
+	/*
+	 * Prepare and submit acomp_reqs to IAA. IAA will process these
+	 * decompress jobs in parallel if async_mode is true.
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		dlens[i] = PAGE_SIZE;
+		sg_init_one(&inputs[i], srcs[i], slens[i]);
+		sg_init_table(&outputs[i], 1);
+		sg_set_page(&outputs[i], pages[i], PAGE_SIZE, 0);
+		acomp_request_set_params(reqs[i], &inputs[i],
+					&outputs[i], slens[i], dlens[i]);
+
+		/*
+		 * As long as the API is called with a valid "wait", chain the
+		 * requests for synchronous/asynchronous decompress ops.
+		 * If async_mode is in effect, but the API is called with a
+		 * NULL "wait", submit the requests first, and poll for
+		 * their completion status later, after all descriptors have
+		 * been submitted.
+		 */
+		if (!async_poll) {
+			/* acomp request chaining. */
+			if (i)
+				acomp_request_chain(reqs[i], reqs[0]);
+			else
+				acomp_reqchain_init(reqs[0], 0, crypto_req_done,
+						    wait);
+		} else {
+			errors[i] = iaa_comp_adecompress(reqs[i]);
+
+			if (errors[i] != -EINPROGRESS) {
+				errors[i] = -EINVAL;
+				err = -EINVAL;
+			} else {
+				errors[i] = -EAGAIN;
+			}
+		}
+	}
+
+	if (!async_poll) {
+		if (async)
+			/* Process the request chain in parallel. */
+			err = crypto_wait_req(acomp_do_async_req_chain(reqs[0],
+					      iaa_comp_adecompress, iaa_comp_poll),
+					      wait);
+		else
+			/* Process the request chain in series. */
+			err = crypto_wait_req(acomp_do_req_chain(reqs[0],
+					      iaa_comp_adecompress), wait);
+
+		for (i = 0; i < nr_pages; ++i) {
+			errors[i] = acomp_request_err(reqs[i]);
+			if (errors[i]) {
+				err = -EINVAL;
+				pr_debug("Request chaining req %d decompress error %d\n", i, errors[i]);
+			} else {
+				dlens[i] = reqs[i]->dlen;
+				BUG_ON(dlens[i] != PAGE_SIZE);
+			}
+		}
+
+		goto reset_reqs;
+	}
+
+	/*
+	 * Asynchronously poll for and process IAA decompress job completions.
+	 */
+	while (!decompressions_done) {
+		decompressions_done = true;
+
+		for (i = 0; i < nr_pages; ++i) {
+			/*
+			 * Skip, if the decompression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(reqs[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					decompressions_done = false;
+				else
+					err = -EINVAL;
+			} else {
+				dlens[i] = reqs[i]->dlen;
+				BUG_ON(dlens[i] != PAGE_SIZE);
+			}
+		}
+	}
+
+reset_reqs:
+	/*
+	 * For the same 'reqs[]' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_deacompress(),
+	 * clear the CRYPTO_ACOMP_REQ_POLL bit on all acomp_reqs, and the
+	 * CRYPTO_TFM_REQ_CHAIN bit on the reqs[0].
+	 */
+	iaa_set_req_poll(reqs, nr_pages, false);
+	if (!async_poll)
+		acomp_reqchain_clear(reqs[0], wait);
+
+	return !err;
+}
+
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
@@ -1832,10 +2222,13 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	.compress		= iaa_comp_acompress,
 	.decompress		= iaa_comp_adecompress,
 	.dst_free               = dst_free,
+	.get_batch_size		= iaa_comp_get_batch_size,
+	.batch_compress		= iaa_comp_acompress_batch,
+	.batch_decompress	= iaa_comp_adecompress_batch,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa",
-		.cra_flags		= CRYPTO_ALG_ASYNC,
+		.cra_flags		= CRYPTO_ALG_ASYNC | CRYPTO_ALG_REQ_CHAIN,
 		.cra_ctxsize		= sizeof(struct iaa_compression_ctx),
 		.cra_module		= THIS_MODULE,
 		.cra_priority		= IAA_ALG_PRIORITY,
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 05/12] crypto: iaa - Make async mode the default.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2024-12-21  6:31 ` [PATCH v5 04/12] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2024-12-21  6:31 ` [PATCH v5 06/12] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default in the most efficient/recommended "async"
mode for parallel compressions/decompressions, namely, asynchronous
submission of descriptors, followed by polling for job completions with
request chaining. Earlier, the "sync" mode used to be the default.

This way, anyone who wants to use IAA for zswap/zram can do so after
building the kernel, and without having to go through these steps to use
async request chaining:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo async > /sys/bus/dsa/drivers/crypto/sync_mode
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index b51b0b4b9ac3..6d49f82165fe 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -153,7 +153,7 @@ static DRIVER_ATTR_RW(verify_compress);
  */
 
 /* Use async mode */
-static bool async_mode;
+static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 06/12] crypto: iaa - Disable iaa_verify_compress by default.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2024-12-21  6:31 ` [PATCH v5 05/12] crypto: iaa - Make async mode the default Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2024-12-21  6:31 ` [PATCH v5 07/12] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default with "iaa_verify_compress" disabled, to
facilitate performance comparisons with software compressors (which also
do not run compress verification by default). Earlier, iaa_crypto compress
verification used to be enabled by default.

With this patch, if users want to enable compress verification, they can do
so with these steps:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo 1 > /sys/bus/dsa/drivers/crypto/verify_compress
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 6d49f82165fe..f4807a828034 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -94,7 +94,7 @@ static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
 /* Verify results of IAA compress or not */
-static bool iaa_verify_compress = true;
+static bool iaa_verify_compress = false;
 
 static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
 {
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 07/12] crypto: iaa - Re-organize the iaa_crypto driver code.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2024-12-21  6:31 ` [PATCH v5 06/12] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2024-12-21  6:31 ` [PATCH v5 08/12] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch merely reorganizes the code in iaa_crypto_main.c, so that
the functions are consolidated into logically related sub-sections of
code.

This is expected to make the code more maintainable and for it to be easier
to replace functional layers and/or add new features.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 540 +++++++++++----------
 1 file changed, 275 insertions(+), 265 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index f4807a828034..2c5b7ce041d6 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -24,6 +24,9 @@
 
 #define IAA_ALG_PRIORITY               300
 
+/**************************************
+ * Driver internal global variables.
+ **************************************/
 /* number of iaa instances probed */
 static unsigned int nr_iaa;
 static unsigned int nr_cpus;
@@ -36,55 +39,46 @@ static unsigned int cpus_per_iaa;
 static struct crypto_comp *deflate_generic_tfm;
 
 /* Per-cpu lookup table for balanced wqs */
-static struct wq_table_entry __percpu *wq_table;
+static struct wq_table_entry __percpu *wq_table = NULL;
 
-static struct idxd_wq *wq_table_next_wq(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (++entry->cur_wq >= entry->n_wqs)
-		entry->cur_wq = 0;
-
-	if (!entry->wqs[entry->cur_wq])
-		return NULL;
-
-	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
-		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
-		 entry->wqs[entry->cur_wq]->id, cpu);
-
-	return entry->wqs[entry->cur_wq];
-}
-
-static void wq_table_add(int cpu, struct idxd_wq *wq)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
-		return;
-
-	entry->wqs[entry->n_wqs++] = wq;
-
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
-}
-
-static void wq_table_free_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+/* Verify results of IAA compress or not */
+static bool iaa_verify_compress = false;
 
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
-}
+/*
+ * The iaa crypto driver supports three 'sync' methods determining how
+ * compressions and decompressions are performed:
+ *
+ * - sync:      the compression or decompression completes before
+ *              returning.  This is the mode used by the async crypto
+ *              interface when the sync mode is set to 'sync' and by
+ *              the sync crypto interface regardless of setting.
+ *
+ * - async:     the compression or decompression is submitted and returns
+ *              immediately.  Completion interrupts are not used so
+ *              the caller is responsible for polling the descriptor
+ *              for completion.  This mode is applicable to only the
+ *              async crypto interface and is ignored for anything
+ *              else.
+ *
+ * - async_irq: the compression or decompression is submitted and
+ *              returns immediately.  Completion interrupts are
+ *              enabled so the caller can wait for the completion and
+ *              yield to other threads.  When the compression or
+ *              decompression completes, the completion is signaled
+ *              and the caller awakened.  This mode is applicable to
+ *              only the async crypto interface and is ignored for
+ *              anything else.
+ *
+ * These modes can be set using the iaa_crypto sync_mode driver
+ * attribute.
+ */
 
-static void wq_table_clear_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+/* Use async mode */
+static bool async_mode = true;
+/* Use interrupts */
+static bool use_irq;
 
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
-}
+static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
@@ -93,9 +87,9 @@ DEFINE_MUTEX(iaa_devices_lock);
 static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
-/* Verify results of IAA compress or not */
-static bool iaa_verify_compress = false;
-
+/**************************************************
+ * Driver attributes along with get/set functions.
+ **************************************************/
 static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
 {
 	return sprintf(buf, "%d\n", iaa_verify_compress);
@@ -123,40 +117,6 @@ static ssize_t verify_compress_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(verify_compress);
 
-/*
- * The iaa crypto driver supports three 'sync' methods determining how
- * compressions and decompressions are performed:
- *
- * - sync:      the compression or decompression completes before
- *              returning.  This is the mode used by the async crypto
- *              interface when the sync mode is set to 'sync' and by
- *              the sync crypto interface regardless of setting.
- *
- * - async:     the compression or decompression is submitted and returns
- *              immediately.  Completion interrupts are not used so
- *              the caller is responsible for polling the descriptor
- *              for completion.  This mode is applicable to only the
- *              async crypto interface and is ignored for anything
- *              else.
- *
- * - async_irq: the compression or decompression is submitted and
- *              returns immediately.  Completion interrupts are
- *              enabled so the caller can wait for the completion and
- *              yield to other threads.  When the compression or
- *              decompression completes, the completion is signaled
- *              and the caller awakened.  This mode is applicable to
- *              only the async crypto interface and is ignored for
- *              anything else.
- *
- * These modes can be set using the iaa_crypto sync_mode driver
- * attribute.
- */
-
-/* Use async mode */
-static bool async_mode = true;
-/* Use interrupts */
-static bool use_irq;
-
 /**
  * set_iaa_sync_mode - Set IAA sync mode
  * @name: The name of the sync mode
@@ -219,8 +179,9 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
-static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
-
+/****************************
+ * Driver compression modes.
+ ****************************/
 static int find_empty_iaa_compression_mode(void)
 {
 	int i = -EINVAL;
@@ -411,11 +372,6 @@ static void free_device_compression_mode(struct iaa_device *iaa_device,
 						IDXD_OP_FLAG_WR_SRC2_AECS_COMP | \
 						IDXD_OP_FLAG_AECS_RW_TGLS)
 
-static int check_completion(struct device *dev,
-			    struct iax_completion_record *comp,
-			    bool compress,
-			    bool only_once);
-
 static int init_device_compression_mode(struct iaa_device *iaa_device,
 					struct iaa_compression_mode *mode,
 					int idx, struct idxd_wq *wq)
@@ -502,6 +458,10 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
 	}
 }
 
+/***********************************************************
+ * Functions for use in crypto probe and remove interfaces:
+ * allocate/init/query/deallocate devices/wqs.
+ ***********************************************************/
 static struct iaa_device *iaa_device_alloc(void)
 {
 	struct iaa_device *iaa_device;
@@ -614,16 +574,6 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 	}
 }
 
-static void clear_wq_table(void)
-{
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
-
-	pr_debug("cleared wq table\n");
-}
-
 static void free_iaa_device(struct iaa_device *iaa_device)
 {
 	if (!iaa_device)
@@ -704,43 +654,6 @@ static int iaa_wq_put(struct idxd_wq *wq)
 	return ret;
 }
 
-static void free_wq_table(void)
-{
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_free_entry(cpu);
-
-	free_percpu(wq_table);
-
-	pr_debug("freed wq table\n");
-}
-
-static int alloc_wq_table(int max_wqs)
-{
-	struct wq_table_entry *entry;
-	int cpu;
-
-	wq_table = alloc_percpu(struct wq_table_entry);
-	if (!wq_table)
-		return -ENOMEM;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++) {
-		entry = per_cpu_ptr(wq_table, cpu);
-		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
-		if (!entry->wqs) {
-			free_wq_table();
-			return -ENOMEM;
-		}
-
-		entry->max_wqs = max_wqs;
-	}
-
-	pr_debug("initialized wq table\n");
-
-	return 0;
-}
-
 static int save_iaa_wq(struct idxd_wq *wq)
 {
 	struct iaa_device *iaa_device, *found = NULL;
@@ -829,6 +742,87 @@ static void remove_iaa_wq(struct idxd_wq *wq)
 		cpus_per_iaa = 1;
 }
 
+/***************************************************************
+ * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
+ ***************************************************************/
+static void wq_table_free_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	kfree(entry->wqs);
+	memset(entry, 0, sizeof(*entry));
+}
+
+static void wq_table_clear_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	entry->n_wqs = 0;
+	entry->cur_wq = 0;
+	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+}
+
+static void clear_wq_table(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		wq_table_clear_entry(cpu);
+
+	pr_debug("cleared wq table\n");
+}
+
+static void free_wq_table(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		wq_table_free_entry(cpu);
+
+	free_percpu(wq_table);
+
+	pr_debug("freed wq table\n");
+}
+
+static int alloc_wq_table(int max_wqs)
+{
+	struct wq_table_entry *entry;
+	int cpu;
+
+	wq_table = alloc_percpu(struct wq_table_entry);
+	if (!wq_table)
+		return -ENOMEM;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		entry = per_cpu_ptr(wq_table, cpu);
+		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
+		if (!entry->wqs) {
+			free_wq_table();
+			return -ENOMEM;
+		}
+
+		entry->max_wqs = max_wqs;
+	}
+
+	pr_debug("initialized wq table\n");
+
+	return 0;
+}
+
+static void wq_table_add(int cpu, struct idxd_wq *wq)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+		return;
+
+	entry->wqs[entry->n_wqs++] = wq;
+
+	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
+		 entry->wqs[entry->n_wqs - 1]->idxd->id,
+		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+}
+
 static int wq_table_add_wqs(int iaa, int cpu)
 {
 	struct iaa_device *iaa_device, *found_device = NULL;
@@ -939,6 +933,29 @@ static void rebalance_wq_table(void)
 	}
 }
 
+/***************************************************************
+ * Assign work-queues for driver ops using per-cpu wq_tables.
+ ***************************************************************/
+static struct idxd_wq *wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	if (++entry->cur_wq >= entry->n_wqs)
+		entry->cur_wq = 0;
+
+	if (!entry->wqs[entry->cur_wq])
+		return NULL;
+
+	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
+		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
+		 entry->wqs[entry->cur_wq]->id, cpu);
+
+	return entry->wqs[entry->cur_wq];
+}
+
+/*************************************************
+ * Core iaa_crypto compress/decompress functions.
+ *************************************************/
 static inline int check_completion(struct device *dev,
 				   struct iax_completion_record *comp,
 				   bool compress,
@@ -1020,13 +1037,130 @@ static int deflate_generic_decompress(struct acomp_req *req)
 
 static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
 				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr);
+				dma_addr_t *src_addr, dma_addr_t *dst_addr)
+{
+	int ret = 0;
+	int nr_sgs;
+
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		goto out;
+	}
+	*src_addr = sg_dma_address(req->src);
+	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
+		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
+		req->src, req->slen, sg_dma_len(req->src));
+
+	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+		goto out;
+	}
+	*dst_addr = sg_dma_address(req->dst);
+	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
+		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
+		req->dst, req->dlen, sg_dma_len(req->dst));
+out:
+	return ret;
+}
 
 static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 			       struct idxd_wq *wq,
 			       dma_addr_t src_addr, unsigned int slen,
 			       dma_addr_t dst_addr, unsigned int *dlen,
-			       u32 compression_crc);
+			       u32 compression_crc)
+{
+	struct iaa_device_compression_mode *active_compression_mode;
+	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	struct iaa_device *iaa_device;
+	struct idxd_desc *idxd_desc;
+	struct iax_hw_desc *desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int ret = 0;
+
+	iaa_wq = idxd_wq_get_private(wq);
+	iaa_device = iaa_wq->iaa_device;
+	idxd = iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
+
+	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	if (IS_ERR(idxd_desc)) {
+		dev_dbg(dev, "idxd descriptor allocation failed\n");
+		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
+			PTR_ERR(idxd_desc));
+		return PTR_ERR(idxd_desc);
+	}
+	desc = idxd_desc->iax_hw;
+
+	/* Verify (optional) - decompress and check crc, suppress dest write */
+
+	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
+	desc->opcode = IAX_OPCODE_DECOMPRESS;
+	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
+	desc->priv = 0;
+
+	desc->src1_addr = (u64)dst_addr;
+	desc->src1_size = *dlen;
+	desc->dst_addr = (u64)src_addr;
+	desc->max_dst_size = slen;
+	desc->completion_addr = idxd_desc->compl_dma;
+
+	dev_dbg(dev, "(verify) compression mode %s,"
+		" desc->src1_addr %llx, desc->src1_size %d,"
+		" desc->dst_addr %llx, desc->max_dst_size %d,"
+		" desc->src2_addr %llx, desc->src2_size %d\n",
+		active_compression_mode->name,
+		desc->src1_addr, desc->src1_size, desc->dst_addr,
+		desc->max_dst_size, desc->src2_addr, desc->src2_size);
+
+	ret = idxd_submit_desc(wq, idxd_desc);
+	if (ret) {
+		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
+		goto err;
+	}
+
+	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+	if (ret) {
+		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
+		goto err;
+	}
+
+	if (compression_crc != idxd_desc->iax_completion->crc) {
+		ret = -EINVAL;
+		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
+			" comp=0x%x, decomp=0x%x\n", compression_crc,
+			idxd_desc->iax_completion->crc);
+		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
+			       8, 1, idxd_desc->iax_completion, 64, 0);
+		goto err;
+	}
+
+	idxd_free_desc(wq, idxd_desc);
+out:
+	return ret;
+err:
+	idxd_free_desc(wq, idxd_desc);
+	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
+
+	goto out;
+}
 
 static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 			      enum idxd_complete_type comp_type,
@@ -1245,133 +1379,6 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 	goto out;
 }
 
-static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr)
-{
-	int ret = 0;
-	int nr_sgs;
-
-	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
-	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
-
-	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		goto out;
-	}
-	*src_addr = sg_dma_address(req->src);
-	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
-		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
-		req->src, req->slen, sg_dma_len(req->src));
-
-	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-		goto out;
-	}
-	*dst_addr = sg_dma_address(req->dst);
-	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
-		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
-		req->dst, req->dlen, sg_dma_len(req->dst));
-out:
-	return ret;
-}
-
-static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
-			       struct idxd_wq *wq,
-			       dma_addr_t src_addr, unsigned int slen,
-			       dma_addr_t dst_addr, unsigned int *dlen,
-			       u32 compression_crc)
-{
-	struct iaa_device_compression_mode *active_compression_mode;
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
-	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
-	struct iax_hw_desc *desc;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
-	int ret = 0;
-
-	iaa_wq = idxd_wq_get_private(wq);
-	iaa_device = iaa_wq->iaa_device;
-	idxd = iaa_device->idxd;
-	pdev = idxd->pdev;
-	dev = &pdev->dev;
-
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
-	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
-			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
-	}
-	desc = idxd_desc->iax_hw;
-
-	/* Verify (optional) - decompress and check crc, suppress dest write */
-
-	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
-	desc->opcode = IAX_OPCODE_DECOMPRESS;
-	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
-	desc->priv = 0;
-
-	desc->src1_addr = (u64)dst_addr;
-	desc->src1_size = *dlen;
-	desc->dst_addr = (u64)src_addr;
-	desc->max_dst_size = slen;
-	desc->completion_addr = idxd_desc->compl_dma;
-
-	dev_dbg(dev, "(verify) compression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n",
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
-	ret = idxd_submit_desc(wq, idxd_desc);
-	if (ret) {
-		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
-		goto err;
-	}
-
-	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
-	if (ret) {
-		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
-		goto err;
-	}
-
-	if (compression_crc != idxd_desc->iax_completion->crc) {
-		ret = -EINVAL;
-		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
-			" comp=0x%x, decomp=0x%x\n", compression_crc,
-			idxd_desc->iax_completion->crc);
-		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
-			       8, 1, idxd_desc->iax_completion, 64, 0);
-		goto err;
-	}
-
-	idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
-}
-
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			  struct idxd_wq *wq,
 			  dma_addr_t src_addr, unsigned int slen,
@@ -2197,6 +2204,9 @@ static bool iaa_comp_adecompress_batch(
 	return !err;
 }
 
+/*********************************************
+ * Interfaces to crypto_alg and crypto_acomp.
+ *********************************************/
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 08/12] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
                   ` (6 preceding siblings ...)
  2024-12-21  6:31 ` [PATCH v5 07/12] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2024-12-21  6:31 ` [PATCH v5 09/12] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the algorithm for mapping available IAA devices and
wqs to cores, as they are being discovered, based on packages instead of
NUMA nodes. This leads to a more realistic mapping of IAA devices as
compression/decompression resources for a package, rather than for a NUMA
node. This also resolves problems that were observed during internal
validation on Intel platforms with many more NUMA nodes than packages: for
such cases, the earlier NUMA based allocation caused some IAAs to be
over-subscribed and some to not be utilized at all.

As a result of this change from NUMA to packages, some of the core
functions used by the iaa_crypto driver's "probe" and "remove" API
have been re-written. The new infrastructure maintains a static/global
mapping of "local wqs" per IAA device, in the "struct iaa_device" itself.
The earlier implementation would allocate memory per-cpu for this data,
which never changes once the IAA devices/wqs have been initialized.

Two main outcomes from this new iaa_crypto driver infrastructure are:

1) Resolves "task blocked for more than x seconds" errors observed during
   internal validation on Intel systems with the earlier NUMA node based
   mappings, which was root-caused to the non-optimal IAA-to-core mappings
   described earlier.

2) Results in a NUM_THREADS factor reduction in memory footprint cost of
   initializing IAA devices/wqs, due to eliminating the per-cpu copies of
   each IAA device's wqs. On a 384 cores Intel Granite Rapids server with
   8 IAA devices, this saves 140MiB.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  17 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 276 ++++++++++++---------
 2 files changed, 171 insertions(+), 122 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index b3b67c44ec8a..74d25e62df12 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -55,6 +55,7 @@ struct iaa_wq {
 	struct idxd_wq		*wq;
 	int			ref;
 	bool			remove;
+	bool			mapped;
 
 	struct iaa_device	*iaa_device;
 
@@ -72,6 +73,13 @@ struct iaa_device_compression_mode {
 	dma_addr_t			aecs_comp_table_dma_addr;
 };
 
+struct wq_table_entry {
+	struct idxd_wq **wqs;
+	int	max_wqs;
+	int	n_wqs;
+	int	cur_wq;
+};
+
 /* Representation of IAA device with wqs, populated by probe */
 struct iaa_device {
 	struct list_head		list;
@@ -82,19 +90,14 @@ struct iaa_device {
 	int				n_wq;
 	struct list_head		wqs;
 
+	struct wq_table_entry		*iaa_local_wqs;
+
 	atomic64_t			comp_calls;
 	atomic64_t			comp_bytes;
 	atomic64_t			decomp_calls;
 	atomic64_t			decomp_bytes;
 };
 
-struct wq_table_entry {
-	struct idxd_wq **wqs;
-	int	max_wqs;
-	int	n_wqs;
-	int	cur_wq;
-};
-
 #define IAA_AECS_ALIGN			32
 
 /*
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 2c5b7ce041d6..418f78454875 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -30,8 +30,9 @@
 /* number of iaa instances probed */
 static unsigned int nr_iaa;
 static unsigned int nr_cpus;
-static unsigned int nr_nodes;
-static unsigned int nr_cpus_per_node;
+static unsigned int nr_packages;
+static unsigned int nr_cpus_per_package;
+static unsigned int nr_iaa_per_package;
 
 /* Number of physical cpus sharing each iaa instance */
 static unsigned int cpus_per_iaa;
@@ -462,17 +463,46 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
  * Functions for use in crypto probe and remove interfaces:
  * allocate/init/query/deallocate devices/wqs.
  ***********************************************************/
-static struct iaa_device *iaa_device_alloc(void)
+static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 {
+	struct wq_table_entry *local;
 	struct iaa_device *iaa_device;
 
 	iaa_device = kzalloc(sizeof(*iaa_device), GFP_KERNEL);
 	if (!iaa_device)
-		return NULL;
+		goto err;
+
+	iaa_device->idxd = idxd;
+
+	/* IAA device's local wqs. */
+	iaa_device->iaa_local_wqs = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->iaa_local_wqs)
+		goto err;
+
+	local = iaa_device->iaa_local_wqs;
+
+	local->wqs = kzalloc(iaa_device->idxd->max_wqs * sizeof(struct wq *), GFP_KERNEL);
+	if (!local->wqs)
+		goto err;
+
+	local->max_wqs = iaa_device->idxd->max_wqs;
+	local->n_wqs = 0;
 
 	INIT_LIST_HEAD(&iaa_device->wqs);
 
 	return iaa_device;
+
+err:
+	if (iaa_device) {
+		if (iaa_device->iaa_local_wqs) {
+			if (iaa_device->iaa_local_wqs->wqs)
+				kfree(iaa_device->iaa_local_wqs->wqs);
+			kfree(iaa_device->iaa_local_wqs);
+		}
+		kfree(iaa_device);
+	}
+
+	return NULL;
 }
 
 static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
@@ -491,12 +521,10 @@ static struct iaa_device *add_iaa_device(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
 
-	iaa_device = iaa_device_alloc();
+	iaa_device = iaa_device_alloc(idxd);
 	if (!iaa_device)
 		return NULL;
 
-	iaa_device->idxd = idxd;
-
 	list_add_tail(&iaa_device->list, &iaa_devices);
 
 	nr_iaa++;
@@ -537,6 +565,7 @@ static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 	iaa_wq->wq = wq;
 	iaa_wq->iaa_device = iaa_device;
 	idxd_wq_set_private(wq, iaa_wq);
+	iaa_wq->mapped = false;
 
 	list_add_tail(&iaa_wq->list, &iaa_device->wqs);
 
@@ -580,6 +609,13 @@ static void free_iaa_device(struct iaa_device *iaa_device)
 		return;
 
 	remove_device_compression_modes(iaa_device);
+
+	if (iaa_device->iaa_local_wqs) {
+		if (iaa_device->iaa_local_wqs->wqs)
+			kfree(iaa_device->iaa_local_wqs->wqs);
+		kfree(iaa_device->iaa_local_wqs);
+	}
+
 	kfree(iaa_device);
 }
 
@@ -716,9 +752,14 @@ static int save_iaa_wq(struct idxd_wq *wq)
 	if (WARN_ON(nr_iaa == 0))
 		return -EINVAL;
 
-	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+	cpus_per_iaa = (nr_packages * nr_cpus_per_package) / nr_iaa;
 	if (!cpus_per_iaa)
 		cpus_per_iaa = 1;
+
+	nr_iaa_per_package = nr_iaa / nr_packages;
+	if (!nr_iaa_per_package)
+		nr_iaa_per_package = 1;
+
 out:
 	return 0;
 }
@@ -735,53 +776,45 @@ static void remove_iaa_wq(struct idxd_wq *wq)
 	}
 
 	if (nr_iaa) {
-		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+		cpus_per_iaa = (nr_packages * nr_cpus_per_package) / nr_iaa;
 		if (!cpus_per_iaa)
 			cpus_per_iaa = 1;
-	} else
+
+		nr_iaa_per_package = nr_iaa / nr_packages;
+		if (!nr_iaa_per_package)
+			nr_iaa_per_package = 1;
+	} else {
 		cpus_per_iaa = 1;
+		nr_iaa_per_package = 1;
+	}
 }
 
 /***************************************************************
  * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
  ***************************************************************/
-static void wq_table_free_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
-}
-
-static void wq_table_clear_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
-}
-
-static void clear_wq_table(void)
+/*
+ * Given a cpu, find the closest IAA instance.  The idea is to try to
+ * choose the most appropriate IAA instance for a caller and spread
+ * available workqueues around to clients.
+ */
+static inline int cpu_to_iaa(int cpu)
 {
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
+	int package_id, base_iaa, iaa = 0;
 
-	pr_debug("cleared wq table\n");
-}
+	if (!nr_packages || !nr_iaa_per_package)
+		return 0;
 
-static void free_wq_table(void)
-{
-	int cpu;
+	package_id = topology_logical_package_id(cpu);
+	base_iaa = package_id * nr_iaa_per_package;
+	iaa = base_iaa + ((cpu % nr_cpus_per_package) / cpus_per_iaa);
 
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_free_entry(cpu);
+	pr_debug("cpu = %d, package_id = %d, base_iaa = %d, iaa = %d",
+		 cpu, package_id, base_iaa, iaa);
 
-	free_percpu(wq_table);
+	if (iaa >= 0 && iaa < nr_iaa)
+		return iaa;
 
-	pr_debug("freed wq table\n");
+	return (nr_iaa - 1);
 }
 
 static int alloc_wq_table(int max_wqs)
@@ -795,13 +828,11 @@ static int alloc_wq_table(int max_wqs)
 
 	for (cpu = 0; cpu < nr_cpus; cpu++) {
 		entry = per_cpu_ptr(wq_table, cpu);
-		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
-		if (!entry->wqs) {
-			free_wq_table();
-			return -ENOMEM;
-		}
 
+		entry->wqs = NULL;
 		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
 	}
 
 	pr_debug("initialized wq table\n");
@@ -809,33 +840,27 @@ static int alloc_wq_table(int max_wqs)
 	return 0;
 }
 
-static void wq_table_add(int cpu, struct idxd_wq *wq)
+static void wq_table_add(int cpu, struct wq_table_entry *iaa_local_wqs)
 {
 	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
 
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
-		return;
-
-	entry->wqs[entry->n_wqs++] = wq;
+	entry->wqs = iaa_local_wqs->wqs;
+	entry->max_wqs = iaa_local_wqs->max_wqs;
+	entry->n_wqs = iaa_local_wqs->n_wqs;
+	entry->cur_wq = 0;
 
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
+	pr_debug("%s: cpu %d: added %d iaa local wqs up to wq %d.%d\n", __func__,
+		 cpu, entry->n_wqs,
 		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+		 entry->wqs[entry->n_wqs - 1]->id);
 }
 
 static int wq_table_add_wqs(int iaa, int cpu)
 {
 	struct iaa_device *iaa_device, *found_device = NULL;
-	int ret = 0, cur_iaa = 0, n_wqs_added = 0;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
+	int ret = 0, cur_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		idxd = iaa_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
 
 		if (cur_iaa != iaa) {
 			cur_iaa++;
@@ -843,7 +868,8 @@ static int wq_table_add_wqs(int iaa, int cpu)
 		}
 
 		found_device = iaa_device;
-		dev_dbg(dev, "getting wq from iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 		break;
 	}
@@ -858,29 +884,58 @@ static int wq_table_add_wqs(int iaa, int cpu)
 		}
 		cur_iaa = 0;
 
-		idxd = found_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
-		dev_dbg(dev, "getting wq from only iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from only iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 	}
 
-	list_for_each_entry(iaa_wq, &found_device->wqs, list) {
-		wq_table_add(cpu, iaa_wq->wq);
-		pr_debug("rebalance: added wq for cpu=%d: iaa wq %d.%d\n",
-			 cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
-		n_wqs_added++;
+	wq_table_add(cpu, found_device->iaa_local_wqs);
+
+out:
+	return ret;
+}
+
+static int map_iaa_device_wqs(struct iaa_device *iaa_device)
+{
+	struct wq_table_entry *local;
+	int ret = 0, n_wqs_added = 0;
+	struct iaa_wq *iaa_wq;
+
+	local = iaa_device->iaa_local_wqs;
+
+	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+		if (iaa_wq->mapped && ++n_wqs_added)
+			continue;
+
+		pr_debug("iaa_device %px: processing wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+
+		if (WARN_ON(local->n_wqs == local->max_wqs))
+			break;
+
+		local->wqs[local->n_wqs++] = iaa_wq->wq;
+		pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+
+		iaa_wq->mapped = true;
+		++n_wqs_added;
 	}
 
-	if (!n_wqs_added) {
-		pr_debug("couldn't find any iaa wqs!\n");
+	if (!n_wqs_added && !iaa_device->n_wq) {
+		pr_debug("iaa_device %d: couldn't find any iaa wqs!\n", iaa_device->idxd->id);
 		ret = -EINVAL;
-		goto out;
 	}
-out:
+
 	return ret;
 }
 
+static void map_iaa_devices(void)
+{
+	struct iaa_device *iaa_device;
+
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		BUG_ON(map_iaa_device_wqs(iaa_device));
+	}
+}
+
 /*
  * Rebalance the wq table so that given a cpu, it's easy to find the
  * closest IAA instance.  The idea is to try to choose the most
@@ -889,48 +944,42 @@ static int wq_table_add_wqs(int iaa, int cpu)
  */
 static void rebalance_wq_table(void)
 {
-	const struct cpumask *node_cpus;
-	int node, cpu, iaa = -1;
+	int cpu, iaa;
 
 	if (nr_iaa == 0)
 		return;
 
-	pr_debug("rebalance: nr_nodes=%d, nr_cpus %d, nr_iaa %d, cpus_per_iaa %d\n",
-		 nr_nodes, nr_cpus, nr_iaa, cpus_per_iaa);
+	map_iaa_devices();
 
-	clear_wq_table();
+	pr_debug("rebalance: nr_packages=%d, nr_cpus %d, nr_iaa %d, cpus_per_iaa %d\n",
+		 nr_packages, nr_cpus, nr_iaa, cpus_per_iaa);
 
-	if (nr_iaa == 1) {
-		for (cpu = 0; cpu < nr_cpus; cpu++) {
-			if (WARN_ON(wq_table_add_wqs(0, cpu))) {
-				pr_debug("could not add any wqs for iaa 0 to cpu %d!\n", cpu);
-				return;
-			}
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		iaa = cpu_to_iaa(cpu);
+		pr_debug("rebalance: cpu=%d iaa=%d\n", cpu, iaa);
+
+		if (WARN_ON(iaa == -1)) {
+			pr_debug("rebalance (cpu_to_iaa(%d)) failed!\n", cpu);
+			return;
 		}
 
-		return;
+		if (WARN_ON(wq_table_add_wqs(iaa, cpu))) {
+			pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
+			return;
+		}
 	}
 
-	for_each_node_with_cpus(node) {
-		node_cpus = cpumask_of_node(node);
-
-		for (cpu = 0; cpu <  cpumask_weight(node_cpus); cpu++) {
-			int node_cpu = cpumask_nth(cpu, node_cpus);
-
-			if (WARN_ON(node_cpu >= nr_cpu_ids)) {
-				pr_debug("node_cpu %d doesn't exist!\n", node_cpu);
-				return;
-			}
-
-			if ((cpu % cpus_per_iaa) == 0)
-				iaa++;
+	pr_debug("Finished rebalance local wqs.");
+}
 
-			if (WARN_ON(wq_table_add_wqs(iaa, node_cpu))) {
-				pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
-				return;
-			}
-		}
+static void free_wq_tables(void)
+{
+	if (wq_table) {
+		free_percpu(wq_table);
+		wq_table = NULL;
 	}
+
+	pr_debug("freed local wq table\n");
 }
 
 /***************************************************************
@@ -2347,7 +2396,7 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	free_iaa_wq(idxd_wq_get_private(wq));
 err_save:
 	if (first_wq)
-		free_wq_table();
+		free_wq_tables();
 err_alloc:
 	mutex_unlock(&iaa_devices_lock);
 	idxd_drv_disable_wq(wq);
@@ -2397,7 +2446,9 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 
 	if (nr_iaa == 0) {
 		iaa_crypto_enabled = false;
-		free_wq_table();
+		free_wq_tables();
+		BUG_ON(!list_empty(&iaa_devices));
+		INIT_LIST_HEAD(&iaa_devices);
 		module_put(THIS_MODULE);
 
 		pr_info("iaa_crypto now DISABLED\n");
@@ -2423,16 +2474,11 @@ static struct idxd_device_driver iaa_crypto_driver = {
 static int __init iaa_crypto_init_module(void)
 {
 	int ret = 0;
-	int node;
+	INIT_LIST_HEAD(&iaa_devices);
 
 	nr_cpus = num_possible_cpus();
-	for_each_node_with_cpus(node)
-		nr_nodes++;
-	if (!nr_nodes) {
-		pr_err("IAA couldn't find any nodes with cpus\n");
-		return -ENODEV;
-	}
-	nr_cpus_per_node = nr_cpus / nr_nodes;
+	nr_cpus_per_package = topology_num_cores_per_package();
+	nr_packages = topology_max_packages();
 
 	if (crypto_has_comp("deflate-generic", 0, 0))
 		deflate_generic_tfm = crypto_alloc_comp("deflate-generic", 0, 0);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 09/12] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
                   ` (7 preceding siblings ...)
  2024-12-21  6:31 ` [PATCH v5 08/12] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2024-12-21  6:31 ` [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching Kanchana P Sridhar
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This change enables processes running on any logical core on a package to
use all the IAA devices enabled on that package for compress jobs. In
other words, compressions originating from any process in a package will be
distributed in round-robin manner to the available IAA devices on the same
package.

The main premise behind this change is to make sure that no compress
engines on any IAA device are un-utilized/under-utilized/over-utilized.
In other words, the compress engines on all IAA devices are considered a
global resource for that package, thus maximizing compression throughput.

This allows the use of all IAA devices present in a given package for
(batched) compressions originating from zswap/zram, from all cores
on this package.

A new per-cpu "global_wq_table" implements this in the iaa_crypto driver.
We can think of the global WQ per IAA as a WQ to which all cores on
that package can submit compress jobs.

To avail of this feature, the user must configure 2 WQs per IAA in order to
enable distribution of compress jobs to multiple IAA devices.

Each IAA will have 2 WQs:
 wq.0 (local WQ):
   Used for decompress jobs from cores mapped by the cpu_to_iaa() "even
   balancing of logical cores to IAA devices" algorithm.

 wq.1 (global WQ):
   Used for compress jobs from *all* logical cores on that package.

The iaa_crypto driver will place all global WQs from all same-package IAA
devices in the global_wq_table per cpu on that package. When the driver
receives a compress job, it will lookup the "next" global WQ in the cpu's
global_wq_table to submit the descriptor.

The starting wq in the global_wq_table for each cpu is the global wq
associated with the IAA nearest to it, so that we stagger the starting
global wq for each process. This results in very uniform usage of all IAAs
for compress jobs.

Two new driver module parameters are added for this feature:

g_wqs_per_iaa (default 0):

 /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa

 This represents the number of global WQs that can be configured per IAA
 device. The recommended setting is 1 to enable the use of this feature
 once the user configures 2 WQs per IAA using higher level scripts as
 described in Documentation/driver-api/crypto/iaa/iaa-crypto.rst.

g_consec_descs_per_gwq (default 1):

 /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq

 This represents the number of consecutive compress jobs that will be
 submitted to the same global WQ (i.e. to the same IAA device) from a given
 core, before moving to the next global WQ. The default is 1, which is also
 the recommended setting to avail of this feature.

The decompress jobs from any core will be sent to the "local" IAA, namely
the one that the driver assigns with the cpu_to_iaa() mapping algorithm
that evenly balances the assignment of logical cores to IAA devices on a
package.

On a 2-package Sapphire Rapids server where each package has 56 cores and
4 IAA devices, this is how the compress/decompress jobs will be mapped
when the user configures 2 WQs per IAA device (which implies wq.1 will
be added to the global WQ table for each logical core on that package):

 package(s):        2
 package0 CPU(s):   0-55,112-167
 package1 CPU(s):   56-111,168-223

 Compress jobs:
 --------------
 package 0:
 iaa_crypto will send compress jobs from all cpus (0-55,112-167) to all IAA
 devices on the package (iax1/iax3/iax5/iax7) in round-robin manner:
 iaa:   iax1           iax3           iax5           iax7

 package 1:
 iaa_crypto will send compress jobs from all cpus (56-111,168-223) to all
 IAA devices on the package (iax9/iax11/iax13/iax15) in round-robin manner:
 iaa:   iax9           iax11          iax13           iax15

 Decompress jobs:
 ----------------
 package 0:
 cpu   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
 iaa:  iax1           iax3           iax5           iax7

 package 1:
 cpu   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
 iaa:  iax9           iax11          iax13           iax15

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |   1 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 385 ++++++++++++++++++++-
 2 files changed, 378 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 74d25e62df12..c46c70ecf355 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -91,6 +91,7 @@ struct iaa_device {
 	struct list_head		wqs;
 
 	struct wq_table_entry		*iaa_local_wqs;
+	struct wq_table_entry		*iaa_global_wqs;
 
 	atomic64_t			comp_calls;
 	atomic64_t			comp_bytes;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 418f78454875..4ca9028d6050 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -42,6 +42,18 @@ static struct crypto_comp *deflate_generic_tfm;
 /* Per-cpu lookup table for balanced wqs */
 static struct wq_table_entry __percpu *wq_table = NULL;
 
+static struct wq_table_entry **pkg_global_wq_tables = NULL;
+
+/* Per-cpu lookup table for global wqs shared by all cpus. */
+static struct wq_table_entry __percpu *global_wq_table = NULL;
+
+/*
+ * Per-cpu counter of consecutive descriptors allocated to
+ * the same wq in the global_wq_table, so that we know
+ * when to switch to the next wq in the global_wq_table.
+ */
+static int __percpu *num_consec_descs_per_wq = NULL;
+
 /* Verify results of IAA compress or not */
 static bool iaa_verify_compress = false;
 
@@ -79,6 +91,16 @@ static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
+/* Number of global wqs per iaa*/
+static int g_wqs_per_iaa = 0;
+
+/*
+ * Number of consecutive descriptors to allocate from a
+ * given global wq before switching to the next wq in
+ * the global_wq_table.
+ */
+static int g_consec_descs_per_gwq = 1;
+
 static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
 LIST_HEAD(iaa_devices);
@@ -180,6 +202,60 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
+static ssize_t g_wqs_per_iaa_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_wqs_per_iaa);
+}
+
+static ssize_t g_wqs_per_iaa_store(struct device_driver *driver,
+				   const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_wqs_per_iaa);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_wqs_per_iaa);
+
+static ssize_t g_consec_descs_per_gwq_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_consec_descs_per_gwq);
+}
+
+static ssize_t g_consec_descs_per_gwq_store(struct device_driver *driver,
+					    const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_consec_descs_per_gwq);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_consec_descs_per_gwq);
+
 /****************************
  * Driver compression modes.
  ****************************/
@@ -465,7 +541,7 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
  ***********************************************************/
 static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 {
-	struct wq_table_entry *local;
+	struct wq_table_entry *local, *global;
 	struct iaa_device *iaa_device;
 
 	iaa_device = kzalloc(sizeof(*iaa_device), GFP_KERNEL);
@@ -488,6 +564,20 @@ static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 	local->max_wqs = iaa_device->idxd->max_wqs;
 	local->n_wqs = 0;
 
+	/* IAA device's global wqs. */
+	iaa_device->iaa_global_wqs = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->iaa_global_wqs)
+		goto err;
+
+	global = iaa_device->iaa_global_wqs;
+
+	global->wqs = kzalloc(iaa_device->idxd->max_wqs * sizeof(struct wq *), GFP_KERNEL);
+	if (!global->wqs)
+		goto err;
+
+	global->max_wqs = iaa_device->idxd->max_wqs;
+	global->n_wqs = 0;
+
 	INIT_LIST_HEAD(&iaa_device->wqs);
 
 	return iaa_device;
@@ -499,6 +589,8 @@ static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 				kfree(iaa_device->iaa_local_wqs->wqs);
 			kfree(iaa_device->iaa_local_wqs);
 		}
+		if (iaa_device->iaa_global_wqs)
+			kfree(iaa_device->iaa_global_wqs);
 		kfree(iaa_device);
 	}
 
@@ -616,6 +708,12 @@ static void free_iaa_device(struct iaa_device *iaa_device)
 		kfree(iaa_device->iaa_local_wqs);
 	}
 
+	if (iaa_device->iaa_global_wqs) {
+		if (iaa_device->iaa_global_wqs->wqs)
+			kfree(iaa_device->iaa_global_wqs->wqs);
+		kfree(iaa_device->iaa_global_wqs);
+	}
+
 	kfree(iaa_device);
 }
 
@@ -817,6 +915,58 @@ static inline int cpu_to_iaa(int cpu)
 	return (nr_iaa - 1);
 }
 
+static void free_global_wq_table(void)
+{
+	if (global_wq_table) {
+		free_percpu(global_wq_table);
+		global_wq_table = NULL;
+	}
+
+	if (num_consec_descs_per_wq) {
+		free_percpu(num_consec_descs_per_wq);
+		num_consec_descs_per_wq = NULL;
+	}
+
+	pr_debug("freed global wq table\n");
+}
+
+static int pkg_global_wq_tables_alloc(void)
+{
+	int i, j;
+
+	pkg_global_wq_tables = kzalloc(nr_packages * sizeof(*pkg_global_wq_tables), GFP_KERNEL);
+	if (!pkg_global_wq_tables)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_packages; ++i) {
+		pkg_global_wq_tables[i] = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+
+		if (!pkg_global_wq_tables[i]) {
+			for (j = 0; j < i; ++j)
+				kfree(pkg_global_wq_tables[j]);
+			kfree(pkg_global_wq_tables);
+			pkg_global_wq_tables = NULL;
+			return -ENOMEM;
+		}
+		pkg_global_wq_tables[i]->wqs = NULL;
+	}
+
+	return 0;
+}
+
+static void pkg_global_wq_tables_dealloc(void)
+{
+	int i;
+
+	for (i = 0; i < nr_packages; ++i) {
+		if (pkg_global_wq_tables[i]->wqs)
+			kfree(pkg_global_wq_tables[i]->wqs);
+		kfree(pkg_global_wq_tables[i]);
+	}
+	kfree(pkg_global_wq_tables);
+	pkg_global_wq_tables = NULL;
+}
+
 static int alloc_wq_table(int max_wqs)
 {
 	struct wq_table_entry *entry;
@@ -835,6 +985,35 @@ static int alloc_wq_table(int max_wqs)
 		entry->cur_wq = 0;
 	}
 
+	global_wq_table = alloc_percpu(struct wq_table_entry);
+	if (!global_wq_table)
+		return 0;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		entry = per_cpu_ptr(global_wq_table, cpu);
+
+		entry->wqs = NULL;
+		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+	}
+
+	num_consec_descs_per_wq = alloc_percpu(int);
+	if (!num_consec_descs_per_wq) {
+		free_global_wq_table();
+		return 0;
+	}
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+		*num_consec_descs = 0;
+	}
+
+	if (pkg_global_wq_tables_alloc()) {
+		free_global_wq_table();
+		return 0;
+	}
+
 	pr_debug("initialized wq table\n");
 
 	return 0;
@@ -895,13 +1074,120 @@ static int wq_table_add_wqs(int iaa, int cpu)
 	return ret;
 }
 
+static void pkg_global_wq_tables_reinit(void)
+{
+	int i, cur_iaa = 0, pkg = 0, nr_pkg_wqs = 0;
+	struct iaa_device *iaa_device;
+	struct wq_table_entry *global;
+
+	if (!pkg_global_wq_tables)
+		return;
+
+	/* Reallocate per-package wqs. */
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		global = iaa_device->iaa_global_wqs;
+		nr_pkg_wqs += global->n_wqs;
+
+		if (++cur_iaa == nr_iaa_per_package) {
+			nr_pkg_wqs = nr_pkg_wqs ? max_t(int, iaa_device->idxd->max_wqs, nr_pkg_wqs) : 0;
+
+			if (pkg_global_wq_tables[pkg]->wqs) {
+				kfree(pkg_global_wq_tables[pkg]->wqs);
+				pkg_global_wq_tables[pkg]->wqs = NULL;
+			}
+
+			if (nr_pkg_wqs)
+				pkg_global_wq_tables[pkg]->wqs = kzalloc(nr_pkg_wqs *
+									 sizeof(struct wq *),
+									 GFP_KERNEL);
+
+			pkg_global_wq_tables[pkg]->n_wqs = 0;
+			pkg_global_wq_tables[pkg]->cur_wq = 0;
+			pkg_global_wq_tables[pkg]->max_wqs = nr_pkg_wqs;
+
+			if (++pkg == nr_packages)
+				break;
+			cur_iaa = 0;
+			nr_pkg_wqs = 0;
+		}
+	}
+
+	pkg = 0;
+	cur_iaa = 0;
+
+	/* Re-initialize per-package wqs. */
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		global = iaa_device->iaa_global_wqs;
+
+		if (pkg_global_wq_tables[pkg]->wqs)
+			for (i = 0; i < global->n_wqs; ++i)
+				pkg_global_wq_tables[pkg]->wqs[pkg_global_wq_tables[pkg]->n_wqs++] = global->wqs[i];
+
+		pr_debug("pkg_global_wq_tables[%d] has %d wqs", pkg, pkg_global_wq_tables[pkg]->n_wqs);
+
+		if (++cur_iaa == nr_iaa_per_package) {
+			if (++pkg == nr_packages)
+				break;
+			cur_iaa = 0;
+		}
+	}
+}
+
+static void global_wq_table_add(int cpu, struct wq_table_entry *pkg_global_wq_table)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+
+	/* This could be NULL. */
+	entry->wqs = pkg_global_wq_table->wqs;
+	entry->max_wqs = pkg_global_wq_table->max_wqs;
+	entry->n_wqs = pkg_global_wq_table->n_wqs;
+	entry->cur_wq = 0;
+
+	if (entry->wqs)
+		pr_debug("%s: cpu %d: added %d iaa global wqs up to wq %d.%d\n", __func__,
+			 cpu, entry->n_wqs,
+			 entry->wqs[entry->n_wqs - 1]->idxd->id,
+			 entry->wqs[entry->n_wqs - 1]->id);
+}
+
+static void global_wq_table_set_start_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int start_wq = g_wqs_per_iaa * (cpu_to_iaa(cpu) % nr_iaa_per_package);
+
+	if ((start_wq >= 0) && (start_wq < entry->n_wqs))
+		entry->cur_wq = start_wq;
+}
+
+static void global_wq_table_add_wqs(void)
+{
+	int cpu;
+
+	if (!pkg_global_wq_tables)
+		return;
+
+	for (cpu = 0; cpu < nr_cpus; cpu += nr_cpus_per_package) {
+		/* cpu's on the same package get the same global_wq_table. */
+		int package_id = topology_logical_package_id(cpu);
+		int pkg_cpu;
+
+		for (pkg_cpu = cpu; pkg_cpu < cpu + nr_cpus_per_package; ++pkg_cpu) {
+			if (pkg_global_wq_tables[package_id]->n_wqs > 0) {
+				global_wq_table_add(pkg_cpu, pkg_global_wq_tables[package_id]);
+				global_wq_table_set_start_wq(pkg_cpu);
+			}
+		}
+	}
+}
+
 static int map_iaa_device_wqs(struct iaa_device *iaa_device)
 {
-	struct wq_table_entry *local;
+	struct wq_table_entry *local, *global;
 	int ret = 0, n_wqs_added = 0;
 	struct iaa_wq *iaa_wq;
 
 	local = iaa_device->iaa_local_wqs;
+	global = iaa_device->iaa_global_wqs;
 
 	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
 		if (iaa_wq->mapped && ++n_wqs_added)
@@ -909,11 +1195,18 @@ static int map_iaa_device_wqs(struct iaa_device *iaa_device)
 
 		pr_debug("iaa_device %px: processing wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
 
-		if (WARN_ON(local->n_wqs == local->max_wqs))
-			break;
+		if ((!n_wqs_added || ((n_wqs_added + g_wqs_per_iaa) < iaa_device->n_wq)) &&
+			(local->n_wqs < local->max_wqs)) {
+
+			local->wqs[local->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		} else {
+			if (WARN_ON(global->n_wqs == global->max_wqs))
+				break;
 
-		local->wqs[local->n_wqs++] = iaa_wq->wq;
-		pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+			global->wqs[global->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %px: added global wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		}
 
 		iaa_wq->mapped = true;
 		++n_wqs_added;
@@ -969,6 +1262,10 @@ static void rebalance_wq_table(void)
 		}
 	}
 
+	if (iaa_crypto_enabled && pkg_global_wq_tables) {
+		pkg_global_wq_tables_reinit();
+		global_wq_table_add_wqs();
+	}
 	pr_debug("Finished rebalance local wqs.");
 }
 
@@ -979,7 +1276,17 @@ static void free_wq_tables(void)
 		wq_table = NULL;
 	}
 
-	pr_debug("freed local wq table\n");
+	if (global_wq_table) {
+		free_percpu(global_wq_table);
+		global_wq_table = NULL;
+	}
+
+	if (num_consec_descs_per_wq) {
+		free_percpu(num_consec_descs_per_wq);
+		num_consec_descs_per_wq = NULL;
+	}
+
+	pr_debug("freed wq tables\n");
 }
 
 /***************************************************************
@@ -1002,6 +1309,35 @@ static struct idxd_wq *wq_table_next_wq(int cpu)
 	return entry->wqs[entry->cur_wq];
 }
 
+/*
+ * Caller should make sure to call only if the
+ * per_cpu_ptr "global_wq_table" is non-NULL
+ * and has at least one wq configured.
+ */
+static struct idxd_wq *global_wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+
+	/*
+	 * Fall-back to local IAA's wq if there were no global wqs configured
+	 * for any IAA device, or if there were problems in setting up global
+	 * wqs for this cpu's package.
+	 */
+	if (!entry->wqs)
+		return wq_table_next_wq(cpu);
+
+	if ((*num_consec_descs) == g_consec_descs_per_gwq) {
+		if (++entry->cur_wq >= entry->n_wqs)
+			entry->cur_wq = 0;
+		*num_consec_descs = 0;
+	}
+
+	++(*num_consec_descs);
+
+	return entry->wqs[entry->cur_wq];
+}
+
 /*************************************************
  * Core iaa_crypto compress/decompress functions.
  *************************************************/
@@ -1563,6 +1899,7 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	struct idxd_wq *wq;
 	struct device *dev;
 	int order = -1;
+	struct wq_table_entry *entry;
 
 	compression_ctx = crypto_tfm_ctx(tfm);
 
@@ -1581,8 +1918,15 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		disable_async = true;
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (!entry || !entry->wqs || entry->n_wqs == 0) {
+		wq = wq_table_next_wq(cpu);
+	} else {
+		wq = global_wq_table_next_wq(cpu);
+	}
 	put_cpu();
+
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
 		return -ENODEV;
@@ -2446,6 +2790,7 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 
 	if (nr_iaa == 0) {
 		iaa_crypto_enabled = false;
+		pkg_global_wq_tables_dealloc();
 		free_wq_tables();
 		BUG_ON(!list_empty(&iaa_devices));
 		INIT_LIST_HEAD(&iaa_devices);
@@ -2515,6 +2860,20 @@ static int __init iaa_crypto_init_module(void)
 		goto err_sync_attr_create;
 	}
 
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_wqs_per_iaa);
+	if (ret) {
+		pr_debug("IAA g_wqs_per_iaa attr creation failed\n");
+		goto err_g_wqs_per_iaa_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_consec_descs_per_gwq);
+	if (ret) {
+		pr_debug("IAA g_consec_descs_per_gwq attr creation failed\n");
+		goto err_g_consec_descs_per_gwq_attr_create;
+	}
+
 	if (iaa_crypto_debugfs_init())
 		pr_warn("debugfs init failed, stats not available\n");
 
@@ -2522,6 +2881,12 @@ static int __init iaa_crypto_init_module(void)
 out:
 	return ret;
 
+err_g_consec_descs_per_gwq_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+err_g_wqs_per_iaa_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_sync_mode);
 err_sync_attr_create:
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
@@ -2545,6 +2910,10 @@ static void __exit iaa_crypto_cleanup_module(void)
 			   &driver_attr_sync_mode);
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_consec_descs_per_gwq);
 	idxd_driver_unregister(&iaa_crypto_driver);
 	iaa_aecs_cleanup_fixed();
 	crypto_free_comp(deflate_generic_tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
                   ` (8 preceding siblings ...)
  2024-12-21  6:31 ` [PATCH v5 09/12] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2025-01-07  0:58   ` Yosry Ahmed
  2024-12-21  6:31 ` [PATCH v5 11/12] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching Kanchana P Sridhar
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch does the following:

1) Defines ZSWAP_MAX_BATCH_SIZE to denote the maximum number of acomp_ctx
   batching resources (acomp_reqs and buffers) to allocate if the zswap
   compressor supports batching. Currently, ZSWAP_MAX_BATCH_SIZE is set to
   8U.

2) Modifies the definition of "struct crypto_acomp_ctx" to represent a
   configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
   "struct crypto_acomp_ctx" to contain the number of resources that will
   be allocated in the cpu hotplug onlining code.

3) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
   crypto_acomp created for the zswap pool (in other words, the zswap
   compression algorithm) has registered implementations for
   batch_compress() and batch_decompress(). If so, it will query the
   crypto_acomp for the maximum batch size supported by the compressor, and
   set "nr_reqs" to the minimum of this compressor-specific max batch size
   and ZSWAP_MAX_BATCH_SIZE. Finally, it will allocate "nr_reqs"
   reqs/buffers, and set the acomp_ctx->nr_reqs accordingly.

4) If the crypto_acomp does not support batching, "nr_reqs" defaults to 1.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 122 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 90 insertions(+), 32 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 9718c33f8192..99cd78891fd0 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -78,6 +78,13 @@ static bool zswap_pool_reached_full;
 
 #define ZSWAP_PARAM_UNSET ""
 
+/*
+ * For compression batching of large folios:
+ * Maximum number of acomp compress requests that will be processed
+ * in a batch, iff the zswap compressor supports batching.
+ */
+#define ZSWAP_MAX_BATCH_SIZE 8U
+
 static int zswap_setup(void);
 
 /* Enable/disable zswap */
@@ -143,9 +150,10 @@ bool zswap_never_enabled(void)
 
 struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
+	struct acomp_req **reqs;
+	u8 **buffers;
+	unsigned int nr_reqs;
 	struct crypto_wait wait;
-	u8 *buffer;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -818,49 +826,88 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
-	int ret;
+	unsigned int nr_reqs = 1;
+	int ret = -ENOMEM;
+	int i, j;
 
 	mutex_init(&acomp_ctx->mutex);
-
-	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!acomp_ctx->buffer)
-		return -ENOMEM;
+	acomp_ctx->nr_reqs = 0;
 
 	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
 	if (IS_ERR(acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp));
-		ret = PTR_ERR(acomp);
-		goto acomp_fail;
+		return PTR_ERR(acomp);
 	}
 	acomp_ctx->acomp = acomp;
 	acomp_ctx->is_sleepable = acomp_is_async(acomp);
 
-	req = acomp_request_alloc(acomp_ctx->acomp);
-	if (!req) {
-		pr_err("could not alloc crypto acomp_request %s\n",
-		       pool->tfm_name);
-		ret = -ENOMEM;
+	/*
+	 * Create the necessary batching resources if the crypto acomp alg
+	 * implements the batch_compress and batch_decompress API.
+	 */
+	if (acomp_has_async_batching(acomp)) {
+		nr_reqs = min(ZSWAP_MAX_BATCH_SIZE, crypto_acomp_batch_size(acomp));
+		pr_info_once("Creating acomp_ctx with %d reqs/buffers for batching since crypto acomp\n%s has registered batch_compress() and batch_decompress().\n",
+			nr_reqs, pool->tfm_name);
+	}
+
+	acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *), GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffers)
+		goto buf_fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+		if (!acomp_ctx->buffers[i]) {
+			for (j = 0; j < i; ++j)
+				kfree(acomp_ctx->buffers[j]);
+			kfree(acomp_ctx->buffers);
+			ret = -ENOMEM;
+			goto buf_fail;
+		}
+	}
+
+	acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req *), GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->reqs)
 		goto req_fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
+		if (!acomp_ctx->reqs[i]) {
+			pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
+			       i, pool->tfm_name);
+			for (j = 0; j < i; ++j)
+				acomp_request_free(acomp_ctx->reqs[j]);
+			kfree(acomp_ctx->reqs);
+			ret = -ENOMEM;
+			goto req_fail;
+		}
 	}
-	acomp_ctx->req = req;
 
+	/*
+	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
+	 * or non-poll mode of acomp, hence there is only one "wait" per
+	 * acomp_ctx, with callback set to reqs[0], under the assumption that
+	 * there is at least 1 request per acomp_ctx.
+	 */
 	crypto_init_wait(&acomp_ctx->wait);
 	/*
 	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
-	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+	acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
+	acomp_ctx->nr_reqs = nr_reqs;
 	return 0;
 
 req_fail:
+	for (i = 0; i < nr_reqs; ++i)
+		kfree(acomp_ctx->buffers[i]);
+	kfree(acomp_ctx->buffers);
+buf_fail:
 	crypto_free_acomp(acomp_ctx->acomp);
-acomp_fail:
-	kfree(acomp_ctx->buffer);
 	return ret;
 }
 
@@ -870,11 +917,22 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 
 	if (!IS_ERR_OR_NULL(acomp_ctx)) {
-		if (!IS_ERR_OR_NULL(acomp_ctx->req))
-			acomp_request_free(acomp_ctx->req);
+		int i;
+
+		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+			if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
+				acomp_request_free(acomp_ctx->reqs[i]);
+		kfree(acomp_ctx->reqs);
+
+		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+			kfree(acomp_ctx->buffers[i]);
+		kfree(acomp_ctx->buffers);
+
 		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 			crypto_free_acomp(acomp_ctx->acomp);
-		kfree(acomp_ctx->buffer);
+
+		acomp_ctx->nr_reqs = 0;
+		acomp_ctx = NULL;
 	}
 
 	return 0;
@@ -897,7 +955,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->buffer;
+	dst = acomp_ctx->buffers[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
@@ -907,7 +965,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * giving the dst buffer with enough length to avoid buffer overflow.
 	 */
 	sg_init_one(&output, dst, PAGE_SIZE * 2);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
 
 	/*
 	 * it maybe looks a little bit silly that we send an asynchronous request,
@@ -921,8 +979,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * but in different threads running on different cpu, we have different
 	 * acomp instance, so multiple threads can do (de)compression in parallel.
 	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
+	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
+	dlen = acomp_ctx->reqs[0]->dlen;
 	if (comp_ret)
 		goto unlock;
 
@@ -975,20 +1033,20 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	 */
 	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
 	    !virt_addr_valid(src)) {
-		memcpy(acomp_ctx->buffer, src, entry->length);
-		src = acomp_ctx->buffer;
+		memcpy(acomp_ctx->buffers[0], src, entry->length);
+		src = acomp_ctx->buffers[0];
 		zpool_unmap_handle(zpool, entry->handle);
 	}
 
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_folio(&output, folio, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
-	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
-	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE);
+	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait));
+	BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
 	mutex_unlock(&acomp_ctx->mutex);
 
-	if (src != acomp_ctx->buffer)
+	if (src != acomp_ctx->buffers[0])
 		zpool_unmap_handle(zpool, entry->handle);
 }
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 11/12] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
                   ` (9 preceding siblings ...)
  2024-12-21  6:31 ` [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2025-01-07  1:16   ` Yosry Ahmed
  2024-12-21  6:31 ` [PATCH v5 12/12] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
  2025-01-07  1:44 ` [PATCH v5 00/12] zswap IAA compress batching Yosry Ahmed
  12 siblings, 1 reply; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch introduces zswap_store_folio() that implements all the computes
done earlier in zswap_store_page() for a single-page, for all the pages in
a folio. This allows us to move the loop over the folio's pages from
zswap_store() to zswap_store_folio().

A distinct zswap_compress_folio() is also added, that simply calls
zswap_compress() for each page in the folio it is called with.

zswap_store_folio() starts by allocating all zswap entries required to
store the folio. Next, it calls zswap_compress_folio() and finally, adds
the entries to the xarray and LRU.

The error handling and cleanup required for all failure scenarios that can
occur while storing a folio in zswap is now consolidated to a
"store_folio_failed" label in zswap_store_folio().

These changes facilitate developing support for compress batching in
zswap_store_folio().

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 183 +++++++++++++++++++++++++++++++++--------------------
 1 file changed, 116 insertions(+), 67 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 99cd78891fd0..1be0f1807bfc 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1467,77 +1467,129 @@ static void shrink_worker(struct work_struct *w)
 * main API
 **********************************/
 
-static ssize_t zswap_store_page(struct page *page,
-				struct obj_cgroup *objcg,
-				struct zswap_pool *pool)
+static bool zswap_compress_folio(struct folio *folio,
+				 struct zswap_entry *entries[],
+				 struct zswap_pool *pool)
 {
-	swp_entry_t page_swpentry = page_swap_entry(page);
-	struct zswap_entry *entry, *old;
+	long index, nr_pages = folio_nr_pages(folio);
 
-	/* allocate entry */
-	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
-	if (!entry) {
-		zswap_reject_kmemcache_fail++;
-		return -EINVAL;
+	for (index = 0; index < nr_pages; ++index) {
+		struct page *page = folio_page(folio, index);
+
+		if (!zswap_compress(page, entries[index], pool))
+			return false;
 	}
 
-	if (!zswap_compress(page, entry, pool))
-		goto compress_failed;
+	return true;
+}
 
-	old = xa_store(swap_zswap_tree(page_swpentry),
-		       swp_offset(page_swpentry),
-		       entry, GFP_KERNEL);
-	if (xa_is_err(old)) {
-		int err = xa_err(old);
+/*
+ * Store all pages in a folio.
+ *
+ * The error handling from all failure points is consolidated to the
+ * "store_folio_failed" label, based on the initialization of the zswap entries'
+ * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
+ * entry's handle is subsequently modified only upon a successful zpool_malloc()
+ * after the page is compressed.
+ */
+static ssize_t zswap_store_folio(struct folio *folio,
+				 struct obj_cgroup *objcg,
+				 struct zswap_pool *pool)
+{
+	long index, nr_pages = folio_nr_pages(folio);
+	struct zswap_entry **entries = NULL;
+	int node_id = folio_nid(folio);
+	size_t compressed_bytes = 0;
 
-		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-		zswap_reject_alloc_fail++;
-		goto store_failed;
+	entries = kmalloc(nr_pages * sizeof(*entries), GFP_KERNEL);
+	if (!entries)
+		return -ENOMEM;
+
+	/* allocate entries */
+	for (index = 0; index < nr_pages; ++index) {
+		entries[index] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
+
+		if (!entries[index]) {
+			zswap_reject_kmemcache_fail++;
+			nr_pages = index;
+			goto store_folio_failed;
+		}
+
+		entries[index]->handle = (unsigned long)ERR_PTR(-EINVAL);
 	}
 
-	/*
-	 * We may have had an existing entry that became stale when
-	 * the folio was redirtied and now the new version is being
-	 * swapped out. Get rid of the old.
-	 */
-	if (old)
-		zswap_entry_free(old);
+	if (!zswap_compress_folio(folio, entries, pool))
+		goto store_folio_failed;
 
-	/*
-	 * The entry is successfully compressed and stored in the tree, there is
-	 * no further possibility of failure. Grab refs to the pool and objcg.
-	 * These refs will be dropped by zswap_entry_free() when the entry is
-	 * removed from the tree.
-	 */
-	zswap_pool_get(pool);
-	if (objcg)
-		obj_cgroup_get(objcg);
+	for (index = 0; index < nr_pages; ++index) {
+		swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, index));
+		struct zswap_entry *old, *entry = entries[index];
+
+		old = xa_store(swap_zswap_tree(page_swpentry),
+			       swp_offset(page_swpentry),
+			       entry, GFP_KERNEL);
+		if (xa_is_err(old)) {
+			int err = xa_err(old);
+
+			WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+			zswap_reject_alloc_fail++;
+			goto store_folio_failed;
+		}
 
-	/*
-	 * We finish initializing the entry while it's already in xarray.
-	 * This is safe because:
-	 *
-	 * 1. Concurrent stores and invalidations are excluded by folio lock.
-	 *
-	 * 2. Writeback is excluded by the entry not being on the LRU yet.
-	 *    The publishing order matters to prevent writeback from seeing
-	 *    an incoherent entry.
-	 */
-	entry->pool = pool;
-	entry->swpentry = page_swpentry;
-	entry->objcg = objcg;
-	entry->referenced = true;
-	if (entry->length) {
-		INIT_LIST_HEAD(&entry->lru);
-		zswap_lru_add(&zswap_list_lru, entry);
+		/*
+		 * We may have had an existing entry that became stale when
+		 * the folio was redirtied and now the new version is being
+		 * swapped out. Get rid of the old.
+		 */
+		if (old)
+			zswap_entry_free(old);
+
+		/*
+		 * The entry is successfully compressed and stored in the tree, there is
+		 * no further possibility of failure. Grab refs to the pool and objcg.
+		 * These refs will be dropped by zswap_entry_free() when the entry is
+		 * removed from the tree.
+		 */
+		zswap_pool_get(pool);
+		if (objcg)
+			obj_cgroup_get(objcg);
+
+		/*
+		 * We finish initializing the entry while it's already in xarray.
+		 * This is safe because:
+		 *
+		 * 1. Concurrent stores and invalidations are excluded by folio lock.
+		 *
+		 * 2. Writeback is excluded by the entry not being on the LRU yet.
+		 *    The publishing order matters to prevent writeback from seeing
+		 *    an incoherent entry.
+		 */
+		entry->pool = pool;
+		entry->swpentry = page_swpentry;
+		entry->objcg = objcg;
+		entry->referenced = true;
+		if (entry->length) {
+			INIT_LIST_HEAD(&entry->lru);
+			zswap_lru_add(&zswap_list_lru, entry);
+		}
+
+		compressed_bytes += entry->length;
 	}
 
-	return entry->length;
+	kfree(entries);
+
+	return compressed_bytes;
+
+store_folio_failed:
+	for (index = 0; index < nr_pages; ++index) {
+		if (!IS_ERR_VALUE(entries[index]->handle))
+			zpool_free(pool->zpool, entries[index]->handle);
+
+		zswap_entry_cache_free(entries[index]);
+	}
+
+	kfree(entries);
 
-store_failed:
-	zpool_free(pool->zpool, entry->handle);
-compress_failed:
-	zswap_entry_cache_free(entry);
 	return -EINVAL;
 }
 
@@ -1549,8 +1601,8 @@ bool zswap_store(struct folio *folio)
 	struct mem_cgroup *memcg = NULL;
 	struct zswap_pool *pool;
 	size_t compressed_bytes = 0;
+	ssize_t bytes;
 	bool ret = false;
-	long index;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1584,15 +1636,11 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
-	for (index = 0; index < nr_pages; ++index) {
-		struct page *page = folio_page(folio, index);
-		ssize_t bytes;
+	bytes = zswap_store_folio(folio, objcg, pool);
+	if (bytes < 0)
+		goto put_pool;
 
-		bytes = zswap_store_page(page, objcg, pool);
-		if (bytes < 0)
-			goto put_pool;
-		compressed_bytes += bytes;
-	}
+	compressed_bytes = bytes;
 
 	if (objcg) {
 		obj_cgroup_charge_zswap(objcg, compressed_bytes);
@@ -1622,6 +1670,7 @@ bool zswap_store(struct folio *folio)
 		pgoff_t offset = swp_offset(swp);
 		struct zswap_entry *entry;
 		struct xarray *tree;
+		long index;
 
 		for (index = 0; index < nr_pages; ++index) {
 			tree = swap_zswap_tree(swp_entry(type, offset + index));
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v5 12/12] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
                   ` (10 preceding siblings ...)
  2024-12-21  6:31 ` [PATCH v5 11/12] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching Kanchana P Sridhar
@ 2024-12-21  6:31 ` Kanchana P Sridhar
  2025-01-07  1:19   ` Yosry Ahmed
  2025-01-07  1:44 ` [PATCH v5 00/12] zswap IAA compress batching Yosry Ahmed
  12 siblings, 1 reply; 55+ messages in thread
From: Kanchana P Sridhar @ 2024-12-21  6:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

zswap_compress_folio() is modified to detect if the pool's acomp_ctx has
more than one "nr_reqs", which will be the case if the cpu onlining code
has allocated batching resources in the acomp_ctx based on the queries to
acomp_has_async_batching() and crypto_acomp_batch_size(). If multiple
"nr_reqs" are available in the acomp_ctx, it means compress batching can be
used with a batch-size of "acomp_ctx->nr_reqs".

If compress batching can be used with the given zswap pool,
zswap_compress_folio() will invoke the newly added zswap_batch_compress()
procedure to compress and store the folio in batches of
"acomp_ctx->nr_reqs" pages. The batch size is effectively
"acomp_ctx->nr_reqs".

zswap_batch_compress() calls crypto_acomp_batch_compress() to compress each
batch of (up to) "acomp_ctx->nr_reqs" pages. The iaa_crypto driver
will compress each batch of pages in parallel in the Intel IAA hardware
with 'async' mode and request chaining.

Hence, zswap_batch_compress() does the same computes for a batch, as
zswap_compress() does for a page; and returns true if the batch was
successfully compressed/stored, and false otherwise.

If the pool does not support compress batching, zswap_compress_folio()
calls zswap_compress() for each individual page in the folio, as before.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 105 insertions(+), 4 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 1be0f1807bfc..f336fafe24c4 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1467,17 +1467,118 @@ static void shrink_worker(struct work_struct *w)
 * main API
 **********************************/
 
+static bool zswap_batch_compress(struct folio *folio,
+				 long index,
+				 unsigned int batch_size,
+				 struct zswap_entry *entries[],
+				 struct zswap_pool *pool,
+				 struct crypto_acomp_ctx *acomp_ctx)
+{
+	int comp_errors[ZSWAP_MAX_BATCH_SIZE] = { 0 };
+	unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
+	struct page *pages[ZSWAP_MAX_BATCH_SIZE];
+	unsigned int i, nr_batch_pages;
+	bool ret = true;
+
+	nr_batch_pages = min((unsigned int)(folio_nr_pages(folio) - index), batch_size);
+
+	for (i = 0; i < nr_batch_pages; ++i) {
+		pages[i] = folio_page(folio, index + i);
+		dlens[i] = PAGE_SIZE;
+	}
+
+	mutex_lock(&acomp_ctx->mutex);
+
+	/*
+	 * Batch compress @nr_batch_pages. If IAA is the compressor, the
+	 * hardware will compress @nr_batch_pages in parallel.
+	 */
+	ret = crypto_acomp_batch_compress(
+		acomp_ctx->reqs,
+		&acomp_ctx->wait,
+		pages,
+		acomp_ctx->buffers,
+		dlens,
+		comp_errors,
+		nr_batch_pages);
+
+	if (ret) {
+		/*
+		 * All batch pages were successfully compressed.
+		 * Store the pages in zpool.
+		 */
+		struct zpool *zpool = pool->zpool;
+		gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
+
+		if (zpool_malloc_support_movable(zpool))
+			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
+
+		for (i = 0; i < nr_batch_pages; ++i) {
+			unsigned long handle;
+			char *buf;
+			int err;
+
+			err = zpool_malloc(zpool, dlens[i], gfp, &handle);
+
+			if (err) {
+				if (err == -ENOSPC)
+					zswap_reject_compress_poor++;
+				else
+					zswap_reject_alloc_fail++;
+
+				ret = false;
+				break;
+			}
+
+			buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
+			memcpy(buf, acomp_ctx->buffers[i], dlens[i]);
+			zpool_unmap_handle(zpool, handle);
+
+			entries[i]->handle = handle;
+			entries[i]->length = dlens[i];
+		}
+	} else {
+		/* Some batch pages had compression errors. */
+		for (i = 0; i < nr_batch_pages; ++i) {
+			if (comp_errors[i]) {
+				if (comp_errors[i] == -ENOSPC)
+					zswap_reject_compress_poor++;
+				else
+					zswap_reject_compress_fail++;
+			}
+		}
+	}
+
+	mutex_unlock(&acomp_ctx->mutex);
+
+	return ret;
+}
+
 static bool zswap_compress_folio(struct folio *folio,
 				 struct zswap_entry *entries[],
 				 struct zswap_pool *pool)
 {
 	long index, nr_pages = folio_nr_pages(folio);
+	struct crypto_acomp_ctx *acomp_ctx;
+	unsigned int batch_size;
 
-	for (index = 0; index < nr_pages; ++index) {
-		struct page *page = folio_page(folio, index);
+	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+	batch_size = acomp_ctx->nr_reqs;
 
-		if (!zswap_compress(page, entries[index], pool))
-			return false;
+	if ((batch_size > 1) && (nr_pages > 1)) {
+		for (index = 0; index < nr_pages; index += batch_size) {
+
+			if (!zswap_batch_compress(folio, index, batch_size,
+						  &entries[index], pool, acomp_ctx))
+				return false;
+		}
+	} else {
+		for (index = 0; index < nr_pages; ++index) {
+			struct page *page = folio_page(folio, index);
+
+			if (!zswap_compress(page, entries[index], pool))
+				return false;
+		}
 	}
 
 	return true;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 04/12] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto.
  2024-12-21  6:31 ` [PATCH v5 04/12] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
@ 2024-12-22  4:07   ` kernel test robot
  0 siblings, 0 replies; 55+ messages in thread
From: kernel test robot @ 2024-12-22  4:07 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: llvm, oe-kbuild-all, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Hi Kanchana,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 5555a83c82d66729e4abaf16ae28d6bd81f9a64a]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchana-P-Sridhar/crypto-acomp-Add-synchronous-asynchronous-acomp-request-chaining/20241221-143254
base:   5555a83c82d66729e4abaf16ae28d6bd81f9a64a
patch link:    https://lore.kernel.org/r/20241221063119.29140-5-kanchana.p.sridhar%40intel.com
patch subject: [PATCH v5 04/12] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto.
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20241222/202412221117.i9BKx0mV-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241222/202412221117.i9BKx0mV-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412221117.i9BKx0mV-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/crypto/intel/iaa/iaa_crypto_main.c:1897: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * This API provides IAA compress batching functionality for use by swap
   drivers/crypto/intel/iaa/iaa_crypto_main.c:2050: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * This API provides IAA decompress batching functionality for use by swap


vim +1897 drivers/crypto/intel/iaa/iaa_crypto_main.c

  1895	
  1896	/**
> 1897	 * This API provides IAA compress batching functionality for use by swap
  1898	 * modules.
  1899	 *
  1900	 * @reqs: @nr_pages asynchronous compress requests.
  1901	 * @wait: crypto_wait for acomp batch compress implemented using request
  1902	 *        chaining. Required if async_mode is "false". If async_mode is "true",
  1903	 *        and @wait is NULL, the completions will be processed using
  1904	 *        asynchronous polling of the requests' completion statuses.
  1905	 * @pages: Pages to be compressed by IAA.
  1906	 * @dsts: Pre-allocated destination buffers to store results of IAA
  1907	 *        compression. Each element of @dsts must be of size "PAGE_SIZE * 2".
  1908	 * @dlens: Will contain the compressed lengths.
  1909	 * @errors: zero on successful compression of the corresponding
  1910	 *          req, or error code in case of error.
  1911	 * @nr_pages: The number of pages, up to IAA_CRYPTO_MAX_BATCH_SIZE,
  1912	 *            to be compressed.
  1913	 *
  1914	 * Returns true if all compress requests complete successfully,
  1915	 * false otherwise.
  1916	 */
  1917	static bool iaa_comp_acompress_batch(
  1918		struct acomp_req *reqs[],
  1919		struct crypto_wait *wait,
  1920		struct page *pages[],
  1921		u8 *dsts[],
  1922		unsigned int dlens[],
  1923		int errors[],
  1924		int nr_pages)
  1925	{
  1926		struct scatterlist inputs[IAA_CRYPTO_MAX_BATCH_SIZE];
  1927		struct scatterlist outputs[IAA_CRYPTO_MAX_BATCH_SIZE];
  1928		bool compressions_done = false;
  1929		bool async = (async_mode && !use_irq);
  1930		bool async_poll = (async && !wait);
  1931		int i, err = 0;
  1932	
  1933		BUG_ON(nr_pages > IAA_CRYPTO_MAX_BATCH_SIZE);
  1934		BUG_ON(!async && !wait);
  1935	
  1936		if (async)
  1937			iaa_set_req_poll(reqs, nr_pages, true);
  1938		else
  1939			iaa_set_req_poll(reqs, nr_pages, false);
  1940	
  1941		/*
  1942		 * Prepare and submit acomp_reqs to IAA. IAA will process these
  1943		 * compress jobs in parallel if async_mode is true.
  1944		 */
  1945		for (i = 0; i < nr_pages; ++i) {
  1946			sg_init_table(&inputs[i], 1);
  1947			sg_set_page(&inputs[i], pages[i], PAGE_SIZE, 0);
  1948	
  1949			/*
  1950			 * Each dst buffer should be of size (PAGE_SIZE * 2).
  1951			 * Reflect same in sg_list.
  1952			 */
  1953			sg_init_one(&outputs[i], dsts[i], PAGE_SIZE * 2);
  1954			acomp_request_set_params(reqs[i], &inputs[i],
  1955						 &outputs[i], PAGE_SIZE, dlens[i]);
  1956	
  1957			/*
  1958			 * As long as the API is called with a valid "wait", chain the
  1959			 * requests for synchronous/asynchronous compress ops.
  1960			 * If async_mode is in effect, but the API is called with a
  1961			 * NULL "wait", submit the requests first, and poll for
  1962			 * their completion status later, after all descriptors have
  1963			 * been submitted.
  1964			 */
  1965			if (!async_poll) {
  1966				/* acomp request chaining. */
  1967				if (i)
  1968					acomp_request_chain(reqs[i], reqs[0]);
  1969				else
  1970					acomp_reqchain_init(reqs[0], 0, crypto_req_done,
  1971							    wait);
  1972			} else {
  1973				errors[i] = iaa_comp_acompress(reqs[i]);
  1974	
  1975				if (errors[i] != -EINPROGRESS) {
  1976					errors[i] = -EINVAL;
  1977					err = -EINVAL;
  1978				} else {
  1979					errors[i] = -EAGAIN;
  1980				}
  1981			}
  1982		}
  1983	
  1984		if (!async_poll) {
  1985			if (async)
  1986				/* Process the request chain in parallel. */
  1987				err = crypto_wait_req(acomp_do_async_req_chain(reqs[0],
  1988						      iaa_comp_acompress, iaa_comp_poll),
  1989						      wait);
  1990			else
  1991				/* Process the request chain in series. */
  1992				err = crypto_wait_req(acomp_do_req_chain(reqs[0],
  1993						      iaa_comp_acompress), wait);
  1994	
  1995			for (i = 0; i < nr_pages; ++i) {
  1996				errors[i] = acomp_request_err(reqs[i]);
  1997				if (errors[i]) {
  1998					err = -EINVAL;
  1999					pr_debug("Request chaining req %d compress error %d\n", i, errors[i]);
  2000				} else {
  2001					dlens[i] = reqs[i]->dlen;
  2002				}
  2003			}
  2004	
  2005			goto reset_reqs;
  2006		}
  2007	
  2008		/*
  2009		 * Asynchronously poll for and process IAA compress job completions.
  2010		 */
  2011		while (!compressions_done) {
  2012			compressions_done = true;
  2013	
  2014			for (i = 0; i < nr_pages; ++i) {
  2015				/*
  2016				 * Skip, if the compression has already completed
  2017				 * successfully or with an error.
  2018				 */
  2019				if (errors[i] != -EAGAIN)
  2020					continue;
  2021	
  2022				errors[i] = iaa_comp_poll(reqs[i]);
  2023	
  2024				if (errors[i]) {
  2025					if (errors[i] == -EAGAIN)
  2026						compressions_done = false;
  2027					else
  2028						err = -EINVAL;
  2029				} else {
  2030					dlens[i] = reqs[i]->dlen;
  2031				}
  2032			}
  2033		}
  2034	
  2035	reset_reqs:
  2036		/*
  2037		 * For the same 'reqs[]' to be usable by
  2038		 * iaa_comp_acompress()/iaa_comp_deacompress(),
  2039		 * clear the CRYPTO_ACOMP_REQ_POLL bit on all acomp_reqs, and the
  2040		 * CRYPTO_TFM_REQ_CHAIN bit on the reqs[0].
  2041		 */
  2042		iaa_set_req_poll(reqs, nr_pages, false);
  2043		if (!async_poll)
  2044			acomp_reqchain_clear(reqs[0], wait);
  2045	
  2046		return !err;
  2047	}
  2048	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2024-12-21  6:31 ` [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching Kanchana P Sridhar
@ 2024-12-28 11:46   ` Herbert Xu
  2025-01-06 17:37     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2024-12-28 11:46 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Fri, Dec 20, 2024 at 10:31:09PM -0800, Kanchana P Sridhar wrote:
> This commit adds get_batch_size(), batch_compress() and batch_decompress()
> interfaces to:

First of all we don't need a batch compress/decompress interface
because the whole point of request chaining is to supply the data
in batches.

I'm also against having a get_batch_size because the user should
be supplying as much data as they're comfortable with.  In other
words if the user is happy to give us 8 requests for iaa then it
should be happy to give us 8 requests for every implementation.

The request chaining interface should be such that processing
8 requests is always better than doing 1 request at a time as
the cost is amortised.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2024-12-28 11:46   ` Herbert Xu
@ 2025-01-06 17:37     ` Sridhar, Kanchana P
  2025-01-06 23:24       ` Yosry Ahmed
  2025-01-07  2:04       ` Herbert Xu
  0 siblings, 2 replies; 55+ messages in thread
From: Sridhar, Kanchana P @ 2025-01-06 17:37 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

Hi Herbert,

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Saturday, December 28, 2024 3:46 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for
> compress/decompress batching.
> 
> On Fri, Dec 20, 2024 at 10:31:09PM -0800, Kanchana P Sridhar wrote:
> > This commit adds get_batch_size(), batch_compress() and
> batch_decompress()
> > interfaces to:
> 
> First of all we don't need a batch compress/decompress interface
> because the whole point of request chaining is to supply the data
> in batches.
> 
> I'm also against having a get_batch_size because the user should
> be supplying as much data as they're comfortable with.  In other
> words if the user is happy to give us 8 requests for iaa then it
> should be happy to give us 8 requests for every implementation.
> 
> The request chaining interface should be such that processing
> 8 requests is always better than doing 1 request at a time as
> the cost is amortised.

Thanks for your comments. Can you please elaborate on how
request chaining would enable cost amortization for software
compressors? With the current implementation, a module like
zswap would need to do the following to invoke request chaining
for software compressors (in addition to pushing the chaining
to the user layer for IAA, as per your suggestion on not needing a
batch compress/decompress interface):

zswap_batch_compress():
   for (i = 0; i < nr_pages_in_batch; ++i) {
      /* set up the acomp_req "reqs[i]". */
      [ ... ]
      if (i)
	acomp_request_chain(reqs[i], reqs[0]);
      else
	acomp_reqchain_init(reqs[0], 0, crypto_req_done, crypto_wait);
   }

   /* Process the request chain in series. */
   err = crypto_wait_req(acomp_do_req_chain(reqs[0], crypto_acomp_compress), crypto_wait);

Internally, acomp_do_req_chain() would sequentially process the
request chain by:
1) adding all requests to a list "state"
2) call "crypto_acomp_compress()" for the next list element
3) when this request completes, dequeue it from the list "state"
4) repeat for all requests in "state"
5) When the last request in "state" completes, call "reqs[0]->base.complete()",
    which notifies crypto_wait.

From what I can understand, the latency cost should be the same for
processing a request chain in series vs. processing each request as it is
done today in zswap, by calling:

  comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);

It is not clear to me if there is a cost amortization benefit for software
compressors. One of the requirements from Yosry was that there should
be no change for the software compressors, which is what I have
attempted to do in v5.

Can you please help us understand if there is a room for optimizing
the implementation of the synchronous "acomp_do_req_chain()" API?
I would also like to get inputs from the zswap maintainers on using
request chaining for a batching implementation for software compressors.

Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-01-06 17:37     ` Sridhar, Kanchana P
@ 2025-01-06 23:24       ` Yosry Ahmed
  2025-01-07  1:36         ` Sridhar, Kanchana P
  2025-01-07  2:04       ` Herbert Xu
  1 sibling, 1 reply; 55+ messages in thread
From: Yosry Ahmed @ 2025-01-06 23:24 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Herbert Xu, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Jan 6, 2025 at 9:37 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Herbert,
>
> > -----Original Message-----
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Sent: Saturday, December 28, 2024 3:46 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> > linux-crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for
> > compress/decompress batching.
> >
> > On Fri, Dec 20, 2024 at 10:31:09PM -0800, Kanchana P Sridhar wrote:
> > > This commit adds get_batch_size(), batch_compress() and
> > batch_decompress()
> > > interfaces to:
> >
> > First of all we don't need a batch compress/decompress interface
> > because the whole point of request chaining is to supply the data
> > in batches.
> >
> > I'm also against having a get_batch_size because the user should
> > be supplying as much data as they're comfortable with.  In other
> > words if the user is happy to give us 8 requests for iaa then it
> > should be happy to give us 8 requests for every implementation.
> >
> > The request chaining interface should be such that processing
> > 8 requests is always better than doing 1 request at a time as
> > the cost is amortised.
>
> Thanks for your comments. Can you please elaborate on how
> request chaining would enable cost amortization for software
> compressors? With the current implementation, a module like
> zswap would need to do the following to invoke request chaining
> for software compressors (in addition to pushing the chaining
> to the user layer for IAA, as per your suggestion on not needing a
> batch compress/decompress interface):
>
> zswap_batch_compress():
>    for (i = 0; i < nr_pages_in_batch; ++i) {
>       /* set up the acomp_req "reqs[i]". */
>       [ ... ]
>       if (i)
>         acomp_request_chain(reqs[i], reqs[0]);
>       else
>         acomp_reqchain_init(reqs[0], 0, crypto_req_done, crypto_wait);
>    }
>
>    /* Process the request chain in series. */
>    err = crypto_wait_req(acomp_do_req_chain(reqs[0], crypto_acomp_compress), crypto_wait);
>
> Internally, acomp_do_req_chain() would sequentially process the
> request chain by:
> 1) adding all requests to a list "state"
> 2) call "crypto_acomp_compress()" for the next list element
> 3) when this request completes, dequeue it from the list "state"
> 4) repeat for all requests in "state"
> 5) When the last request in "state" completes, call "reqs[0]->base.complete()",
>     which notifies crypto_wait.
>
> From what I can understand, the latency cost should be the same for
> processing a request chain in series vs. processing each request as it is
> done today in zswap, by calling:
>
>   comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
>
> It is not clear to me if there is a cost amortization benefit for software
> compressors. One of the requirements from Yosry was that there should
> be no change for the software compressors, which is what I have
> attempted to do in v5.
>
> Can you please help us understand if there is a room for optimizing
> the implementation of the synchronous "acomp_do_req_chain()" API?
> I would also like to get inputs from the zswap maintainers on using
> request chaining for a batching implementation for software compressors.

Is there a functional change in doing so, or just using different
interfaces to accomplish the same thing we do today?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2024-12-21  6:31 ` [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching Kanchana P Sridhar
@ 2025-01-07  0:58   ` Yosry Ahmed
  2025-01-08  3:26     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 55+ messages in thread
From: Yosry Ahmed @ 2025-01-07  0:58 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	wajdi.k.feghali, vinodh.gopal

On Fri, Dec 20, 2024 at 10:31 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch does the following:
>
> 1) Defines ZSWAP_MAX_BATCH_SIZE to denote the maximum number of acomp_ctx
>    batching resources (acomp_reqs and buffers) to allocate if the zswap
>    compressor supports batching. Currently, ZSWAP_MAX_BATCH_SIZE is set to
>    8U.
>
> 2) Modifies the definition of "struct crypto_acomp_ctx" to represent a
>    configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
>    "struct crypto_acomp_ctx" to contain the number of resources that will
>    be allocated in the cpu hotplug onlining code.
>
> 3) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
>    crypto_acomp created for the zswap pool (in other words, the zswap
>    compression algorithm) has registered implementations for
>    batch_compress() and batch_decompress().

This is an implementation detail that is not visible to the zswap
code. Please do not refer to batch_compress() and batch_decompress()
here, just mention that we check if the compressor supports batching.

> If so, it will query the
>    crypto_acomp for the maximum batch size supported by the compressor, and
>    set "nr_reqs" to the minimum of this compressor-specific max batch size
>    and ZSWAP_MAX_BATCH_SIZE. Finally, it will allocate "nr_reqs"
>    reqs/buffers, and set the acomp_ctx->nr_reqs accordingly.
>
> 4) If the crypto_acomp does not support batching, "nr_reqs" defaults to 1.

General note, some implementation details are obvious from the code
and do not need to be explained in the commit log. It's mostly useful
to explain what you are doing from a high level, and why you are doing
it.

In this case, we should mainly describe that we are adding support for
the per-CPU acomp_ctx to track multiple compression/decompression
requests but are not actually using more than one request yet. Mention
that followup changes will actually utilize this to batch
compression/decompression of multiple pages, and highlight important
implementation details (such as ZSWAP_MAX_BATCH_SIZE limiting the
amount of extra memory we are using for this, and that there is no
extra memory usage for compressors that do not use batching).

>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 122 +++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 90 insertions(+), 32 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 9718c33f8192..99cd78891fd0 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -78,6 +78,13 @@ static bool zswap_pool_reached_full;
>
>  #define ZSWAP_PARAM_UNSET ""
>
> +/*
> + * For compression batching of large folios:
> + * Maximum number of acomp compress requests that will be processed
> + * in a batch, iff the zswap compressor supports batching.
> + */

Please mention that this limit exists because we preallocate enough
requests and buffers accordingly, so a higher limit means higher
memory usage.

> +#define ZSWAP_MAX_BATCH_SIZE 8U
> +
>  static int zswap_setup(void);
>
>  /* Enable/disable zswap */
> @@ -143,9 +150,10 @@ bool zswap_never_enabled(void)
>
>  struct crypto_acomp_ctx {
>         struct crypto_acomp *acomp;
> -       struct acomp_req *req;
> +       struct acomp_req **reqs;
> +       u8 **buffers;
> +       unsigned int nr_reqs;
>         struct crypto_wait wait;
> -       u8 *buffer;
>         struct mutex mutex;
>         bool is_sleepable;
>  };
> @@ -818,49 +826,88 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>         struct crypto_acomp *acomp;
> -       struct acomp_req *req;
> -       int ret;
> +       unsigned int nr_reqs = 1;
> +       int ret = -ENOMEM;
> +       int i, j;
>
>         mutex_init(&acomp_ctx->mutex);
> -
> -       acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> -       if (!acomp_ctx->buffer)
> -               return -ENOMEM;
> +       acomp_ctx->nr_reqs = 0;
>
>         acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
>         if (IS_ERR(acomp)) {
>                 pr_err("could not alloc crypto acomp %s : %ld\n",
>                                 pool->tfm_name, PTR_ERR(acomp));
> -               ret = PTR_ERR(acomp);
> -               goto acomp_fail;
> +               return PTR_ERR(acomp);
>         }
>         acomp_ctx->acomp = acomp;
>         acomp_ctx->is_sleepable = acomp_is_async(acomp);
>
> -       req = acomp_request_alloc(acomp_ctx->acomp);
> -       if (!req) {
> -               pr_err("could not alloc crypto acomp_request %s\n",
> -                      pool->tfm_name);
> -               ret = -ENOMEM;
> +       /*
> +        * Create the necessary batching resources if the crypto acomp alg
> +        * implements the batch_compress and batch_decompress API.

No mention of the internal implementation of acomp_has_async_batching() please.

> +        */
> +       if (acomp_has_async_batching(acomp)) {
> +               nr_reqs = min(ZSWAP_MAX_BATCH_SIZE, crypto_acomp_batch_size(acomp));
> +               pr_info_once("Creating acomp_ctx with %d reqs/buffers for batching since crypto acomp\n%s has registered batch_compress() and batch_decompress().\n",
> +                       nr_reqs, pool->tfm_name);

This will only be printed once, so if the compressor changes the
information will no longer be up-to-date on all CPUs. I think we
should just drop it.

> +       }
> +
> +       acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *), GFP_KERNEL, cpu_to_node(cpu));

Can we use kcalloc_node() here?

> +       if (!acomp_ctx->buffers)
> +               goto buf_fail;
> +
> +       for (i = 0; i < nr_reqs; ++i) {
> +               acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> +               if (!acomp_ctx->buffers[i]) {
> +                       for (j = 0; j < i; ++j)
> +                               kfree(acomp_ctx->buffers[j]);
> +                       kfree(acomp_ctx->buffers);
> +                       ret = -ENOMEM;
> +                       goto buf_fail;
> +               }
> +       }
> +
> +       acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req *), GFP_KERNEL, cpu_to_node(cpu));

Ditto.

> +       if (!acomp_ctx->reqs)
>                 goto req_fail;
> +
> +       for (i = 0; i < nr_reqs; ++i) {
> +               acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
> +               if (!acomp_ctx->reqs[i]) {
> +                       pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
> +                              i, pool->tfm_name);
> +                       for (j = 0; j < i; ++j)
> +                               acomp_request_free(acomp_ctx->reqs[j]);
> +                       kfree(acomp_ctx->reqs);
> +                       ret = -ENOMEM;
> +                       goto req_fail;
> +               }
>         }
> -       acomp_ctx->req = req;
>
> +       /*
> +        * The crypto_wait is used only in fully synchronous, i.e., with scomp
> +        * or non-poll mode of acomp, hence there is only one "wait" per
> +        * acomp_ctx, with callback set to reqs[0], under the assumption that
> +        * there is at least 1 request per acomp_ctx.
> +        */
>         crypto_init_wait(&acomp_ctx->wait);
>         /*
>          * if the backend of acomp is async zip, crypto_req_done() will wakeup
>          * crypto_wait_req(); if the backend of acomp is scomp, the callback
>          * won't be called, crypto_wait_req() will return without blocking.
>          */
> -       acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> +       acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
>                                    crypto_req_done, &acomp_ctx->wait);
>
> +       acomp_ctx->nr_reqs = nr_reqs;
>         return 0;
>
>  req_fail:
> +       for (i = 0; i < nr_reqs; ++i)
> +               kfree(acomp_ctx->buffers[i]);
> +       kfree(acomp_ctx->buffers);

The cleanup code is all over the place. Sometimes it's done in the
loops allocating the memory and sometimes here. It's a bit hard to
follow. Please have all the cleanups here. You can just initialize the
arrays to 0s, and then if the array is not-NULL you can free any
non-NULL elements (kfree() will handle NULLs gracefully).

There may be even potential for code reuse with zswap_cpu_comp_dead().

> +buf_fail:
>         crypto_free_acomp(acomp_ctx->acomp);
> -acomp_fail:
> -       kfree(acomp_ctx->buffer);
>         return ret;
>  }
>
> @@ -870,11 +917,22 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>
>         if (!IS_ERR_OR_NULL(acomp_ctx)) {
> -               if (!IS_ERR_OR_NULL(acomp_ctx->req))
> -                       acomp_request_free(acomp_ctx->req);
> +               int i;
> +
> +               for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +                       if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> +                               acomp_request_free(acomp_ctx->reqs[i]);
> +               kfree(acomp_ctx->reqs);
> +
> +               for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +                       kfree(acomp_ctx->buffers[i]);
> +               kfree(acomp_ctx->buffers);
> +
>                 if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>                         crypto_free_acomp(acomp_ctx->acomp);
> -               kfree(acomp_ctx->buffer);
> +
> +               acomp_ctx->nr_reqs = 0;
> +               acomp_ctx = NULL;
>         }
>
>         return 0;
> @@ -897,7 +955,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>
>         mutex_lock(&acomp_ctx->mutex);
>
> -       dst = acomp_ctx->buffer;
> +       dst = acomp_ctx->buffers[0];
>         sg_init_table(&input, 1);
>         sg_set_page(&input, page, PAGE_SIZE, 0);
>
> @@ -907,7 +965,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>          * giving the dst buffer with enough length to avoid buffer overflow.
>          */
>         sg_init_one(&output, dst, PAGE_SIZE * 2);
> -       acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +       acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
>
>         /*
>          * it maybe looks a little bit silly that we send an asynchronous request,
> @@ -921,8 +979,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>          * but in different threads running on different cpu, we have different
>          * acomp instance, so multiple threads can do (de)compression in parallel.
>          */
> -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> -       dlen = acomp_ctx->req->dlen;
> +       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
> +       dlen = acomp_ctx->reqs[0]->dlen;
>         if (comp_ret)
>                 goto unlock;
>
> @@ -975,20 +1033,20 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>          */
>         if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
>             !virt_addr_valid(src)) {
> -               memcpy(acomp_ctx->buffer, src, entry->length);
> -               src = acomp_ctx->buffer;
> +               memcpy(acomp_ctx->buffers[0], src, entry->length);
> +               src = acomp_ctx->buffers[0];
>                 zpool_unmap_handle(zpool, entry->handle);
>         }
>
>         sg_init_one(&input, src, entry->length);
>         sg_init_table(&output, 1);
>         sg_set_folio(&output, folio, PAGE_SIZE, 0);
> -       acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> -       BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> -       BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> +       acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE);
> +       BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait));
> +       BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
>         mutex_unlock(&acomp_ctx->mutex);
>
> -       if (src != acomp_ctx->buffer)
> +       if (src != acomp_ctx->buffers[0])
>                 zpool_unmap_handle(zpool, entry->handle);
>  }
>
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 11/12] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching.
  2024-12-21  6:31 ` [PATCH v5 11/12] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching Kanchana P Sridhar
@ 2025-01-07  1:16   ` Yosry Ahmed
  2025-01-08  3:57     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 55+ messages in thread
From: Yosry Ahmed @ 2025-01-07  1:16 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	wajdi.k.feghali, vinodh.gopal

On Fri, Dec 20, 2024 at 10:31 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch introduces zswap_store_folio() that implements all the computes
> done earlier in zswap_store_page() for a single-page, for all the pages in
> a folio. This allows us to move the loop over the folio's pages from
> zswap_store() to zswap_store_folio().
>
> A distinct zswap_compress_folio() is also added, that simply calls
> zswap_compress() for each page in the folio it is called with.

The git diff looks funky, it may make things clearer to introduce
zswap_compress_folio() in a separate patch.

>
> zswap_store_folio() starts by allocating all zswap entries required to
> store the folio. Next, it calls zswap_compress_folio() and finally, adds
> the entries to the xarray and LRU.
>
> The error handling and cleanup required for all failure scenarios that can
> occur while storing a folio in zswap is now consolidated to a
> "store_folio_failed" label in zswap_store_folio().
>
> These changes facilitate developing support for compress batching in
> zswap_store_folio().
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 183 +++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 116 insertions(+), 67 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 99cd78891fd0..1be0f1807bfc 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1467,77 +1467,129 @@ static void shrink_worker(struct work_struct *w)
>  * main API
>  **********************************/
>
> -static ssize_t zswap_store_page(struct page *page,
> -                               struct obj_cgroup *objcg,
> -                               struct zswap_pool *pool)
> +static bool zswap_compress_folio(struct folio *folio,
> +                                struct zswap_entry *entries[],
> +                                struct zswap_pool *pool)
>  {
> -       swp_entry_t page_swpentry = page_swap_entry(page);
> -       struct zswap_entry *entry, *old;
> +       long index, nr_pages = folio_nr_pages(folio);
>
> -       /* allocate entry */
> -       entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> -       if (!entry) {
> -               zswap_reject_kmemcache_fail++;
> -               return -EINVAL;
> +       for (index = 0; index < nr_pages; ++index) {
> +               struct page *page = folio_page(folio, index);
> +
> +               if (!zswap_compress(page, entries[index], pool))
> +                       return false;
>         }
>
> -       if (!zswap_compress(page, entry, pool))
> -               goto compress_failed;
> +       return true;
> +}
>
> -       old = xa_store(swap_zswap_tree(page_swpentry),
> -                      swp_offset(page_swpentry),
> -                      entry, GFP_KERNEL);
> -       if (xa_is_err(old)) {
> -               int err = xa_err(old);
> +/*
> + * Store all pages in a folio.
> + *
> + * The error handling from all failure points is consolidated to the
> + * "store_folio_failed" label, based on the initialization of the zswap entries'
> + * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
> + * entry's handle is subsequently modified only upon a successful zpool_malloc()
> + * after the page is compressed.
> + */
> +static ssize_t zswap_store_folio(struct folio *folio,
> +                                struct obj_cgroup *objcg,
> +                                struct zswap_pool *pool)
> +{
> +       long index, nr_pages = folio_nr_pages(folio);
> +       struct zswap_entry **entries = NULL;
> +       int node_id = folio_nid(folio);
> +       size_t compressed_bytes = 0;
>
> -               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> -               zswap_reject_alloc_fail++;
> -               goto store_failed;
> +       entries = kmalloc(nr_pages * sizeof(*entries), GFP_KERNEL);

We can probably use kcalloc() here.

> +       if (!entries)
> +               return -ENOMEM;
> +
> +       /* allocate entries */

This comment can be dropped.

> +       for (index = 0; index < nr_pages; ++index) {
> +               entries[index] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> +
> +               if (!entries[index]) {
> +                       zswap_reject_kmemcache_fail++;
> +                       nr_pages = index;
> +                       goto store_folio_failed;
> +               }
> +
> +               entries[index]->handle = (unsigned long)ERR_PTR(-EINVAL);
>         }
>
> -       /*
> -        * We may have had an existing entry that became stale when
> -        * the folio was redirtied and now the new version is being
> -        * swapped out. Get rid of the old.
> -        */
> -       if (old)
> -               zswap_entry_free(old);
> +       if (!zswap_compress_folio(folio, entries, pool))
> +               goto store_folio_failed;
>
> -       /*
> -        * The entry is successfully compressed and stored in the tree, there is
> -        * no further possibility of failure. Grab refs to the pool and objcg.
> -        * These refs will be dropped by zswap_entry_free() when the entry is
> -        * removed from the tree.
> -        */
> -       zswap_pool_get(pool);
> -       if (objcg)
> -               obj_cgroup_get(objcg);
> +       for (index = 0; index < nr_pages; ++index) {
> +               swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, index));
> +               struct zswap_entry *old, *entry = entries[index];
> +
> +               old = xa_store(swap_zswap_tree(page_swpentry),
> +                              swp_offset(page_swpentry),
> +                              entry, GFP_KERNEL);
> +               if (xa_is_err(old)) {
> +                       int err = xa_err(old);
> +
> +                       WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> +                       zswap_reject_alloc_fail++;
> +                       goto store_folio_failed;
> +               }
>
> -       /*
> -        * We finish initializing the entry while it's already in xarray.
> -        * This is safe because:
> -        *
> -        * 1. Concurrent stores and invalidations are excluded by folio lock.
> -        *
> -        * 2. Writeback is excluded by the entry not being on the LRU yet.
> -        *    The publishing order matters to prevent writeback from seeing
> -        *    an incoherent entry.
> -        */
> -       entry->pool = pool;
> -       entry->swpentry = page_swpentry;
> -       entry->objcg = objcg;
> -       entry->referenced = true;
> -       if (entry->length) {
> -               INIT_LIST_HEAD(&entry->lru);
> -               zswap_lru_add(&zswap_list_lru, entry);
> +               /*
> +                * We may have had an existing entry that became stale when
> +                * the folio was redirtied and now the new version is being
> +                * swapped out. Get rid of the old.
> +                */
> +               if (old)
> +                       zswap_entry_free(old);
> +
> +               /*
> +                * The entry is successfully compressed and stored in the tree, there is
> +                * no further possibility of failure. Grab refs to the pool and objcg.
> +                * These refs will be dropped by zswap_entry_free() when the entry is
> +                * removed from the tree.
> +                */
> +               zswap_pool_get(pool);
> +               if (objcg)
> +                       obj_cgroup_get(objcg);
> +
> +               /*
> +                * We finish initializing the entry while it's already in xarray.
> +                * This is safe because:
> +                *
> +                * 1. Concurrent stores and invalidations are excluded by folio lock.
> +                *
> +                * 2. Writeback is excluded by the entry not being on the LRU yet.
> +                *    The publishing order matters to prevent writeback from seeing
> +                *    an incoherent entry.
> +                */
> +               entry->pool = pool;
> +               entry->swpentry = page_swpentry;
> +               entry->objcg = objcg;
> +               entry->referenced = true;
> +               if (entry->length) {
> +                       INIT_LIST_HEAD(&entry->lru);
> +                       zswap_lru_add(&zswap_list_lru, entry);
> +               }
> +
> +               compressed_bytes += entry->length;
>         }
>
> -       return entry->length;
> +       kfree(entries);
> +
> +       return compressed_bytes;
> +
> +store_folio_failed:
> +       for (index = 0; index < nr_pages; ++index) {
> +               if (!IS_ERR_VALUE(entries[index]->handle))
> +                       zpool_free(pool->zpool, entries[index]->handle);
> +
> +               zswap_entry_cache_free(entries[index]);
> +       }

If there is a failure in xa_store() halfway through the entries, this
loop will free all the compressed objects and entries. But, some of
the entries are already in the xarray, and zswap_store() will try to
free them again. This seems like a bug, or did I miss something here?

> +
> +       kfree(entries);
>
> -store_failed:
> -       zpool_free(pool->zpool, entry->handle);
> -compress_failed:
> -       zswap_entry_cache_free(entry);
>         return -EINVAL;
>  }
>
> @@ -1549,8 +1601,8 @@ bool zswap_store(struct folio *folio)
>         struct mem_cgroup *memcg = NULL;
>         struct zswap_pool *pool;
>         size_t compressed_bytes = 0;
> +       ssize_t bytes;
>         bool ret = false;
> -       long index;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1584,15 +1636,11 @@ bool zswap_store(struct folio *folio)
>                 mem_cgroup_put(memcg);
>         }
>
> -       for (index = 0; index < nr_pages; ++index) {
> -               struct page *page = folio_page(folio, index);
> -               ssize_t bytes;
> +       bytes = zswap_store_folio(folio, objcg, pool);
> +       if (bytes < 0)
> +               goto put_pool;
>
> -               bytes = zswap_store_page(page, objcg, pool);
> -               if (bytes < 0)
> -                       goto put_pool;
> -               compressed_bytes += bytes;
> -       }
> +       compressed_bytes = bytes;

What's the point of having both compressed_bytes and bytes now?

>
>         if (objcg) {
>                 obj_cgroup_charge_zswap(objcg, compressed_bytes);
> @@ -1622,6 +1670,7 @@ bool zswap_store(struct folio *folio)
>                 pgoff_t offset = swp_offset(swp);
>                 struct zswap_entry *entry;
>                 struct xarray *tree;
> +               long index;
>
>                 for (index = 0; index < nr_pages; ++index) {
>                         tree = swap_zswap_tree(swp_entry(type, offset + index));
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 12/12] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2024-12-21  6:31 ` [PATCH v5 12/12] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
@ 2025-01-07  1:19   ` Yosry Ahmed
  0 siblings, 0 replies; 55+ messages in thread
From: Yosry Ahmed @ 2025-01-07  1:19 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	wajdi.k.feghali, vinodh.gopal

On Fri, Dec 20, 2024 at 10:31 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_compress_folio() is modified to detect if the pool's acomp_ctx has
> more than one "nr_reqs", which will be the case if the cpu onlining code
> has allocated batching resources in the acomp_ctx based on the queries to
> acomp_has_async_batching() and crypto_acomp_batch_size(). If multiple
> "nr_reqs" are available in the acomp_ctx, it means compress batching can be
> used with a batch-size of "acomp_ctx->nr_reqs".
>
> If compress batching can be used with the given zswap pool,
> zswap_compress_folio() will invoke the newly added zswap_batch_compress()
> procedure to compress and store the folio in batches of
> "acomp_ctx->nr_reqs" pages. The batch size is effectively
> "acomp_ctx->nr_reqs".
>
> zswap_batch_compress() calls crypto_acomp_batch_compress() to compress each
> batch of (up to) "acomp_ctx->nr_reqs" pages. The iaa_crypto driver
> will compress each batch of pages in parallel in the Intel IAA hardware
> with 'async' mode and request chaining.
>
> Hence, zswap_batch_compress() does the same computes for a batch, as
> zswap_compress() does for a page; and returns true if the batch was
> successfully compressed/stored, and false otherwise.
>
> If the pool does not support compress batching, zswap_compress_folio()
> calls zswap_compress() for each individual page in the folio, as before.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 105 insertions(+), 4 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 1be0f1807bfc..f336fafe24c4 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1467,17 +1467,118 @@ static void shrink_worker(struct work_struct *w)
>  * main API
>  **********************************/
>
> +static bool zswap_batch_compress(struct folio *folio,
> +                                long index,
> +                                unsigned int batch_size,
> +                                struct zswap_entry *entries[],
> +                                struct zswap_pool *pool,
> +                                struct crypto_acomp_ctx *acomp_ctx)
> +{
> +       int comp_errors[ZSWAP_MAX_BATCH_SIZE] = { 0 };
> +       unsigned int dlens[ZSWAP_MAX_BATCH_SIZE];
> +       struct page *pages[ZSWAP_MAX_BATCH_SIZE];
> +       unsigned int i, nr_batch_pages;
> +       bool ret = true;
> +
> +       nr_batch_pages = min((unsigned int)(folio_nr_pages(folio) - index), batch_size);
> +
> +       for (i = 0; i < nr_batch_pages; ++i) {
> +               pages[i] = folio_page(folio, index + i);
> +               dlens[i] = PAGE_SIZE;
> +       }
> +
> +       mutex_lock(&acomp_ctx->mutex);
> +
> +       /*
> +        * Batch compress @nr_batch_pages. If IAA is the compressor, the
> +        * hardware will compress @nr_batch_pages in parallel.
> +        */
> +       ret = crypto_acomp_batch_compress(
> +               acomp_ctx->reqs,
> +               &acomp_ctx->wait,
> +               pages,
> +               acomp_ctx->buffers,
> +               dlens,
> +               comp_errors,
> +               nr_batch_pages);

I will hold off on reviewing this patch until the acomp interface is
settled, but I am wondering if this can be a vectorization of
zswap_compress() instead, since there's a lot of common code.

> +
> +       if (ret) {
> +               /*
> +                * All batch pages were successfully compressed.
> +                * Store the pages in zpool.
> +                */
> +               struct zpool *zpool = pool->zpool;
> +               gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
> +
> +               if (zpool_malloc_support_movable(zpool))
> +                       gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> +
> +               for (i = 0; i < nr_batch_pages; ++i) {
> +                       unsigned long handle;
> +                       char *buf;
> +                       int err;
> +
> +                       err = zpool_malloc(zpool, dlens[i], gfp, &handle);
> +
> +                       if (err) {
> +                               if (err == -ENOSPC)
> +                                       zswap_reject_compress_poor++;
> +                               else
> +                                       zswap_reject_alloc_fail++;
> +
> +                               ret = false;
> +                               break;
> +                       }
> +
> +                       buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> +                       memcpy(buf, acomp_ctx->buffers[i], dlens[i]);
> +                       zpool_unmap_handle(zpool, handle);
> +
> +                       entries[i]->handle = handle;
> +                       entries[i]->length = dlens[i];
> +               }
> +       } else {
> +               /* Some batch pages had compression errors. */
> +               for (i = 0; i < nr_batch_pages; ++i) {
> +                       if (comp_errors[i]) {
> +                               if (comp_errors[i] == -ENOSPC)
> +                                       zswap_reject_compress_poor++;
> +                               else
> +                                       zswap_reject_compress_fail++;
> +                       }
> +               }
> +       }
> +
> +       mutex_unlock(&acomp_ctx->mutex);
> +
> +       return ret;
> +}
> +
>  static bool zswap_compress_folio(struct folio *folio,
>                                  struct zswap_entry *entries[],
>                                  struct zswap_pool *pool)
>  {
>         long index, nr_pages = folio_nr_pages(folio);
> +       struct crypto_acomp_ctx *acomp_ctx;
> +       unsigned int batch_size;
>
> -       for (index = 0; index < nr_pages; ++index) {
> -               struct page *page = folio_page(folio, index);
> +       acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> +       batch_size = acomp_ctx->nr_reqs;
>
> -               if (!zswap_compress(page, entries[index], pool))
> -                       return false;
> +       if ((batch_size > 1) && (nr_pages > 1)) {
> +               for (index = 0; index < nr_pages; index += batch_size) {
> +
> +                       if (!zswap_batch_compress(folio, index, batch_size,
> +                                                 &entries[index], pool, acomp_ctx))
> +                               return false;
> +               }
> +       } else {
> +               for (index = 0; index < nr_pages; ++index) {
> +                       struct page *page = folio_page(folio, index);
> +
> +                       if (!zswap_compress(page, entries[index], pool))
> +                               return false;
> +               }
>         }
>
>         return true;
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-01-06 23:24       ` Yosry Ahmed
@ 2025-01-07  1:36         ` Sridhar, Kanchana P
  2025-01-07  1:46           ` Yosry Ahmed
  0 siblings, 1 reply; 55+ messages in thread
From: Sridhar, Kanchana P @ 2025-01-07  1:36 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Herbert Xu, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Monday, January 6, 2025 3:24 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for
> compress/decompress batching.
> 
> On Mon, Jan 6, 2025 at 9:37 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Herbert,
> >
> > > -----Original Message-----
> > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > Sent: Saturday, December 28, 2024 3:46 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> > > linux-crypto@vger.kernel.org; davem@davemloft.net;
> clabbe@baylibre.com;
> > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for
> > > compress/decompress batching.
> > >
> > > On Fri, Dec 20, 2024 at 10:31:09PM -0800, Kanchana P Sridhar wrote:
> > > > This commit adds get_batch_size(), batch_compress() and
> > > batch_decompress()
> > > > interfaces to:
> > >
> > > First of all we don't need a batch compress/decompress interface
> > > because the whole point of request chaining is to supply the data
> > > in batches.
> > >
> > > I'm also against having a get_batch_size because the user should
> > > be supplying as much data as they're comfortable with.  In other
> > > words if the user is happy to give us 8 requests for iaa then it
> > > should be happy to give us 8 requests for every implementation.
> > >
> > > The request chaining interface should be such that processing
> > > 8 requests is always better than doing 1 request at a time as
> > > the cost is amortised.
> >
> > Thanks for your comments. Can you please elaborate on how
> > request chaining would enable cost amortization for software
> > compressors? With the current implementation, a module like
> > zswap would need to do the following to invoke request chaining
> > for software compressors (in addition to pushing the chaining
> > to the user layer for IAA, as per your suggestion on not needing a
> > batch compress/decompress interface):
> >
> > zswap_batch_compress():
> >    for (i = 0; i < nr_pages_in_batch; ++i) {
> >       /* set up the acomp_req "reqs[i]". */
> >       [ ... ]
> >       if (i)
> >         acomp_request_chain(reqs[i], reqs[0]);
> >       else
> >         acomp_reqchain_init(reqs[0], 0, crypto_req_done, crypto_wait);
> >    }
> >
> >    /* Process the request chain in series. */
> >    err = crypto_wait_req(acomp_do_req_chain(reqs[0],
> crypto_acomp_compress), crypto_wait);
> >
> > Internally, acomp_do_req_chain() would sequentially process the
> > request chain by:
> > 1) adding all requests to a list "state"
> > 2) call "crypto_acomp_compress()" for the next list element
> > 3) when this request completes, dequeue it from the list "state"
> > 4) repeat for all requests in "state"
> > 5) When the last request in "state" completes, call "reqs[0]-
> >base.complete()",
> >     which notifies crypto_wait.
> >
> > From what I can understand, the latency cost should be the same for
> > processing a request chain in series vs. processing each request as it is
> > done today in zswap, by calling:
> >
> >   comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >reqs[0]), &acomp_ctx->wait);
> >
> > It is not clear to me if there is a cost amortization benefit for software
> > compressors. One of the requirements from Yosry was that there should
> > be no change for the software compressors, which is what I have
> > attempted to do in v5.
> >
> > Can you please help us understand if there is a room for optimizing
> > the implementation of the synchronous "acomp_do_req_chain()" API?
> > I would also like to get inputs from the zswap maintainers on using
> > request chaining for a batching implementation for software compressors.
> 
> Is there a functional change in doing so, or just using different
> interfaces to accomplish the same thing we do today?

The code paths for software compressors are considerably different between
these two scenarios:

1) Given a batch of 8 pages: for each page, call zswap_compress() that does this:

    	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);

2) Given a batch of 8 pages:
     a) Create a request chain of 8 acomp_reqs, starting with reqs[0], as
         described earlier.
     b) Process the request chain by calling:

              err = crypto_wait_req(acomp_do_req_chain(reqs[0], crypto_acomp_compress), &acomp_ctx->wait);
	/* Get each req's error status. */
	for (i = 0; i < nr_pages; ++i) {
		errors[i] = acomp_request_err(reqs[i]);
		if (errors[i]) {
			pr_debug("Request chaining req %d compress error %d\n", i, errors[i]);
		} else {
			dlens[i] = reqs[i]->dlen;
		}
	}

What I mean by considerably different code paths is that request chaining
internally overwrites the req's base.complete and base.data (after saving the
original values) to implement the algorithm described earlier. Basically, the
chain is processed in series by getting the next req in the chain, setting it's
completion function to "acomp_reqchain_done()", which gets called when
the "op" (crypto_acomp_compress()) is completed for that req.
acomp_reqchain_done() will cause the next req to be processed in the
same manner. If this next req happens to be the last req to be processed,
it will notify the original completion function of reqs[0], with the crypto_wait
that zswap sets up in zswap_cpu_comp_prepare():

	acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
				   crypto_req_done, &acomp_ctx->wait);

Patch [1] in v5 of this series has the full implementation of acomp_do_req_chain()
in case you want to understand this in more detail.

The "functional change" wrt request chaining is limited to the above.

[1]: https://patchwork.kernel.org/project/linux-mm/patch/20241221063119.29140-2-kanchana.p.sridhar@intel.com/

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 00/12] zswap IAA compress batching
  2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
                   ` (11 preceding siblings ...)
  2024-12-21  6:31 ` [PATCH v5 12/12] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
@ 2025-01-07  1:44 ` Yosry Ahmed
  12 siblings, 0 replies; 55+ messages in thread
From: Yosry Ahmed @ 2025-01-07  1:44 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	wajdi.k.feghali, vinodh.gopal

On Fri, Dec 20, 2024 at 10:31 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
>
> IAA Compression Batching with acomp Request Chaining:
> =====================================================
>
> This patch-series introduces the use of the Intel Analytics Accelerator
> (IAA) for parallel batch compression of pages in large folios to improve
> zswap swapout latency, resulting in sys time reduction by 22% (usemem30)
> and by 27% (kernel compilation); as well as a 30% increase in usemem30
> throughput with IAA batching as compared to zstd.
>
> The patch-series is organized as follows:
>
>  1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
>     patches are tagged with "crypto:" in the subject:
>
>     Patch 1) Adds new acomp request chaining framework and interface based
>              on Herbert Xu's ahash reference implementation in "[PATCH 2/6]
>              crypto: hash - Add request chaining API" [1]. acomp algorithms
>              can use request chaining through these interfaces:
>
>              Setup the request chain:
>                acomp_reqchain_init()
>                acomp_request_chain()
>
>              Process the request chain:
>                acomp_do_req_chain(): synchronously (sequentially)
>                acomp_do_async_req_chain(): asynchronously using submit/poll
>                                            ops (in parallel)
>
>     Patch 2) Adds acomp_alg/crypto_acomp interfaces for batch_compress(),
>              batch_decompress() and get_batch_size(), that swap modules can
>              invoke using the new batching API crypto_acomp_batch_compress(),
>              crypto_acomp_batch_decompress() and crypto_acomp_batch_size().
>              Additionally, crypto acomp provides a new
>              acomp_has_async_batching() interface to query for these API
>              before allocating batching resources for a given compressor in
>              zswap/zram.
>
>     Patch 3) New CRYPTO_ACOMP_REQ_POLL acomp_req flag to act as a gate for
>              async poll mode in iaa_crypto.
>
>     Patch 4) iaa-crypto driver implementations for sync/async
>              crypto_acomp_batch_compress() and
>              crypto_acomp_batch_decompress() developed using request
>              chaining. If the iaa_crypto driver is set up for 'async'
>              sync_mode, these batching implementations deploy the
>              asynchronous request chaining implementation. 'async' is the
>              recommended mode for realizing the benefits of IAA parallelism.
>              If iaa_crypto is set up for 'sync' sync_mode, the synchronous
>              version of the request chaining API is used.
>
>              The "iaa_acomp_fixed_deflate" algorithm registers these
>              implementations for its "batch_compress" and "batch_decompress"
>              interfaces respectively and opts in with CRYPTO_ALG_REQ_CHAIN.
>              Further, iaa_crypto provides an implementation for the
>              "get_batch_size" interface: this returns the
>              IAA_CRYPTO_MAX_BATCH_SIZE constant that iaa_crypto defines
>              currently as 8U for IAA compression algorithms (iaa_crypto can
>              change this if needed as we optimize our batching algorithm).
>
>     Patch 5) Modifies the default iaa_crypto driver mode to async, now that
>              iaa_crypto provides a truly async mode that gives
>              significantly better latency than sync mode for the batching
>              use case.
>
>     Patch 6) Disables verify_compress by default, to facilitate users to
>              run IAA easily for comparison with software compressors.
>
>     Patch 7) Reorganizes the iaa_crypto driver code into logically related
>              sections and avoids forward declarations, in order to facilitate
>              Patch 8. This patch makes no functional changes.
>
>     Patch 8) Makes a major infrastructure change in the iaa_crypto driver,
>              to map IAA devices/work-queues to cores based on packages
>              instead of NUMA nodes. This doesn't impact performance on
>              the Sapphire Rapids system used for performance
>              testing. However, this change fixes functional problems we
>              found on Granite Rapids in internal validation, where the
>              number of NUMA nodes is greater than the number of packages,
>              which was resulting in over-utilization of some IAA devices
>              and non-usage of other IAA devices as per the current NUMA
>              based mapping infrastructure.
>              This patch also eliminates duplication of device wqs in
>              per-cpu wq_tables, thereby saving 140MiB on a 384 cores
>              Granite Rapids server with 8 IAAs. Submitting this change now
>              so that it can go through code reviews before it can be merged.
>
>     Patch 9) Builds upon the new infrastructure for mapping IAAs to cores
>              based on packages, and enables configuring a "global_wq" per
>              IAA, which can be used as a global resource for compress jobs
>              for the package. If the user configures 2WQs per IAA device,
>              the driver will distribute compress jobs from all cores on the
>              package to the "global_wqs" of all the IAA devices on that
>              package, in a round-robin manner. This can be used to improve
>              compression throughput for workloads that see a lot of swapout
>              activity.
>
>  2) zswap modifications to enable compress batching in zswap_store()
>     of large folios (including pmd-mappable folios):
>
>     Patch 10) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set
>               as 8U) to denote the maximum number of acomp_ctx batching
>               resources. Further, the "struct crypto_acomp_ctx" is modified
>               to contain a configurable number of acomp_reqs and buffers.
>               The cpu hotplug onlining code will query
>               acomp_has_async_batching() and if this returns "true", will
>               further get the compressor defined maximum batch size, and
>               will use the minimum of zswap's upper limit and the
>               compressor's maximum batch size to allocate
>               acomp_reqs/buffers if the acomp supports batching, and 1
>               acomp_req/buffer if not.
>
>     Patch 11) Restructures & simplifies zswap_store() to make it amenable
>               for batching. Moves the loop over the folio's pages to a new
>               zswap_store_folio(), which in turn allocates zswap entries
>               for all folio pages upfront, before proceeding to call a
>               newly added zswap_compress_folio(), which simply calls
>               zswap_compress() for each folio page.
>
>     Patch 12) Finally, this patch modifies zswap_compress_folio() to detect
>               if the pool's acomp_ctx has batching resources. If so, the
>               "acomp_ctx->nr_reqs" becomes the batch size to use to call
>               crypto_acomp_batch_compress() for every "acomp_ctx->nr_reqs"
>               pages in the large folio. The crypto API calls into the new
>               iaa_crypto "iaa_comp_acompress_batch()" that does batching
>               with request chaining. Upon successful compression of a
>               batch, the compressed buffers are stored in zpool.
>
> With v5 of this patch series, the IAA compress batching feature will be
> enabled seamlessly on Intel platforms that have IAA by selecting
> 'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
> sync_mode driver attribute.
>
> [1]: https://lore.kernel.org/linux-crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herbert@gondor.apana.org.au/
>
>
> System setup for testing:
> =========================
> Testing of this patch-series was done with mm-unstable as of 12-20-2024,
> commit 5555a83c82d6, without and with this patch-series.
> Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores
> per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
> partition swap. Core frequency was fixed at 2500MHz.
>
> Other kernel configuration parameters:
>
>     zswap compressor  : zstd, deflate-iaa
>     zswap allocator   : zsmalloc
>     vm.page-cluster   : 0, 2
>
> IAA "compression verification" is disabled and IAA is run in the async
> mode (the defaults with this series). 2WQs are configured per IAA
> device. Compress jobs from all cores on a socket are distributed among all
> 4 IAA devices on the same socket.
>
> I ran experiments with these workloads:
>
> 1) usemem 30 processes with these large folios enabled to "always":
>    - 16k/32k/64k
>    - 2048k
>
> 2) Kernel compilation allmodconfig with 2G max memory, 32 threads, run in
>    tmpfs with these large folios enabled to "always":
>    - 16k/32k/64k
>
>
> IAA compress batching performance: sync vs. async request chaining:
> ===================================================================
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
> processes were run, each allocating and writing 10G of memory, and sleeping
> for 10 sec before exiting:
>
> usemem --init-time -w -O -s 10 -n 30 10g
>
> "async polling" here refers to the v4 implementation of batch compression
> without request chaining, which is used as baseline to compare the request
> chaining implementations in v5.
>
> These are the latencies measured using bcc profiling with bpftrace for the
> various iaa_crypto modes:
>
>  -------------------------------------------------------------------------------
>  usemem30: 16k/32k/64k Folios         crypto_acomp_batch_compress() latency
>
>  iaa_crypto batching          count     mean       p50       p99
>  implementation                         (ns)      (ns)      (ns)
>  -------------------------------------------------------------------------------
>
>  async polling            5,210,702    10,083     9,675   17,488
>
>  sync request chaining    5,396,532    33,391    32,977   39,426
>
>  async request chaining   5,509,777     9,959     9,611   16,590
>
>  -------------------------------------------------------------------------------
>
> This demonstrates that async request chaining doesn't cause IAA compress
> batching performance regression wrt the v4 implementation without request
> chaining.
>
>
> Performance testing (usemem30):
> ===============================
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
> processes were run, each allocating and writing 10G of memory, and sleeping
> for 10 sec before exiting:
>
> usemem --init-time -w -O -s 10 -n 30 10g
>
>
>  16k/32/64k folios: usemem30: zstd:
>  ==================================
>
>  -------------------------------------------------------------------------------
>                         mm-unstable-12-20-2024   v5 of this patch-series
>  -------------------------------------------------------------------------------
>  zswap compressor                      zstd             zstd
>  vm.page-cluster                          2                2
>
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)          6,143,774        6,180,657
>  Avg throughput (KB/s)              204,792          206,021
>  elapsed time (sec)                  110.45           112.02
>  sys time (sec)                    2,628.55         2,684.53
>
>  -------------------------------------------------------------------------------
>  memcg_high                         469,269          481,665
>  memcg_swap_fail                      1,198              910
>  zswpout                         48,932,319       48,931,447
>  zswpin                                 384              398
>  pswpout                                  0                0
>  pswpin                                   0                0
>  thp_swpout                               0                0
>  thp_swpout_fallback                      0                0
>  16kB-swpout_fallback                     0                0
>  32kB_swpout_fallback                     0                0
>  64kB_swpout_fallback                 1,198              910
>  pgmajfault                           3,459            3,090
>  swap_ra                                 96              100
>  swap_ra_hit                             48               54
>  ZSWPOUT-16kB                             2                2
>  ZSWPOUT-32kB                             2                0
>  ZSWPOUT-64kB                     3,057,060        3,057,286
>  SWPOUT-16kB                              0                0
>  SWPOUT-32kB                              0                0
>  SWPOUT-64kB                              0                0
>  -------------------------------------------------------------------------------
>
>
>  16k/32/64k folios: usemem30: deflate-iaa:
>  =========================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable-12-20-2024     v5 of this patch-series
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa        deflate-iaa      IAA Batching
>  vm.page-cluster                        2                  2       vs.     vs.
>                                                                    Seq    zstd
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)        7,679,064         8,027,314         5%    30%
>  Avg throughput (KB/s)            255,968           267,577         5%    30%
>  elapsed time (sec)                 90.82             87.53        -4%   -22%
>  sys time (sec)                  2,205.73          2,099.80        -5%   -22%
>
>  -------------------------------------------------------------------------------
>  memcg_high                       716,670           722,693
>  memcg_swap_fail                    1,187             1,251
>  zswpout                       64,511,695        64,510,499
>  zswpin                               483               477
>  pswpout                                0                 0
>  pswpin                                 0                 0
>  thp_swpout                             0                 0
>  thp_swpout_fallback                    0                 0
>  16kB-swpout_fallback                   0                 0
>  32kB_swpout_fallback                   0                 0
>  64kB_swpout_fallback               1,187             1,251
>  pgmajfault                         3,180             3,187
>  swap_ra                              175               155
>  swap_ra_hit                          114                76
>  ZSWPOUT-16kB                           5                 3
>  ZSWPOUT-32kB                           1                 2
>  ZSWPOUT-64kB                   4,030,709         4,030,573
>  SWPOUT-16kB                            0                 0
>  SWPOUT-32kB                            0                 0
>  SWPOUT-64kB                            0                 0
>  -------------------------------------------------------------------------------
>
>
>  2M folios: usemem30: zstd:
>  ==========================
>
>  -------------------------------------------------------------------------------
>                mm-unstable-12-20-2024   v5 of this patch-series
>  -------------------------------------------------------------------------------
>  zswap compressor               zstd             zstd
>  vm.page-cluster                   2                2
>
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)   6,643,427        6,534,525
>  Avg throughput (KB/s)       221,447          217,817
>  elapsed time (sec)           102.92           104.44
>  sys time (sec)             2,332.67         2,415.00
>

Seems like there's a regression for zstd here?

>  -------------------------------------------------------------------------------
>  memcg_high                   61,999           60,770
>  memcg_swap_fail                  37               47
>  zswpout                  48,934,491       48,934,952
>  zswpin                          386              404
>  pswpout                           0                0
>  pswpin                            0                0
>  thp_swpout                        0                0
>  thp_swpout_fallback              37               47
>  pgmajfault                    5,010            4,646
>  swap_ra                       5,836            4,692
>  swap_ra_hit                   5,790            4,640
>  ZSWPOUT-2048kB               95,529           95,520
>  SWPOUT-2048kB                     0                0
>  -------------------------------------------------------------------------------
>
>
>  2M folios: usemem30: deflate-iaa:
>  =================================
>
>  -------------------------------------------------------------------------------
>                  mm-unstable-12-20-2024        v5 of this patch-series
>  -------------------------------------------------------------------------------
>  zswap compressor           deflate-iaa      deflate-iaa     IAA Batching
>  vm.page-cluster                      2                2      vs.     vs.
>                                                               Seq    zstd
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)      8,197,457        8,427,981       3%     29%
>  Avg throughput (KB/s)          273,248          280,932       3%     29%
>  elapsed time (sec)               86.79            83.45      -4%    -20%
>  sys time (sec)                2,044.02         1,925.84      -6%    -20%
>
>  -------------------------------------------------------------------------------
>  memcg_high                      94,008           88,809
>  memcg_swap_fail                     50               57
>  zswpout                     64,521,910       64,520,405
>  zswpin                             421              452
>  pswpout                              0                0
>  pswpin                               0                0
>  thp_swpout                           0                0
>  thp_swpout_fallback                 50               57
>  pgmajfault                       9,658            8,958
>  swap_ra                         19,633           17,341
>  swap_ra_hit                     19,579           17,278
>  ZSWPOUT-2048kB                 125,916          125,913
>  SWPOUT-2048kB                        0                0
>  -------------------------------------------------------------------------------
>
>
> Performance testing (Kernel compilation, allmodconfig):
> =======================================================
>
> The experiments with kernel compilation test, 32 threads, in tmpfs use the
> "allmodconfig" that takes ~12 minutes, and has considerable swapout
> activity. The cgroup's memory.max is set to 2G.
>
>
>  16k/32k/64k folios: Kernel compilation/allmodconfig:
>  ====================================================
>  w/o: mm-unstable-12-20-2024
>
>  -------------------------------------------------------------------------------
>                                w/o            v5            w/o             v5
>  -------------------------------------------------------------------------------
>  zswap compressor             zstd          zstd    deflate-iaa    deflate-iaa
>  vm.page-cluster                 0             0              0              0
>
>  -------------------------------------------------------------------------------
>  real_sec                   792.04        793.92         783.43         766.93
>  user_sec                15,781.73     15,772.48      15,753.22      15,766.53
>  sys_sec                  5,302.83      5,308.05       3,982.30       3,853.21
>  -------------------------------------------------------------------------------
>  Max_Res_Set_Size_KB     1,871,908     1,873,368      1,871,836      1,873,168
>  -------------------------------------------------------------------------------
>  memcg_high                      0             0              0              0
>  memcg_swap_fail                 0             0              0              0
>  zswpout                90,775,917    91,653,816    106,964,482    110,380,500
>  zswpin                 26,099,486    26,611,908     31,598,420     32,618,221
>  pswpout                        48            96            331            331
>  pswpin                         48            89            320            310
>  thp_swpout                      0             0              0              0
>  thp_swpout_fallback             0             0              0              0
>  16kB_swpout_fallback            0             0              0              0
>  32kB_swpout_fallback            0             0              0              0
>  64kB_swpout_fallback            0         2,337          7,943          5,512
>  pgmajfault             27,858,798    28,438,518     33,970,455     34,999,918
>  swap_ra                         0             0              0              0
>  swap_ra_hit                 2,173         2,913          2,192          5,248
>  ZSWPOUT-16kB            1,292,865     1,306,214      1,463,397      1,483,056
>  ZSWPOUT-32kB              695,446       705,451        830,676        829,992
>  ZSWPOUT-64kB            2,938,716     2,958,250      3,520,199      3,634,972
>  SWPOUT-16kB                     0             0              0              0
>  SWPOUT-32kB                     0             0              0              0
>  SWPOUT-64kB                     3             6             20             19
>  -------------------------------------------------------------------------------
>
>
>
> Summary:
> ========
> The performance testing data with usemem 30 processes and kernel
> compilation test show 30% throughput gains and 22% sys time reduction
> (usemem30) and 27% sys time reduction (kernel compilation) with
> zswap_store() large folios using IAA compress batching as compared to
> zstd.

These improvements seem to mostly come from the change from zstd to
deflate-iaa, not from batching (which is the main purpose of the patch
series). The gains from the batching seem to be rather low and
definitely below what I expected. We should be compressing 8 pages in
parallel instead of in series, so why are the gains from the batching
marginal?

I think the comparison with zstd is rather confusing. Testing with
zstd before and after the series to check for regressions is good, but
comparing zstd vs IAA with batching is rather confusing/misleading.
The impact of this series should be measured by IAA with and without
batching.

>
> The iaa_crypto wq stats will show almost the same number of compress calls
> for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively.
> We see a latency reduction of 2.5% by distributing compress jobs among all
> IAA devices on the socket (based on v1 data).
>
> We can expect to see even more significant performance and throughput
> improvements if we use the parallelism offered by IAA to do reclaim
> batching of 4K/large folios (really any-order folios), and using the
> zswap_store() high throughput compression to batch-compress pages
> comprising these folios, not just batching within large folios. This is the
> reclaim batching patch 13 in v1, which will be submitted in a separate
> patch-series.
>
> Our internal validation of IAA compress/decompress batching in highly
> contended Sapphire Rapids server setups with workloads running on 72 cores
> for ~25 minutes under stringent memory limit constraints have shown up to
> 50% reduction in sys time and 3.5% reduction in workload run time as
> compared to software compressors.
>
>
> Changes since v4:
> =================
> 1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
> 2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
> 3) Implemented IAA compress batching using request chaining.
> 4) zswap_store() batching simplifications suggested by Chengming, Yosry and
>    Nhat, thanks to all!
>    - New zswap_compress_folio() that is called by zswap_store().
>    - Move the loop over folio's pages out of zswap_store() and into a
>      zswap_store_folio() that stores all pages.
>    - Allocate all zswap entries for the folio upfront.
>    - Added zswap_batch_compress().
>    - Branch to call zswap_compress() or zswap_batch_compress() inside
>      zswap_compress_folio().
>    - All iterations over pages kept in same function level.
>    - No helpers other than the newly added zswap_store_folio() and
>      zswap_compress_folio().
>
>
> Changes since v3:
> =================
> 1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
> 2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
>    based on packages instead of NUMA nodes.
> 3) Added acomp_has_async_batching() API to crypto acomp, that allows
>    zswap/zram to query if a crypto_acomp has registered batch_compress and
>    batch_decompress interfaces.
> 4) Clear the poll bits on the acomp_reqs passed to
>    iaa_comp_a[de]compress_batch() so that a module like zswap can be
>    confident about the acomp_reqs[0] not having the poll bit set before
>    calling the fully synchronous API crypto_acomp_[de]compress().
>    Herbert, I would appreciate it if you can review changes 2-4; in patches
>    1-8 in v4. I did not want to introduce too many iaa_crypto changes in
>    v4, given that patch 7 is already making a major change. I plan to work
>    on incorporating the request chaining using the ahash interface in v5
>    (I need to understand the basic crypto ahash better). Thanks Herbert!
> 5) Incorporated Johannes' suggestion to not have a sysctl to enable
>    compress batching.
> 6) Incorporated Yosry's suggestion to allocate batching resources in the
>    cpu hotplug onlining code, since there is no longer a sysctl to control
>    batching. Thanks Yosry!
> 7) Incorporated Johannes' suggestions related to making the overall
>    sequence of events between zswap_store() and zswap_batch_store() similar
>    as much as possible for readability and control flow, better naming of
>    procedures, avoiding forward declarations, not inlining error path
>    procedures, deleting zswap internal details from zswap.h, etc. Thanks
>    Johannes, really appreciate the direction!
>    I have tried to explain the minimal future-proofing in terms of the
>    zswap_batch_store() signature and the definition of "struct
>    zswap_batch_store_sub_batch" in the comments for this struct. I hope the
>    new code explains the control flow a bit better.
>
>
> Changes since v2:
> =================
> 1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
> 2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
>    returned by kmalloc_node() for acomp_ctx->buffers and for
>    acomp_ctx->reqs.
> 3) Fixed a bug in zswap_pool_can_batch() for returning true if
>    pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
>    the per-cpu acomp_batch_ctx tests true for batching resources having
>    been allocated on this cpu. Also, changed from per_cpu_ptr() to
>    raw_cpu_ptr().
> 4) Incorporated the zswap_store_propagate_errors() compilation warning fix
>    suggested by Dan Carpenter. Thanks Dan!
> 5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
>    zswap.h, with SWAP_CRYPTO_BATCH_SIZE.
>
> Changes since v1:
> =================
> 1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
> 2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
>    async/poll mode, and to encapsulate the polling functionality in the
>    iaa_crypto driver. Thanks Herbert!
> 3) Incorporated Herbert's and Yosry's suggestions to implement the batching
>    API in iaa_crypto and to make its use seamless from zswap's
>    perspective. Thanks Herbert and Yosry!
> 4) Incorporated Yosry's suggestion to make it more convenient for the user
>    to enable compress batching, while minimizing the memory footprint
>    cost. Thanks Yosry!
> 5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
>    reclaim batching patch from this series, since it requires a broader
>    discussion.
>
>
> I would greatly appreciate code review comments for the iaa_crypto driver
> and mm patches included in this series!
>
> Thanks,
> Kanchana
>
>
>
> Kanchana P Sridhar (12):
>   crypto: acomp - Add synchronous/asynchronous acomp request chaining.
>   crypto: acomp - Define new interfaces for compress/decompress
>     batching.
>   crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable
>     async mode.
>   crypto: iaa - Implement batch_compress(), batch_decompress() API in
>     iaa_crypto.
>   crypto: iaa - Make async mode the default.
>   crypto: iaa - Disable iaa_verify_compress by default.
>   crypto: iaa - Re-organize the iaa_crypto driver code.
>   crypto: iaa - Map IAA devices/wqs to cores based on packages instead
>     of NUMA.
>   crypto: iaa - Distribute compress jobs from all cores to all IAAs on a
>     package.
>   mm: zswap: Allocate pool batching resources if the crypto_alg supports
>     batching.
>   mm: zswap: Restructure & simplify zswap_store() to make it amenable
>     for batching.
>   mm: zswap: Compress batching with Intel IAA in zswap_store() of large
>     folios.
>
>  crypto/acompress.c                         |  287 ++++
>  drivers/crypto/intel/iaa/iaa_crypto.h      |   27 +-
>  drivers/crypto/intel/iaa/iaa_crypto_main.c | 1697 +++++++++++++++-----
>  include/crypto/acompress.h                 |  157 ++
>  include/crypto/algapi.h                    |   10 +
>  include/crypto/internal/acompress.h        |   29 +
>  include/linux/crypto.h                     |   31 +
>  mm/zswap.c                                 |  406 +++--
>  8 files changed, 2103 insertions(+), 541 deletions(-)
>
>
> base-commit: 5555a83c82d66729e4abaf16ae28d6bd81f9a64a
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-01-07  1:36         ` Sridhar, Kanchana P
@ 2025-01-07  1:46           ` Yosry Ahmed
  2025-01-07  2:06             ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Yosry Ahmed @ 2025-01-07  1:46 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Herbert Xu, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Jan 6, 2025 at 5:38 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Yosry,
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Monday, January 6, 2025 3:24 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>; linux-
> > kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> > nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> > Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for
> > compress/decompress batching.
> >
> > On Mon, Jan 6, 2025 at 9:37 AM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > Hi Herbert,
> > >
> > > > -----Original Message-----
> > > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Sent: Saturday, December 28, 2024 3:46 AM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> > > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > > ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-
> > foundation.org;
> > > > linux-crypto@vger.kernel.org; davem@davemloft.net;
> > clabbe@baylibre.com;
> > > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > > > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > > Subject: Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for
> > > > compress/decompress batching.
> > > >
> > > > On Fri, Dec 20, 2024 at 10:31:09PM -0800, Kanchana P Sridhar wrote:
> > > > > This commit adds get_batch_size(), batch_compress() and
> > > > batch_decompress()
> > > > > interfaces to:
> > > >
> > > > First of all we don't need a batch compress/decompress interface
> > > > because the whole point of request chaining is to supply the data
> > > > in batches.
> > > >
> > > > I'm also against having a get_batch_size because the user should
> > > > be supplying as much data as they're comfortable with.  In other
> > > > words if the user is happy to give us 8 requests for iaa then it
> > > > should be happy to give us 8 requests for every implementation.
> > > >
> > > > The request chaining interface should be such that processing
> > > > 8 requests is always better than doing 1 request at a time as
> > > > the cost is amortised.
> > >
> > > Thanks for your comments. Can you please elaborate on how
> > > request chaining would enable cost amortization for software
> > > compressors? With the current implementation, a module like
> > > zswap would need to do the following to invoke request chaining
> > > for software compressors (in addition to pushing the chaining
> > > to the user layer for IAA, as per your suggestion on not needing a
> > > batch compress/decompress interface):
> > >
> > > zswap_batch_compress():
> > >    for (i = 0; i < nr_pages_in_batch; ++i) {
> > >       /* set up the acomp_req "reqs[i]". */
> > >       [ ... ]
> > >       if (i)
> > >         acomp_request_chain(reqs[i], reqs[0]);
> > >       else
> > >         acomp_reqchain_init(reqs[0], 0, crypto_req_done, crypto_wait);
> > >    }
> > >
> > >    /* Process the request chain in series. */
> > >    err = crypto_wait_req(acomp_do_req_chain(reqs[0],
> > crypto_acomp_compress), crypto_wait);
> > >
> > > Internally, acomp_do_req_chain() would sequentially process the
> > > request chain by:
> > > 1) adding all requests to a list "state"
> > > 2) call "crypto_acomp_compress()" for the next list element
> > > 3) when this request completes, dequeue it from the list "state"
> > > 4) repeat for all requests in "state"
> > > 5) When the last request in "state" completes, call "reqs[0]-
> > >base.complete()",
> > >     which notifies crypto_wait.
> > >
> > > From what I can understand, the latency cost should be the same for
> > > processing a request chain in series vs. processing each request as it is
> > > done today in zswap, by calling:
> > >
> > >   comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> > >reqs[0]), &acomp_ctx->wait);
> > >
> > > It is not clear to me if there is a cost amortization benefit for software
> > > compressors. One of the requirements from Yosry was that there should
> > > be no change for the software compressors, which is what I have
> > > attempted to do in v5.
> > >
> > > Can you please help us understand if there is a room for optimizing
> > > the implementation of the synchronous "acomp_do_req_chain()" API?
> > > I would also like to get inputs from the zswap maintainers on using
> > > request chaining for a batching implementation for software compressors.
> >
> > Is there a functional change in doing so, or just using different
> > interfaces to accomplish the same thing we do today?
>
> The code paths for software compressors are considerably different between
> these two scenarios:
>
> 1) Given a batch of 8 pages: for each page, call zswap_compress() that does this:
>
>         comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
>
> 2) Given a batch of 8 pages:
>      a) Create a request chain of 8 acomp_reqs, starting with reqs[0], as
>          described earlier.
>      b) Process the request chain by calling:
>
>               err = crypto_wait_req(acomp_do_req_chain(reqs[0], crypto_acomp_compress), &acomp_ctx->wait);
>         /* Get each req's error status. */
>         for (i = 0; i < nr_pages; ++i) {
>                 errors[i] = acomp_request_err(reqs[i]);
>                 if (errors[i]) {
>                         pr_debug("Request chaining req %d compress error %d\n", i, errors[i]);
>                 } else {
>                         dlens[i] = reqs[i]->dlen;
>                 }
>         }
>
> What I mean by considerably different code paths is that request chaining
> internally overwrites the req's base.complete and base.data (after saving the
> original values) to implement the algorithm described earlier. Basically, the
> chain is processed in series by getting the next req in the chain, setting it's
> completion function to "acomp_reqchain_done()", which gets called when
> the "op" (crypto_acomp_compress()) is completed for that req.
> acomp_reqchain_done() will cause the next req to be processed in the
> same manner. If this next req happens to be the last req to be processed,
> it will notify the original completion function of reqs[0], with the crypto_wait
> that zswap sets up in zswap_cpu_comp_prepare():
>
>         acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
>                                    crypto_req_done, &acomp_ctx->wait);
>
> Patch [1] in v5 of this series has the full implementation of acomp_do_req_chain()
> in case you want to understand this in more detail.
>
> The "functional change" wrt request chaining is limited to the above.

For software compressors, the batch size should be 1. In that
scenario, from a zswap perspective (without going into the acomp
implementation details please), is there a functional difference? If
not, we can just use the request chaining API regardless of batching
if that is what Herbert means.

>
> [1]: https://patchwork.kernel.org/project/linux-mm/patch/20241221063119.29140-2-kanchana.p.sridhar@intel.com/
>
> Thanks,
> Kanchana
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-01-06 17:37     ` Sridhar, Kanchana P
  2025-01-06 23:24       ` Yosry Ahmed
@ 2025-01-07  2:04       ` Herbert Xu
  1 sibling, 0 replies; 55+ messages in thread
From: Herbert Xu @ 2025-01-07  2:04 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Jan 06, 2025 at 05:37:07PM +0000, Sridhar, Kanchana P wrote:
>
> Internally, acomp_do_req_chain() would sequentially process the
> request chain by:

acomp_do_req_chain is just interim scaffolding.  It will disappear
once we convert the underlying algorithms to acomp and support
chaining natively.  For example, the ahash version looked like this:

https://lore.kernel.org/all/6fc95eb867115e898fb6cca4a9470d147a5587bd.1730021644.git.herbert@gondor.apana.org.au/

Its final form, the user will supply a chained request that goes
directly to the algorithm which can then process it in one go.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-01-07  1:46           ` Yosry Ahmed
@ 2025-01-07  2:06             ` Herbert Xu
  2025-01-07  3:10               ` Yosry Ahmed
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2025-01-07  2:06 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Jan 06, 2025 at 05:46:01PM -0800, Yosry Ahmed wrote:
>
> For software compressors, the batch size should be 1. In that
> scenario, from a zswap perspective (without going into the acomp
> implementation details please), is there a functional difference? If
> not, we can just use the request chaining API regardless of batching
> if that is what Herbert means.

If you can supply a batch size of 8 for iaa, there is no reason
why you can't do it for software algorithms.  It's the same
reason that we have GSO in the TCP stack, regardless of whether
the hardware can handle TSO.

The amortisation of the segmentation cost means that it will be
a win over-all.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-01-07  2:06             ` Herbert Xu
@ 2025-01-07  3:10               ` Yosry Ahmed
  2025-01-08  1:38                 ` Herbert Xu
  2025-02-16  5:17                 ` Herbert Xu
  0 siblings, 2 replies; 55+ messages in thread
From: Yosry Ahmed @ 2025-01-07  3:10 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Jan 6, 2025 at 6:06 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Mon, Jan 06, 2025 at 05:46:01PM -0800, Yosry Ahmed wrote:
> >
> > For software compressors, the batch size should be 1. In that
> > scenario, from a zswap perspective (without going into the acomp
> > implementation details please), is there a functional difference? If
> > not, we can just use the request chaining API regardless of batching
> > if that is what Herbert means.
>
> If you can supply a batch size of 8 for iaa, there is no reason
> why you can't do it for software algorithms.  It's the same
> reason that we have GSO in the TCP stack, regardless of whether
> the hardware can handle TSO.

The main problem is memory usage. Zswap needs a PAGE_SIZE*2-sized
buffer for each request on each CPU. We preallocate these buffers to
avoid trying to allocate this much memory in the reclaim path (i.e.
potentially allocating two pages to reclaim one).

With batching, we need to preallocate N PAGE_SIZE*2-sized buffers on
each CPU instead. For N=8, we are allocating PAGE_SIZE*14 extra memory
on each CPU (56 KB on x86). That cost may be acceptable with IAA
hardware accelerated batching, but not for software compressors that
will end up processing the batch serially anyway.

Does this make sense to you or did I miss something?

>
> The amortisation of the segmentation cost means that it will be
> a win over-all.
>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-01-07  3:10               ` Yosry Ahmed
@ 2025-01-08  1:38                 ` Herbert Xu
  2025-01-08  1:43                   ` Yosry Ahmed
  2025-02-16  5:17                 ` Herbert Xu
  1 sibling, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2025-01-08  1:38 UTC (permalink / raw)
  To: Yosry Ahmed, Eric Biggers
  Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Jan 06, 2025 at 07:10:53PM -0800, Yosry Ahmed wrote:
>
> The main problem is memory usage. Zswap needs a PAGE_SIZE*2-sized
> buffer for each request on each CPU. We preallocate these buffers to
> avoid trying to allocate this much memory in the reclaim path (i.e.
> potentially allocating two pages to reclaim one).

What if we allowed each acomp request to take a whole folio?
That would mean you'd only need to allocate one request per
folio, regardless of how big it is.

Eric, we could do something similar with ahash.  Allow the
user to supply a folio (or scatterlist entry) instead of a
single page, and then cut it up based on a unit-size supplied
by the user (e.g., 512 bytes for sector-based users).  That
would mean just a single request object as long as your input
is a folio or something similar.

Is this something that you could use in fs/verity? You'd still
need to allocate enough memory to store the output hashes.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-01-08  1:38                 ` Herbert Xu
@ 2025-01-08  1:43                   ` Yosry Ahmed
  0 siblings, 0 replies; 55+ messages in thread
From: Yosry Ahmed @ 2025-01-08  1:43 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Biggers, Sridhar, Kanchana P, linux-kernel, linux-mm,
	hannes, nphamcs, chengming.zhou, usamaarif642, ryan.roberts,
	21cnbao, akpm, linux-crypto, davem, clabbe, ardb, ebiggers,
	surenb, Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Jan 7, 2025 at 5:39 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Mon, Jan 06, 2025 at 07:10:53PM -0800, Yosry Ahmed wrote:
> >
> > The main problem is memory usage. Zswap needs a PAGE_SIZE*2-sized
> > buffer for each request on each CPU. We preallocate these buffers to
> > avoid trying to allocate this much memory in the reclaim path (i.e.
> > potentially allocating two pages to reclaim one).
>
> What if we allowed each acomp request to take a whole folio?
> That would mean you'd only need to allocate one request per
> folio, regardless of how big it is.

Hmm this means we need to allocate a single request instead of N
requests, but the source of overhead is the output buffers not the
requests. We need PAGE_SIZE*2 for each page in the folio in the output
buffer on each CPU. Preallocating this unnecessarily adds up to a lot
of memory.

Did I miss something?

>
> Eric, we could do something similar with ahash.  Allow the
> user to supply a folio (or scatterlist entry) instead of a
> single page, and then cut it up based on a unit-size supplied
> by the user (e.g., 512 bytes for sector-based users).  That
> would mean just a single request object as long as your input
> is a folio or something similar.
>
> Is this something that you could use in fs/verity? You'd still
> need to allocate enough memory to store the output hashes.
>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2025-01-07  0:58   ` Yosry Ahmed
@ 2025-01-08  3:26     ` Sridhar, Kanchana P
  2025-01-08  4:16       ` Yosry Ahmed
  0 siblings, 1 reply; 55+ messages in thread
From: Sridhar, Kanchana P @ 2025-01-08  3:26 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Monday, January 6, 2025 4:59 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> On Fri, Dec 20, 2024 at 10:31 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch does the following:
> >
> > 1) Defines ZSWAP_MAX_BATCH_SIZE to denote the maximum number of
> acomp_ctx
> >    batching resources (acomp_reqs and buffers) to allocate if the zswap
> >    compressor supports batching. Currently, ZSWAP_MAX_BATCH_SIZE is set
> to
> >    8U.
> >
> > 2) Modifies the definition of "struct crypto_acomp_ctx" to represent a
> >    configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
> >    "struct crypto_acomp_ctx" to contain the number of resources that will
> >    be allocated in the cpu hotplug onlining code.
> >
> > 3) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
> >    crypto_acomp created for the zswap pool (in other words, the zswap
> >    compression algorithm) has registered implementations for
> >    batch_compress() and batch_decompress().
> 
> This is an implementation detail that is not visible to the zswap
> code. Please do not refer to batch_compress() and batch_decompress()
> here, just mention that we check if the compressor supports batching.

Thanks for the suggestions. Sure, I will modify the commit log accordingly.

> 
> > If so, it will query the
> >    crypto_acomp for the maximum batch size supported by the compressor,
> and
> >    set "nr_reqs" to the minimum of this compressor-specific max batch size
> >    and ZSWAP_MAX_BATCH_SIZE. Finally, it will allocate "nr_reqs"
> >    reqs/buffers, and set the acomp_ctx->nr_reqs accordingly.
> >
> > 4) If the crypto_acomp does not support batching, "nr_reqs" defaults to 1.
> 
> General note, some implementation details are obvious from the code
> and do not need to be explained in the commit log. It's mostly useful
> to explain what you are doing from a high level, and why you are doing
> it.
> 
> In this case, we should mainly describe that we are adding support for
> the per-CPU acomp_ctx to track multiple compression/decompression
> requests but are not actually using more than one request yet. Mention
> that followup changes will actually utilize this to batch
> compression/decompression of multiple pages, and highlight important
> implementation details (such as ZSWAP_MAX_BATCH_SIZE limiting the
> amount of extra memory we are using for this, and that there is no
> extra memory usage for compressors that do not use batching).

Sure, will do so.

> 
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 122 +++++++++++++++++++++++++++++++++++++++--------
> ------
> >  1 file changed, 90 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 9718c33f8192..99cd78891fd0 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -78,6 +78,13 @@ static bool zswap_pool_reached_full;
> >
> >  #define ZSWAP_PARAM_UNSET ""
> >
> > +/*
> > + * For compression batching of large folios:
> > + * Maximum number of acomp compress requests that will be processed
> > + * in a batch, iff the zswap compressor supports batching.
> > + */
> 
> Please mention that this limit exists because we preallocate enough
> requests and buffers accordingly, so a higher limit means higher
> memory usage.

Ok.

> 
> > +#define ZSWAP_MAX_BATCH_SIZE 8U
> > +
> >  static int zswap_setup(void);
> >
> >  /* Enable/disable zswap */
> > @@ -143,9 +150,10 @@ bool zswap_never_enabled(void)
> >
> >  struct crypto_acomp_ctx {
> >         struct crypto_acomp *acomp;
> > -       struct acomp_req *req;
> > +       struct acomp_req **reqs;
> > +       u8 **buffers;
> > +       unsigned int nr_reqs;
> >         struct crypto_wait wait;
> > -       u8 *buffer;
> >         struct mutex mutex;
> >         bool is_sleepable;
> >  };
> > @@ -818,49 +826,88 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> >         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx,
> cpu);
> >         struct crypto_acomp *acomp;
> > -       struct acomp_req *req;
> > -       int ret;
> > +       unsigned int nr_reqs = 1;
> > +       int ret = -ENOMEM;
> > +       int i, j;
> >
> >         mutex_init(&acomp_ctx->mutex);
> > -
> > -       acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> cpu_to_node(cpu));
> > -       if (!acomp_ctx->buffer)
> > -               return -ENOMEM;
> > +       acomp_ctx->nr_reqs = 0;
> >
> >         acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
> >         if (IS_ERR(acomp)) {
> >                 pr_err("could not alloc crypto acomp %s : %ld\n",
> >                                 pool->tfm_name, PTR_ERR(acomp));
> > -               ret = PTR_ERR(acomp);
> > -               goto acomp_fail;
> > +               return PTR_ERR(acomp);
> >         }
> >         acomp_ctx->acomp = acomp;
> >         acomp_ctx->is_sleepable = acomp_is_async(acomp);
> >
> > -       req = acomp_request_alloc(acomp_ctx->acomp);
> > -       if (!req) {
> > -               pr_err("could not alloc crypto acomp_request %s\n",
> > -                      pool->tfm_name);
> > -               ret = -ENOMEM;
> > +       /*
> > +        * Create the necessary batching resources if the crypto acomp alg
> > +        * implements the batch_compress and batch_decompress API.
> 
> No mention of the internal implementation of acomp_has_async_batching()
> please.

Ok.

> 
> > +        */
> > +       if (acomp_has_async_batching(acomp)) {
> > +               nr_reqs = min(ZSWAP_MAX_BATCH_SIZE,
> crypto_acomp_batch_size(acomp));
> > +               pr_info_once("Creating acomp_ctx with %d reqs/buffers for
> batching since crypto acomp\n%s has registered batch_compress() and
> batch_decompress().\n",
> > +                       nr_reqs, pool->tfm_name);
> 
> This will only be printed once, so if the compressor changes the
> information will no longer be up-to-date on all CPUs. I think we
> should just drop it.

Yes, makes sense.

> 
> > +       }
> > +
> > +       acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *),
> GFP_KERNEL, cpu_to_node(cpu));
> 
> Can we use kcalloc_node() here?

I was wondering if the performance penalty of the kcalloc_node() is acceptable
because the cpu onlining happens infrequently? If so, it appears zero-initializing
the allocated memory will help in the cleanup code suggestion in your subsequent
comment.

> 
> > +       if (!acomp_ctx->buffers)
> > +               goto buf_fail;
> > +
> > +       for (i = 0; i < nr_reqs; ++i) {
> > +               acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2,
> GFP_KERNEL, cpu_to_node(cpu));
> > +               if (!acomp_ctx->buffers[i]) {
> > +                       for (j = 0; j < i; ++j)
> > +                               kfree(acomp_ctx->buffers[j]);
> > +                       kfree(acomp_ctx->buffers);
> > +                       ret = -ENOMEM;
> > +                       goto buf_fail;
> > +               }
> > +       }
> > +
> > +       acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req
> *), GFP_KERNEL, cpu_to_node(cpu));
> 
> Ditto.

Sure.

> 
> > +       if (!acomp_ctx->reqs)
> >                 goto req_fail;
> > +
> > +       for (i = 0; i < nr_reqs; ++i) {
> > +               acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
> > +               if (!acomp_ctx->reqs[i]) {
> > +                       pr_err("could not alloc crypto acomp_request reqs[%d]
> %s\n",
> > +                              i, pool->tfm_name);
> > +                       for (j = 0; j < i; ++j)
> > +                               acomp_request_free(acomp_ctx->reqs[j]);
> > +                       kfree(acomp_ctx->reqs);
> > +                       ret = -ENOMEM;
> > +                       goto req_fail;
> > +               }
> >         }
> > -       acomp_ctx->req = req;
> >
> > +       /*
> > +        * The crypto_wait is used only in fully synchronous, i.e., with scomp
> > +        * or non-poll mode of acomp, hence there is only one "wait" per
> > +        * acomp_ctx, with callback set to reqs[0], under the assumption that
> > +        * there is at least 1 request per acomp_ctx.
> > +        */
> >         crypto_init_wait(&acomp_ctx->wait);
> >         /*
> >          * if the backend of acomp is async zip, crypto_req_done() will wakeup
> >          * crypto_wait_req(); if the backend of acomp is scomp, the callback
> >          * won't be called, crypto_wait_req() will return without blocking.
> >          */
> > -       acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> > +       acomp_request_set_callback(acomp_ctx->reqs[0],
> CRYPTO_TFM_REQ_MAY_BACKLOG,
> >                                    crypto_req_done, &acomp_ctx->wait);
> >
> > +       acomp_ctx->nr_reqs = nr_reqs;
> >         return 0;
> >
> >  req_fail:
> > +       for (i = 0; i < nr_reqs; ++i)
> > +               kfree(acomp_ctx->buffers[i]);
> > +       kfree(acomp_ctx->buffers);
> 
> The cleanup code is all over the place. Sometimes it's done in the
> loops allocating the memory and sometimes here. It's a bit hard to
> follow. Please have all the cleanups here. You can just initialize the
> arrays to 0s, and then if the array is not-NULL you can free any
> non-NULL elements (kfree() will handle NULLs gracefully).

Sure, if performance of kzalloc_node() is an acceptable trade-off for the
cleanup code simplification.

> 
> There may be even potential for code reuse with zswap_cpu_comp_dead().

I assume the reuse will be through copy-and-paste the same lines of code as
against a common procedure being called by zswap_cpu_comp_prepare()
and zswap_cpu_comp_dead()?

> 
> > +buf_fail:
> >         crypto_free_acomp(acomp_ctx->acomp);
> > -acomp_fail:
> > -       kfree(acomp_ctx->buffer);
> >         return ret;
> >  }
> >
> > @@ -870,11 +917,22 @@ static int zswap_cpu_comp_dead(unsigned int
> cpu, struct hlist_node *node)
> >         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx,
> cpu);
> >
> >         if (!IS_ERR_OR_NULL(acomp_ctx)) {
> > -               if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > -                       acomp_request_free(acomp_ctx->req);
> > +               int i;
> > +
> > +               for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > +                       if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> > +                               acomp_request_free(acomp_ctx->reqs[i]);
> > +               kfree(acomp_ctx->reqs);
> > +
> > +               for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > +                       kfree(acomp_ctx->buffers[i]);
> > +               kfree(acomp_ctx->buffers);
> > +
> >                 if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> >                         crypto_free_acomp(acomp_ctx->acomp);
> > -               kfree(acomp_ctx->buffer);
> > +
> > +               acomp_ctx->nr_reqs = 0;
> > +               acomp_ctx = NULL;
> >         }
> >
> >         return 0;
> > @@ -897,7 +955,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >
> >         mutex_lock(&acomp_ctx->mutex);
> >
> > -       dst = acomp_ctx->buffer;
> > +       dst = acomp_ctx->buffers[0];
> >         sg_init_table(&input, 1);
> >         sg_set_page(&input, page, PAGE_SIZE, 0);
> >
> > @@ -907,7 +965,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >          * giving the dst buffer with enough length to avoid buffer overflow.
> >          */
> >         sg_init_one(&output, dst, PAGE_SIZE * 2);
> > -       acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> > +       acomp_request_set_params(acomp_ctx->reqs[0], &input, &output,
> PAGE_SIZE, dlen);
> >
> >         /*
> >          * it maybe looks a little bit silly that we send an asynchronous request,
> > @@ -921,8 +979,8 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >          * but in different threads running on different cpu, we have different
> >          * acomp instance, so multiple threads can do (de)compression in
> parallel.
> >          */
> > -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >req), &acomp_ctx->wait);
> > -       dlen = acomp_ctx->req->dlen;
> > +       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >reqs[0]), &acomp_ctx->wait);
> > +       dlen = acomp_ctx->reqs[0]->dlen;
> >         if (comp_ret)
> >                 goto unlock;
> >
> > @@ -975,20 +1033,20 @@ static void zswap_decompress(struct
> zswap_entry *entry, struct folio *folio)
> >          */
> >         if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
> >             !virt_addr_valid(src)) {
> > -               memcpy(acomp_ctx->buffer, src, entry->length);
> > -               src = acomp_ctx->buffer;
> > +               memcpy(acomp_ctx->buffers[0], src, entry->length);
> > +               src = acomp_ctx->buffers[0];
> >                 zpool_unmap_handle(zpool, entry->handle);
> >         }
> >
> >         sg_init_one(&input, src, entry->length);
> >         sg_init_table(&output, 1);
> >         sg_set_folio(&output, folio, PAGE_SIZE, 0);
> > -       acomp_request_set_params(acomp_ctx->req, &input, &output, entry-
> >length, PAGE_SIZE);
> > -       BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx-
> >req), &acomp_ctx->wait));
> > -       BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> > +       acomp_request_set_params(acomp_ctx->reqs[0], &input, &output,
> entry->length, PAGE_SIZE);
> > +       BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx-
> >reqs[0]), &acomp_ctx->wait));
> > +       BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
> >         mutex_unlock(&acomp_ctx->mutex);
> >
> > -       if (src != acomp_ctx->buffer)
> > +       if (src != acomp_ctx->buffers[0])
> >                 zpool_unmap_handle(zpool, entry->handle);
> >  }
> >
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v5 11/12] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching.
  2025-01-07  1:16   ` Yosry Ahmed
@ 2025-01-08  3:57     ` Sridhar, Kanchana P
  2025-01-08  4:22       ` Yosry Ahmed
  0 siblings, 1 reply; 55+ messages in thread
From: Sridhar, Kanchana P @ 2025-01-08  3:57 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Monday, January 6, 2025 5:17 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v5 11/12] mm: zswap: Restructure & simplify
> zswap_store() to make it amenable for batching.
> 
> On Fri, Dec 20, 2024 at 10:31 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch introduces zswap_store_folio() that implements all the computes
> > done earlier in zswap_store_page() for a single-page, for all the pages in
> > a folio. This allows us to move the loop over the folio's pages from
> > zswap_store() to zswap_store_folio().
> >
> > A distinct zswap_compress_folio() is also added, that simply calls
> > zswap_compress() for each page in the folio it is called with.
> 
> The git diff looks funky, it may make things clearer to introduce
> zswap_compress_folio() in a separate patch.

Ok, will do so.

> 
> >
> > zswap_store_folio() starts by allocating all zswap entries required to
> > store the folio. Next, it calls zswap_compress_folio() and finally, adds
> > the entries to the xarray and LRU.
> >
> > The error handling and cleanup required for all failure scenarios that can
> > occur while storing a folio in zswap is now consolidated to a
> > "store_folio_failed" label in zswap_store_folio().
> >
> > These changes facilitate developing support for compress batching in
> > zswap_store_folio().
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 183 +++++++++++++++++++++++++++++++++-----------------
> ---
> >  1 file changed, 116 insertions(+), 67 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 99cd78891fd0..1be0f1807bfc 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1467,77 +1467,129 @@ static void shrink_worker(struct work_struct
> *w)
> >  * main API
> >  **********************************/
> >
> > -static ssize_t zswap_store_page(struct page *page,
> > -                               struct obj_cgroup *objcg,
> > -                               struct zswap_pool *pool)
> > +static bool zswap_compress_folio(struct folio *folio,
> > +                                struct zswap_entry *entries[],
> > +                                struct zswap_pool *pool)
> >  {
> > -       swp_entry_t page_swpentry = page_swap_entry(page);
> > -       struct zswap_entry *entry, *old;
> > +       long index, nr_pages = folio_nr_pages(folio);
> >
> > -       /* allocate entry */
> > -       entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > -       if (!entry) {
> > -               zswap_reject_kmemcache_fail++;
> > -               return -EINVAL;
> > +       for (index = 0; index < nr_pages; ++index) {
> > +               struct page *page = folio_page(folio, index);
> > +
> > +               if (!zswap_compress(page, entries[index], pool))
> > +                       return false;
> >         }
> >
> > -       if (!zswap_compress(page, entry, pool))
> > -               goto compress_failed;
> > +       return true;
> > +}
> >
> > -       old = xa_store(swap_zswap_tree(page_swpentry),
> > -                      swp_offset(page_swpentry),
> > -                      entry, GFP_KERNEL);
> > -       if (xa_is_err(old)) {
> > -               int err = xa_err(old);
> > +/*
> > + * Store all pages in a folio.
> > + *
> > + * The error handling from all failure points is consolidated to the
> > + * "store_folio_failed" label, based on the initialization of the zswap
> entries'
> > + * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
> > + * entry's handle is subsequently modified only upon a successful
> zpool_malloc()
> > + * after the page is compressed.
> > + */
> > +static ssize_t zswap_store_folio(struct folio *folio,
> > +                                struct obj_cgroup *objcg,
> > +                                struct zswap_pool *pool)
> > +{
> > +       long index, nr_pages = folio_nr_pages(folio);
> > +       struct zswap_entry **entries = NULL;
> > +       int node_id = folio_nid(folio);
> > +       size_t compressed_bytes = 0;
> >
> > -               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n",
> err);
> > -               zswap_reject_alloc_fail++;
> > -               goto store_failed;
> > +       entries = kmalloc(nr_pages * sizeof(*entries), GFP_KERNEL);
> 
> We can probably use kcalloc() here.

I am a little worried about the latency penalty of kcalloc() in the reclaim path,
especially since I am not relying on zero-initialized memory for "entries"..

> 
> > +       if (!entries)
> > +               return -ENOMEM;
> > +
> > +       /* allocate entries */
> 
> This comment can be dropped.

Sure.

> 
> > +       for (index = 0; index < nr_pages; ++index) {
> > +               entries[index] = zswap_entry_cache_alloc(GFP_KERNEL,
> node_id);
> > +
> > +               if (!entries[index]) {
> > +                       zswap_reject_kmemcache_fail++;
> > +                       nr_pages = index;
> > +                       goto store_folio_failed;
> > +               }
> > +
> > +               entries[index]->handle = (unsigned long)ERR_PTR(-EINVAL);
> >         }
> >
> > -       /*
> > -        * We may have had an existing entry that became stale when
> > -        * the folio was redirtied and now the new version is being
> > -        * swapped out. Get rid of the old.
> > -        */
> > -       if (old)
> > -               zswap_entry_free(old);
> > +       if (!zswap_compress_folio(folio, entries, pool))
> > +               goto store_folio_failed;
> >
> > -       /*
> > -        * The entry is successfully compressed and stored in the tree, there is
> > -        * no further possibility of failure. Grab refs to the pool and objcg.
> > -        * These refs will be dropped by zswap_entry_free() when the entry is
> > -        * removed from the tree.
> > -        */
> > -       zswap_pool_get(pool);
> > -       if (objcg)
> > -               obj_cgroup_get(objcg);
> > +       for (index = 0; index < nr_pages; ++index) {
> > +               swp_entry_t page_swpentry = page_swap_entry(folio_page(folio,
> index));
> > +               struct zswap_entry *old, *entry = entries[index];
> > +
> > +               old = xa_store(swap_zswap_tree(page_swpentry),
> > +                              swp_offset(page_swpentry),
> > +                              entry, GFP_KERNEL);
> > +               if (xa_is_err(old)) {
> > +                       int err = xa_err(old);
> > +
> > +                       WARN_ONCE(err != -ENOMEM, "unexpected xarray error:
> %d\n", err);
> > +                       zswap_reject_alloc_fail++;
> > +                       goto store_folio_failed;
> > +               }
> >
> > -       /*
> > -        * We finish initializing the entry while it's already in xarray.
> > -        * This is safe because:
> > -        *
> > -        * 1. Concurrent stores and invalidations are excluded by folio lock.
> > -        *
> > -        * 2. Writeback is excluded by the entry not being on the LRU yet.
> > -        *    The publishing order matters to prevent writeback from seeing
> > -        *    an incoherent entry.
> > -        */
> > -       entry->pool = pool;
> > -       entry->swpentry = page_swpentry;
> > -       entry->objcg = objcg;
> > -       entry->referenced = true;
> > -       if (entry->length) {
> > -               INIT_LIST_HEAD(&entry->lru);
> > -               zswap_lru_add(&zswap_list_lru, entry);
> > +               /*
> > +                * We may have had an existing entry that became stale when
> > +                * the folio was redirtied and now the new version is being
> > +                * swapped out. Get rid of the old.
> > +                */
> > +               if (old)
> > +                       zswap_entry_free(old);
> > +
> > +               /*
> > +                * The entry is successfully compressed and stored in the tree,
> there is
> > +                * no further possibility of failure. Grab refs to the pool and objcg.
> > +                * These refs will be dropped by zswap_entry_free() when the
> entry is
> > +                * removed from the tree.
> > +                */
> > +               zswap_pool_get(pool);
> > +               if (objcg)
> > +                       obj_cgroup_get(objcg);
> > +
> > +               /*
> > +                * We finish initializing the entry while it's already in xarray.
> > +                * This is safe because:
> > +                *
> > +                * 1. Concurrent stores and invalidations are excluded by folio
> lock.
> > +                *
> > +                * 2. Writeback is excluded by the entry not being on the LRU yet.
> > +                *    The publishing order matters to prevent writeback from seeing
> > +                *    an incoherent entry.
> > +                */
> > +               entry->pool = pool;
> > +               entry->swpentry = page_swpentry;
> > +               entry->objcg = objcg;
> > +               entry->referenced = true;
> > +               if (entry->length) {
> > +                       INIT_LIST_HEAD(&entry->lru);
> > +                       zswap_lru_add(&zswap_list_lru, entry);
> > +               }
> > +
> > +               compressed_bytes += entry->length;
> >         }
> >
> > -       return entry->length;
> > +       kfree(entries);
> > +
> > +       return compressed_bytes;
> > +
> > +store_folio_failed:
> > +       for (index = 0; index < nr_pages; ++index) {
> > +               if (!IS_ERR_VALUE(entries[index]->handle))
> > +                       zpool_free(pool->zpool, entries[index]->handle);
> > +
> > +               zswap_entry_cache_free(entries[index]);
> > +       }
> 
> If there is a failure in xa_store() halfway through the entries, this
> loop will free all the compressed objects and entries. But, some of
> the entries are already in the xarray, and zswap_store() will try to
> free them again. This seems like a bug, or did I miss something here?

Thanks, great catch! Yes, this is a bug. I have a simple fix implemented,
that I am currently testing and will include in v6.

> 
> > +
> > +       kfree(entries);
> >
> > -store_failed:
> > -       zpool_free(pool->zpool, entry->handle);
> > -compress_failed:
> > -       zswap_entry_cache_free(entry);
> >         return -EINVAL;
> >  }
> >
> > @@ -1549,8 +1601,8 @@ bool zswap_store(struct folio *folio)
> >         struct mem_cgroup *memcg = NULL;
> >         struct zswap_pool *pool;
> >         size_t compressed_bytes = 0;
> > +       ssize_t bytes;
> >         bool ret = false;
> > -       long index;
> >
> >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > @@ -1584,15 +1636,11 @@ bool zswap_store(struct folio *folio)
> >                 mem_cgroup_put(memcg);
> >         }
> >
> > -       for (index = 0; index < nr_pages; ++index) {
> > -               struct page *page = folio_page(folio, index);
> > -               ssize_t bytes;
> > +       bytes = zswap_store_folio(folio, objcg, pool);
> > +       if (bytes < 0)
> > +               goto put_pool;
> >
> > -               bytes = zswap_store_page(page, objcg, pool);
> > -               if (bytes < 0)
> > -                       goto put_pool;
> > -               compressed_bytes += bytes;
> > -       }
> > +       compressed_bytes = bytes;
> 
> What's the point of having both compressed_bytes and bytes now?

The main reason was to cleanly handle a negative error value returned in "bytes"
(declared as ssize_t), as against a true total "compressed_bytes" (declared as size_t)
for the folio to use for objcg charging. This is similar to the current mainline
code where zswap_store() calls zswap_store_page(). I was hoping to avoid potential
issues with overflow/underflow, and for maintainability. Let me know if this is Ok.

Thanks,
Kanchana

> 
> >
> >         if (objcg) {
> >                 obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > @@ -1622,6 +1670,7 @@ bool zswap_store(struct folio *folio)
> >                 pgoff_t offset = swp_offset(swp);
> >                 struct zswap_entry *entry;
> >                 struct xarray *tree;
> > +               long index;
> >
> >                 for (index = 0; index < nr_pages; ++index) {
> >                         tree = swap_zswap_tree(swp_entry(type, offset + index));
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.
  2025-01-08  3:26     ` Sridhar, Kanchana P
@ 2025-01-08  4:16       ` Yosry Ahmed
  0 siblings, 0 replies; 55+ messages in thread
From: Yosry Ahmed @ 2025-01-08  4:16 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

[..]
> >
> > > +       }
> > > +
> > > +       acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *),
> > GFP_KERNEL, cpu_to_node(cpu));
> >
> > Can we use kcalloc_node() here?
>
> I was wondering if the performance penalty of the kcalloc_node() is acceptable
> because the cpu onlining happens infrequently? If so, it appears zero-initializing
> the allocated memory will help in the cleanup code suggestion in your subsequent
> comment.

I don't think zeroing in this path would be a problem.

>
> >
> > > +       if (!acomp_ctx->buffers)
> > > +               goto buf_fail;
> > > +
> > > +       for (i = 0; i < nr_reqs; ++i) {
> > > +               acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2,
> > GFP_KERNEL, cpu_to_node(cpu));
> > > +               if (!acomp_ctx->buffers[i]) {
> > > +                       for (j = 0; j < i; ++j)
> > > +                               kfree(acomp_ctx->buffers[j]);
> > > +                       kfree(acomp_ctx->buffers);
> > > +                       ret = -ENOMEM;
> > > +                       goto buf_fail;
> > > +               }
> > > +       }
> > > +
> > > +       acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req
> > *), GFP_KERNEL, cpu_to_node(cpu));
> >
> > Ditto.
>
> Sure.
>
> >
> > > +       if (!acomp_ctx->reqs)
> > >                 goto req_fail;
> > > +
> > > +       for (i = 0; i < nr_reqs; ++i) {
> > > +               acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
> > > +               if (!acomp_ctx->reqs[i]) {
> > > +                       pr_err("could not alloc crypto acomp_request reqs[%d]
> > %s\n",
> > > +                              i, pool->tfm_name);
> > > +                       for (j = 0; j < i; ++j)
> > > +                               acomp_request_free(acomp_ctx->reqs[j]);
> > > +                       kfree(acomp_ctx->reqs);
> > > +                       ret = -ENOMEM;
> > > +                       goto req_fail;
> > > +               }
> > >         }
> > > -       acomp_ctx->req = req;
> > >
> > > +       /*
> > > +        * The crypto_wait is used only in fully synchronous, i.e., with scomp
> > > +        * or non-poll mode of acomp, hence there is only one "wait" per
> > > +        * acomp_ctx, with callback set to reqs[0], under the assumption that
> > > +        * there is at least 1 request per acomp_ctx.
> > > +        */
> > >         crypto_init_wait(&acomp_ctx->wait);
> > >         /*
> > >          * if the backend of acomp is async zip, crypto_req_done() will wakeup
> > >          * crypto_wait_req(); if the backend of acomp is scomp, the callback
> > >          * won't be called, crypto_wait_req() will return without blocking.
> > >          */
> > > -       acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> > > +       acomp_request_set_callback(acomp_ctx->reqs[0],
> > CRYPTO_TFM_REQ_MAY_BACKLOG,
> > >                                    crypto_req_done, &acomp_ctx->wait);
> > >
> > > +       acomp_ctx->nr_reqs = nr_reqs;
> > >         return 0;
> > >
> > >  req_fail:
> > > +       for (i = 0; i < nr_reqs; ++i)
> > > +               kfree(acomp_ctx->buffers[i]);
> > > +       kfree(acomp_ctx->buffers);
> >
> > The cleanup code is all over the place. Sometimes it's done in the
> > loops allocating the memory and sometimes here. It's a bit hard to
> > follow. Please have all the cleanups here. You can just initialize the
> > arrays to 0s, and then if the array is not-NULL you can free any
> > non-NULL elements (kfree() will handle NULLs gracefully).
>
> Sure, if performance of kzalloc_node() is an acceptable trade-off for the
> cleanup code simplification.
>
> >
> > There may be even potential for code reuse with zswap_cpu_comp_dead().
>
> I assume the reuse will be through copy-and-paste the same lines of code as
> against a common procedure being called by zswap_cpu_comp_prepare()
> and zswap_cpu_comp_dead()?

Well, I meant we can possibly introduce the helper that will be used
by both zswap_cpu_comp_prepare() and zswap_cpu_comp_dead() (for
example see __mem_cgroup_free() called from both the freeing path and
the allocation path to do cleanup).

I didn't look too closely into it though, maybe it's best to keep them
separate, depending on how the code ends up looking like.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 11/12] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching.
  2025-01-08  3:57     ` Sridhar, Kanchana P
@ 2025-01-08  4:22       ` Yosry Ahmed
  0 siblings, 0 replies; 55+ messages in thread
From: Yosry Ahmed @ 2025-01-08  4:22 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, akpm, linux-crypto, herbert,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

[..]
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 99cd78891fd0..1be0f1807bfc 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1467,77 +1467,129 @@ static void shrink_worker(struct work_struct
> > *w)
> > >  * main API
> > >  **********************************/
> > >
> > > -static ssize_t zswap_store_page(struct page *page,
> > > -                               struct obj_cgroup *objcg,
> > > -                               struct zswap_pool *pool)
> > > +static bool zswap_compress_folio(struct folio *folio,
> > > +                                struct zswap_entry *entries[],
> > > +                                struct zswap_pool *pool)
> > >  {
> > > -       swp_entry_t page_swpentry = page_swap_entry(page);
> > > -       struct zswap_entry *entry, *old;
> > > +       long index, nr_pages = folio_nr_pages(folio);
> > >
> > > -       /* allocate entry */
> > > -       entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > > -       if (!entry) {
> > > -               zswap_reject_kmemcache_fail++;
> > > -               return -EINVAL;
> > > +       for (index = 0; index < nr_pages; ++index) {
> > > +               struct page *page = folio_page(folio, index);
> > > +
> > > +               if (!zswap_compress(page, entries[index], pool))
> > > +                       return false;
> > >         }
> > >
> > > -       if (!zswap_compress(page, entry, pool))
> > > -               goto compress_failed;
> > > +       return true;
> > > +}
> > >
> > > -       old = xa_store(swap_zswap_tree(page_swpentry),
> > > -                      swp_offset(page_swpentry),
> > > -                      entry, GFP_KERNEL);
> > > -       if (xa_is_err(old)) {
> > > -               int err = xa_err(old);
> > > +/*
> > > + * Store all pages in a folio.
> > > + *
> > > + * The error handling from all failure points is consolidated to the
> > > + * "store_folio_failed" label, based on the initialization of the zswap
> > entries'
> > > + * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
> > > + * entry's handle is subsequently modified only upon a successful
> > zpool_malloc()
> > > + * after the page is compressed.
> > > + */
> > > +static ssize_t zswap_store_folio(struct folio *folio,
> > > +                                struct obj_cgroup *objcg,
> > > +                                struct zswap_pool *pool)
> > > +{
> > > +       long index, nr_pages = folio_nr_pages(folio);
> > > +       struct zswap_entry **entries = NULL;
> > > +       int node_id = folio_nid(folio);
> > > +       size_t compressed_bytes = 0;
> > >
> > > -               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n",
> > err);
> > > -               zswap_reject_alloc_fail++;
> > > -               goto store_failed;
> > > +       entries = kmalloc(nr_pages * sizeof(*entries), GFP_KERNEL);
> >
> > We can probably use kcalloc() here.
>
> I am a little worried about the latency penalty of kcalloc() in the reclaim path,
> especially since I am not relying on zero-initialized memory for "entries"..

Hmm good point, for a 2M THP we could be allocating an entire page here.

[..]
> > > @@ -1549,8 +1601,8 @@ bool zswap_store(struct folio *folio)
> > >         struct mem_cgroup *memcg = NULL;
> > >         struct zswap_pool *pool;
> > >         size_t compressed_bytes = 0;
> > > +       ssize_t bytes;
> > >         bool ret = false;
> > > -       long index;
> > >
> > >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > >         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > > @@ -1584,15 +1636,11 @@ bool zswap_store(struct folio *folio)
> > >                 mem_cgroup_put(memcg);
> > >         }
> > >
> > > -       for (index = 0; index < nr_pages; ++index) {
> > > -               struct page *page = folio_page(folio, index);
> > > -               ssize_t bytes;
> > > +       bytes = zswap_store_folio(folio, objcg, pool);
> > > +       if (bytes < 0)
> > > +               goto put_pool;
> > >
> > > -               bytes = zswap_store_page(page, objcg, pool);
> > > -               if (bytes < 0)
> > > -                       goto put_pool;
> > > -               compressed_bytes += bytes;
> > > -       }
> > > +       compressed_bytes = bytes;
> >
> > What's the point of having both compressed_bytes and bytes now?
>
> The main reason was to cleanly handle a negative error value returned in "bytes"
> (declared as ssize_t), as against a true total "compressed_bytes" (declared as size_t)
> for the folio to use for objcg charging. This is similar to the current mainline
> code where zswap_store() calls zswap_store_page(). I was hoping to avoid potential
> issues with overflow/underflow, and for maintainability. Let me know if this is Ok.

It makes sense in the current mainline because we store the return
value of each call to zswap_store_page() in 'bytes', then check if
it's an error value, then add it to 'compressed_bytes'. Now we have a
single call to zswap_store_folio() and a single return value. AFAICT,
there is currently no benefit to storing it in 'bytes', checking it,
then moving it to 'compressed_bytes'. The compiler will probably
optimize the variable away anyway, but it looks weird.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-01-07  3:10               ` Yosry Ahmed
  2025-01-08  1:38                 ` Herbert Xu
@ 2025-02-16  5:17                 ` Herbert Xu
  2025-02-20 17:32                   ` Yosry Ahmed
  1 sibling, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2025-02-16  5:17 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Jan 06, 2025 at 07:10:53PM -0800, Yosry Ahmed wrote:
>
> The main problem is memory usage. Zswap needs a PAGE_SIZE*2-sized
> buffer for each request on each CPU. We preallocate these buffers to
> avoid trying to allocate this much memory in the reclaim path (i.e.
> potentially allocating two pages to reclaim one).

Actually this PAGE_SIZE * 2 thing baffles me.  Why would you
allocate more memory than the input? The comment says that it's
because certain hardware accelerators will disregard the output
buffer length, but surely that's just a bug in the driver?

Which driver does this? We should fix it or remove it if it's
writing output with no regard to the maximum length.

You should only ever need PAGE_SIZE for the output buffer, if
the output exceeds that then just fail the compression.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-16  5:17                 ` Herbert Xu
@ 2025-02-20 17:32                   ` Yosry Ahmed
  2025-02-22  6:26                     ` Barry Song
  0 siblings, 1 reply; 55+ messages in thread
From: Yosry Ahmed @ 2025-02-20 17:32 UTC (permalink / raw)
  To: Herbert Xu, Barry Song
  Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Sun, Feb 16, 2025 at 01:17:59PM +0800, Herbert Xu wrote:
> On Mon, Jan 06, 2025 at 07:10:53PM -0800, Yosry Ahmed wrote:
> >
> > The main problem is memory usage. Zswap needs a PAGE_SIZE*2-sized
> > buffer for each request on each CPU. We preallocate these buffers to
> > avoid trying to allocate this much memory in the reclaim path (i.e.
> > potentially allocating two pages to reclaim one).
> 
> Actually this PAGE_SIZE * 2 thing baffles me.  Why would you
> allocate more memory than the input? The comment says that it's
> because certain hardware accelerators will disregard the output
> buffer length, but surely that's just a bug in the driver?
> 
> Which driver does this? We should fix it or remove it if it's
> writing output with no regard to the maximum length.
> 
> You should only ever need PAGE_SIZE for the output buffer, if
> the output exceeds that then just fail the compression.

I agree this should be fixed if it can be. This was discussed before
here:
https://lore.kernel.org/lkml/CAGsJ_4wuTZcGurby9h4PU2DwFaiEKB4bxuycaeyz3bPw3jSX3A@mail.gmail.com/

Barry is the one who brought up why we need PAGE_SIZE*2. Barry, could
you please chime in here?

> 
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-20 17:32                   ` Yosry Ahmed
@ 2025-02-22  6:26                     ` Barry Song
  2025-02-22  6:34                       ` Herbert Xu
  2025-02-22 12:31                       ` Sergey Senozhatsky
  0 siblings, 2 replies; 55+ messages in thread
From: Barry Song @ 2025-02-22  6:26 UTC (permalink / raw)
  To: Yosry Ahmed, Minchan Kim, Sergey Senozhatsky
  Cc: Herbert Xu, Sridhar, Kanchana P, linux-kernel, linux-mm, hannes,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, akpm,
	linux-crypto, davem, clabbe, ardb, ebiggers, surenb, Accardi,
	Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Fri, Feb 21, 2025 at 6:32 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Sun, Feb 16, 2025 at 01:17:59PM +0800, Herbert Xu wrote:
> > On Mon, Jan 06, 2025 at 07:10:53PM -0800, Yosry Ahmed wrote:
> > >
> > > The main problem is memory usage. Zswap needs a PAGE_SIZE*2-sized
> > > buffer for each request on each CPU. We preallocate these buffers to
> > > avoid trying to allocate this much memory in the reclaim path (i.e.
> > > potentially allocating two pages to reclaim one).
> >
> > Actually this PAGE_SIZE * 2 thing baffles me.  Why would you
> > allocate more memory than the input? The comment says that it's
> > because certain hardware accelerators will disregard the output
> > buffer length, but surely that's just a bug in the driver?
> >
> > Which driver does this? We should fix it or remove it if it's
> > writing output with no regard to the maximum length.
> >
> > You should only ever need PAGE_SIZE for the output buffer, if
> > the output exceeds that then just fail the compression.
>
> I agree this should be fixed if it can be. This was discussed before
> here:
> https://lore.kernel.org/lkml/CAGsJ_4wuTZcGurby9h4PU2DwFaiEKB4bxuycaeyz3bPw3jSX3A@mail.gmail.com/
>
> Barry is the one who brought up why we need PAGE_SIZE*2. Barry, could
> you please chime in here?

I'm not sure if any real hardware driver fails to return -ERRNO, but there could
be another reason why zRAM doesn't want -ERRNO from the previous code
comment:
"When we receive -ERRNO from the compression backend, there's nothing more
we can do":

int zcomp_compress(struct zcomp_strm *zstrm,
                const void *src, unsigned int *dst_len)
{
        /*
         * Our dst memory (zstrm->buffer) is always `2 * PAGE_SIZE' sized
         * because sometimes we can endup having a bigger compressed data
         * due to various reasons: for example compression algorithms tend
         * to add some padding to the compressed buffer. Speaking of padding,
         * comp algorithm `842' pads the compressed length to multiple of 8
         * and returns -ENOSP when the dst memory is not big enough, which
         * is not something that ZRAM wants to see. We can handle the
         * `compressed_size > PAGE_SIZE' case easily in ZRAM, but when we
         * receive -ERRNO from the compressing backend we can't help it
         * anymore. To make `842' happy we need to tell the exact size of
         * the dst buffer, zram_drv will take care of the fact that
         * compressed buffer is too big.
         */
        *dst_len = PAGE_SIZE * 2;

        return crypto_comp_compress(zstrm->tfm,
                        src, PAGE_SIZE,
                        zstrm->buffer, dst_len);
}

After reviewing the zRAM code, I don't see why zram_write_page() needs
to rely on
comp_len to call write_incompressible_page().

zram_write_page()
{
        ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm,
                             mem, &comp_len);
        kunmap_local(mem);

        if (unlikely(ret)) {
                zcomp_stream_put(zstrm);
                pr_err("Compression failed! err=%d\n", ret);
                return ret;
        }

        if (comp_len >= huge_class_size) {
                zcomp_stream_put(zstrm);
                return write_incompressible_page(zram, page, index);
        }
}

I mean, why can't we make it as the below:

zram_write_page()
{
        ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm,
                             mem, &comp_len);
        kunmap_local(mem);

        if (unlikely(ret && ret != -ENOSP)) {
                zcomp_stream_put(zstrm);
                pr_err("Compression failed! err=%d\n", ret);
                return ret;
        }

        if (comp_len >= huge_class_size || ret) {
                zcomp_stream_put(zstrm);
                return write_incompressible_page(zram, page, index);
        }
}

As long as crypto drivers consistently return -ENOSP or a specific error
code for dst_buf overflow, we should be able to eliminate the
2*PAGE_SIZE buffer.

My point is:
1. All drivers must be capable of handling dst_buf overflow.
2. All drivers must return a consistent and dedicated error code for
dst_buf overflow.

+Minchan, Sergey,
Do you think we can implement this change in zRAM by using PAGE_SIZE instead
of 2 * PAGE_SIZE?

>
> >
> > Cheers,
> > --
> > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22  6:26                     ` Barry Song
@ 2025-02-22  6:34                       ` Herbert Xu
  2025-02-22  6:41                         ` Barry Song
  2025-02-22 12:31                       ` Sergey Senozhatsky
  1 sibling, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2025-02-22  6:34 UTC (permalink / raw)
  To: Barry Song
  Cc: Yosry Ahmed, Minchan Kim, Sergey Senozhatsky, Sridhar,
	Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, akpm, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On Sat, Feb 22, 2025 at 07:26:43PM +1300, Barry Song wrote:
>
> After reviewing the zRAM code, I don't see why zram_write_page() needs
> to rely on
> comp_len to call write_incompressible_page().
> 
> zram_write_page()
> {
>         ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm,
>                              mem, &comp_len);
>         kunmap_local(mem);
> 
>         if (unlikely(ret)) {
>                 zcomp_stream_put(zstrm);
>                 pr_err("Compression failed! err=%d\n", ret);
>                 return ret;
>         }
> 
>         if (comp_len >= huge_class_size) {
>                 zcomp_stream_put(zstrm);
>                 return write_incompressible_page(zram, page, index);
>         }
> }

Surely any compression error should just be treated as an
incompressible page?

I mean we might wish to report unusual errors in case the
admin or developer can do something about it, but for the
system as a whole it should still continue as if the page
was simply incompressible.

> As long as crypto drivers consistently return -ENOSP or a specific error
> code for dst_buf overflow, we should be able to eliminate the
> 2*PAGE_SIZE buffer.

Yes we could certainly do that.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22  6:34                       ` Herbert Xu
@ 2025-02-22  6:41                         ` Barry Song
  2025-02-22  6:52                           ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Barry Song @ 2025-02-22  6:41 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Yosry Ahmed, Minchan Kim, Sergey Senozhatsky, Sridhar,
	Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, akpm, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On Sat, Feb 22, 2025 at 7:34 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Sat, Feb 22, 2025 at 07:26:43PM +1300, Barry Song wrote:
> >
> > After reviewing the zRAM code, I don't see why zram_write_page() needs
> > to rely on
> > comp_len to call write_incompressible_page().
> >
> > zram_write_page()
> > {
> >         ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm,
> >                              mem, &comp_len);
> >         kunmap_local(mem);
> >
> >         if (unlikely(ret)) {
> >                 zcomp_stream_put(zstrm);
> >                 pr_err("Compression failed! err=%d\n", ret);
> >                 return ret;
> >         }
> >
> >         if (comp_len >= huge_class_size) {
> >                 zcomp_stream_put(zstrm);
> >                 return write_incompressible_page(zram, page, index);
> >         }
> > }
>
> Surely any compression error should just be treated as an
> incompressible page?

probably no, as an incompressible page might become compressible
after changing an algorithm. This is possible, users may swith an
algorithm to compress an incompressible page in the background.

Errors other than dst_buf overflow are a completely different matter
though :-)

>
> I mean we might wish to report unusual errors in case the
> admin or developer can do something about it, but for the
> system as a whole it should still continue as if the page
> was simply incompressible.
>
> > As long as crypto drivers consistently return -ENOSP or a specific error
> > code for dst_buf overflow, we should be able to eliminate the
> > 2*PAGE_SIZE buffer.
>
> Yes we could certainly do that.
>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>

Thanks
barry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22  6:41                         ` Barry Song
@ 2025-02-22  6:52                           ` Herbert Xu
  2025-02-22  7:13                             ` Barry Song
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2025-02-22  6:52 UTC (permalink / raw)
  To: Barry Song
  Cc: Yosry Ahmed, Minchan Kim, Sergey Senozhatsky, Sridhar,
	Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, akpm, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On Sat, Feb 22, 2025 at 07:41:54PM +1300, Barry Song wrote:
>
> probably no, as an incompressible page might become compressible
> after changing an algorithm. This is possible, users may swith an
> algorithm to compress an incompressible page in the background.

I don't understand the difference.  If something is wrong with
the system causing the compression algorithm to fail, shouldn't
zswap just hobble along as if the page was incompressible?

In fact it would be quite reasonble to try to recompress it if
the admin did change the algorithm later because the error may
have been specific to the previous algorithm implementation.

Of course I totally agree that there should be a reporting
mechanism to catch errors that admins/developers should know
about.  But apart from reporting that error there should be
no difference between an inherently incompressible page vs.
buggy algorithm/broken hardware failing to compress the page.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22  6:52                           ` Herbert Xu
@ 2025-02-22  7:13                             ` Barry Song
  2025-02-22  7:22                               ` Herbert Xu
  2025-02-24 21:49                               ` Yosry Ahmed
  0 siblings, 2 replies; 55+ messages in thread
From: Barry Song @ 2025-02-22  7:13 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Yosry Ahmed, Minchan Kim, Sergey Senozhatsky, Sridhar,
	Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, akpm, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On Sat, Feb 22, 2025 at 7:52 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Sat, Feb 22, 2025 at 07:41:54PM +1300, Barry Song wrote:
> >
> > probably no, as an incompressible page might become compressible
> > after changing an algorithm. This is possible, users may swith an
> > algorithm to compress an incompressible page in the background.
>
> I don't understand the difference.  If something is wrong with
> the system causing the compression algorithm to fail, shouldn't
> zswap just hobble along as if the page was incompressible?
>
> In fact it would be quite reasonble to try to recompress it if
> the admin did change the algorithm later because the error may
> have been specific to the previous algorithm implementation.
>

Somehow, I find your comment reasonable. Another point I want
to mention is the semantic difference. For example, in a system
with only one algorithm, a dst_buf overflow still means a successful
swap-out. However, other errors actually indicate an I/O failure.
In such cases, vmscan.c will log the relevant error in pageout() to
notify the user.

Anyway, I'm not an authority on this, so I’d like to see comments
from Minchan, Sergey, and Yosry.

> Of course I totally agree that there should be a reporting
> mechanism to catch errors that admins/developers should know
> about.  But apart from reporting that error there should be
> no difference between an inherently incompressible page vs.
> buggy algorithm/broken hardware failing to compress the page.
>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22  7:13                             ` Barry Song
@ 2025-02-22  7:22                               ` Herbert Xu
  2025-02-22  8:21                                 ` Barry Song
  2025-02-24 21:49                               ` Yosry Ahmed
  1 sibling, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2025-02-22  7:22 UTC (permalink / raw)
  To: Barry Song
  Cc: Yosry Ahmed, Minchan Kim, Sergey Senozhatsky, Sridhar,
	Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, akpm, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On Sat, Feb 22, 2025 at 08:13:13PM +1300, Barry Song wrote:
>
> Somehow, I find your comment reasonable. Another point I want
> to mention is the semantic difference. For example, in a system
> with only one algorithm, a dst_buf overflow still means a successful
> swap-out. However, other errors actually indicate an I/O failure.
> In such cases, vmscan.c will log the relevant error in pageout() to
> notify the user.

I'm talking specifically about the error from the Crypto API,
not any other error.  So if you werer using some sort of an
offload device to do the compression, that could indeed fail
due to an IO error (perhaps the PCI bus is on fire :)

But because that's reported through the Crypto API, it should
not be treated any differently than an incompressible page,
except for reporting purposes.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22  7:22                               ` Herbert Xu
@ 2025-02-22  8:21                                 ` Barry Song
  0 siblings, 0 replies; 55+ messages in thread
From: Barry Song @ 2025-02-22  8:21 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Yosry Ahmed, Minchan Kim, Sergey Senozhatsky, Sridhar,
	Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, akpm, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On Sat, Feb 22, 2025 at 8:23 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Sat, Feb 22, 2025 at 08:13:13PM +1300, Barry Song wrote:
> >
> > Somehow, I find your comment reasonable. Another point I want
> > to mention is the semantic difference. For example, in a system
> > with only one algorithm, a dst_buf overflow still means a successful
> > swap-out. However, other errors actually indicate an I/O failure.
> > In such cases, vmscan.c will log the relevant error in pageout() to
> > notify the user.
>
> I'm talking specifically about the error from the Crypto API,
> not any other error.  So if you werer using some sort of an
> offload device to do the compression, that could indeed fail
> due to an IO error (perhaps the PCI bus is on fire :)
>
> But because that's reported through the Crypto API, it should
> not be treated any differently than an incompressible page,
> except for reporting purposes.

I'm referring more to the mm subsystem :-)

Let me provide a concrete example. Below is a small program that will
swap out 16MB of memory to zRAM:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>

#define MB (1024 * 1024)
#define SIZE (16 * MB)

int main() {
    void *addr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_ANON |
MAP_PRIVATE, -1, 0);
    if (addr == MAP_FAILED) {
        perror("mmap failed");
        return 1;
    }

    for (size_t i = 0; i < SIZE / sizeof(int); i++) {
        ((int*)addr)[i] = rand();
    }

    if (madvise(addr, SIZE, MADV_PAGEOUT) != 0) {
        perror("madvise failed");
        return 1;
    }

    while (1);

    return 0;
}

For errors other than dst_buf overflow, we receive:

/ # ./a.out &
/ # free
               total        used        free      shared  buff/cache   available
Mem:          341228       77036      251872           0       20600      264192
Swap:        2703356           0     2703356
[1]+  Done                       ./a.out

/ # cat /proc/vmstat | grep swp
pswpin 0
pswpout 0
...

No memory has been swapped out, the swap-out counter is zero, and
the swap file is not used at all.

If this is an incompressible page(I mean dst_buf overflow error), there is
no actual issue, and we get the following:

/ #
/ # free
               total        used        free      shared  buff/cache   available
Mem:          341228       92948      236248           0       20372      248280
Swap:        2703356       16384     2686972

/ # cat /proc/vmstat | grep swp
pswpin 0
pswpout 4096
...


>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22  6:26                     ` Barry Song
  2025-02-22  6:34                       ` Herbert Xu
@ 2025-02-22 12:31                       ` Sergey Senozhatsky
  2025-02-22 14:27                         ` Sergey Senozhatsky
                                           ` (2 more replies)
  1 sibling, 3 replies; 55+ messages in thread
From: Sergey Senozhatsky @ 2025-02-22 12:31 UTC (permalink / raw)
  To: Barry Song
  Cc: Yosry Ahmed, Minchan Kim, Sergey Senozhatsky, Herbert Xu,
	Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, akpm, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On (25/02/22 19:26), Barry Song wrote:
> After reviewing the zRAM code, I don't see why zram_write_page() needs
> to rely on
> comp_len to call write_incompressible_page().

[..]

> zram_write_page()
> {
>         ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm,
>                              mem, &comp_len);
>         kunmap_local(mem);
> 
>         if (unlikely(ret && ret != -ENOSP)) {
>                 zcomp_stream_put(zstrm);
>                 pr_err("Compression failed! err=%d\n", ret);
>                 return ret;
>         }
> 
>         if (comp_len >= huge_class_size || ret) {
>                 zcomp_stream_put(zstrm);
>                 return write_incompressible_page(zram, page, index);
>         }
> }

Sorry, I'm slower than usual now, but why should we?  Shouldn't compression
algorithms just never fail, even on 3D videos, because otherwise they won't
be able to validate their Weissman score or something :)

On a serious note - what is the use-case here?  Is the failure here due to
some random "cosmic rays" that taint the  compression H/W?  If so then what
makes us believe that it's uni-directional?  What if it's decompression
that gets busted and then you can't decompress anything previously stored
compressed and stored in zsmalloc.  Wouldn't it be better in this case
to turn the computer off and on again?

The idea behind zram's code is that incompressible pages are not unusual,
they are quite usual, in fact,  It's not necessarily that the data grew
in size after compression, the data is incompressible from zsmalloc PoV.
That is the algorithm wasn't able to compress a PAGE_SIZE buffer to an
object smaller than zsmalloc's huge-class-watermark (around 3600 bytes,
depending on zspage chain size).  That's why we look at the comp-len.
Anything else is an error, perhaps a pretty catastrophic error.

> As long as crypto drivers consistently return -ENOSP or a specific error
> code for dst_buf overflow, we should be able to eliminate the
> 2*PAGE_SIZE buffer.
> 
> My point is:
> 1. All drivers must be capable of handling dst_buf overflow.
> 2. All drivers must return a consistent and dedicated error code for
> dst_buf overflow.

Sorry, where do these rules come from?

> +Minchan, Sergey,
> Do you think we can implement this change in zRAM by using PAGE_SIZE instead
> of 2 * PAGE_SIZE?

Sorry again, what problem are you solving?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22 12:31                       ` Sergey Senozhatsky
@ 2025-02-22 14:27                         ` Sergey Senozhatsky
  2025-02-23  0:14                           ` Herbert Xu
  2025-02-22 16:24                         ` Barry Song
  2025-02-23  0:24                         ` Herbert Xu
  2 siblings, 1 reply; 55+ messages in thread
From: Sergey Senozhatsky @ 2025-02-22 14:27 UTC (permalink / raw)
  To: Barry Song
  Cc: Yosry Ahmed, Minchan Kim, Herbert Xu, Sridhar, Kanchana P,
	linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, akpm, linux-crypto, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Feghali, Wajdi K,
	Gopal, Vinodh, Sergey Senozhatsky

On (25/02/22 21:31), Sergey Senozhatsky wrote:
> > As long as crypto drivers consistently return -ENOSP or a specific error
> > code for dst_buf overflow, we should be able to eliminate the
> > 2*PAGE_SIZE buffer.
> > 
> > My point is:
> > 1. All drivers must be capable of handling dst_buf overflow.
> > 2. All drivers must return a consistent and dedicated error code for
> > dst_buf overflow.

So I didn't look at all of them, but at least S/W lzo1 doesn't even
have a notion of max-output-len.  lzo1x_1_compress() accepts a pointer
to out_len which tells the size of output stream (the algorithm is free
to produce any), so there is no dst_buf overflow as far as lzo1 is
concerned.  Unless I'm missing something or misunderstanding your points.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22 12:31                       ` Sergey Senozhatsky
  2025-02-22 14:27                         ` Sergey Senozhatsky
@ 2025-02-22 16:24                         ` Barry Song
  2025-02-23  0:24                         ` Herbert Xu
  2 siblings, 0 replies; 55+ messages in thread
From: Barry Song @ 2025-02-22 16:24 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Yosry Ahmed, Minchan Kim, Herbert Xu, Sridhar, Kanchana P,
	linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, akpm, linux-crypto, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Feghali, Wajdi K,
	Gopal, Vinodh

On Sun, Feb 23, 2025 at 1:31 AM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (25/02/22 19:26), Barry Song wrote:
> > After reviewing the zRAM code, I don't see why zram_write_page() needs
> > to rely on
> > comp_len to call write_incompressible_page().
>
> [..]
>
> > zram_write_page()
> > {
> >         ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm,
> >                              mem, &comp_len);
> >         kunmap_local(mem);
> >
> >         if (unlikely(ret && ret != -ENOSP)) {
> >                 zcomp_stream_put(zstrm);
> >                 pr_err("Compression failed! err=%d\n", ret);
> >                 return ret;
> >         }
> >
> >         if (comp_len >= huge_class_size || ret) {
> >                 zcomp_stream_put(zstrm);
> >                 return write_incompressible_page(zram, page, index);
> >         }
> > }
>
> Sorry, I'm slower than usual now, but why should we?  Shouldn't compression
> algorithms just never fail, even on 3D videos, because otherwise they won't
> be able to validate their Weissman score or something :)
>
> On a serious note - what is the use-case here?  Is the failure here due to
> some random "cosmic rays" that taint the  compression H/W?  If so then what
> makes us believe that it's uni-directional?  What if it's decompression
> that gets busted and then you can't decompress anything previously stored
> compressed and stored in zsmalloc.  Wouldn't it be better in this case
> to turn the computer off and on again?
>
> The idea behind zram's code is that incompressible pages are not unusual,
> they are quite usual, in fact,  It's not necessarily that the data grew
> in size after compression, the data is incompressible from zsmalloc PoV.
> That is the algorithm wasn't able to compress a PAGE_SIZE buffer to an
> object smaller than zsmalloc's huge-class-watermark (around 3600 bytes,
> depending on zspage chain size).  That's why we look at the comp-len.
> Anything else is an error, perhaps a pretty catastrophic error.
>
> > As long as crypto drivers consistently return -ENOSP or a specific error
> > code for dst_buf overflow, we should be able to eliminate the
> > 2*PAGE_SIZE buffer.
> >
> > My point is:
> > 1. All drivers must be capable of handling dst_buf overflow.
> > 2. All drivers must return a consistent and dedicated error code for
> > dst_buf overflow.
>
> Sorry, where do these rules come from?
>
> > +Minchan, Sergey,
> > Do you think we can implement this change in zRAM by using PAGE_SIZE instead
> > of 2 * PAGE_SIZE?
>
> Sorry again, what problem are you solving?

The context is that both zswap and zRAM currently use a destination buffer of
2 * PAGE_SIZE instead of just PAGE_SIZE. Herbert, Chenming, and Yosry are
questioning why it hasn't been reduced to a single PAGE_SIZE, and some
attempts have been made to do so.[1][2].

The rules are based on my thoughts on feasibility if we aim to reduce it to a
single PAGE_SIZE.

[1] https://lore.kernel.org/linux-mm/Z7F1B_blIbByYBzz@gondor.apana.org.au/
[2] https://lore.kernel.org/lkml/20231213-zswap-dstmem-v4-1-f228b059dd89@bytedance.com/

Thanks
Barry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22 14:27                         ` Sergey Senozhatsky
@ 2025-02-23  0:14                           ` Herbert Xu
  2025-02-23  2:09                             ` Sergey Senozhatsky
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2025-02-23  0:14 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Barry Song, Yosry Ahmed, Minchan Kim, Sridhar, Kanchana P,
	linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, akpm, linux-crypto, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Feghali, Wajdi K,
	Gopal, Vinodh

On Sat, Feb 22, 2025 at 11:27:49PM +0900, Sergey Senozhatsky wrote:
>
> So I didn't look at all of them, but at least S/W lzo1 doesn't even
> have a notion of max-output-len.  lzo1x_1_compress() accepts a pointer
> to out_len which tells the size of output stream (the algorithm is free
> to produce any), so there is no dst_buf overflow as far as lzo1 is
> concerned.  Unless I'm missing something or misunderstanding your points.

I just looked at deflate/zstd and they seem to be doing the right
things.

But yes lzo is a gaping security hole on the compression side.

The API has always specified a maximum output length and it needs
to be respected for both compression and decompression.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22 12:31                       ` Sergey Senozhatsky
  2025-02-22 14:27                         ` Sergey Senozhatsky
  2025-02-22 16:24                         ` Barry Song
@ 2025-02-23  0:24                         ` Herbert Xu
  2025-02-23  1:57                           ` Sergey Senozhatsky
  2 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2025-02-23  0:24 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Barry Song, Yosry Ahmed, Minchan Kim, Sridhar, Kanchana P,
	linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, akpm, linux-crypto, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Feghali, Wajdi K,
	Gopal, Vinodh

On Sat, Feb 22, 2025 at 09:31:41PM +0900, Sergey Senozhatsky wrote:
>
> The idea behind zram's code is that incompressible pages are not unusual,
> they are quite usual, in fact,  It's not necessarily that the data grew
> in size after compression, the data is incompressible from zsmalloc PoV.
> That is the algorithm wasn't able to compress a PAGE_SIZE buffer to an
> object smaller than zsmalloc's huge-class-watermark (around 3600 bytes,
> depending on zspage chain size).  That's why we look at the comp-len.
> Anything else is an error, perhaps a pretty catastrophic error.

If you're rejecting everything above the watermark then you should
simply pass the watermark as the output length to the algorithm so
that it can stop doing useless work once it gets past that point.

> > +Minchan, Sergey,
> > Do you think we can implement this change in zRAM by using PAGE_SIZE instead
> > of 2 * PAGE_SIZE?
> 
> Sorry again, what problem are you solving?

For compression, there is no point in allocating a destination buffer
that is bigger than the original.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-23  0:24                         ` Herbert Xu
@ 2025-02-23  1:57                           ` Sergey Senozhatsky
  0 siblings, 0 replies; 55+ messages in thread
From: Sergey Senozhatsky @ 2025-02-23  1:57 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sergey Senozhatsky, Barry Song, Yosry Ahmed, Minchan Kim,
	Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, akpm, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On (25/02/23 08:24), Herbert Xu wrote:
> On Sat, Feb 22, 2025 at 09:31:41PM +0900, Sergey Senozhatsky wrote:
> >
> > The idea behind zram's code is that incompressible pages are not unusual,
> > they are quite usual, in fact,  It's not necessarily that the data grew
> > in size after compression, the data is incompressible from zsmalloc PoV.
> > That is the algorithm wasn't able to compress a PAGE_SIZE buffer to an
> > object smaller than zsmalloc's huge-class-watermark (around 3600 bytes,
> > depending on zspage chain size).  That's why we look at the comp-len.
> > Anything else is an error, perhaps a pretty catastrophic error.
> 
> If you're rejecting everything above the watermark then you should
> simply pass the watermark as the output length to the algorithm so
> that it can stop doing useless work once it gets past that point.

Makes sense.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-23  0:14                           ` Herbert Xu
@ 2025-02-23  2:09                             ` Sergey Senozhatsky
  2025-02-23  2:52                               ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Sergey Senozhatsky @ 2025-02-23  2:09 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sergey Senozhatsky, Barry Song, Yosry Ahmed, Minchan Kim,
	Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, akpm, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On (25/02/23 08:14), Herbert Xu wrote:
> On Sat, Feb 22, 2025 at 11:27:49PM +0900, Sergey Senozhatsky wrote:
> >
> > So I didn't look at all of them, but at least S/W lzo1 doesn't even
> > have a notion of max-output-len.  lzo1x_1_compress() accepts a pointer
> > to out_len which tells the size of output stream (the algorithm is free
> > to produce any), so there is no dst_buf overflow as far as lzo1 is
> > concerned.  Unless I'm missing something or misunderstanding your points.
> 
> I just looked at deflate/zstd and they seem to be doing the right
> things.
> 
> But yes lzo is a gaping security hole on the compression side.

Right, for lzo/lzo-rle we need a safety page.

It also seems that there is no common way of reporting dst_but overflow.
Some algos return -ENOSPC immediately, some don't return anything at all,
and deflate does it's own thing - there are these places where they see
they are out of out space but they Z_OK it

if (s->pending != 0) {
	flush_pending(strm);
	if (strm->avail_out == 0) {
		/* Since avail_out is 0, deflate will be called again with
		 * more output space, but possibly with both pending and
		 * avail_in equal to zero. There won't be anything to do,
		 * but this is not an error situation so make sure we
		 * return OK instead of BUF_ERROR at next call of deflate:
		 */
		s->last_flush = -1;
		return Z_OK;
	}
}


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-23  2:09                             ` Sergey Senozhatsky
@ 2025-02-23  2:52                               ` Herbert Xu
  2025-02-23  3:12                                 ` Sergey Senozhatsky
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2025-02-23  2:52 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Barry Song, Yosry Ahmed, Minchan Kim, Sridhar, Kanchana P,
	linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, akpm, linux-crypto, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Feghali, Wajdi K,
	Gopal, Vinodh

On Sun, Feb 23, 2025 at 11:09:32AM +0900, Sergey Senozhatsky wrote:
>
> Right, for lzo/lzo-rle we need a safety page.

We should fix it because it's a security hole for anyone who calls
it through the Crypto API.

> It also seems that there is no common way of reporting dst_but overflow.
> Some algos return -ENOSPC immediately, some don't return anything at all,
> and deflate does it's own thing - there are these places where they see
> they are out of out space but they Z_OK it
> 
> if (s->pending != 0) {
> 	flush_pending(strm);
> 	if (strm->avail_out == 0) {
> 		/* Since avail_out is 0, deflate will be called again with
> 		 * more output space, but possibly with both pending and
> 		 * avail_in equal to zero. There won't be anything to do,
> 		 * but this is not an error situation so make sure we
> 		 * return OK instead of BUF_ERROR at next call of deflate:
> 		 */
> 		s->last_flush = -1;
> 		return Z_OK;
> 	}
> }

Z_OK is actually an error, see crypto/deflate.c:

	ret = zlib_deflate(stream, Z_FINISH);
	if (ret != Z_STREAM_END) {
		ret = -EINVAL;
		goto out;
	}

We could change this to ENOSPC for consistency.

If you do find anything that returns 0 through the Crypto API please
let me know.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-23  2:52                               ` Herbert Xu
@ 2025-02-23  3:12                                 ` Sergey Senozhatsky
  2025-02-23  3:38                                   ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Sergey Senozhatsky @ 2025-02-23  3:12 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sergey Senozhatsky, Barry Song, Yosry Ahmed, Minchan Kim,
	Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, akpm, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On (25/02/23 10:52), Herbert Xu wrote:
> On Sun, Feb 23, 2025 at 11:09:32AM +0900, Sergey Senozhatsky wrote:
> >
> > Right, for lzo/lzo-rle we need a safety page.
> 
> We should fix it because it's a security hole for anyone who calls
> it through the Crypto API.

Yeah, I don't disagree.

> > It also seems that there is no common way of reporting dst_but overflow.
> > Some algos return -ENOSPC immediately, some don't return anything at all,
> > and deflate does it's own thing - there are these places where they see
> > they are out of out space but they Z_OK it
> > 
> > if (s->pending != 0) {
> > 	flush_pending(strm);
> > 	if (strm->avail_out == 0) {
> > 		/* Since avail_out is 0, deflate will be called again with
> > 		 * more output space, but possibly with both pending and
> > 		 * avail_in equal to zero. There won't be anything to do,
> > 		 * but this is not an error situation so make sure we
> > 		 * return OK instead of BUF_ERROR at next call of deflate:
> > 		 */
> > 		s->last_flush = -1;
> > 		return Z_OK;
> > 	}
> > }
> 
> Z_OK is actually an error, see crypto/deflate.c:

I saw Z_STREAM_END, but deflate states "this is not an error" and
there are more places like this.

> 	ret = zlib_deflate(stream, Z_FINISH);
> 	if (ret != Z_STREAM_END) {
> 		ret = -EINVAL;
> 		goto out;
> 	}
> 
> We could change this to ENOSPC for consistency.

So it will ENOSPC all errors, not sure how good that is.  We also
have lz4/lz4hc that return the number of bytes "(((char *)op) - dest)"
if successful and 0 otherwise.  So any error is 0. dst_buf overrun
is also 0, impossible to tell the difference, again not sure if we
can just ENOSPC.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-23  3:12                                 ` Sergey Senozhatsky
@ 2025-02-23  3:38                                   ` Herbert Xu
  2025-02-23  4:02                                     ` Sergey Senozhatsky
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2025-02-23  3:38 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Barry Song, Yosry Ahmed, Minchan Kim, Sridhar, Kanchana P,
	linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, akpm, linux-crypto, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Feghali, Wajdi K,
	Gopal, Vinodh

On Sun, Feb 23, 2025 at 12:12:47PM +0900, Sergey Senozhatsky wrote:
>
> > > It also seems that there is no common way of reporting dst_but overflow.
> > > Some algos return -ENOSPC immediately, some don't return anything at all,
> > > and deflate does it's own thing - there are these places where they see
> > > they are out of out space but they Z_OK it
> > > 
> > > if (s->pending != 0) {
> > > 	flush_pending(strm);
> > > 	if (strm->avail_out == 0) {
> > > 		/* Since avail_out is 0, deflate will be called again with
> > > 		 * more output space, but possibly with both pending and
> > > 		 * avail_in equal to zero. There won't be anything to do,
> > > 		 * but this is not an error situation so make sure we
> > > 		 * return OK instead of BUF_ERROR at next call of deflate:
> > > 		 */
> > > 		s->last_flush = -1;
> > > 		return Z_OK;
> > > 	}
> > > }
> > 
> > Z_OK is actually an error, see crypto/deflate.c:
> 
> I saw Z_STREAM_END, but deflate states "this is not an error" and
> there are more places like this.

That would be a serious bug in deflate.  Where did you see it
return Z_STREAM_END in case of an overrun or error?

> So it will ENOSPC all errors, not sure how good that is.  We also
> have lz4/lz4hc that return the number of bytes "(((char *)op) - dest)"
> if successful and 0 otherwise.  So any error is 0. dst_buf overrun
> is also 0, impossible to tell the difference, again not sure if we
> can just ENOSPC.

I'm talking about the Crypto API calling convention.  Individual
compression libraries obviously have vastly different calling
conventions.

In the Crypto API, lz4 will return -EINVAL:

	int out_len = LZ4_compress_default(src, dst,
		slen, *dlen, ctx);

	if (!out_len)
		return -EINVAL;

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-23  3:38                                   ` Herbert Xu
@ 2025-02-23  4:02                                     ` Sergey Senozhatsky
  2025-02-23  6:04                                       ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Sergey Senozhatsky @ 2025-02-23  4:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sergey Senozhatsky, Barry Song, Yosry Ahmed, Minchan Kim,
	Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, akpm, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C,
	Feghali, Wajdi K, Gopal, Vinodh

On (25/02/23 11:38), Herbert Xu wrote:
> On Sun, Feb 23, 2025 at 12:12:47PM +0900, Sergey Senozhatsky wrote:
> >
> > > > It also seems that there is no common way of reporting dst_but overflow.
> > > > Some algos return -ENOSPC immediately, some don't return anything at all,
> > > > and deflate does it's own thing - there are these places where they see
> > > > they are out of out space but they Z_OK it
> > > > 
> > > > if (s->pending != 0) {
> > > > 	flush_pending(strm);
> > > > 	if (strm->avail_out == 0) {
> > > > 		/* Since avail_out is 0, deflate will be called again with
> > > > 		 * more output space, but possibly with both pending and
> > > > 		 * avail_in equal to zero. There won't be anything to do,
> > > > 		 * but this is not an error situation so make sure we
> > > > 		 * return OK instead of BUF_ERROR at next call of deflate:
> > > > 		 */
> > > > 		s->last_flush = -1;
> > > > 		return Z_OK;
> > > > 	}
> > > > }
> > > 
> > > Z_OK is actually an error, see crypto/deflate.c:
> > 
> > I saw Z_STREAM_END, but deflate states "this is not an error" and
> > there are more places like this.
> 
> That would be a serious bug in deflate.  Where did you see it
> return Z_STREAM_END in case of an overrun or error?

Oh, sorry for the confusion, I was talking about Z_OK for overruns.

> > So it will ENOSPC all errors, not sure how good that is.  We also
> > have lz4/lz4hc that return the number of bytes "(((char *)op) - dest)"
> > if successful and 0 otherwise.  So any error is 0. dst_buf overrun
> > is also 0, impossible to tell the difference, again not sure if we
> > can just ENOSPC.
> 
> I'm talking about the Crypto API calling convention.  Individual
> compression libraries obviously have vastly different calling
> conventions.
> 
> In the Crypto API, lz4 will return -EINVAL:
> 
> 	int out_len = LZ4_compress_default(src, dst,
> 		slen, *dlen, ctx);
> 
> 	if (!out_len)
> 		return -EINVAL;

Right, so you said that for deflate it could be

       ret = zlib_deflate(stream, Z_FINISH);
       if (ret != Z_STREAM_END) {
               ret = -ENOSPC;          // and not -EINVAL
               goto out;
       }

if I understood it correctly.  Which would make it: return 0 on success
or -ENOSPC otherwise.  So if crypto API wants consistency and return -ENOSPC
for buffer overruns, then for lz4/lz4hc it also becomes binary: either 0 or
-ENOSCP.  Current -EINVAL return looks better to me, both for deflate and
for lz4/lz4hc.  -ENOSPC is an actionable error code, a user can double the
dst_out size and retry compression etc., while in reality it could be some
SW/HW issue that is misreported as -ENOSPC.



So re-iterating Barry's points:

> My point is:
> 1. All drivers must be capable of handling dst_buf overflow.

Not the case.

> 2. All drivers must return a consistent and dedicated error code for
> dst_buf overflow.

Not the case.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-23  4:02                                     ` Sergey Senozhatsky
@ 2025-02-23  6:04                                       ` Herbert Xu
  0 siblings, 0 replies; 55+ messages in thread
From: Herbert Xu @ 2025-02-23  6:04 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Barry Song, Yosry Ahmed, Minchan Kim, Sridhar, Kanchana P,
	linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, akpm, linux-crypto, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Feghali, Wajdi K,
	Gopal, Vinodh

On Sun, Feb 23, 2025 at 01:02:41PM +0900, Sergey Senozhatsky wrote:
>
> if I understood it correctly.  Which would make it: return 0 on success
> or -ENOSPC otherwise.  So if crypto API wants consistency and return -ENOSPC
> for buffer overruns, then for lz4/lz4hc it also becomes binary: either 0 or
> -ENOSCP.  Current -EINVAL return looks better to me, both for deflate and
> for lz4/lz4hc.  -ENOSPC is an actionable error code, a user can double the
> dst_out size and retry compression etc., while in reality it could be some
> SW/HW issue that is misreported as -ENOSPC.

When you're compressing you're trying to make it smaller.  It's
always better to not compress something rather than doubling the
buffer on ENOSPC.

In any case, no software compression algorithm should ever fail
for a reason other than ENOSPC.

Hardware offload devices can fail of course.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-22  7:13                             ` Barry Song
  2025-02-22  7:22                               ` Herbert Xu
@ 2025-02-24 21:49                               ` Yosry Ahmed
  2025-02-27  3:05                                 ` Barry Song
  1 sibling, 1 reply; 55+ messages in thread
From: Yosry Ahmed @ 2025-02-24 21:49 UTC (permalink / raw)
  To: Barry Song
  Cc: Herbert Xu, Minchan Kim, Sergey Senozhatsky, Sridhar, Kanchana P,
	linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, akpm, linux-crypto, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Feghali, Wajdi K,
	Gopal, Vinodh

On Sat, Feb 22, 2025 at 08:13:13PM +1300, Barry Song wrote:
> On Sat, Feb 22, 2025 at 7:52 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> >
> > On Sat, Feb 22, 2025 at 07:41:54PM +1300, Barry Song wrote:
> > >
> > > probably no, as an incompressible page might become compressible
> > > after changing an algorithm. This is possible, users may swith an
> > > algorithm to compress an incompressible page in the background.
> >
> > I don't understand the difference.  If something is wrong with
> > the system causing the compression algorithm to fail, shouldn't
> > zswap just hobble along as if the page was incompressible?
> >
> > In fact it would be quite reasonble to try to recompress it if
> > the admin did change the algorithm later because the error may
> > have been specific to the previous algorithm implementation.
> >
> 
> Somehow, I find your comment reasonable. Another point I want
> to mention is the semantic difference. For example, in a system
> with only one algorithm, a dst_buf overflow still means a successful
> swap-out. However, other errors actually indicate an I/O failure.
> In such cases, vmscan.c will log the relevant error in pageout() to
> notify the user.
> 
> Anyway, I'm not an authority on this, so I’d like to see comments
> from Minchan, Sergey, and Yosry.

From a zswap perspective, things are a bit simpler. Currently zswap
handles compression errors and pages compressing to above PAGE_SIZE in
the same way (because zs_pool_malloc() will fail for sizes larger than
PAGE_SIZE). In both cases, zswap_store() will err out, and the page will
either go to the underlying swap disk or reclaim of that page will fail
if writeback is disabled for this cgroup.

Zswap currently does not do anything special about incompressible pages,
it just passes them along to disk. So if the Crypto API can guarantee
that compression nevers writes past PAGE_SIZE, the main benefit for
zswap would be reducing the buffer size from PAGE_SIZE*2 to PAGE_SIZE.

If/when zswap develops handling of incompressible memory (to avoid LRU
inversion), I imagine we would handle compression errors and
incompressible pages similarly. In both cases we'd store the page as-is
and move th LRU along to write more pages to disk. There is no point to
fail the reclaim operation in this case, because unlike zram we do have
a choice :)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching.
  2025-02-24 21:49                               ` Yosry Ahmed
@ 2025-02-27  3:05                                 ` Barry Song
  0 siblings, 0 replies; 55+ messages in thread
From: Barry Song @ 2025-02-27  3:05 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Herbert Xu, Minchan Kim, Sergey Senozhatsky, Sridhar, Kanchana P,
	linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, akpm, linux-crypto, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Feghali, Wajdi K,
	Gopal, Vinodh

On Tue, Feb 25, 2025 at 10:49 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Sat, Feb 22, 2025 at 08:13:13PM +1300, Barry Song wrote:
> > On Sat, Feb 22, 2025 at 7:52 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > >
> > > On Sat, Feb 22, 2025 at 07:41:54PM +1300, Barry Song wrote:
> > > >
> > > > probably no, as an incompressible page might become compressible
> > > > after changing an algorithm. This is possible, users may swith an
> > > > algorithm to compress an incompressible page in the background.
> > >
> > > I don't understand the difference.  If something is wrong with
> > > the system causing the compression algorithm to fail, shouldn't
> > > zswap just hobble along as if the page was incompressible?
> > >
> > > In fact it would be quite reasonble to try to recompress it if
> > > the admin did change the algorithm later because the error may
> > > have been specific to the previous algorithm implementation.
> > >
> >
> > Somehow, I find your comment reasonable. Another point I want
> > to mention is the semantic difference. For example, in a system
> > with only one algorithm, a dst_buf overflow still means a successful
> > swap-out. However, other errors actually indicate an I/O failure.
> > In such cases, vmscan.c will log the relevant error in pageout() to
> > notify the user.
> >
> > Anyway, I'm not an authority on this, so I’d like to see comments
> > from Minchan, Sergey, and Yosry.
>
> From a zswap perspective, things are a bit simpler. Currently zswap
> handles compression errors and pages compressing to above PAGE_SIZE in
> the same way (because zs_pool_malloc() will fail for sizes larger than
> PAGE_SIZE). In both cases, zswap_store() will err out, and the page will
> either go to the underlying swap disk or reclaim of that page will fail
> if writeback is disabled for this cgroup.
>
> Zswap currently does not do anything special about incompressible pages,
> it just passes them along to disk. So if the Crypto API can guarantee
> that compression nevers writes past PAGE_SIZE, the main benefit for
> zswap would be reducing the buffer size from PAGE_SIZE*2 to PAGE_SIZE.
>
> If/when zswap develops handling of incompressible memory (to avoid LRU
> inversion), I imagine we would handle compression errors and
> incompressible pages similarly. In both cases we'd store the page as-is
> and move th LRU along to write more pages to disk. There is no point to
> fail the reclaim operation in this case, because unlike zram we do have
> a choice :)

Yes. For zswap, I suppose we just need to wait until all driver issues are
resolved, such as:
crypto: lzo - Fix compression buffer overrun
https://lore.kernel.org/lkml/Z7_JOAgi-Ej3CCic@gondor.apana.org.au/

for zswap, we just need to address point 1, which is not the case yet.

"
> 1. All drivers must be capable of handling dst_buf overflow.

-Not the case.
"

Thanks
Barry


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2025-02-27  3:05 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-21  6:31 [PATCH v5 00/12] zswap IAA compress batching Kanchana P Sridhar
2024-12-21  6:31 ` [PATCH v5 01/12] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
2024-12-21  6:31 ` [PATCH v5 02/12] crypto: acomp - Define new interfaces for compress/decompress batching Kanchana P Sridhar
2024-12-28 11:46   ` Herbert Xu
2025-01-06 17:37     ` Sridhar, Kanchana P
2025-01-06 23:24       ` Yosry Ahmed
2025-01-07  1:36         ` Sridhar, Kanchana P
2025-01-07  1:46           ` Yosry Ahmed
2025-01-07  2:06             ` Herbert Xu
2025-01-07  3:10               ` Yosry Ahmed
2025-01-08  1:38                 ` Herbert Xu
2025-01-08  1:43                   ` Yosry Ahmed
2025-02-16  5:17                 ` Herbert Xu
2025-02-20 17:32                   ` Yosry Ahmed
2025-02-22  6:26                     ` Barry Song
2025-02-22  6:34                       ` Herbert Xu
2025-02-22  6:41                         ` Barry Song
2025-02-22  6:52                           ` Herbert Xu
2025-02-22  7:13                             ` Barry Song
2025-02-22  7:22                               ` Herbert Xu
2025-02-22  8:21                                 ` Barry Song
2025-02-24 21:49                               ` Yosry Ahmed
2025-02-27  3:05                                 ` Barry Song
2025-02-22 12:31                       ` Sergey Senozhatsky
2025-02-22 14:27                         ` Sergey Senozhatsky
2025-02-23  0:14                           ` Herbert Xu
2025-02-23  2:09                             ` Sergey Senozhatsky
2025-02-23  2:52                               ` Herbert Xu
2025-02-23  3:12                                 ` Sergey Senozhatsky
2025-02-23  3:38                                   ` Herbert Xu
2025-02-23  4:02                                     ` Sergey Senozhatsky
2025-02-23  6:04                                       ` Herbert Xu
2025-02-22 16:24                         ` Barry Song
2025-02-23  0:24                         ` Herbert Xu
2025-02-23  1:57                           ` Sergey Senozhatsky
2025-01-07  2:04       ` Herbert Xu
2024-12-21  6:31 ` [PATCH v5 03/12] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
2024-12-21  6:31 ` [PATCH v5 04/12] crypto: iaa - Implement batch_compress(), batch_decompress() API in iaa_crypto Kanchana P Sridhar
2024-12-22  4:07   ` kernel test robot
2024-12-21  6:31 ` [PATCH v5 05/12] crypto: iaa - Make async mode the default Kanchana P Sridhar
2024-12-21  6:31 ` [PATCH v5 06/12] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2024-12-21  6:31 ` [PATCH v5 07/12] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
2024-12-21  6:31 ` [PATCH v5 08/12] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
2024-12-21  6:31 ` [PATCH v5 09/12] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
2024-12-21  6:31 ` [PATCH v5 10/12] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching Kanchana P Sridhar
2025-01-07  0:58   ` Yosry Ahmed
2025-01-08  3:26     ` Sridhar, Kanchana P
2025-01-08  4:16       ` Yosry Ahmed
2024-12-21  6:31 ` [PATCH v5 11/12] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching Kanchana P Sridhar
2025-01-07  1:16   ` Yosry Ahmed
2025-01-08  3:57     ` Sridhar, Kanchana P
2025-01-08  4:22       ` Yosry Ahmed
2024-12-21  6:31 ` [PATCH v5 12/12] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
2025-01-07  1:19   ` Yosry Ahmed
2025-01-07  1:44 ` [PATCH v5 00/12] zswap IAA compress batching Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox