[PATCH v8 00/14] zswap IAA compress batching

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v8 00/14] zswap IAA compress batching
@ 2025-03-03  8:47 Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 01/14] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
                   ` (13 more replies)
  0 siblings, 14 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

IAA Compression Batching with crypto_acomp Request Chaining:
============================================================

This patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel batch compression of pages in large folios to improve
zswap swapout latency. It does this by first creating a generic batching
framework in crypto_acomp using request chaining, followed by invoking
request chaining API to compress/decompress a batch in the iaa_crypto
driver.

From zswap's perspective, the most significant changes are:

1) A unified zswap_batch_compress() API is added to compress multiple
   pages. If the compressor has multiple acomp requests, i.e., internally
   supports batching, a request chain is created and crypto_acomp_compress()
   is called with the head request in the chain. In the iaa_crypto driver,
   the request chain is batch compressed in parallel.

   If the compressor can only compress one page at a time, each page is
   compressed sequentially.

   In both cases, the calls to crypto are exactly the same:

   crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);

   Many thanks to Yosry for this suggestion, because it is an essential
   component of unifying common code paths between sequential/batching
   compressions.

2) zswap_compress() has been deleted, since we call zswap_batch_compress()
   in all cases.

3) A new zswap_store_pages() is added, that stores multiple pages in a
   folio in a range of indices. This is an extension of the earlier
   zswap_store_page(), except it operates on a batch of pages.

4) A new zswap_store_folio() is added, that stores all pages in a folio by
   calling zswap_store_pages(). If the compressor supports batching,
   i.e., has multiple acomp requests, the folio will be compressed in
   batches of "acomp_ctx->nr_reqs" using request chaining. If the
   compressor has only one acomp request, the folio will be compressed
   in batches of ZSWAP_MAX_BATCH_SIZE pages, where each page in the batch
   is compressed sequentially.

5) A simplification of the acomp_ctx resources allocation/deletion
   vis-a-vis CPU hot[un]plug. v8 of this patch-series proposes that these
   resources are not destroyed during CPU offlining, rather have a lifespan
   that tracks the zswap_pool's: from pool creation to pool deletion. This
   is in response to Yosry's comments in v6 with regards to exploring mutex
   locking options in zswap_cpu_comp_prepare().

Improvements seen with v8's IAA compress batching with request chaining
vs. IAA sequential:

usemem30 with 64K folios:
  70% higher throughput
  32% lower elapsed time
  32% lower sys time 

usemem30 with 2M folios:
  68% higher throughput
  24% lower elapsed time
  24% lower sys time 

Kernel compilation allmod config with 64K folios:
  8.4% lower sys time

There are no regressions seen with zstd with either usemem30 or kernel
compilation allmod config.

In v8, there are only very minor changes wrt v7 for patches 1 through 13.
The only major change is in patch 14.

The patch-series is organized as follows:

 1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
    patches are tagged with "crypto:" in the subject:

    Patch 1) Adds new acomp request chaining framework and interface based
             on Herbert Xu's ahash reference implementation in "[PATCH 2/6]
             crypto: hash - Add request chaining API" [1]. acomp algorithms
             can use request chaining through these interfaces:

             Setup the request chain:
               acomp_reqchain_init()
               acomp_request_chain()

             Process the request chain:
               acomp_do_req_chain(): synchronously (sequentially)
               acomp_do_async_req_chain(): asynchronously using submit/poll
                                           ops (in parallel)

             Query if a request has a chain of requests that need to be
             processed with it, as a batch:

             acomp_is_reqchain()

    Patch 2) Adds acomp_alg/crypto_acomp interfaces for get_batch_size(),
             that swap modules can invoke using the new
             crypto_acomp_batch_size() API, to get the maximum batch size
             supported by a compressor before allocating batching
             resources.

    Patch 3) New CRYPTO_ACOMP_REQ_POLL acomp_req flag to act as a gate for
             async poll mode in iaa_crypto.

    Patch 4) iaa-crypto driver implementations for async and sync compress
             and decompress batching using request chaining. The driver's
             compress() and decompress() interface implementations will
             query acomp_is_reqchain() to do batched vs. sequential
             compression/decompression. If the iaa_crypto driver is set up
             for 'async' sync_mode, these batching implementations deploy
             the asynchronous request chaining framework provided via
             acomp_do_async_req_chain() in patch 1. 'async' is the
             recommended mode for realizing the benefits of IAA
             parallelism. If iaa_crypto is set up for 'sync' sync_mode, the
             synchronous version of the request chaining API is used, i.e.,
             acomp_do_req_chain() - this will process the chain in series.

             The "iaa_acomp_fixed_deflate" algorithm opts in to request
             chaining with CRYPTO_ALG_REQ_CHAIN, and registers the
             get_batch_size() interface, which returns the
             IAA_CRYPTO_MAX_BATCH_SIZE constant that iaa_crypto defines
             currently as 8U for IAA compression algorithms (iaa_crypto can
             change this if needed as we optimize our batching algorithm). 

    Patch 5) Modifies the default iaa_crypto driver mode to async, now that
             iaa_crypto provides a truly async mode that gives
             significantly better latency than sync mode for the batching
             use case.

    Patch 6) Disables verify_compress by default, to facilitate users to
             run IAA easily for comparison with software compressors.

    Patch 7) Reorganizes the iaa_crypto driver code into logically related
             sections and avoids forward declarations, in order to facilitate
             Patch 8. This patch makes no functional changes.

    Patch 8) Makes a major infrastructure change in the iaa_crypto driver,
             to map IAA devices/work-queues to cores based on packages
             instead of NUMA nodes. This doesn't impact performance on
             the Sapphire Rapids system used for performance
             testing. However, this change fixes functional problems we
             found on Granite Rapids in internal validation, where the
             number of NUMA nodes is greater than the number of packages,
             which was resulting in over-utilization of some IAA devices
             and non-usage of other IAA devices as per the current NUMA
             based mapping infrastructure.
             This patch also eliminates duplication of device wqs in
             per-cpu wq_tables, thereby saving 140MiB on a 384 cores
             Granite Rapids server with 8 IAAs. Submitting this change now
             so that it can go through code reviews before it can be merged.

    Patch 9) Builds upon the new infrastructure for mapping IAAs to cores
             based on packages, and enables configuring a "global_wq" per
             IAA, which can be used as a global resource for compress jobs
             for the package. If the user configures 2WQs per IAA device,
             the driver will distribute compress jobs from all cores on the
             package to the "global_wqs" of all the IAA devices on that
             package, in a round-robin manner. This can be used to improve
             compression throughput for workloads that see a lot of swapout
             activity.

   Patch 10) Makes an important change to iaa_crypto driver's descriptor
             allocation, from blocking to non-blocking with
             retries/timeouts and mitigations in case of timeouts during
             compress/decompress ops. This prevents tasks getting blocked
             indefinitely, which was observed when testing 30 cores running
             workloads, with only 1 IAA enabled on Sapphire Rapids (out of
             4). These timeouts are typically only encountered, and
             associated mitigations exercised, only in configurations with
             1 IAA device shared by 30+ cores.

   Patch 11) Fixes a bug with the "deflate_generic_tfm" global being
             accessed without locks in the software decomp fallback code.

 2) zswap modifications to enable compress batching in zswap_store()
    of large folios (including pmd-mappable folios):

   Patch 12) Simplifies acomp_ctx resources to have a lifespan from pool
             creation to pool deletion, persisting through CPU hot[un]plugs
             after initial allocation.

   Patch 13) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set
             as 8U) to denote the maximum number of acomp_ctx batching
             resources. Further, the "struct crypto_acomp_ctx" is modified
             to contain a configurable number of acomp_reqs and buffers.
             The cpu hotplug onlining code will allocate up to
             ZSWAP_MAX_BATCH_SIZE requests/buffers in the per-cpu
             acomp_ctx, thereby limiting the memory usage in zswap, and
             ensuring that non-batching compressors incur no memory penalty
             except for minimal overhead.

   Patch 14) Restructures & simplifies zswap_store() to make it amenable
             for batching. Moves the loop over the folio's pages to a new
             zswap_store_folio(), which in turn calls a new
             zswap_store_pages() with a range of indices in the folio to be
             stored. zswap_store_pages() calls the new
             zswap_batch_compress() to use compression batching with
             request chaining to compress multiple pages in one shot (in
             parallel in IAA), or to sequentially compress each page in the
             batch.

             zstd performance is on par with mm-unstable. We see impressive
             throughput/performance improvements with IAA batching
             vs. no-batching.

With v8 of this patch series, the IAA compress batching feature will be
enabled seamlessly on Intel platforms that have IAA by selecting
'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
sync_mode driver attribute.

[1]: https://lore.kernel.org/linux-crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herbert@gondor.apana.org.au/

System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 2-27-2025,
commit d58172d128ac, without and with this patch-series.
Data was gathered on an Intel Sapphire Rapids (SPR) server, dual-socket
56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD
disk partition swap. Core frequency was fixed at 2500MHz.

Other kernel configuration parameters:

    zswap compressor  : zstd, deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 0

IAA "compression verification" is disabled and IAA is run in the async
mode (the defaults with this series).

I ran experiments with these workloads:

1) usemem 30 processes with these large folios enabled to "always":
   - 64k
   - 2048k

2) Kernel compilation allmodconfig with 2G max memory, 32 threads, run in
   tmpfs with these large folios enabled to "always":
   - 64k

Performance testing (usemem30):
===============================
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:

usemem --init-time -w -O -b 1 -s 10 -n 30 10g

 64K folios: usemem30: deflate-iaa:
 ==================================

 -------------------------------------------------------------------------------
                 mm-unstable-2-27-2025             v8
 -------------------------------------------------------------------------------
 zswap compressor         deflate-iaa     deflate-iaa    IAA Batching
                                                             vs.
                                                           Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)    6,025,001      10,220,686         70%      
 Avg throughput (KB/s)        200,833         340,689                    
 elapsed time (sec)            100.25           68.48        -32%      
 sys time (sec)              2,414.12        1,630.42        -32%      

 -------------------------------------------------------------------------------
 memcg_high                   912,474       1,007,870                    
 memcg_swap_fail                    0               0                    
 zswpout                   58,338,296      64,507,786                    
 zswpin                            65              69                    
 pswpout                            0               0                    
 pswpin                             0               0                    
 thp_swpout                         0               0                    
 thp_swpout_fallback                0               0                    
 64kB_swpout_fallback             941             125                    
 pgmajfault                     2,867           2,943                    
 anon_fault_alloc_64kB      4,924,286       4,923,648                   
 ZSWPOUT-64kB               3,645,190       4,031,585   
 SWPOUT-64kB                        0               0   
 -------------------------------------------------------------------------------

 2M folios: usemem30: deflate-iaa:
 =================================

 -------------------------------------------------------------------------------
                 mm-unstable-2-27-2025              v8
 -------------------------------------------------------------------------------
 zswap compressor          deflate-iaa     deflate-iaa     IAA Batching
                                                               vs.
                                                            Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)     6,374,303      10,720,261        68%     
 Avg throughput (KB/s)         212,476         357,342               
 elapsed time (sec)              87.04           65.80       -24%     
 sys time (sec)               2,012.30        1,522.93       -24%     

 -------------------------------------------------------------------------------
 memcg_high                   115,418          125,177              
 memcg_swap_fail                    0                0              
 zswpout                   59,471,372       64,509,017              
 zswpin                            10              441              
 pswpout                            0                0              
 pswpin                             0                0              
 thp_swpout                         0                0              
 thp_swpout_fallback               24              222              
 pgmajfault                     2,784            3,226              
 anon_fault_alloc_2048kB      153,737          153,733             
 ZSWPOUT-2048kB               116,131          125,781         
 SWPOUT-2048kB                      0                0         
 -------------------------------------------------------------------------------

 64K folios: usemem30: zstd:
 ===========================

 -------------------------------------------------------------------------------
                mm-unstable-2-27-2025           v8             v8

 -------------------------------------------------------------------------------
 zswap compressor               zstd          zstd           zstd
 -------------------------------------------------------------------------------
 Total throughput (KB/s)   6,920,374     6,935,013      6,924,096
 Avg throughput (KB/s)       230,679       231,167        230,803
 elapsed time (sec)            94.62         88.77          89.79
 sys time (sec)             2,387.50      2,213.12       2,250.91

 -------------------------------------------------------------------------------
 memcg_high                  764,199       763,887        764,378
 memcg_swap_fail                   0             0              0
 zswpout                  48,917,234    48,897,180     48,928,700
 zswpin                          426           422            430
 pswpout                           0             0              0
 pswpin                            0             0              0
 thp_swpout                        0             0              0
 thp_swpout_fallback               0             0              0
 64kB_swpout_fallback          2,190         2,178          2,231
 pgmajfault                    3,231         3,210          3,206
 anon_fault_alloc_64kB     4,924,864     4,924,556      4,924,192
 ZSWPOUT-64kB              3,055,153     3,053,916      3,055,843
 SWPOUT-64kB                       0             0              0
 -------------------------------------------------------------------------------

 2M folios: usemem30: zstd:
 ==========================

 -------------------------------------------------------------------------------
                mm-unstable-2-27-2025           v8
 -------------------------------------------------------------------------------
 zswap compressor               zstd          zstd
 -------------------------------------------------------------------------------
 Total throughput (KB/s)   7,655,965     7,685,827
 Avg throughput (KB/s)       255,198       256,194
 elapsed time (sec)            86.52         81.10
 sys time (sec)             2,030.63      1,886.54

 -------------------------------------------------------------------------------
 memcg_high                   93,106        93,054
 memcg_swap_fail                   0             0
 zswpout                  48,091,648    48,054,576
 zswpin                            0           398
 pswpout                           0             0
 pswpin                            0             0
 thp_swpout                        0             0
 thp_swpout_fallback               8           203
 pgmajfault                    2,794         3,242
 anon_fault_alloc_2048kB     153,736       153,742
 ZSWPOUT-2048kB               93,921        93,678
 SWPOUT-2048kB                     0             0
 -------------------------------------------------------------------------------

Performance testing (Kernel compilation, allmodconfig):
=======================================================

The experiments with kernel compilation test, 32 threads, in tmpfs use the
"allmodconfig" that takes ~12 minutes, and has considerable swapout/swapin
activity. The cgroup's memory.max is set to 2G.

 64K folios: Kernel compilation/allmodconfig:
 ============================================

 --------------------------------------------------------------------------------
                     mm-unstable               v8     mm-unstable              v8
 --------------------------------------------------------------------------------
 zswap compressor    deflate-iaa      deflate-iaa            zstd            zstd   
 --------------------------------------------------------------------------------
 real_sec                 775.83           742.52          769.39          762.85 
 user_sec              15,659.10        15,644.82       15,666.28       15,636.25 
 sys_sec                4,209.69         3,856.43        5,277.86        5,242.89 
 --------------------------------------------------------------------------------
 Max_Res_Set_Size_KB   1,871,116        1,873,212       1,873,200       1,871,732 
 --------------------------------------------------------------------------------
 memcg_high                    0                0               0               0 
 memcg_swap_fail               0                0               0               0 
 zswpout             107,305,181      101,061,420      86,621,912      87,004,261 
 zswpin               32,418,991       29,825,342      25,337,514      25,530,348 
 pswpout                     272              597              94              86 
 pswpin                      274              519              54              84 
 thp_swpout                    0                0               0               0 
 thp_swpout_fallback           0                0               0               0 
 64kB_swpout_fallback        494                0               0               0 
 pgmajfault           34,577,545       31,830,251      26,892,991      27,111,920 
 ZSWPOUT-64kB          3,498,796        3,276,189       2,737,544       2,766,736 
 SWPOUT-64kB                  17               36               4               5 
 --------------------------------------------------------------------------------

With the iaa_crypto driver changes for non-blocking descriptor allocations,
no timeouts-with-mitigations were seen in compress/decompress jobs, for all
of the above experiments.

Summary:
========
The performance testing data with usemem 30 processes and kernel
compilation test show 68%-70% throughput gains and 24%-32% sys time
reduction (usemem30) and 8.4% sys time reduction (kernel compilation) with
zswap_store() large folios using IAA compress batching as compared to
IAA sequential. There is no performance regression for zstd/usemem30
and zstd/kernel compilation allmod config, and even a 7.3% sys time
reduction for zstd/usemem, most likely due to the performance optimizations
implemented in zswap_batch_compress() in patch 14 of v8, such as
likely/unlikely compiler directives to minimize branch mis-predicts, and
the use of prefetchw of the zswap entry before it is written.

We can expect to see even more significant performance and throughput
improvements if we use the parallelism offered by IAA to do reclaim
batching of 4K/large folios (really any-order folios), and using the
zswap_store() high throughput compression to batch-compress pages
comprising these folios, not just batching within large folios. This is the
reclaim batching patch 13 in v1, which will be submitted in a separate
patch-series.

Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 21.3% more memory savings with IAA, as
compared to zstd, for same performance. IAA batching demonstrates more than
2X the memory savings obtained by zstd for same performance.

Changes since v7:
=================
1) Rebased to mm-unstable as of 3-3-2025, commit 5f089a9aa987.
2) Changed the acomp_ctx->nr_reqs to be u8 since ZSWAP_MAX_BATCH_SIZE is
   defined as 8U, for saving memory in this per-cpu structure.
3) Fixed a typo in code comments in acomp_ctx_get_cpu_lock():
   acomp_ctx->initialized to acomp_ctx->__online.
4) Incorporated suggestions from Yosry, Chengming, Nhat and Johannes,
   thanks to all!
   a) zswap_batch_compress() replaces zswap_compress(). Thanks Yosry
      for this suggestion!
   b) Process the folio in sub-batches of ZSWAP_MAX_BATCH_SIZE, regardless
      of whether or not the compressor supports batching. This gets rid of
      the kmalloc(entries), and allows us to allocate an array of
      ZSWAP_MAX_BATCH_SIZE entries on the stack. This is implemented in
      zswap_store_pages().
   c) Use of a common structure and code paths for compressing a folio in
      batches, either as a request chain (in parallel in IAA hardware) or
      sequentially. No code duplication since zswap_compress() has been
      replaced with zswap_batch_compress(), simplifying maintainability.
5) A key difference between compressors that support batching and
   those that do not, is that for the latter, the acomp_ctx mutex is
   locked/unlocked per ZSWAP_MAX_BATCH_SIZE batch, so that decompressions
   to handle page-faults can make progress. This fixes the zstd kernel
   compilation regression seen in v7. For compressors that support
   batching, for e.g. IAA, the mutex is locked/released once for storing
   the folio.
6) Used likely/unlikely compiler directives and prefetchw to restore
   performance with the common code paths.

Changes since v6:
=================
1) Rebased to mm-unstable as of 2-27-2025, commit d58172d128ac.

2) Deleted crypto_acomp_batch_compress() and
   crypto_acomp_batch_decompress() interfaces, as per Herbert's
   suggestion. Batching is instead enabled by chaining the requests. For
   non-batching compressors, there is no request chaining involved. Both,
   batching and non-batching compressions are accomplished by zswap by
   calling:

   crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);

3) iaa_crypto implementation of batch compressions/decompressions using
   request chaining, as per Herbert's suggestions.
4) Simplification of the acomp_ctx resource allocation/deletion with
   respect to CPU hot[un]plug, to address Yosry's suggestions to explore the
   mutex options in zswap_cpu_comp_prepare(). Yosry, please let me know if
   the per-cpu memory cost of this proposed change is acceptable (IAA:
   64.8KB, Software compressors: 8.2KB). On the positive side, I believe
   restarting reclaim on a CPU after it has been through an offline-online
   transition, will be much faster by not deleting the acomp_ctx resources
   when the CPU gets offlined.
5) Use of lockdep assertions rather than comments for internal locking
   rules, as per Yosry's suggestion.
6) No specific references to IAA in zswap.c, as suggested by Yosry.
7) Explored various solutions other than the v6 zswap_store_folio()
   implementation, to fix the zstd regression seen in v5, to attempt to
   unify common code paths, and to allocate smaller arrays for the zswap
   entries on the stack. All these options were found to cause usemem30
   latency regression with zstd. The v6 version of zswap_store_folio() is
   the only implementation that does not cause zstd regression, confirmed
   by 10 consecutive runs, each giving quite consistent latency
   numbers. Hence, the v6 implementation is carried forward to v7, with
   changes for branching for batching vs. sequential compression API
   calls.

Changes since v5:
=================
1) Rebased to mm-unstable as of 2-1-2025, commit 7de6fd8ab650.

Several improvements, regression fixes and bug fixes, based on Yosry's
v5 comments (Thanks Yosry!):

2) Fix for zstd performance regression in v5.
3) Performance debug and fix for marginal improvements with IAA batching
   vs. sequential.
4) Performance testing data compares IAA with and without batching, instead
   of IAA batching against zstd.
5) Commit logs/zswap comments not mentioning crypto_acomp implementation
   details.
6) Delete the pr_info_once() when batching resources are allocated in
   zswap_cpu_comp_prepare().
7) Use kcalloc_node() for the multiple acomp_ctx buffers/reqs in
   zswap_cpu_comp_prepare().
8) Simplify and consolidate error handling cleanup code in
   zswap_cpu_comp_prepare().
9) Introduce zswap_compress_folio() in a separate patch.
10) Bug fix in zswap_store_folio() when xa_store() failure can cause all
    compressed objects and entries to be freed, and UAF when zswap_store()
    tries to free the entries that were already added to the xarray prior
    to the failure.
11) Deleting compressed_bytes/bytes. zswap_store_folio() also comprehends
    the recent fixes in commit bf5eaaaf7941 ("mm/zswap: fix inconsistency
    when zswap_store_page() fails") by Hyeonggon Yoo.

iaa_crypto improvements/fixes/changes:

12) Enables asynchronous mode and makes it the default. With commit
    4ebd9a5ca478 ("crypto: iaa - Fix IAA disabling that occurs when
    sync_mode is set to 'async'"), async mode was previously just sync. We
    now have true async support.
13) Change idxd descriptor allocations from blocking to non-blocking with
    timeouts, and mitigations for compress/decompress ops that fail to
    obtain a descriptor. This is a fix for tasks blocked errors seen in
    configurations where 30+ cores are running workloads under high memory
    pressure, and sending comps/decomps to 1 IAA device.
14) Fixes a bug with unprotected access of "deflate_generic_tfm" in
    deflate_generic_decompress(), which can cause data corruption and
    zswap_decompress() kernel crash.
15) zswap uses crypto_acomp_batch_compress() with async polling instead of
    request chaining for slightly better latency. However, the request
    chaining framework itself is unchanged, preserved from v5.

Changes since v4:
=================
1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
3) Implemented IAA compress batching using request chaining.
4) zswap_store() batching simplifications suggested by Chengming, Yosry and
   Nhat, thanks to all!
   - New zswap_compress_folio() that is called by zswap_store().
   - Move the loop over folio's pages out of zswap_store() and into a
     zswap_store_folio() that stores all pages.
   - Allocate all zswap entries for the folio upfront.
   - Added zswap_batch_compress().
   - Branch to call zswap_compress() or zswap_batch_compress() inside
     zswap_compress_folio().
   - All iterations over pages kept in same function level.
   - No helpers other than the newly added zswap_store_folio() and
     zswap_compress_folio().

Changes since v3:
=================
1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
   based on packages instead of NUMA nodes.
3) Added acomp_has_async_batching() API to crypto acomp, that allows
   zswap/zram to query if a crypto_acomp has registered batch_compress and
   batch_decompress interfaces.
4) Clear the poll bits on the acomp_reqs passed to
   iaa_comp_a[de]compress_batch() so that a module like zswap can be
   confident about the acomp_reqs[0] not having the poll bit set before
   calling the fully synchronous API crypto_acomp_[de]compress().
   Herbert, I would appreciate it if you can review changes 2-4; in patches
   1-8 in v4. I did not want to introduce too many iaa_crypto changes in
   v4, given that patch 7 is already making a major change. I plan to work
   on incorporating the request chaining using the ahash interface in v5
   (I need to understand the basic crypto ahash better). Thanks Herbert!
5) Incorporated Johannes' suggestion to not have a sysctl to enable
   compress batching.
6) Incorporated Yosry's suggestion to allocate batching resources in the
   cpu hotplug onlining code, since there is no longer a sysctl to control
   batching. Thanks Yosry!
7) Incorporated Johannes' suggestions related to making the overall
   sequence of events between zswap_store() and zswap_batch_store() similar
   as much as possible for readability and control flow, better naming of
   procedures, avoiding forward declarations, not inlining error path
   procedures, deleting zswap internal details from zswap.h, etc. Thanks
   Johannes, really appreciate the direction!
   I have tried to explain the minimal future-proofing in terms of the
   zswap_batch_store() signature and the definition of "struct
   zswap_batch_store_sub_batch" in the comments for this struct. I hope the
   new code explains the control flow a bit better.

Changes since v2:
=================
1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
   returned by kmalloc_node() for acomp_ctx->buffers and for
   acomp_ctx->reqs.
3) Fixed a bug in zswap_pool_can_batch() for returning true if
   pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
   the per-cpu acomp_batch_ctx tests true for batching resources having
   been allocated on this cpu. Also, changed from per_cpu_ptr() to
   raw_cpu_ptr().
4) Incorporated the zswap_store_propagate_errors() compilation warning fix
   suggested by Dan Carpenter. Thanks Dan!
5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
   zswap.h, with SWAP_CRYPTO_BATCH_SIZE.

Changes since v1:
=================
1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
   async/poll mode, and to encapsulate the polling functionality in the
   iaa_crypto driver. Thanks Herbert!
3) Incorporated Herbert's and Yosry's suggestions to implement the batching
   API in iaa_crypto and to make its use seamless from zswap's
   perspective. Thanks Herbert and Yosry!
4) Incorporated Yosry's suggestion to make it more convenient for the user
   to enable compress batching, while minimizing the memory footprint
   cost. Thanks Yosry!
5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
   reclaim batching patch from this series, since it requires a broader
   discussion.

I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana

Kanchana P Sridhar (14):
  crypto: acomp - Add synchronous/asynchronous acomp request chaining.
  crypto: acomp - New interfaces to facilitate batching support in acomp
    & drivers.
  crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable
    async mode.
  crypto: iaa - Implement batch compression/decompression with request
    chaining.
  crypto: iaa - Enable async mode and make it the default.
  crypto: iaa - Disable iaa_verify_compress by default.
  crypto: iaa - Re-organize the iaa_crypto driver code.
  crypto: iaa - Map IAA devices/wqs to cores based on packages instead
    of NUMA.
  crypto: iaa - Distribute compress jobs from all cores to all IAAs on a
    package.
  crypto: iaa - Descriptor allocation timeouts with mitigations in
    iaa_crypto.
  crypto: iaa - Fix for "deflate_generic_tfm" global being accessed
    without locks.
  mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex
    lock usage.
  mm: zswap: Allocate pool batching resources if the compressor supports
    batching.
  mm: zswap: Compress batching with request chaining in zswap_store() of
    large folios.

 .../driver-api/crypto/iaa/iaa-crypto.rst      |   11 +-
 crypto/acompress.c                            |  285 +++
 drivers/crypto/intel/iaa/iaa_crypto.h         |   30 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c    | 1556 ++++++++++++-----
 include/crypto/acompress.h                    |   79 +
 include/crypto/algapi.h                       |   10 +
 include/crypto/internal/acompress.h           |   14 +
 include/linux/crypto.h                        |   39 +
 mm/zswap.c                                    |  722 +++++---
 9 files changed, 2043 insertions(+), 703 deletions(-)

base-commit: 5f089a9aa987ccf72df0c6955e168e865f280603
-- 
2.27.0

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 01/14] crypto: acomp - Add synchronous/asynchronous acomp request chaining.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 02/14] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers Kanchana P Sridhar
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch is based on Herbert Xu's request chaining for ahash
("[PATCH 2/6] crypto: hash - Add request chaining API") [1]. The generic
framework for request chaining that's provided in the ahash implementation
has been used as reference to develop a similar synchronous request
chaining framework for crypto_acomp.

Furthermore, this commit develops an asynchronous request chaining
framework and API that iaa_crypto can use for request chaining with
parallelism, in order to fully benefit from Intel IAA's multiple
compress/decompress engines in hardware. This allows us to gain significant
latency improvements with IAA batching as compared to synchronous request
chaining.

 Usage of acomp request chaining API:
 ====================================

 Any crypto_acomp compressor can avail of request chaining as follows:

 Step 1: Create request chain:

  Request 0 (the first req in the chain):

  void acomp_reqchain_init(struct acomp_req *req,
                           u32 flags, crypto_completion_t compl,
                           void *data);

  Subsequent requests:

  void acomp_request_chain(struct acomp_req *req,
                           struct acomp_req *head);

 Step 2: Process the request chain using the specified compress/decompress
         "op":

  2.a) Synchronous: the chain of requests is processed in series:

       int acomp_do_req_chain(struct acomp_req *req,
                              int (*op)(struct acomp_req *req));

  2.b) Asynchronous: the chain of requests is processed in parallel using a
       submit-poll paradigm:

       int acomp_do_async_req_chain(struct acomp_req *req,
                                    int (*op_submit)(struct acomp_req *req),
                                    int (*op_poll)(struct acomp_req *req));

Request chaining will be used in subsequent patches to implement
compress/decompress batching in the iaa_crypto driver for the two supported
IAA driver sync_modes:

  sync_mode = 'sync' will use (2.a),
  sync_mode = 'async' will use (2.b).

These files are directly re-used from [1] which is not yet merged:

include/crypto/algapi.h
include/linux/crypto.h

Hence, I am adding Herbert as the co-developer of this acomp request
chaining patch.

[1]: https://lore.kernel.org/linux-crypto/677614fbdc70b31df2e26483c8d2cd1510c8af91.1730021644.git.herbert@gondor.apana.org.au/

Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by:
---
 crypto/acompress.c                  | 284 ++++++++++++++++++++++++++++
 include/crypto/acompress.h          |  46 +++++
 include/crypto/algapi.h             |  10 +
 include/crypto/internal/acompress.h |  10 +
 include/linux/crypto.h              |  39 ++++
 5 files changed, 389 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index 6fdf0ff9f3c0..cb6444d09dd7 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -23,6 +23,19 @@ struct crypto_scomp;
 
 static const struct crypto_type crypto_acomp_type;
 
+struct acomp_save_req_state {
+	struct list_head head;
+	struct acomp_req *req0;
+	struct acomp_req *cur;
+	int (*op)(struct acomp_req *req);
+	crypto_completion_t compl;
+	void *data;
+};
+
+static void acomp_reqchain_done(void *data, int err);
+static int acomp_save_req(struct acomp_req *req, crypto_completion_t cplt);
+static void acomp_restore_req(struct acomp_req *req);
+
 static inline struct acomp_alg *__crypto_acomp_alg(struct crypto_alg *alg)
 {
 	return container_of(alg, struct acomp_alg, calg.base);
@@ -123,6 +136,277 @@ struct crypto_acomp *crypto_alloc_acomp_node(const char *alg_name, u32 type,
 }
 EXPORT_SYMBOL_GPL(crypto_alloc_acomp_node);
 
+static int acomp_save_req(struct acomp_req *req, crypto_completion_t cplt)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct acomp_save_req_state *state;
+	gfp_t gfp;
+	u32 flags;
+
+	if (!acomp_is_async(tfm))
+		return 0;
+
+	flags = acomp_request_flags(req);
+	gfp = (flags & CRYPTO_TFM_REQ_MAY_SLEEP) ?  GFP_KERNEL : GFP_ATOMIC;
+	state = kmalloc(sizeof(*state), gfp);
+	if (!state)
+		return -ENOMEM;
+
+	state->compl = req->base.complete;
+	state->data = req->base.data;
+	state->req0 = req;
+
+	req->base.complete = cplt;
+	req->base.data = state;
+
+	return 0;
+}
+
+static void acomp_restore_req(struct acomp_req *req)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct acomp_save_req_state *state;
+
+	if (!acomp_is_async(tfm))
+		return;
+
+	state = req->base.data;
+
+	req->base.complete = state->compl;
+	req->base.data = state->data;
+	kfree(state);
+}
+
+static int acomp_reqchain_finish(struct acomp_save_req_state *state,
+				 int err, u32 mask)
+{
+	struct acomp_req *req0 = state->req0;
+	struct acomp_req *req = state->cur;
+	struct acomp_req *n;
+
+	req->base.err = err;
+
+	if (req == req0)
+		INIT_LIST_HEAD(&req->base.list);
+	else
+		list_add_tail(&req->base.list, &req0->base.list);
+
+	list_for_each_entry_safe(req, n, &state->head, base.list) {
+		list_del_init(&req->base.list);
+
+		req->base.flags &= mask;
+		req->base.complete = acomp_reqchain_done;
+		req->base.data = state;
+		state->cur = req;
+		err = state->op(req);
+
+		if (err == -EINPROGRESS) {
+			if (!list_empty(&state->head))
+				err = -EBUSY;
+			goto out;
+		}
+
+		if (err == -EBUSY)
+			goto out;
+
+		req->base.err = err;
+		list_add_tail(&req->base.list, &req0->base.list);
+	}
+
+	acomp_restore_req(req0);
+
+out:
+	return err;
+}
+
+static void acomp_reqchain_done(void *data, int err)
+{
+	struct acomp_save_req_state *state = data;
+	crypto_completion_t compl = state->compl;
+
+	data = state->data;
+
+	if (err == -EINPROGRESS) {
+		if (!list_empty(&state->head))
+			return;
+		goto notify;
+	}
+
+	err = acomp_reqchain_finish(state, err, CRYPTO_TFM_REQ_MAY_BACKLOG);
+	if (err == -EBUSY)
+		return;
+
+notify:
+	compl(data, err);
+}
+
+int acomp_do_req_chain(struct acomp_req *req,
+		       int (*op)(struct acomp_req *req))
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct acomp_save_req_state *state;
+	struct acomp_save_req_state state0;
+	int err = 0;
+
+	if (!acomp_request_chained(req) || list_empty(&req->base.list) ||
+	    !crypto_acomp_req_chain(tfm))
+		return op(req);
+
+	state = &state0;
+
+	if (acomp_is_async(tfm)) {
+		err = acomp_save_req(req, acomp_reqchain_done);
+		if (err) {
+			struct acomp_req *r2;
+
+			req->base.err = err;
+			list_for_each_entry(r2, &req->base.list, base.list)
+				r2->base.err = err;
+
+			return err;
+		}
+
+		state = req->base.data;
+	}
+
+	state->op = op;
+	state->cur = req;
+	INIT_LIST_HEAD(&state->head);
+	list_splice(&req->base.list, &state->head);
+
+	err = op(req);
+	if (err == -EBUSY || err == -EINPROGRESS)
+		return -EBUSY;
+
+	return acomp_reqchain_finish(state, err, ~0);
+}
+EXPORT_SYMBOL_GPL(acomp_do_req_chain);
+
+static void acomp_async_reqchain_done(struct acomp_req *req0,
+				      struct list_head *state,
+				      int (*op_poll)(struct acomp_req *req))
+{
+	struct acomp_req *req, *n;
+	bool req0_done = false;
+	int err;
+
+	while (!list_empty(state)) {
+
+		if (!req0_done) {
+			err = op_poll(req0);
+			if (!(err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)) {
+				req0->base.err = err;
+				req0_done = true;
+			}
+		}
+
+		list_for_each_entry_safe(req, n, state, base.list) {
+			err = op_poll(req);
+
+			if (err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)
+				continue;
+
+			req->base.err = err;
+			list_del_init(&req->base.list);
+			list_add_tail(&req->base.list, &req0->base.list);
+		}
+	}
+
+	while (!req0_done) {
+		err = op_poll(req0);
+		if (!(err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)) {
+			req0->base.err = err;
+			break;
+		}
+	}
+}
+
+static int acomp_async_reqchain_finish(struct acomp_req *req0,
+				       struct list_head *state,
+				       int (*op_submit)(struct acomp_req *req),
+				       int (*op_poll)(struct acomp_req *req))
+{
+	struct acomp_req *req, *n;
+	int err = 0;
+
+	INIT_LIST_HEAD(&req0->base.list);
+
+	list_for_each_entry_safe(req, n, state, base.list) {
+		BUG_ON(req == req0);
+
+		err = op_submit(req);
+
+		if (!(err == -EINPROGRESS || err == -EBUSY)) {
+			req->base.err = err;
+			list_del_init(&req->base.list);
+			list_add_tail(&req->base.list, &req0->base.list);
+		}
+	}
+
+	acomp_async_reqchain_done(req0, state, op_poll);
+
+	return req0->base.err;
+}
+
+int acomp_do_async_req_chain(struct acomp_req *req,
+			     int (*op_submit)(struct acomp_req *req),
+			     int (*op_poll)(struct acomp_req *req))
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+	struct list_head state;
+	struct acomp_req *r2;
+	int err = 0;
+	void *req0_data = req->base.data;
+
+	if (!acomp_request_chained(req) || list_empty(&req->base.list) ||
+		!acomp_is_async(tfm) || !crypto_acomp_req_chain(tfm)) {
+
+		err = op_submit(req);
+
+		if (err == -EINPROGRESS || err == -EBUSY) {
+			bool req0_done = false;
+
+			while (!req0_done) {
+				err = op_poll(req);
+				if (!(err == -EAGAIN || err == -EINPROGRESS || err == -EBUSY)) {
+					req->base.err = err;
+					break;
+				}
+			}
+		} else {
+			req->base.err = err;
+		}
+
+		req->base.data = req0_data;
+		if (acomp_is_async(tfm))
+			req->base.complete(req->base.data, req->base.err);
+
+		return err;
+	}
+
+	err = op_submit(req);
+	req->base.err = err;
+
+	if (err && !(err == -EINPROGRESS || err == -EBUSY))
+		goto err_prop;
+
+	INIT_LIST_HEAD(&state);
+	list_splice(&req->base.list, &state);
+
+	err = acomp_async_reqchain_finish(req, &state, op_submit, op_poll);
+	req->base.data = req0_data;
+	req->base.complete(req->base.data, req->base.err);
+
+	return err;
+
+err_prop:
+	list_for_each_entry(r2, &req->base.list, base.list)
+		r2->base.err = err;
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(acomp_do_async_req_chain);
+
 struct acomp_req *acomp_request_alloc(struct crypto_acomp *acomp)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp);
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 54937b615239..e6783deba3ac 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -206,6 +206,7 @@ static inline void acomp_request_set_callback(struct acomp_req *req,
 	req->base.data = data;
 	req->base.flags &= CRYPTO_ACOMP_ALLOC_OUTPUT;
 	req->base.flags |= flgs & ~CRYPTO_ACOMP_ALLOC_OUTPUT;
+	req->base.flags &= ~CRYPTO_TFM_REQ_CHAIN;
 }
 
 /**
@@ -237,6 +238,51 @@ static inline void acomp_request_set_params(struct acomp_req *req,
 		req->flags |= CRYPTO_ACOMP_ALLOC_OUTPUT;
 }
 
+static inline u32 acomp_request_flags(struct acomp_req *req)
+{
+	return req->base.flags;
+}
+
+static inline void acomp_reqchain_init(struct acomp_req *req,
+				       u32 flags, crypto_completion_t compl,
+				       void *data)
+{
+	acomp_request_set_callback(req, flags, compl, data);
+	crypto_reqchain_init(&req->base);
+}
+
+static inline bool acomp_is_reqchain(struct acomp_req *req)
+{
+	return crypto_is_reqchain(&req->base);
+}
+
+static inline void acomp_reqchain_clear(struct acomp_req *req, void *data)
+{
+	struct crypto_wait *wait = (struct crypto_wait *)data;
+	reinit_completion(&wait->completion);
+	crypto_reqchain_clear(&req->base);
+	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, data);
+}
+
+static inline void acomp_request_chain(struct acomp_req *req,
+				       struct acomp_req *head)
+{
+	crypto_request_chain(&req->base, &head->base);
+}
+
+int acomp_do_req_chain(struct acomp_req *req,
+		       int (*op)(struct acomp_req *req));
+
+int acomp_do_async_req_chain(struct acomp_req *req,
+			     int (*op_submit)(struct acomp_req *req),
+			     int (*op_poll)(struct acomp_req *req));
+
+static inline int acomp_request_err(struct acomp_req *req)
+{
+	return req->base.err;
+}
+
 /**
  * crypto_acomp_compress() -- Invoke asynchronous compress operation
  *
diff --git a/include/crypto/algapi.h b/include/crypto/algapi.h
index 156de41ca760..c5df380c7d08 100644
--- a/include/crypto/algapi.h
+++ b/include/crypto/algapi.h
@@ -271,4 +271,14 @@ static inline u32 crypto_tfm_alg_type(struct crypto_tfm *tfm)
 	return tfm->__crt_alg->cra_flags & CRYPTO_ALG_TYPE_MASK;
 }
 
+static inline bool crypto_request_chained(struct crypto_async_request *req)
+{
+	return req->flags & CRYPTO_TFM_REQ_CHAIN;
+}
+
+static inline bool crypto_tfm_req_chain(struct crypto_tfm *tfm)
+{
+	return tfm->__crt_alg->cra_flags & CRYPTO_ALG_REQ_CHAIN;
+}
+
 #endif	/* _CRYPTO_ALGAPI_H */
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 8831edaafc05..53b4ef59b48c 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -84,6 +84,16 @@ static inline void __acomp_request_free(struct acomp_req *req)
 	kfree_sensitive(req);
 }
 
+static inline bool acomp_request_chained(struct acomp_req *req)
+{
+	return crypto_request_chained(&req->base);
+}
+
+static inline bool crypto_acomp_req_chain(struct crypto_acomp *tfm)
+{
+	return crypto_tfm_req_chain(&tfm->base);
+}
+
 /**
  * crypto_register_acomp() -- Register asynchronous compression algorithm
  *
diff --git a/include/linux/crypto.h b/include/linux/crypto.h
index b164da5e129e..f1bc282e1ed6 100644
--- a/include/linux/crypto.h
+++ b/include/linux/crypto.h
@@ -13,6 +13,8 @@
 #define _LINUX_CRYPTO_H
 
 #include <linux/completion.h>
+#include <linux/errno.h>
+#include <linux/list.h>
 #include <linux/refcount.h>
 #include <linux/slab.h>
 #include <linux/types.h>
@@ -124,6 +126,9 @@
  */
 #define CRYPTO_ALG_FIPS_INTERNAL	0x00020000
 
+/* Set if the algorithm supports request chains. */
+#define CRYPTO_ALG_REQ_CHAIN		0x00040000
+
 /*
  * Transform masks and values (for crt_flags).
  */
@@ -133,6 +138,7 @@
 #define CRYPTO_TFM_REQ_FORBID_WEAK_KEYS	0x00000100
 #define CRYPTO_TFM_REQ_MAY_SLEEP	0x00000200
 #define CRYPTO_TFM_REQ_MAY_BACKLOG	0x00000400
+#define CRYPTO_TFM_REQ_CHAIN		0x00000800
 
 /*
  * Miscellaneous stuff.
@@ -174,6 +180,7 @@ struct crypto_async_request {
 	struct crypto_tfm *tfm;
 
 	u32 flags;
+	int err;
 };
 
 /**
@@ -391,6 +398,9 @@ void crypto_req_done(void *req, int err);
 
 static inline int crypto_wait_req(int err, struct crypto_wait *wait)
 {
+	if (!wait)
+		return err;
+
 	switch (err) {
 	case -EINPROGRESS:
 	case -EBUSY:
@@ -540,5 +550,34 @@ int crypto_comp_decompress(struct crypto_comp *tfm,
 			   const u8 *src, unsigned int slen,
 			   u8 *dst, unsigned int *dlen);
 
+static inline void crypto_reqchain_init(struct crypto_async_request *req)
+{
+	req->err = -EINPROGRESS;
+	req->flags |= CRYPTO_TFM_REQ_CHAIN;
+	INIT_LIST_HEAD(&req->list);
+}
+
+static inline bool crypto_is_reqchain(struct crypto_async_request *req)
+{
+	return req->flags & CRYPTO_TFM_REQ_CHAIN;
+}
+
+static inline void crypto_reqchain_clear(struct crypto_async_request *req)
+{
+	req->flags &= ~CRYPTO_TFM_REQ_CHAIN;
+}
+
+static inline void crypto_request_chain(struct crypto_async_request *req,
+					struct crypto_async_request *head)
+{
+	req->err = -EINPROGRESS;
+	list_add_tail(&req->list, &head->list);
+}
+
+static inline bool crypto_tfm_is_async(struct crypto_tfm *tfm)
+{
+	return tfm->__crt_alg->cra_flags & CRYPTO_ALG_ASYNC;
+}
+
 #endif	/* _LINUX_CRYPTO_H */
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 02/14] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 01/14] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 03/14] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This commit adds a get_batch_size() interface to:

  struct acomp_alg
  struct crypto_acomp

A crypto_acomp compression algorithm that supports batching of compressions
and decompressions must register and provide an implementation for this
API, so that higher level modules such as zswap and zram can allocate
resources for submitting multiple compress/decompress jobs that can be
batched. In addition, the compression algorithm must register itself
to use request chaining (cra_flags |= CRYPTO_ALG_REQ_CHAIN).

A new helper function acomp_has_async_batching() can be invoked to query if
a crypto_acomp has registered this API. Further, the newly added
crypto_acomp API "crypto_acomp_batch_size()" is provided for use by higher
level modules like zswap and zram. crypto_acomp_batch_size() returns 1 if
the acomp has not provided an implementation for get_batch_size().

For instance, zswap can call crypto_acomp_batch_size() to get the maximum
batch-size supported by the compressor. Based on this, zswap can use the
minimum of any zswap-specific upper limits for batch-size and the
compressor's max batch-size, to allocate batching resources. Further,
the way that zswap can avail of the compressor's batching capability is by
using request chaining to create a list requests chained to a head request.
zswap can call crypto_acomp_compress() or crypto_acomp_decompress() with
the head request in the chain for processing the chain as a batch. The call
into crypto for compress/decompress will thus remain the same from zswap's
perspective for both, batching and sequential compressions/decompressions.

An acomp_is_reqchain() API is introduced, that a driver can call to query
if a request received from compress/decompress represents a request chain,
and accordingly, process the request chain using either one of:

  acomp_do_req_chain()
  acomp_do_async_req_chain()

These capabilities allow the iaa_crypto Intel IAA driver to register and
implement the get_batch_size() acomp_alg interface, that can subsequently
be invoked from the kernel zswap/zram modules to construct a request chain
to compress/decompress pages in parallel in the IAA hardware accelerator to
improve swapout/swapin performance.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c                  |  1 +
 include/crypto/acompress.h          | 28 ++++++++++++++++++++++++++++
 include/crypto/internal/acompress.h |  4 ++++
 3 files changed, 33 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index cb6444d09dd7..b2a6c06d7262 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -84,6 +84,7 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm)
 
 	acomp->compress = alg->compress;
 	acomp->decompress = alg->decompress;
+	acomp->get_batch_size = alg->get_batch_size;
 	acomp->dst_free = alg->dst_free;
 	acomp->reqsize = alg->reqsize;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index e6783deba3ac..147f184b6bea 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -43,6 +43,9 @@ struct acomp_req {
  *
  * @compress:		Function performs a compress operation
  * @decompress:		Function performs a de-compress operation
+ * @get_batch_size:     Maximum batch-size for batching compress/decompress
+ *                      operations. If registered, the acomp must provide
+ *                      a batching implementation using request chaining.
  * @dst_free:		Frees destination buffer if allocated inside the
  *			algorithm
  * @reqsize:		Context size for (de)compression requests
@@ -51,6 +54,7 @@ struct acomp_req {
 struct crypto_acomp {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
 	void (*dst_free)(struct scatterlist *dst);
 	unsigned int reqsize;
 	struct crypto_tfm base;
@@ -142,6 +146,13 @@ static inline bool acomp_is_async(struct crypto_acomp *tfm)
 	       CRYPTO_ALG_ASYNC;
 }
 
+static inline bool acomp_has_async_batching(struct crypto_acomp *tfm)
+{
+	return (acomp_is_async(tfm) &&
+		(crypto_comp_alg_common(tfm)->base.cra_flags & CRYPTO_ALG_TYPE_ACOMPRESS) &&
+		tfm->get_batch_size);
+}
+
 static inline struct crypto_acomp *crypto_acomp_reqtfm(struct acomp_req *req)
 {
 	return __crypto_acomp_tfm(req->base.tfm);
@@ -311,4 +322,21 @@ static inline int crypto_acomp_decompress(struct acomp_req *req)
 	return crypto_acomp_reqtfm(req)->decompress(req);
 }
 
+/**
+ * crypto_acomp_batch_size() -- Get the algorithm's batch size
+ *
+ * Function returns the algorithm's batch size for batching operations
+ *
+ * @tfm:	ACOMPRESS tfm handle allocated with crypto_alloc_acomp()
+ *
+ * Return:	crypto_acomp's batch size.
+ */
+static inline unsigned int crypto_acomp_batch_size(struct crypto_acomp *tfm)
+{
+	if (acomp_has_async_batching(tfm))
+		return tfm->get_batch_size();
+
+	return 1;
+}
+
 #endif
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 53b4ef59b48c..24b63db56dfb 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -17,6 +17,9 @@
  *
  * @compress:	Function performs a compress operation
  * @decompress:	Function performs a de-compress operation
+ * @get_batch_size:     Maximum batch-size for batching compress/decompress
+ *                      operations. If registered, the acomp must provide
+ *                      a batching implementation using request chaining.
  * @dst_free:	Frees destination buffer if allocated inside the algorithm
  * @init:	Initialize the cryptographic transformation object.
  *		This function is used to initialize the cryptographic
@@ -37,6 +40,7 @@
 struct acomp_alg {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int (*get_batch_size)(void);
 	void (*dst_free)(struct scatterlist *dst);
 	int (*init)(struct crypto_acomp *tfm);
 	void (*exit)(struct crypto_acomp *tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 03/14] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 01/14] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 02/14] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 04/14] crypto: iaa - Implement batch compression/decompression with request chaining Kanchana P Sridhar
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

If the iaa_crypto driver has async_mode set to true, and use_irq set to
false, it can still be forced to use synchronous mode by turning off the
CRYPTO_ACOMP_REQ_POLL flag in req->flags.

In other words, all three of the following need to be true for a request
to be processed in fully async poll mode:

 1) async_mode should be "true"
 2) use_irq should be "false"
 3) req->flags & CRYPTO_ACOMP_REQ_POLL should be "true"

Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 11 ++++++++++-
 include/crypto/acompress.h                 |  5 +++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index c3776b0de51d..d7983ab3c34a 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1520,6 +1520,10 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		return -EINVAL;
 	}
 
+	/* If the caller has requested no polling, disable async. */
+	if (!(req->flags & CRYPTO_ACOMP_REQ_POLL))
+		disable_async = true;
+
 	cpu = get_cpu();
 	wq = wq_table_next_wq(cpu);
 	put_cpu();
@@ -1712,6 +1716,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 {
 	struct crypto_tfm *tfm = req->base.tfm;
 	dma_addr_t src_addr, dst_addr;
+	bool disable_async = false;
 	int nr_sgs, cpu, ret = 0;
 	struct iaa_wq *iaa_wq;
 	struct device *dev;
@@ -1727,6 +1732,10 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		return -EINVAL;
 	}
 
+	/* If the caller has requested no polling, disable async. */
+	if (!(req->flags & CRYPTO_ACOMP_REQ_POLL))
+		disable_async = true;
+
 	if (!req->dst)
 		return iaa_comp_adecompress_alloc_dest(req);
 
@@ -1775,7 +1784,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		req->dst, req->dlen, sg_dma_len(req->dst));
 
 	ret = iaa_decompress(tfm, req, wq, src_addr, req->slen,
-			     dst_addr, &req->dlen, false);
+			     dst_addr, &req->dlen, disable_async);
 	if (ret == -EINPROGRESS)
 		return ret;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 147f184b6bea..afadf84f236d 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -14,6 +14,11 @@
 #include <linux/crypto.h>
 
 #define CRYPTO_ACOMP_ALLOC_OUTPUT	0x00000001
+/*
+ * If set, the driver must have a way to submit the req, then
+ * poll its completion status for success/error.
+ */
+#define CRYPTO_ACOMP_REQ_POLL		0x00000002
 #define CRYPTO_ACOMP_DST_MAX		131072
 
 /**
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 04/14] crypto: iaa - Implement batch compression/decompression with request chaining.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2025-03-03  8:47 ` [PATCH v8 03/14] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 05/14] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch provides the iaa_crypto driver implementation for the newly added
crypto_acomp "get_batch_size()" interface which will be called when swap
modules invoke crypto_acomp_batch_size() to query for the maximum batch size.
This will return an iaa_driver specific constant, IAA_CRYPTO_MAX_BATCH_SIZE
(set to 8U currently).

This allows swap modules such as zswap/zram to allocate required batching
resources and then invoke fully asynchronous batch parallel
compression/decompression of pages on systems with Intel IAA, by setting up
a request chain, and calling crypto_acomp_compress() or
crypto_acomp_decompress() with the head request in the chain.

This enables zswap compress batching code to be developed in
a manner similar to the current single-page synchronous calls to
crypto_acomp_compress() and crypto_acomp_decompress(), thereby,
facilitating encapsulated and modular hand-off between the kernel
zswap/zram code and the crypto_acomp layer.

This patch also provides implementations of IAA batching with request
chaining for both iaa_crypto sync modes: asynchronous/no-irq and fully
synchronous.

Since iaa_crypto supports the use of acomp request chaining, this patch
also adds CRYPTO_ALG_REQ_CHAIN to the iaa_acomp_fixed_deflate algorithm's
cra_flags.

Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |   9 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 186 ++++++++++++++++++++-
 2 files changed, 192 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 56985e395263..45d94a646636 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -39,6 +39,15 @@
 					 IAA_DECOMP_CHECK_FOR_EOB | \
 					 IAA_DECOMP_STOP_ON_EOB)
 
+/*
+ * The maximum compress/decompress batch size for IAA's implementation of
+ * batched compressions/decompressions.
+ * The IAA compression algorithms should provide the crypto_acomp
+ * get_batch_size() interface through a function that returns this
+ * constant.
+ */
+#define IAA_CRYPTO_MAX_BATCH_SIZE 8U
+
 /* Representation of IAA workqueue */
 struct iaa_wq {
 	struct list_head	list;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index d7983ab3c34a..a9800b8f3575 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1807,6 +1807,185 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 	ctx->use_irq = use_irq;
 }
 
+static int iaa_comp_poll(struct acomp_req *req)
+{
+	struct idxd_desc *idxd_desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	struct idxd_wq *wq;
+	bool compress_op;
+	int ret;
+
+	idxd_desc = req->base.data;
+	if (!idxd_desc)
+		return -EAGAIN;
+
+	compress_op = (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS);
+	wq = idxd_desc->wq;
+	iaa_wq = idxd_wq_get_private(wq);
+	idxd = iaa_wq->iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	ret = check_completion(dev, idxd_desc->iax_completion, true, true);
+	if (ret == -EAGAIN)
+		return ret;
+	if (ret)
+		goto out;
+
+	req->dlen = idxd_desc->iax_completion->output_size;
+
+	/* Update stats */
+	if (compress_op) {
+		update_total_comp_bytes_out(req->dlen);
+		update_wq_comp_bytes(wq, req->dlen);
+	} else {
+		update_total_decomp_bytes_in(req->slen);
+		update_wq_decomp_bytes(wq, req->slen);
+	}
+
+	if (iaa_verify_compress && (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS)) {
+		struct crypto_tfm *tfm = req->base.tfm;
+		dma_addr_t src_addr, dst_addr;
+		u32 compression_crc;
+
+		compression_crc = idxd_desc->iax_completion->crc;
+
+		dma_sync_sg_for_device(dev, req->dst, 1, DMA_FROM_DEVICE);
+		dma_sync_sg_for_device(dev, req->src, 1, DMA_TO_DEVICE);
+
+		src_addr = sg_dma_address(req->src);
+		dst_addr = sg_dma_address(req->dst);
+
+		ret = iaa_compress_verify(tfm, req, wq, src_addr, req->slen,
+					  dst_addr, &req->dlen, compression_crc);
+	}
+out:
+	/* caller doesn't call crypto_wait_req, so no acomp_request_complete() */
+
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	idxd_free_desc(idxd_desc->wq, idxd_desc);
+
+	dev_dbg(dev, "%s: returning ret=%d\n", __func__, ret);
+
+	return ret;
+}
+
+static unsigned int iaa_comp_get_batch_size(void)
+{
+	return IAA_CRYPTO_MAX_BATCH_SIZE;
+}
+
+static void iaa_set_reqchain_poll(
+	struct acomp_req *req0,
+	bool set_flag)
+{
+	struct acomp_req *req;
+
+	set_flag ? (req0->flags |= CRYPTO_ACOMP_REQ_POLL) :
+		   (req0->flags &= ~CRYPTO_ACOMP_REQ_POLL);
+
+	list_for_each_entry(req, &req0->base.list, base.list)
+		set_flag ? (req->flags |= CRYPTO_ACOMP_REQ_POLL) :
+			   (req->flags &= ~CRYPTO_ACOMP_REQ_POLL);
+}
+
+/**
+ * This API provides IAA compress batching functionality for use by swap
+ * modules. Batching is implemented using request chaining.
+ *
+ * @req:  The head asynchronous compress request in the chain.
+ *
+ * Returns the compression error status (0 or -errno) of the last
+ * request that finishes. Caller should call acomp_request_err()
+ * for each request in the chain, to get its error status.
+ */
+static int iaa_comp_acompress_batch(struct acomp_req *req)
+{
+	bool async = (async_mode && !use_irq);
+	int err = 0;
+
+	if (likely(async))
+		iaa_set_reqchain_poll(req, true);
+	else
+		iaa_set_reqchain_poll(req, false);
+
+
+	if (likely(async))
+		/* Process the request chain in parallel. */
+		err = acomp_do_async_req_chain(req, iaa_comp_acompress, iaa_comp_poll);
+	else
+		/* Process the request chain in series. */
+		err = acomp_do_req_chain(req, iaa_comp_acompress);
+
+	/*
+	 * For the same request chain to be usable by
+	 * iaa_comp_acompress()/iaa_comp_adecompress() in synchronous mode,
+	 * clear the CRYPTO_ACOMP_REQ_POLL bit on all acomp_reqs.
+	 */
+	iaa_set_reqchain_poll(req, false);
+
+	return err;
+}
+
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules. Batching is implemented using request chaining.
+ *
+ * @req:  The head asynchronous decompress request in the chain.
+ *
+ * Returns the decompression error status (0 or -errno) of the last
+ * request that finishes. Caller should call acomp_request_err()
+ * for each request in the chain, to get its error status.
+ */
+static int iaa_comp_adecompress_batch(struct acomp_req *req)
+{
+	bool async = (async_mode && !use_irq);
+	int err = 0;
+
+	if (likely(async))
+		iaa_set_reqchain_poll(req, true);
+	else
+		iaa_set_reqchain_poll(req, false);
+
+
+	if (likely(async))
+		/* Process the request chain in parallel. */
+		err = acomp_do_async_req_chain(req, iaa_comp_adecompress, iaa_comp_poll);
+	else
+		/* Process the request chain in series. */
+		err = acomp_do_req_chain(req, iaa_comp_adecompress);
+
+	/*
+	 * For the same request chain to be usable by
+	 * iaa_comp_acompress()/iaa_comp_adecompress() in synchronous mode,
+	 * clear the CRYPTO_ACOMP_REQ_POLL bit on all acomp_reqs.
+	 */
+	iaa_set_reqchain_poll(req, false);
+
+	return err;
+}
+
+static int iaa_compress_main(struct acomp_req *req)
+{
+	if (acomp_is_reqchain(req))
+		return iaa_comp_acompress_batch(req);
+
+	return iaa_comp_acompress(req);
+}
+
+static int iaa_decompress_main(struct acomp_req *req)
+{
+	if (acomp_is_reqchain(req))
+		return iaa_comp_adecompress_batch(req);
+
+	return iaa_comp_adecompress(req);
+}
+
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
@@ -1829,13 +2008,14 @@ static void dst_free(struct scatterlist *sgl)
 
 static struct acomp_alg iaa_acomp_fixed_deflate = {
 	.init			= iaa_comp_init_fixed,
-	.compress		= iaa_comp_acompress,
-	.decompress		= iaa_comp_adecompress,
+	.compress		= iaa_compress_main,
+	.decompress		= iaa_decompress_main,
 	.dst_free               = dst_free,
+	.get_batch_size		= iaa_comp_get_batch_size,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa",
-		.cra_flags		= CRYPTO_ALG_ASYNC,
+		.cra_flags		= CRYPTO_ALG_ASYNC | CRYPTO_ALG_REQ_CHAIN,
 		.cra_ctxsize		= sizeof(struct iaa_compression_ctx),
 		.cra_module		= THIS_MODULE,
 		.cra_priority		= IAA_ALG_PRIORITY,
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 05/14] crypto: iaa - Enable async mode and make it the default.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2025-03-03  8:47 ` [PATCH v8 04/14] crypto: iaa - Implement batch compression/decompression with request chaining Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 06/14] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch enables the 'async' sync_mode in the driver. Further, it sets
the default sync_mode to 'async', which makes it easier for IAA hardware
acceleration in the iaa_crypto driver to be loaded by default in the most
efficient/recommended 'async' mode for parallel
compressions/decompressions, namely, asynchronous submission of
descriptors, followed by polling for job completions with or without
request chaining. Earlier, the "sync" mode used to be the default.

This way, anyone who wants to use IAA for zswap/zram can do so after
building the kernel, and without having to go through these steps to use
async mode:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo async > /sys/bus/dsa/drivers/crypto/sync_mode
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 Documentation/driver-api/crypto/iaa/iaa-crypto.rst | 11 ++---------
 drivers/crypto/intel/iaa/iaa_crypto_main.c         |  4 ++--
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 8e50b900d51c..782da5230fcd 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -272,7 +272,7 @@ The available attributes are:
       echo async_irq > /sys/bus/dsa/drivers/crypto/sync_mode
 
     Async mode without interrupts (caller must poll) can be enabled by
-    writing 'async' to it (please see Caveat)::
+    writing 'async' to it::
 
       echo async > /sys/bus/dsa/drivers/crypto/sync_mode
 
@@ -281,14 +281,7 @@ The available attributes are:
 
       echo sync > /sys/bus/dsa/drivers/crypto/sync_mode
 
-    The default mode is 'sync'.
-
-    Caveat: since the only mechanism that iaa_crypto currently implements
-    for async polling without interrupts is via the 'sync' mode as
-    described earlier, writing 'async' to
-    '/sys/bus/dsa/drivers/crypto/sync_mode' will internally enable the
-    'sync' mode. This is to ensure correct iaa_crypto behavior until true
-    async polling without interrupts is enabled in iaa_crypto.
+    The default mode is 'async'.
 
 .. _iaa_default_config:
 
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index a9800b8f3575..4dac4852c113 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -153,7 +153,7 @@ static DRIVER_ATTR_RW(verify_compress);
  */
 
 /* Use async mode */
-static bool async_mode;
+static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
@@ -173,7 +173,7 @@ static int set_iaa_sync_mode(const char *name)
 		async_mode = false;
 		use_irq = false;
 	} else if (sysfs_streq(name, "async")) {
-		async_mode = false;
+		async_mode = true;
 		use_irq = false;
 	} else if (sysfs_streq(name, "async_irq")) {
 		async_mode = true;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 06/14] crypto: iaa - Disable iaa_verify_compress by default.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2025-03-03  8:47 ` [PATCH v8 05/14] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 07/14] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default with "iaa_verify_compress" disabled, to
facilitate performance comparisons with software compressors (which also
do not run compress verification by default). Earlier, iaa_crypto compress
verification used to be enabled by default.

With this patch, if users want to enable compress verification, they can do
so with these steps:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo 1 > /sys/bus/dsa/drivers/crypto/verify_compress
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 4dac4852c113..5038fd7ced02 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -94,7 +94,7 @@ static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
 /* Verify results of IAA compress or not */
-static bool iaa_verify_compress = true;
+static bool iaa_verify_compress = false;
 
 static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
 {
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 07/14] crypto: iaa - Re-organize the iaa_crypto driver code.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2025-03-03  8:47 ` [PATCH v8 06/14] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 08/14] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch merely reorganizes the code in iaa_crypto_main.c, so that
the functions are consolidated into logically related sub-sections of
code.

This is expected to make the code more maintainable and for it to be easier
to replace functional layers and/or add new features.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 540 +++++++++++----------
 1 file changed, 275 insertions(+), 265 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 5038fd7ced02..abaee160e5ec 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -24,6 +24,9 @@
 
 #define IAA_ALG_PRIORITY               300
 
+/**************************************
+ * Driver internal global variables.
+ **************************************/
 /* number of iaa instances probed */
 static unsigned int nr_iaa;
 static unsigned int nr_cpus;
@@ -36,55 +39,46 @@ static unsigned int cpus_per_iaa;
 static struct crypto_comp *deflate_generic_tfm;
 
 /* Per-cpu lookup table for balanced wqs */
-static struct wq_table_entry __percpu *wq_table;
+static struct wq_table_entry __percpu *wq_table = NULL;
 
-static struct idxd_wq *wq_table_next_wq(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (++entry->cur_wq >= entry->n_wqs)
-		entry->cur_wq = 0;
-
-	if (!entry->wqs[entry->cur_wq])
-		return NULL;
-
-	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
-		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
-		 entry->wqs[entry->cur_wq]->id, cpu);
-
-	return entry->wqs[entry->cur_wq];
-}
-
-static void wq_table_add(int cpu, struct idxd_wq *wq)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
-		return;
-
-	entry->wqs[entry->n_wqs++] = wq;
-
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
-}
-
-static void wq_table_free_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+/* Verify results of IAA compress or not */
+static bool iaa_verify_compress = false;
 
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
-}
+/*
+ * The iaa crypto driver supports three 'sync' methods determining how
+ * compressions and decompressions are performed:
+ *
+ * - sync:      the compression or decompression completes before
+ *              returning.  This is the mode used by the async crypto
+ *              interface when the sync mode is set to 'sync' and by
+ *              the sync crypto interface regardless of setting.
+ *
+ * - async:     the compression or decompression is submitted and returns
+ *              immediately.  Completion interrupts are not used so
+ *              the caller is responsible for polling the descriptor
+ *              for completion.  This mode is applicable to only the
+ *              async crypto interface and is ignored for anything
+ *              else.
+ *
+ * - async_irq: the compression or decompression is submitted and
+ *              returns immediately.  Completion interrupts are
+ *              enabled so the caller can wait for the completion and
+ *              yield to other threads.  When the compression or
+ *              decompression completes, the completion is signaled
+ *              and the caller awakened.  This mode is applicable to
+ *              only the async crypto interface and is ignored for
+ *              anything else.
+ *
+ * These modes can be set using the iaa_crypto sync_mode driver
+ * attribute.
+ */
 
-static void wq_table_clear_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+/* Use async mode */
+static bool async_mode = true;
+/* Use interrupts */
+static bool use_irq;
 
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
-}
+static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
@@ -93,9 +87,9 @@ DEFINE_MUTEX(iaa_devices_lock);
 static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
-/* Verify results of IAA compress or not */
-static bool iaa_verify_compress = false;
-
+/**************************************************
+ * Driver attributes along with get/set functions.
+ **************************************************/
 static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
 {
 	return sprintf(buf, "%d\n", iaa_verify_compress);
@@ -123,40 +117,6 @@ static ssize_t verify_compress_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(verify_compress);
 
-/*
- * The iaa crypto driver supports three 'sync' methods determining how
- * compressions and decompressions are performed:
- *
- * - sync:      the compression or decompression completes before
- *              returning.  This is the mode used by the async crypto
- *              interface when the sync mode is set to 'sync' and by
- *              the sync crypto interface regardless of setting.
- *
- * - async:     the compression or decompression is submitted and returns
- *              immediately.  Completion interrupts are not used so
- *              the caller is responsible for polling the descriptor
- *              for completion.  This mode is applicable to only the
- *              async crypto interface and is ignored for anything
- *              else.
- *
- * - async_irq: the compression or decompression is submitted and
- *              returns immediately.  Completion interrupts are
- *              enabled so the caller can wait for the completion and
- *              yield to other threads.  When the compression or
- *              decompression completes, the completion is signaled
- *              and the caller awakened.  This mode is applicable to
- *              only the async crypto interface and is ignored for
- *              anything else.
- *
- * These modes can be set using the iaa_crypto sync_mode driver
- * attribute.
- */
-
-/* Use async mode */
-static bool async_mode = true;
-/* Use interrupts */
-static bool use_irq;
-
 /**
  * set_iaa_sync_mode - Set IAA sync mode
  * @name: The name of the sync mode
@@ -219,8 +179,9 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
-static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
-
+/****************************
+ * Driver compression modes.
+ ****************************/
 static int find_empty_iaa_compression_mode(void)
 {
 	int i = -EINVAL;
@@ -411,11 +372,6 @@ static void free_device_compression_mode(struct iaa_device *iaa_device,
 						IDXD_OP_FLAG_WR_SRC2_AECS_COMP | \
 						IDXD_OP_FLAG_AECS_RW_TGLS)
 
-static int check_completion(struct device *dev,
-			    struct iax_completion_record *comp,
-			    bool compress,
-			    bool only_once);
-
 static int init_device_compression_mode(struct iaa_device *iaa_device,
 					struct iaa_compression_mode *mode,
 					int idx, struct idxd_wq *wq)
@@ -502,6 +458,10 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
 	}
 }
 
+/***********************************************************
+ * Functions for use in crypto probe and remove interfaces:
+ * allocate/init/query/deallocate devices/wqs.
+ ***********************************************************/
 static struct iaa_device *iaa_device_alloc(void)
 {
 	struct iaa_device *iaa_device;
@@ -614,16 +574,6 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 	}
 }
 
-static void clear_wq_table(void)
-{
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
-
-	pr_debug("cleared wq table\n");
-}
-
 static void free_iaa_device(struct iaa_device *iaa_device)
 {
 	if (!iaa_device)
@@ -704,43 +654,6 @@ static int iaa_wq_put(struct idxd_wq *wq)
 	return ret;
 }
 
-static void free_wq_table(void)
-{
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_free_entry(cpu);
-
-	free_percpu(wq_table);
-
-	pr_debug("freed wq table\n");
-}
-
-static int alloc_wq_table(int max_wqs)
-{
-	struct wq_table_entry *entry;
-	int cpu;
-
-	wq_table = alloc_percpu(struct wq_table_entry);
-	if (!wq_table)
-		return -ENOMEM;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++) {
-		entry = per_cpu_ptr(wq_table, cpu);
-		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
-		if (!entry->wqs) {
-			free_wq_table();
-			return -ENOMEM;
-		}
-
-		entry->max_wqs = max_wqs;
-	}
-
-	pr_debug("initialized wq table\n");
-
-	return 0;
-}
-
 static int save_iaa_wq(struct idxd_wq *wq)
 {
 	struct iaa_device *iaa_device, *found = NULL;
@@ -829,6 +742,87 @@ static void remove_iaa_wq(struct idxd_wq *wq)
 		cpus_per_iaa = 1;
 }
 
+/***************************************************************
+ * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
+ ***************************************************************/
+static void wq_table_free_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	kfree(entry->wqs);
+	memset(entry, 0, sizeof(*entry));
+}
+
+static void wq_table_clear_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	entry->n_wqs = 0;
+	entry->cur_wq = 0;
+	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+}
+
+static void clear_wq_table(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		wq_table_clear_entry(cpu);
+
+	pr_debug("cleared wq table\n");
+}
+
+static void free_wq_table(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		wq_table_free_entry(cpu);
+
+	free_percpu(wq_table);
+
+	pr_debug("freed wq table\n");
+}
+
+static int alloc_wq_table(int max_wqs)
+{
+	struct wq_table_entry *entry;
+	int cpu;
+
+	wq_table = alloc_percpu(struct wq_table_entry);
+	if (!wq_table)
+		return -ENOMEM;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		entry = per_cpu_ptr(wq_table, cpu);
+		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
+		if (!entry->wqs) {
+			free_wq_table();
+			return -ENOMEM;
+		}
+
+		entry->max_wqs = max_wqs;
+	}
+
+	pr_debug("initialized wq table\n");
+
+	return 0;
+}
+
+static void wq_table_add(int cpu, struct idxd_wq *wq)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+		return;
+
+	entry->wqs[entry->n_wqs++] = wq;
+
+	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
+		 entry->wqs[entry->n_wqs - 1]->idxd->id,
+		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+}
+
 static int wq_table_add_wqs(int iaa, int cpu)
 {
 	struct iaa_device *iaa_device, *found_device = NULL;
@@ -939,6 +933,29 @@ static void rebalance_wq_table(void)
 	}
 }
 
+/***************************************************************
+ * Assign work-queues for driver ops using per-cpu wq_tables.
+ ***************************************************************/
+static struct idxd_wq *wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	if (++entry->cur_wq >= entry->n_wqs)
+		entry->cur_wq = 0;
+
+	if (!entry->wqs[entry->cur_wq])
+		return NULL;
+
+	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
+		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
+		 entry->wqs[entry->cur_wq]->id, cpu);
+
+	return entry->wqs[entry->cur_wq];
+}
+
+/*************************************************
+ * Core iaa_crypto compress/decompress functions.
+ *************************************************/
 static inline int check_completion(struct device *dev,
 				   struct iax_completion_record *comp,
 				   bool compress,
@@ -1020,13 +1037,130 @@ static int deflate_generic_decompress(struct acomp_req *req)
 
 static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
 				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr);
+				dma_addr_t *src_addr, dma_addr_t *dst_addr)
+{
+	int ret = 0;
+	int nr_sgs;
+
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		goto out;
+	}
+	*src_addr = sg_dma_address(req->src);
+	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
+		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
+		req->src, req->slen, sg_dma_len(req->src));
+
+	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+		goto out;
+	}
+	*dst_addr = sg_dma_address(req->dst);
+	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
+		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
+		req->dst, req->dlen, sg_dma_len(req->dst));
+out:
+	return ret;
+}
 
 static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 			       struct idxd_wq *wq,
 			       dma_addr_t src_addr, unsigned int slen,
 			       dma_addr_t dst_addr, unsigned int *dlen,
-			       u32 compression_crc);
+			       u32 compression_crc)
+{
+	struct iaa_device_compression_mode *active_compression_mode;
+	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	struct iaa_device *iaa_device;
+	struct idxd_desc *idxd_desc;
+	struct iax_hw_desc *desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int ret = 0;
+
+	iaa_wq = idxd_wq_get_private(wq);
+	iaa_device = iaa_wq->iaa_device;
+	idxd = iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
+
+	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	if (IS_ERR(idxd_desc)) {
+		dev_dbg(dev, "idxd descriptor allocation failed\n");
+		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
+			PTR_ERR(idxd_desc));
+		return PTR_ERR(idxd_desc);
+	}
+	desc = idxd_desc->iax_hw;
+
+	/* Verify (optional) - decompress and check crc, suppress dest write */
+
+	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
+	desc->opcode = IAX_OPCODE_DECOMPRESS;
+	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
+	desc->priv = 0;
+
+	desc->src1_addr = (u64)dst_addr;
+	desc->src1_size = *dlen;
+	desc->dst_addr = (u64)src_addr;
+	desc->max_dst_size = slen;
+	desc->completion_addr = idxd_desc->compl_dma;
+
+	dev_dbg(dev, "(verify) compression mode %s,"
+		" desc->src1_addr %llx, desc->src1_size %d,"
+		" desc->dst_addr %llx, desc->max_dst_size %d,"
+		" desc->src2_addr %llx, desc->src2_size %d\n",
+		active_compression_mode->name,
+		desc->src1_addr, desc->src1_size, desc->dst_addr,
+		desc->max_dst_size, desc->src2_addr, desc->src2_size);
+
+	ret = idxd_submit_desc(wq, idxd_desc);
+	if (ret) {
+		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
+		goto err;
+	}
+
+	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+	if (ret) {
+		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
+		goto err;
+	}
+
+	if (compression_crc != idxd_desc->iax_completion->crc) {
+		ret = -EINVAL;
+		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
+			" comp=0x%x, decomp=0x%x\n", compression_crc,
+			idxd_desc->iax_completion->crc);
+		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
+			       8, 1, idxd_desc->iax_completion, 64, 0);
+		goto err;
+	}
+
+	idxd_free_desc(wq, idxd_desc);
+out:
+	return ret;
+err:
+	idxd_free_desc(wq, idxd_desc);
+	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
+
+	goto out;
+}
 
 static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 			      enum idxd_complete_type comp_type,
@@ -1245,133 +1379,6 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 	goto out;
 }
 
-static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr)
-{
-	int ret = 0;
-	int nr_sgs;
-
-	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
-	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
-
-	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		goto out;
-	}
-	*src_addr = sg_dma_address(req->src);
-	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
-		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
-		req->src, req->slen, sg_dma_len(req->src));
-
-	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-		goto out;
-	}
-	*dst_addr = sg_dma_address(req->dst);
-	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
-		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
-		req->dst, req->dlen, sg_dma_len(req->dst));
-out:
-	return ret;
-}
-
-static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
-			       struct idxd_wq *wq,
-			       dma_addr_t src_addr, unsigned int slen,
-			       dma_addr_t dst_addr, unsigned int *dlen,
-			       u32 compression_crc)
-{
-	struct iaa_device_compression_mode *active_compression_mode;
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
-	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
-	struct iax_hw_desc *desc;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
-	int ret = 0;
-
-	iaa_wq = idxd_wq_get_private(wq);
-	iaa_device = iaa_wq->iaa_device;
-	idxd = iaa_device->idxd;
-	pdev = idxd->pdev;
-	dev = &pdev->dev;
-
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
-	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
-			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
-	}
-	desc = idxd_desc->iax_hw;
-
-	/* Verify (optional) - decompress and check crc, suppress dest write */
-
-	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
-	desc->opcode = IAX_OPCODE_DECOMPRESS;
-	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
-	desc->priv = 0;
-
-	desc->src1_addr = (u64)dst_addr;
-	desc->src1_size = *dlen;
-	desc->dst_addr = (u64)src_addr;
-	desc->max_dst_size = slen;
-	desc->completion_addr = idxd_desc->compl_dma;
-
-	dev_dbg(dev, "(verify) compression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n",
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
-	ret = idxd_submit_desc(wq, idxd_desc);
-	if (ret) {
-		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
-		goto err;
-	}
-
-	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
-	if (ret) {
-		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
-		goto err;
-	}
-
-	if (compression_crc != idxd_desc->iax_completion->crc) {
-		ret = -EINVAL;
-		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
-			" comp=0x%x, decomp=0x%x\n", compression_crc,
-			idxd_desc->iax_completion->crc);
-		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
-			       8, 1, idxd_desc->iax_completion, 64, 0);
-		goto err;
-	}
-
-	idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
-}
-
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			  struct idxd_wq *wq,
 			  dma_addr_t src_addr, unsigned int slen,
@@ -1986,6 +1993,9 @@ static int iaa_decompress_main(struct acomp_req *req)
 	return iaa_comp_adecompress(req);
 }
 
+/*********************************************
+ * Interfaces to crypto_alg and crypto_acomp.
+ *********************************************/
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 08/14] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
                   ` (6 preceding siblings ...)
  2025-03-03  8:47 ` [PATCH v8 07/14] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 09/14] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the algorithm for mapping available IAA devices and
wqs to cores, as they are being discovered, based on packages instead of
NUMA nodes. This leads to a more realistic mapping of IAA devices as
compression/decompression resources for a package, rather than for a NUMA
node. This also resolves problems that were observed during internal
validation on Intel platforms with many more NUMA nodes than packages: for
such cases, the earlier NUMA based allocation caused some IAAs to be
over-subscribed and some to not be utilized at all.

As a result of this change from NUMA to packages, some of the core
functions used by the iaa_crypto driver's "probe" and "remove" API
have been re-written. The new infrastructure maintains a static/global
mapping of "local wqs" per IAA device, in the "struct iaa_device" itself.
The earlier implementation would allocate memory per-cpu for this data,
which never changes once the IAA devices/wqs have been initialized.

Two main outcomes from this new iaa_crypto driver infrastructure are:

1) Resolves "task blocked for more than x seconds" errors observed during
   internal validation on Intel systems with the earlier NUMA node based
   mappings, which was root-caused to the non-optimal IAA-to-core mappings
   described earlier.

2) Results in a NUM_THREADS factor reduction in memory footprint cost of
   initializing IAA devices/wqs, due to eliminating the per-cpu copies of
   each IAA device's wqs. On a 384 cores Intel Granite Rapids server with
   8 IAA devices, this saves 140MiB.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  17 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 276 ++++++++++++---------
 2 files changed, 171 insertions(+), 122 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 45d94a646636..72ffdf55f7b3 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -55,6 +55,7 @@ struct iaa_wq {
 	struct idxd_wq		*wq;
 	int			ref;
 	bool			remove;
+	bool			mapped;
 
 	struct iaa_device	*iaa_device;
 
@@ -72,6 +73,13 @@ struct iaa_device_compression_mode {
 	dma_addr_t			aecs_comp_table_dma_addr;
 };
 
+struct wq_table_entry {
+	struct idxd_wq **wqs;
+	int	max_wqs;
+	int	n_wqs;
+	int	cur_wq;
+};
+
 /* Representation of IAA device with wqs, populated by probe */
 struct iaa_device {
 	struct list_head		list;
@@ -82,19 +90,14 @@ struct iaa_device {
 	int				n_wq;
 	struct list_head		wqs;
 
+	struct wq_table_entry		*iaa_local_wqs;
+
 	atomic64_t			comp_calls;
 	atomic64_t			comp_bytes;
 	atomic64_t			decomp_calls;
 	atomic64_t			decomp_bytes;
 };
 
-struct wq_table_entry {
-	struct idxd_wq **wqs;
-	int	max_wqs;
-	int	n_wqs;
-	int	cur_wq;
-};
-
 #define IAA_AECS_ALIGN			32
 
 /*
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index abaee160e5ec..40751d7c83c0 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -30,8 +30,9 @@
 /* number of iaa instances probed */
 static unsigned int nr_iaa;
 static unsigned int nr_cpus;
-static unsigned int nr_nodes;
-static unsigned int nr_cpus_per_node;
+static unsigned int nr_packages;
+static unsigned int nr_cpus_per_package;
+static unsigned int nr_iaa_per_package;
 
 /* Number of physical cpus sharing each iaa instance */
 static unsigned int cpus_per_iaa;
@@ -462,17 +463,46 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
  * Functions for use in crypto probe and remove interfaces:
  * allocate/init/query/deallocate devices/wqs.
  ***********************************************************/
-static struct iaa_device *iaa_device_alloc(void)
+static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 {
+	struct wq_table_entry *local;
 	struct iaa_device *iaa_device;
 
 	iaa_device = kzalloc(sizeof(*iaa_device), GFP_KERNEL);
 	if (!iaa_device)
-		return NULL;
+		goto err;
+
+	iaa_device->idxd = idxd;
+
+	/* IAA device's local wqs. */
+	iaa_device->iaa_local_wqs = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->iaa_local_wqs)
+		goto err;
+
+	local = iaa_device->iaa_local_wqs;
+
+	local->wqs = kzalloc(iaa_device->idxd->max_wqs * sizeof(struct wq *), GFP_KERNEL);
+	if (!local->wqs)
+		goto err;
+
+	local->max_wqs = iaa_device->idxd->max_wqs;
+	local->n_wqs = 0;
 
 	INIT_LIST_HEAD(&iaa_device->wqs);
 
 	return iaa_device;
+
+err:
+	if (iaa_device) {
+		if (iaa_device->iaa_local_wqs) {
+			if (iaa_device->iaa_local_wqs->wqs)
+				kfree(iaa_device->iaa_local_wqs->wqs);
+			kfree(iaa_device->iaa_local_wqs);
+		}
+		kfree(iaa_device);
+	}
+
+	return NULL;
 }
 
 static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
@@ -491,12 +521,10 @@ static struct iaa_device *add_iaa_device(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
 
-	iaa_device = iaa_device_alloc();
+	iaa_device = iaa_device_alloc(idxd);
 	if (!iaa_device)
 		return NULL;
 
-	iaa_device->idxd = idxd;
-
 	list_add_tail(&iaa_device->list, &iaa_devices);
 
 	nr_iaa++;
@@ -537,6 +565,7 @@ static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 	iaa_wq->wq = wq;
 	iaa_wq->iaa_device = iaa_device;
 	idxd_wq_set_private(wq, iaa_wq);
+	iaa_wq->mapped = false;
 
 	list_add_tail(&iaa_wq->list, &iaa_device->wqs);
 
@@ -580,6 +609,13 @@ static void free_iaa_device(struct iaa_device *iaa_device)
 		return;
 
 	remove_device_compression_modes(iaa_device);
+
+	if (iaa_device->iaa_local_wqs) {
+		if (iaa_device->iaa_local_wqs->wqs)
+			kfree(iaa_device->iaa_local_wqs->wqs);
+		kfree(iaa_device->iaa_local_wqs);
+	}
+
 	kfree(iaa_device);
 }
 
@@ -716,9 +752,14 @@ static int save_iaa_wq(struct idxd_wq *wq)
 	if (WARN_ON(nr_iaa == 0))
 		return -EINVAL;
 
-	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+	cpus_per_iaa = (nr_packages * nr_cpus_per_package) / nr_iaa;
 	if (!cpus_per_iaa)
 		cpus_per_iaa = 1;
+
+	nr_iaa_per_package = nr_iaa / nr_packages;
+	if (!nr_iaa_per_package)
+		nr_iaa_per_package = 1;
+
 out:
 	return 0;
 }
@@ -735,53 +776,45 @@ static void remove_iaa_wq(struct idxd_wq *wq)
 	}
 
 	if (nr_iaa) {
-		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+		cpus_per_iaa = (nr_packages * nr_cpus_per_package) / nr_iaa;
 		if (!cpus_per_iaa)
 			cpus_per_iaa = 1;
-	} else
+
+		nr_iaa_per_package = nr_iaa / nr_packages;
+		if (!nr_iaa_per_package)
+			nr_iaa_per_package = 1;
+	} else {
 		cpus_per_iaa = 1;
+		nr_iaa_per_package = 1;
+	}
 }
 
 /***************************************************************
  * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
  ***************************************************************/
-static void wq_table_free_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
-}
-
-static void wq_table_clear_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
-}
-
-static void clear_wq_table(void)
+/*
+ * Given a cpu, find the closest IAA instance.  The idea is to try to
+ * choose the most appropriate IAA instance for a caller and spread
+ * available workqueues around to clients.
+ */
+static inline int cpu_to_iaa(int cpu)
 {
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
+	int package_id, base_iaa, iaa = 0;
 
-	pr_debug("cleared wq table\n");
-}
+	if (!nr_packages || !nr_iaa_per_package)
+		return 0;
 
-static void free_wq_table(void)
-{
-	int cpu;
+	package_id = topology_logical_package_id(cpu);
+	base_iaa = package_id * nr_iaa_per_package;
+	iaa = base_iaa + ((cpu % nr_cpus_per_package) / cpus_per_iaa);
 
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_free_entry(cpu);
+	pr_debug("cpu = %d, package_id = %d, base_iaa = %d, iaa = %d",
+		 cpu, package_id, base_iaa, iaa);
 
-	free_percpu(wq_table);
+	if (iaa >= 0 && iaa < nr_iaa)
+		return iaa;
 
-	pr_debug("freed wq table\n");
+	return (nr_iaa - 1);
 }
 
 static int alloc_wq_table(int max_wqs)
@@ -795,13 +828,11 @@ static int alloc_wq_table(int max_wqs)
 
 	for (cpu = 0; cpu < nr_cpus; cpu++) {
 		entry = per_cpu_ptr(wq_table, cpu);
-		entry->wqs = kcalloc(max_wqs, sizeof(struct wq *), GFP_KERNEL);
-		if (!entry->wqs) {
-			free_wq_table();
-			return -ENOMEM;
-		}
 
+		entry->wqs = NULL;
 		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
 	}
 
 	pr_debug("initialized wq table\n");
@@ -809,33 +840,27 @@ static int alloc_wq_table(int max_wqs)
 	return 0;
 }
 
-static void wq_table_add(int cpu, struct idxd_wq *wq)
+static void wq_table_add(int cpu, struct wq_table_entry *iaa_local_wqs)
 {
 	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
 
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
-		return;
-
-	entry->wqs[entry->n_wqs++] = wq;
+	entry->wqs = iaa_local_wqs->wqs;
+	entry->max_wqs = iaa_local_wqs->max_wqs;
+	entry->n_wqs = iaa_local_wqs->n_wqs;
+	entry->cur_wq = 0;
 
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
+	pr_debug("%s: cpu %d: added %d iaa local wqs up to wq %d.%d\n", __func__,
+		 cpu, entry->n_wqs,
 		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+		 entry->wqs[entry->n_wqs - 1]->id);
 }
 
 static int wq_table_add_wqs(int iaa, int cpu)
 {
 	struct iaa_device *iaa_device, *found_device = NULL;
-	int ret = 0, cur_iaa = 0, n_wqs_added = 0;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
+	int ret = 0, cur_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		idxd = iaa_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
 
 		if (cur_iaa != iaa) {
 			cur_iaa++;
@@ -843,7 +868,8 @@ static int wq_table_add_wqs(int iaa, int cpu)
 		}
 
 		found_device = iaa_device;
-		dev_dbg(dev, "getting wq from iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 		break;
 	}
@@ -858,29 +884,58 @@ static int wq_table_add_wqs(int iaa, int cpu)
 		}
 		cur_iaa = 0;
 
-		idxd = found_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
-		dev_dbg(dev, "getting wq from only iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from only iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 	}
 
-	list_for_each_entry(iaa_wq, &found_device->wqs, list) {
-		wq_table_add(cpu, iaa_wq->wq);
-		pr_debug("rebalance: added wq for cpu=%d: iaa wq %d.%d\n",
-			 cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
-		n_wqs_added++;
+	wq_table_add(cpu, found_device->iaa_local_wqs);
+
+out:
+	return ret;
+}
+
+static int map_iaa_device_wqs(struct iaa_device *iaa_device)
+{
+	struct wq_table_entry *local;
+	int ret = 0, n_wqs_added = 0;
+	struct iaa_wq *iaa_wq;
+
+	local = iaa_device->iaa_local_wqs;
+
+	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+		if (iaa_wq->mapped && ++n_wqs_added)
+			continue;
+
+		pr_debug("iaa_device %px: processing wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+
+		if (WARN_ON(local->n_wqs == local->max_wqs))
+			break;
+
+		local->wqs[local->n_wqs++] = iaa_wq->wq;
+		pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+
+		iaa_wq->mapped = true;
+		++n_wqs_added;
 	}
 
-	if (!n_wqs_added) {
-		pr_debug("couldn't find any iaa wqs!\n");
+	if (!n_wqs_added && !iaa_device->n_wq) {
+		pr_debug("iaa_device %d: couldn't find any iaa wqs!\n", iaa_device->idxd->id);
 		ret = -EINVAL;
-		goto out;
 	}
-out:
+
 	return ret;
 }
 
+static void map_iaa_devices(void)
+{
+	struct iaa_device *iaa_device;
+
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		BUG_ON(map_iaa_device_wqs(iaa_device));
+	}
+}
+
 /*
  * Rebalance the wq table so that given a cpu, it's easy to find the
  * closest IAA instance.  The idea is to try to choose the most
@@ -889,48 +944,42 @@ static int wq_table_add_wqs(int iaa, int cpu)
  */
 static void rebalance_wq_table(void)
 {
-	const struct cpumask *node_cpus;
-	int node, cpu, iaa = -1;
+	int cpu, iaa;
 
 	if (nr_iaa == 0)
 		return;
 
-	pr_debug("rebalance: nr_nodes=%d, nr_cpus %d, nr_iaa %d, cpus_per_iaa %d\n",
-		 nr_nodes, nr_cpus, nr_iaa, cpus_per_iaa);
+	map_iaa_devices();
 
-	clear_wq_table();
+	pr_debug("rebalance: nr_packages=%d, nr_cpus %d, nr_iaa %d, cpus_per_iaa %d\n",
+		 nr_packages, nr_cpus, nr_iaa, cpus_per_iaa);
 
-	if (nr_iaa == 1) {
-		for (cpu = 0; cpu < nr_cpus; cpu++) {
-			if (WARN_ON(wq_table_add_wqs(0, cpu))) {
-				pr_debug("could not add any wqs for iaa 0 to cpu %d!\n", cpu);
-				return;
-			}
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		iaa = cpu_to_iaa(cpu);
+		pr_debug("rebalance: cpu=%d iaa=%d\n", cpu, iaa);
+
+		if (WARN_ON(iaa == -1)) {
+			pr_debug("rebalance (cpu_to_iaa(%d)) failed!\n", cpu);
+			return;
 		}
 
-		return;
+		if (WARN_ON(wq_table_add_wqs(iaa, cpu))) {
+			pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
+			return;
+		}
 	}
 
-	for_each_node_with_cpus(node) {
-		node_cpus = cpumask_of_node(node);
-
-		for (cpu = 0; cpu <  cpumask_weight(node_cpus); cpu++) {
-			int node_cpu = cpumask_nth(cpu, node_cpus);
-
-			if (WARN_ON(node_cpu >= nr_cpu_ids)) {
-				pr_debug("node_cpu %d doesn't exist!\n", node_cpu);
-				return;
-			}
-
-			if ((cpu % cpus_per_iaa) == 0)
-				iaa++;
+	pr_debug("Finished rebalance local wqs.");
+}
 
-			if (WARN_ON(wq_table_add_wqs(iaa, node_cpu))) {
-				pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
-				return;
-			}
-		}
+static void free_wq_tables(void)
+{
+	if (wq_table) {
+		free_percpu(wq_table);
+		wq_table = NULL;
 	}
+
+	pr_debug("freed local wq table\n");
 }
 
 /***************************************************************
@@ -2134,7 +2183,7 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	free_iaa_wq(idxd_wq_get_private(wq));
 err_save:
 	if (first_wq)
-		free_wq_table();
+		free_wq_tables();
 err_alloc:
 	mutex_unlock(&iaa_devices_lock);
 	idxd_drv_disable_wq(wq);
@@ -2184,7 +2233,9 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 
 	if (nr_iaa == 0) {
 		iaa_crypto_enabled = false;
-		free_wq_table();
+		free_wq_tables();
+		BUG_ON(!list_empty(&iaa_devices));
+		INIT_LIST_HEAD(&iaa_devices);
 		module_put(THIS_MODULE);
 
 		pr_info("iaa_crypto now DISABLED\n");
@@ -2210,16 +2261,11 @@ static struct idxd_device_driver iaa_crypto_driver = {
 static int __init iaa_crypto_init_module(void)
 {
 	int ret = 0;
-	int node;
+	INIT_LIST_HEAD(&iaa_devices);
 
 	nr_cpus = num_possible_cpus();
-	for_each_node_with_cpus(node)
-		nr_nodes++;
-	if (!nr_nodes) {
-		pr_err("IAA couldn't find any nodes with cpus\n");
-		return -ENODEV;
-	}
-	nr_cpus_per_node = nr_cpus / nr_nodes;
+	nr_cpus_per_package = topology_num_cores_per_package();
+	nr_packages = topology_max_packages();
 
 	if (crypto_has_comp("deflate-generic", 0, 0))
 		deflate_generic_tfm = crypto_alloc_comp("deflate-generic", 0, 0);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 09/14] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
                   ` (7 preceding siblings ...)
  2025-03-03  8:47 ` [PATCH v8 08/14] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 10/14] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This change enables processes running on any logical core on a package to
use all the IAA devices enabled on that package for compress jobs. In
other words, compressions originating from any process in a package will be
distributed in round-robin manner to the available IAA devices on the same
package.

This is not the default behavior, and is recommended only for highly
contended scenarios, when there is significant swapout/swapin activity. The
commit log describes how to enable this feature through driver parameters,
but the key thing to note is that this requires configuring 2 work-queues
per IAA device (each with 64 entries), with 1 WQ used solely for decompress
jobs, and the other WQ used solely for compress jobs. Hence the above
recommendation.

The main premise behind this change is to make sure that no compress
engines on any IAA device are un-utilized/under-utilized/over-utilized.
In other words, the compress engines on all IAA devices are considered a
global resource for that package, thus maximizing compression throughput.

This allows the use of all IAA devices present in a given package for
(batched) compressions originating from zswap/zram, from all cores
on this package.

A new per-cpu "global_wq_table" implements this in the iaa_crypto driver.
We can think of the global WQ per IAA as a WQ to which all cores on
that package can submit compress jobs.

To avail of this feature, the user must configure 2 WQs per IAA in order to
enable distribution of compress jobs to multiple IAA devices.

Each IAA will have 2 WQs:
 wq.0 (local WQ):
   Used for decompress jobs from cores mapped by the cpu_to_iaa() "even
   balancing of logical cores to IAA devices" algorithm.

 wq.1 (global WQ):
   Used for compress jobs from *all* logical cores on that package.

The iaa_crypto driver will place all global WQs from all same-package IAA
devices in the global_wq_table per cpu on that package. When the driver
receives a compress job, it will lookup the "next" global WQ in the cpu's
global_wq_table to submit the descriptor.

The starting wq in the global_wq_table for each cpu is the global wq
associated with the IAA nearest to it, so that we stagger the starting
global wq for each process. This results in very uniform usage of all IAAs
for compress jobs.

Two new driver module parameters are added for this feature:

g_wqs_per_iaa (default 0):

 /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa

 This represents the number of global WQs that can be configured per IAA
 device. The recommended setting is 1 to enable the use of this feature
 once the user configures 2 WQs per IAA using higher level scripts as
 described in Documentation/driver-api/crypto/iaa/iaa-crypto.rst.

g_consec_descs_per_gwq (default 1):

 /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq

 This represents the number of consecutive compress jobs that will be
 submitted to the same global WQ (i.e. to the same IAA device) from a given
 core, before moving to the next global WQ. The default is 1, which is also
 the recommended setting to avail of this feature.

The decompress jobs from any core will be sent to the "local" IAA, namely
the one that the driver assigns with the cpu_to_iaa() mapping algorithm
that evenly balances the assignment of logical cores to IAA devices on a
package.

On a 2-package Sapphire Rapids server where each package has 56 cores and
4 IAA devices, this is how the compress/decompress jobs will be mapped
when the user configures 2 WQs per IAA device (which implies wq.1 will
be added to the global WQ table for each logical core on that package):

 package(s):        2
 package0 CPU(s):   0-55,112-167
 package1 CPU(s):   56-111,168-223

 Compress jobs:
 --------------
 package 0:
 iaa_crypto will send compress jobs from all cpus (0-55,112-167) to all IAA
 devices on the package (iax1/iax3/iax5/iax7) in round-robin manner:
 iaa:   iax1           iax3           iax5           iax7

 package 1:
 iaa_crypto will send compress jobs from all cpus (56-111,168-223) to all
 IAA devices on the package (iax9/iax11/iax13/iax15) in round-robin manner:
 iaa:   iax9           iax11          iax13           iax15

 Decompress jobs:
 ----------------
 package 0:
 cpu   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
 iaa:  iax1           iax3           iax5           iax7

 package 1:
 cpu   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
 iaa:  iax9           iax11          iax13           iax15

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |   1 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 385 ++++++++++++++++++++-
 2 files changed, 378 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 72ffdf55f7b3..5f38f530c33d 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -91,6 +91,7 @@ struct iaa_device {
 	struct list_head		wqs;
 
 	struct wq_table_entry		*iaa_local_wqs;
+	struct wq_table_entry		*iaa_global_wqs;
 
 	atomic64_t			comp_calls;
 	atomic64_t			comp_bytes;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 40751d7c83c0..cb96897e7fed 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -42,6 +42,18 @@ static struct crypto_comp *deflate_generic_tfm;
 /* Per-cpu lookup table for balanced wqs */
 static struct wq_table_entry __percpu *wq_table = NULL;
 
+static struct wq_table_entry **pkg_global_wq_tables = NULL;
+
+/* Per-cpu lookup table for global wqs shared by all cpus. */
+static struct wq_table_entry __percpu *global_wq_table = NULL;
+
+/*
+ * Per-cpu counter of consecutive descriptors allocated to
+ * the same wq in the global_wq_table, so that we know
+ * when to switch to the next wq in the global_wq_table.
+ */
+static int __percpu *num_consec_descs_per_wq = NULL;
+
 /* Verify results of IAA compress or not */
 static bool iaa_verify_compress = false;
 
@@ -79,6 +91,16 @@ static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
+/* Number of global wqs per iaa*/
+static int g_wqs_per_iaa = 0;
+
+/*
+ * Number of consecutive descriptors to allocate from a
+ * given global wq before switching to the next wq in
+ * the global_wq_table.
+ */
+static int g_consec_descs_per_gwq = 1;
+
 static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
 LIST_HEAD(iaa_devices);
@@ -180,6 +202,60 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
+static ssize_t g_wqs_per_iaa_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_wqs_per_iaa);
+}
+
+static ssize_t g_wqs_per_iaa_store(struct device_driver *driver,
+				   const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_wqs_per_iaa);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_wqs_per_iaa);
+
+static ssize_t g_consec_descs_per_gwq_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_consec_descs_per_gwq);
+}
+
+static ssize_t g_consec_descs_per_gwq_store(struct device_driver *driver,
+					    const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_consec_descs_per_gwq);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_consec_descs_per_gwq);
+
 /****************************
  * Driver compression modes.
  ****************************/
@@ -465,7 +541,7 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
  ***********************************************************/
 static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 {
-	struct wq_table_entry *local;
+	struct wq_table_entry *local, *global;
 	struct iaa_device *iaa_device;
 
 	iaa_device = kzalloc(sizeof(*iaa_device), GFP_KERNEL);
@@ -488,6 +564,20 @@ static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 	local->max_wqs = iaa_device->idxd->max_wqs;
 	local->n_wqs = 0;
 
+	/* IAA device's global wqs. */
+	iaa_device->iaa_global_wqs = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->iaa_global_wqs)
+		goto err;
+
+	global = iaa_device->iaa_global_wqs;
+
+	global->wqs = kzalloc(iaa_device->idxd->max_wqs * sizeof(struct wq *), GFP_KERNEL);
+	if (!global->wqs)
+		goto err;
+
+	global->max_wqs = iaa_device->idxd->max_wqs;
+	global->n_wqs = 0;
+
 	INIT_LIST_HEAD(&iaa_device->wqs);
 
 	return iaa_device;
@@ -499,6 +589,8 @@ static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 				kfree(iaa_device->iaa_local_wqs->wqs);
 			kfree(iaa_device->iaa_local_wqs);
 		}
+		if (iaa_device->iaa_global_wqs)
+			kfree(iaa_device->iaa_global_wqs);
 		kfree(iaa_device);
 	}
 
@@ -616,6 +708,12 @@ static void free_iaa_device(struct iaa_device *iaa_device)
 		kfree(iaa_device->iaa_local_wqs);
 	}
 
+	if (iaa_device->iaa_global_wqs) {
+		if (iaa_device->iaa_global_wqs->wqs)
+			kfree(iaa_device->iaa_global_wqs->wqs);
+		kfree(iaa_device->iaa_global_wqs);
+	}
+
 	kfree(iaa_device);
 }
 
@@ -817,6 +915,58 @@ static inline int cpu_to_iaa(int cpu)
 	return (nr_iaa - 1);
 }
 
+static void free_global_wq_table(void)
+{
+	if (global_wq_table) {
+		free_percpu(global_wq_table);
+		global_wq_table = NULL;
+	}
+
+	if (num_consec_descs_per_wq) {
+		free_percpu(num_consec_descs_per_wq);
+		num_consec_descs_per_wq = NULL;
+	}
+
+	pr_debug("freed global wq table\n");
+}
+
+static int pkg_global_wq_tables_alloc(void)
+{
+	int i, j;
+
+	pkg_global_wq_tables = kzalloc(nr_packages * sizeof(*pkg_global_wq_tables), GFP_KERNEL);
+	if (!pkg_global_wq_tables)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_packages; ++i) {
+		pkg_global_wq_tables[i] = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+
+		if (!pkg_global_wq_tables[i]) {
+			for (j = 0; j < i; ++j)
+				kfree(pkg_global_wq_tables[j]);
+			kfree(pkg_global_wq_tables);
+			pkg_global_wq_tables = NULL;
+			return -ENOMEM;
+		}
+		pkg_global_wq_tables[i]->wqs = NULL;
+	}
+
+	return 0;
+}
+
+static void pkg_global_wq_tables_dealloc(void)
+{
+	int i;
+
+	for (i = 0; i < nr_packages; ++i) {
+		if (pkg_global_wq_tables[i]->wqs)
+			kfree(pkg_global_wq_tables[i]->wqs);
+		kfree(pkg_global_wq_tables[i]);
+	}
+	kfree(pkg_global_wq_tables);
+	pkg_global_wq_tables = NULL;
+}
+
 static int alloc_wq_table(int max_wqs)
 {
 	struct wq_table_entry *entry;
@@ -835,6 +985,35 @@ static int alloc_wq_table(int max_wqs)
 		entry->cur_wq = 0;
 	}
 
+	global_wq_table = alloc_percpu(struct wq_table_entry);
+	if (!global_wq_table)
+		return 0;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		entry = per_cpu_ptr(global_wq_table, cpu);
+
+		entry->wqs = NULL;
+		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+	}
+
+	num_consec_descs_per_wq = alloc_percpu(int);
+	if (!num_consec_descs_per_wq) {
+		free_global_wq_table();
+		return 0;
+	}
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+		*num_consec_descs = 0;
+	}
+
+	if (pkg_global_wq_tables_alloc()) {
+		free_global_wq_table();
+		return 0;
+	}
+
 	pr_debug("initialized wq table\n");
 
 	return 0;
@@ -895,13 +1074,120 @@ static int wq_table_add_wqs(int iaa, int cpu)
 	return ret;
 }
 
+static void pkg_global_wq_tables_reinit(void)
+{
+	int i, cur_iaa = 0, pkg = 0, nr_pkg_wqs = 0;
+	struct iaa_device *iaa_device;
+	struct wq_table_entry *global;
+
+	if (!pkg_global_wq_tables)
+		return;
+
+	/* Reallocate per-package wqs. */
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		global = iaa_device->iaa_global_wqs;
+		nr_pkg_wqs += global->n_wqs;
+
+		if (++cur_iaa == nr_iaa_per_package) {
+			nr_pkg_wqs = nr_pkg_wqs ? max_t(int, iaa_device->idxd->max_wqs, nr_pkg_wqs) : 0;
+
+			if (pkg_global_wq_tables[pkg]->wqs) {
+				kfree(pkg_global_wq_tables[pkg]->wqs);
+				pkg_global_wq_tables[pkg]->wqs = NULL;
+			}
+
+			if (nr_pkg_wqs)
+				pkg_global_wq_tables[pkg]->wqs = kzalloc(nr_pkg_wqs *
+									 sizeof(struct wq *),
+									 GFP_KERNEL);
+
+			pkg_global_wq_tables[pkg]->n_wqs = 0;
+			pkg_global_wq_tables[pkg]->cur_wq = 0;
+			pkg_global_wq_tables[pkg]->max_wqs = nr_pkg_wqs;
+
+			if (++pkg == nr_packages)
+				break;
+			cur_iaa = 0;
+			nr_pkg_wqs = 0;
+		}
+	}
+
+	pkg = 0;
+	cur_iaa = 0;
+
+	/* Re-initialize per-package wqs. */
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		global = iaa_device->iaa_global_wqs;
+
+		if (pkg_global_wq_tables[pkg]->wqs)
+			for (i = 0; i < global->n_wqs; ++i)
+				pkg_global_wq_tables[pkg]->wqs[pkg_global_wq_tables[pkg]->n_wqs++] = global->wqs[i];
+
+		pr_debug("pkg_global_wq_tables[%d] has %d wqs", pkg, pkg_global_wq_tables[pkg]->n_wqs);
+
+		if (++cur_iaa == nr_iaa_per_package) {
+			if (++pkg == nr_packages)
+				break;
+			cur_iaa = 0;
+		}
+	}
+}
+
+static void global_wq_table_add(int cpu, struct wq_table_entry *pkg_global_wq_table)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+
+	/* This could be NULL. */
+	entry->wqs = pkg_global_wq_table->wqs;
+	entry->max_wqs = pkg_global_wq_table->max_wqs;
+	entry->n_wqs = pkg_global_wq_table->n_wqs;
+	entry->cur_wq = 0;
+
+	if (entry->wqs)
+		pr_debug("%s: cpu %d: added %d iaa global wqs up to wq %d.%d\n", __func__,
+			 cpu, entry->n_wqs,
+			 entry->wqs[entry->n_wqs - 1]->idxd->id,
+			 entry->wqs[entry->n_wqs - 1]->id);
+}
+
+static void global_wq_table_set_start_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int start_wq = g_wqs_per_iaa * (cpu_to_iaa(cpu) % nr_iaa_per_package);
+
+	if ((start_wq >= 0) && (start_wq < entry->n_wqs))
+		entry->cur_wq = start_wq;
+}
+
+static void global_wq_table_add_wqs(void)
+{
+	int cpu;
+
+	if (!pkg_global_wq_tables)
+		return;
+
+	for (cpu = 0; cpu < nr_cpus; cpu += nr_cpus_per_package) {
+		/* cpu's on the same package get the same global_wq_table. */
+		int package_id = topology_logical_package_id(cpu);
+		int pkg_cpu;
+
+		for (pkg_cpu = cpu; pkg_cpu < cpu + nr_cpus_per_package; ++pkg_cpu) {
+			if (pkg_global_wq_tables[package_id]->n_wqs > 0) {
+				global_wq_table_add(pkg_cpu, pkg_global_wq_tables[package_id]);
+				global_wq_table_set_start_wq(pkg_cpu);
+			}
+		}
+	}
+}
+
 static int map_iaa_device_wqs(struct iaa_device *iaa_device)
 {
-	struct wq_table_entry *local;
+	struct wq_table_entry *local, *global;
 	int ret = 0, n_wqs_added = 0;
 	struct iaa_wq *iaa_wq;
 
 	local = iaa_device->iaa_local_wqs;
+	global = iaa_device->iaa_global_wqs;
 
 	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
 		if (iaa_wq->mapped && ++n_wqs_added)
@@ -909,11 +1195,18 @@ static int map_iaa_device_wqs(struct iaa_device *iaa_device)
 
 		pr_debug("iaa_device %px: processing wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
 
-		if (WARN_ON(local->n_wqs == local->max_wqs))
-			break;
+		if ((!n_wqs_added || ((n_wqs_added + g_wqs_per_iaa) < iaa_device->n_wq)) &&
+			(local->n_wqs < local->max_wqs)) {
+
+			local->wqs[local->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		} else {
+			if (WARN_ON(global->n_wqs == global->max_wqs))
+				break;
 
-		local->wqs[local->n_wqs++] = iaa_wq->wq;
-		pr_debug("iaa_device %px: added local wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+			global->wqs[global->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %px: added global wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		}
 
 		iaa_wq->mapped = true;
 		++n_wqs_added;
@@ -969,6 +1262,10 @@ static void rebalance_wq_table(void)
 		}
 	}
 
+	if (iaa_crypto_enabled && pkg_global_wq_tables) {
+		pkg_global_wq_tables_reinit();
+		global_wq_table_add_wqs();
+	}
 	pr_debug("Finished rebalance local wqs.");
 }
 
@@ -979,7 +1276,17 @@ static void free_wq_tables(void)
 		wq_table = NULL;
 	}
 
-	pr_debug("freed local wq table\n");
+	if (global_wq_table) {
+		free_percpu(global_wq_table);
+		global_wq_table = NULL;
+	}
+
+	if (num_consec_descs_per_wq) {
+		free_percpu(num_consec_descs_per_wq);
+		num_consec_descs_per_wq = NULL;
+	}
+
+	pr_debug("freed wq tables\n");
 }
 
 /***************************************************************
@@ -1002,6 +1309,35 @@ static struct idxd_wq *wq_table_next_wq(int cpu)
 	return entry->wqs[entry->cur_wq];
 }
 
+/*
+ * Caller should make sure to call only if the
+ * per_cpu_ptr "global_wq_table" is non-NULL
+ * and has at least one wq configured.
+ */
+static struct idxd_wq *global_wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+
+	/*
+	 * Fall-back to local IAA's wq if there were no global wqs configured
+	 * for any IAA device, or if there were problems in setting up global
+	 * wqs for this cpu's package.
+	 */
+	if (!entry->wqs)
+		return wq_table_next_wq(cpu);
+
+	if ((*num_consec_descs) == g_consec_descs_per_gwq) {
+		if (++entry->cur_wq >= entry->n_wqs)
+			entry->cur_wq = 0;
+		*num_consec_descs = 0;
+	}
+
+	++(*num_consec_descs);
+
+	return entry->wqs[entry->cur_wq];
+}
+
 /*************************************************
  * Core iaa_crypto compress/decompress functions.
  *************************************************/
@@ -1563,6 +1899,7 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	struct idxd_wq *wq;
 	struct device *dev;
 	int order = -1;
+	struct wq_table_entry *entry;
 
 	compression_ctx = crypto_tfm_ctx(tfm);
 
@@ -1581,8 +1918,15 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		disable_async = true;
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (!entry || !entry->wqs || entry->n_wqs == 0) {
+		wq = wq_table_next_wq(cpu);
+	} else {
+		wq = global_wq_table_next_wq(cpu);
+	}
 	put_cpu();
+
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
 		return -ENODEV;
@@ -2233,6 +2577,7 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 
 	if (nr_iaa == 0) {
 		iaa_crypto_enabled = false;
+		pkg_global_wq_tables_dealloc();
 		free_wq_tables();
 		BUG_ON(!list_empty(&iaa_devices));
 		INIT_LIST_HEAD(&iaa_devices);
@@ -2302,6 +2647,20 @@ static int __init iaa_crypto_init_module(void)
 		goto err_sync_attr_create;
 	}
 
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_wqs_per_iaa);
+	if (ret) {
+		pr_debug("IAA g_wqs_per_iaa attr creation failed\n");
+		goto err_g_wqs_per_iaa_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_consec_descs_per_gwq);
+	if (ret) {
+		pr_debug("IAA g_consec_descs_per_gwq attr creation failed\n");
+		goto err_g_consec_descs_per_gwq_attr_create;
+	}
+
 	if (iaa_crypto_debugfs_init())
 		pr_warn("debugfs init failed, stats not available\n");
 
@@ -2309,6 +2668,12 @@ static int __init iaa_crypto_init_module(void)
 out:
 	return ret;
 
+err_g_consec_descs_per_gwq_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+err_g_wqs_per_iaa_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_sync_mode);
 err_sync_attr_create:
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
@@ -2332,6 +2697,10 @@ static void __exit iaa_crypto_cleanup_module(void)
 			   &driver_attr_sync_mode);
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_consec_descs_per_gwq);
 	idxd_driver_unregister(&iaa_crypto_driver);
 	iaa_aecs_cleanup_fixed();
 	crypto_free_comp(deflate_generic_tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 10/14] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
                   ` (8 preceding siblings ...)
  2025-03-03  8:47 ` [PATCH v8 09/14] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 11/14] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks Kanchana P Sridhar
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the descriptor allocation from blocking to non-blocking
with bounded retries or "timeouts".

This is necessary to prevent task blocked errors in high contention
scenarios, for instance, when the platform has only 1 IAA device
enabled. With 1 IAA device enabled per package on a dual-package SPR with
56 cores/package, there are 112 logical cores mapped to this single IAA
device. In this scenario, the task blocked errors can occur because
idxd_alloc_desc() is called with IDXD_OP_BLOCK. Any process that is able to
obtain IAA_CRYPTO_MAX_BATCH_SIZE (8U) descriptors, will cause contention
for allocating descriptors for all other processes. Under IDXD_OP_BLOCK,
this can cause compress/decompress jobs to stall in stress test scenarios
(e.g. zswap_store() of 2M folios).

In order to make the iaa_crypto driver be more fail-safe, this commit
implements the following:

1) Change compress/decompress descriptor allocations to be non-blocking
   with retries ("timeouts").
2) Return compress error to zswap if descriptor allocation with timeouts
   fails during compress ops. zswap_store() will return an error and the
   folio gets stored in the backing swap device.
3) Fallback to software decompress if descriptor allocation with timeouts
   fails during decompress ops.
4) Bug fixes for freeing the descriptor consistently in all error cases.

With these fixes, there are no task blocked errors seen under stress
testing conditions, and no performance degradation observed.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  3 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 74 ++++++++++++----------
 2 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 5f38f530c33d..de14e5e2a017 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -21,6 +21,9 @@
 
 #define IAA_COMPLETION_TIMEOUT		1000000
 
+#define IAA_ALLOC_DESC_COMP_TIMEOUT	   1000
+#define IAA_ALLOC_DESC_DECOMP_TIMEOUT	    500
+
 #define IAA_ANALYTICS_ERROR		0x0a
 #define IAA_ERROR_DECOMP_BUF_OVERFLOW	0x0b
 #define IAA_ERROR_COMP_BUF_OVERFLOW	0x19
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index cb96897e7fed..7503fafca279 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1406,6 +1406,7 @@ static int deflate_generic_decompress(struct acomp_req *req)
 	void *src, *dst;
 	int ret;
 
+	req->dlen = PAGE_SIZE;
 	src = kmap_local_page(sg_page(req->src)) + req->src->offset;
 	dst = kmap_local_page(sg_page(req->dst)) + req->dst->offset;
 
@@ -1469,7 +1470,8 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	int alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1485,7 +1487,11 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < IAA_ALLOC_DESC_DECOMP_TIMEOUT)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "idxd descriptor allocation failed\n");
 		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
@@ -1661,7 +1667,8 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	int alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1677,7 +1684,11 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < IAA_ALLOC_DESC_COMP_TIMEOUT)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "idxd descriptor allocation failed\n");
 		dev_dbg(dev, "iaa compress failed: ret=%ld\n", PTR_ERR(idxd_desc));
@@ -1753,15 +1764,10 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 
 	*compression_crc = idxd_desc->iax_completion->crc;
 
-	if (!ctx->async_mode || disable_async)
-		idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
 err:
 	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
+out:
+	return ret;
 }
 
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
@@ -1773,7 +1779,8 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	int alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1789,12 +1796,18 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < IAA_ALLOC_DESC_DECOMP_TIMEOUT)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "idxd descriptor allocation failed\n");
 		dev_dbg(dev, "iaa decompress failed: ret=%ld\n",
 			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
+		ret = PTR_ERR(idxd_desc);
+		idxd_desc = NULL;
+		goto fallback_software_decomp;
 	}
 	desc = idxd_desc->iax_hw;
 
@@ -1837,7 +1850,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	ret = idxd_submit_desc(wq, idxd_desc);
 	if (ret) {
 		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-		goto err;
+		goto fallback_software_decomp;
 	}
 
 	/* Update stats */
@@ -1851,19 +1864,20 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	}
 
 	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+
+fallback_software_decomp:
 	if (ret) {
-		dev_dbg(dev, "%s: check_completion failed ret=%d\n", __func__, ret);
-		if (idxd_desc->iax_completion->status == IAA_ANALYTICS_ERROR) {
+		dev_dbg(dev, "%s: desc allocation/submission/check_completion failed ret=%d\n", __func__, ret);
+		if (idxd_desc && idxd_desc->iax_completion->status == IAA_ANALYTICS_ERROR) {
 			pr_warn("%s: falling back to deflate-generic decompress, "
 				"analytics error code %x\n", __func__,
 				idxd_desc->iax_completion->error_code);
-			ret = deflate_generic_decompress(req);
-			if (ret) {
-				dev_dbg(dev, "%s: deflate-generic failed ret=%d\n",
-					__func__, ret);
-				goto err;
-			}
-		} else {
+		}
+
+		ret = deflate_generic_decompress(req);
+
+		if (ret) {
+			pr_err("%s: iaa decompress failed: fallback to deflate-generic software decompress error ret=%d\n", __func__, ret);
 			goto err;
 		}
 	} else {
@@ -1872,19 +1886,15 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	*dlen = req->dlen;
 
-	if (!ctx->async_mode || disable_async)
-		idxd_free_desc(wq, idxd_desc);
-
 	/* Update stats */
 	update_total_decomp_bytes_in(slen);
 	update_wq_decomp_bytes(wq, slen);
+
+err:
+	if (idxd_desc)
+		idxd_free_desc(wq, idxd_desc);
 out:
 	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa decompress failed: ret=%d\n", ret);
-
-	goto out;
 }
 
 static int iaa_comp_acompress(struct acomp_req *req)
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 11/14] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
                   ` (9 preceding siblings ...)
  2025-03-03  8:47 ` [PATCH v8 10/14] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Kanchana P Sridhar
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

The mainline implementation of "deflate_generic_decompress" has a bug in
the usage of this global variable:

static struct crypto_comp *deflate_generic_tfm;

The "deflate_generic_tfm" is allocated at module init time, and freed
during module cleanup. Any calls to software decompress, for instance, if
descriptor allocation fails or job submission fails, will trigger this bug
in the deflate_generic_decompress() procedure. The problem is the
unprotected access of "deflate_generic_tfm" in this procedure. While
stress testing workloads under high memory pressure, with 1 IAA device
and "deflate-iaa" as the compressor, the descriptor allocation times out
and the software fallback route is taken. With multiple processes calling:

        ret = crypto_comp_decompress(deflate_generic_tfm,
                                     src, req->slen, dst, &req->dlen);

we end up with data corruption, that results in req->dlen being larger
than PAGE_SIZE. zswap_decompress() subsequently raises a kernel bug.

This bug can manifest under high contention and memory pressure situations
with high likelihood. This has been resolved by adding a mutex, which is
locked before accessing "deflate_generic_tfm" and unlocked after the
crypto_comp call is done.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 7503fafca279..2a994f307679 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -105,6 +105,7 @@ static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];

 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
+DEFINE_MUTEX(deflate_generic_tfm_lock);

 /* If enabled, IAA hw crypto algos are registered, unavailable otherwise */
 static bool iaa_crypto_enabled;
@@ -1407,6 +1408,9 @@ static int deflate_generic_decompress(struct acomp_req *req)
 	int ret;

 	req->dlen = PAGE_SIZE;
+
+	mutex_lock(&deflate_generic_tfm_lock);
+
 	src = kmap_local_page(sg_page(req->src)) + req->src->offset;
 	dst = kmap_local_page(sg_page(req->dst)) + req->dst->offset;

@@ -1416,6 +1420,8 @@ static int deflate_generic_decompress(struct acomp_req *req)
 	kunmap_local(src);
 	kunmap_local(dst);

+	mutex_unlock(&deflate_generic_tfm_lock);
+
 	update_total_sw_decomp_calls();

 	return ret;
-- 
2.27.0

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
                   ` (10 preceding siblings ...)
  2025-03-03  8:47 ` [PATCH v8 11/14] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-06 19:35   ` Yosry Ahmed
  2025-03-06 19:42   ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Yosry Ahmed
  2025-03-03  8:47 ` [PATCH v8 13/14] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
  2025-03-03  8:47 ` [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios Kanchana P Sridhar
  13 siblings, 2 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the acomp_ctx resources' lifetime to be from pool
creation to deletion. A "bool __online" and "u8 nr_reqs" are added to
"struct crypto_acomp_ctx" which simplify a few things:

1) zswap_pool_create() will initialize all members of each percpu acomp_ctx
   to 0 or NULL and only then initialize the mutex.
2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online
   to true, without locking the mutex.
3) CPU hotunplug will lock the mutex before setting __online to false. It
   will not delete any resources.
4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online
   is true, and if so, return the mutex for use in zswap compress and
   decompress ops.
5) CPU onlining after offlining will simply check if either __online or
   nr_reqs are non-0, and return 0 if so, without re-allocating the
   resources.
6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to
   delete the acomp_ctx resources.
7) Common resource deletion code in case of zswap_cpu_comp_prepare()
   errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new
   acomp_ctx_dealloc().

The CPU hot[un]plug callback functions are moved to "pool functions"
accordingly.

The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU
offlining, and only deleting them when the pool is destroyed, is as follows:

    IAA with batching: 64.8 KB
    Software compressors: 8.2 KB

I would appreciate code review comments on whether this memory cost is
acceptable, for the latency improvement that it provides due to a faster
reclaim restart after a CPU hotunplug-hotplug sequence - all that the
hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and
if so, set __online to true and return, and reclaim can proceed.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 182 insertions(+), 91 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 10f2a16e7586..cff96df1df8b 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -144,10 +144,12 @@ bool zswap_never_enabled(void)
 struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
 	struct acomp_req *req;
-	struct crypto_wait wait;
 	u8 *buffer;
+	u8 nr_reqs;
+	struct crypto_wait wait;
 	struct mutex mutex;
 	bool is_sleepable;
+	bool __online;
 };
 
 /*
@@ -246,6 +248,122 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 **********************************/
 static void __zswap_pool_empty(struct percpu_ref *ref);
 
+static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
+{
+	if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) {
+
+		if (!IS_ERR_OR_NULL(acomp_ctx->req))
+			acomp_request_free(acomp_ctx->req);
+		acomp_ctx->req = NULL;
+
+		kfree(acomp_ctx->buffer);
+		acomp_ctx->buffer = NULL;
+
+		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+			crypto_free_acomp(acomp_ctx->acomp);
+
+		acomp_ctx->nr_reqs = 0;
+	}
+}
+
+static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
+{
+	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
+	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+	int ret = -ENOMEM;
+
+	/*
+	 * Just to be even more fail-safe against changes in assumptions and/or
+	 * implementation of the CPU hotplug code.
+	 */
+	if (acomp_ctx->__online)
+		return 0;
+
+	if (acomp_ctx->nr_reqs) {
+		acomp_ctx->__online = true;
+		return 0;
+	}
+
+	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+	if (IS_ERR(acomp_ctx->acomp)) {
+		pr_err("could not alloc crypto acomp %s : %ld\n",
+			pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
+		ret = PTR_ERR(acomp_ctx->acomp);
+		goto fail;
+	}
+
+	acomp_ctx->nr_reqs = 1;
+
+	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
+	if (!acomp_ctx->req) {
+		pr_err("could not alloc crypto acomp_request %s\n",
+		       pool->tfm_name);
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffer) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	crypto_init_wait(&acomp_ctx->wait);
+
+	/*
+	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
+	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
+	 * won't be called, crypto_wait_req() will return without blocking.
+	 */
+	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, &acomp_ctx->wait);
+
+	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
+
+	acomp_ctx->__online = true;
+
+	return 0;
+
+fail:
+	acomp_ctx_dealloc(acomp_ctx);
+
+	return ret;
+}
+
+static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
+{
+	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
+	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+
+	mutex_lock(&acomp_ctx->mutex);
+	acomp_ctx->__online = false;
+	mutex_unlock(&acomp_ctx->mutex);
+
+	return 0;
+}
+
+static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node *node)
+{
+	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
+	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+
+	/*
+	 * The lifetime of acomp_ctx resources is from pool creation to
+	 * pool deletion.
+	 *
+	 * Reclaims should not be happening because, we get to this routine only
+	 * in two scenarios:
+	 *
+	 * 1) pool creation failures before/during the pool ref initialization.
+	 * 2) we are in the process of releasing the pool, it is off the
+	 *    zswap_pools list and has no references.
+	 *
+	 * Hence, there is no need for locks.
+	 */
+	acomp_ctx->__online = false;
+	acomp_ctx_dealloc(acomp_ctx);
+}
+
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 {
 	struct zswap_pool *pool;
@@ -285,13 +403,21 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 		goto error;
 	}
 
-	for_each_possible_cpu(cpu)
-		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
+	for_each_possible_cpu(cpu) {
+		struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+
+		acomp_ctx->acomp = NULL;
+		acomp_ctx->req = NULL;
+		acomp_ctx->buffer = NULL;
+		acomp_ctx->__online = false;
+		acomp_ctx->nr_reqs = 0;
+		mutex_init(&acomp_ctx->mutex);
+	}
 
 	ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
 				       &pool->node);
 	if (ret)
-		goto error;
+		goto ref_fail;
 
 	/* being the current pool takes 1 ref; this func expects the
 	 * caller to always add the new pool as the current pool
@@ -307,6 +433,9 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 	return pool;
 
 ref_fail:
+	for_each_possible_cpu(cpu)
+		zswap_cpu_comp_dealloc(cpu, &pool->node);
+
 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
 error:
 	if (pool->acomp_ctx)
@@ -361,8 +490,13 @@ static struct zswap_pool *__zswap_pool_create_fallback(void)
 
 static void zswap_pool_destroy(struct zswap_pool *pool)
 {
+	int cpu;
+
 	zswap_pool_debug("destroying", pool);
 
+	for_each_possible_cpu(cpu)
+		zswap_cpu_comp_dealloc(cpu, &pool->node);
+
 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
 	free_percpu(pool->acomp_ctx);
 
@@ -816,85 +950,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
 /*********************************
 * compressed storage functions
 **********************************/
-static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
-{
-	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
-	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
-	struct crypto_acomp *acomp = NULL;
-	struct acomp_req *req = NULL;
-	u8 *buffer = NULL;
-	int ret;
-
-	buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!buffer) {
-		ret = -ENOMEM;
-		goto fail;
-	}
-
-	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
-	if (IS_ERR(acomp)) {
-		pr_err("could not alloc crypto acomp %s : %ld\n",
-				pool->tfm_name, PTR_ERR(acomp));
-		ret = PTR_ERR(acomp);
-		goto fail;
-	}
-
-	req = acomp_request_alloc(acomp);
-	if (!req) {
-		pr_err("could not alloc crypto acomp_request %s\n",
-		       pool->tfm_name);
-		ret = -ENOMEM;
-		goto fail;
-	}
-
-	/*
-	 * Only hold the mutex after completing allocations, otherwise we may
-	 * recurse into zswap through reclaim and attempt to hold the mutex
-	 * again resulting in a deadlock.
-	 */
-	mutex_lock(&acomp_ctx->mutex);
-	crypto_init_wait(&acomp_ctx->wait);
-
-	/*
-	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
-	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
-	 * won't be called, crypto_wait_req() will return without blocking.
-	 */
-	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
-				   crypto_req_done, &acomp_ctx->wait);
-
-	acomp_ctx->buffer = buffer;
-	acomp_ctx->acomp = acomp;
-	acomp_ctx->is_sleepable = acomp_is_async(acomp);
-	acomp_ctx->req = req;
-	mutex_unlock(&acomp_ctx->mutex);
-	return 0;
-
-fail:
-	if (acomp)
-		crypto_free_acomp(acomp);
-	kfree(buffer);
-	return ret;
-}
-
-static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
-{
-	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
-	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
-
-	mutex_lock(&acomp_ctx->mutex);
-	if (!IS_ERR_OR_NULL(acomp_ctx)) {
-		if (!IS_ERR_OR_NULL(acomp_ctx->req))
-			acomp_request_free(acomp_ctx->req);
-		acomp_ctx->req = NULL;
-		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
-			crypto_free_acomp(acomp_ctx->acomp);
-		kfree(acomp_ctx->buffer);
-	}
-	mutex_unlock(&acomp_ctx->mutex);
-
-	return 0;
-}
 
 static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
 {
@@ -902,16 +957,52 @@ static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
 
 	for (;;) {
 		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
-		mutex_lock(&acomp_ctx->mutex);
-		if (likely(acomp_ctx->req))
-			return acomp_ctx;
 		/*
-		 * It is possible that we were migrated to a different CPU after
-		 * getting the per-CPU ctx but before the mutex was acquired. If
-		 * the old CPU got offlined, zswap_cpu_comp_dead() could have
-		 * already freed ctx->req (among other things) and set it to
-		 * NULL. Just try again on the new CPU that we ended up on.
+		 * If the CPU onlining code successfully allocates acomp_ctx resources,
+		 * it sets acomp_ctx->__online to true. Until this happens, we have
+		 * two options:
+		 *
+		 * 1. Return NULL and fail all stores on this CPU.
+		 * 2. Retry, until onlining has finished allocating resources.
+		 *
+		 * In theory, option 1 could be more appropriate, because it
+		 * allows the calling procedure to decide how it wants to handle
+		 * reclaim racing with CPU hotplug. For instance, it might be Ok
+		 * for compress to return an error for the backing swap device
+		 * to store the folio. Decompress could wait until we get a
+		 * valid and locked mutex after onlining has completed. For now,
+		 * we go with option 2 because adding a do-while in
+		 * zswap_decompress() adds latency for software compressors.
+		 *
+		 * Once initialized, the resources will be de-allocated only
+		 * when the pool is destroyed. The acomp_ctx will hold on to the
+		 * resources through CPU offlining/onlining at any time until
+		 * the pool is destroyed.
+		 *
+		 * This prevents races/deadlocks between reclaim and CPU acomp_ctx
+		 * resource allocation that are a dependency for reclaim.
+		 * It further simplifies the interaction with CPU onlining and
+		 * offlining:
+		 *
+		 * - CPU onlining does not take the mutex. It only allocates
+		 *   resources and sets __online to true.
+		 * - CPU offlining acquires the mutex before setting
+		 *   __online to false. If reclaim has acquired the mutex,
+		 *   offlining will have to wait for reclaim to complete before
+		 *   hotunplug can proceed. Further, hotplug merely sets
+		 *   __online to false. It does not delete the acomp_ctx
+		 *   resources.
+		 *
+		 * Option 1 is better than potentially not exiting the earlier
+		 * for (;;) loop because the system is running low on memory
+		 * and/or CPUs are getting offlined for whatever reason. At
+		 * least failing this store will prevent data loss by failing
+		 * zswap_store(), and saving the data in the backing swap device.
 		 */
+		mutex_lock(&acomp_ctx->mutex);
+		if (likely(acomp_ctx->__online))
+			return acomp_ctx;
+
 		mutex_unlock(&acomp_ctx->mutex);
 	}
 }
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 13/14] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
                   ` (11 preceding siblings ...)
  2025-03-03  8:47 ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-06 20:00   ` Yosry Ahmed
  2025-03-03  8:47 ` [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios Kanchana P Sridhar
  13 siblings, 1 reply; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch adds support for the per-CPU acomp_ctx to track multiple
compression/decompression requests and multiple compression destination
buffers. The zswap_cpu_comp_prepare() CPU onlining code will get the
maximum batch-size the compressor supports. If so, it will allocate the
necessary batching resources.

However, zswap does not use more than one request yet. Follow-up patches
will actually utilize the multiple acomp_ctx requests/buffers for batch
compression/decompression of multiple pages.

The newly added ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used
for batching. There is a small extra memory overhead of allocating the
"reqs" and "buffers" arrays for compressors that do not support batching.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 99 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 69 insertions(+), 30 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index cff96df1df8b..fae59d6d5147 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -78,6 +78,16 @@ static bool zswap_pool_reached_full;
 
 #define ZSWAP_PARAM_UNSET ""
 
+/*
+ * For compression batching of large folios:
+ * Maximum number of acomp compress requests that will be processed
+ * in a batch, iff the zswap compressor supports batching.
+ * This limit exists because we preallocate enough requests and buffers
+ * in the per-cpu acomp_ctx accordingly. Hence, a higher limit means higher
+ * memory usage.
+ */
+#define ZSWAP_MAX_BATCH_SIZE 8U
+
 static int zswap_setup(void);
 
 /* Enable/disable zswap */
@@ -143,8 +153,8 @@ bool zswap_never_enabled(void)
 
 struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
-	u8 *buffer;
+	struct acomp_req **reqs;
+	u8 **buffers;
 	u8 nr_reqs;
 	struct crypto_wait wait;
 	struct mutex mutex;
@@ -251,13 +261,22 @@ static void __zswap_pool_empty(struct percpu_ref *ref);
 static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
 {
 	if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) {
+		u8 i;
+
+		if (acomp_ctx->reqs) {
+			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+				if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
+					acomp_request_free(acomp_ctx->reqs[i]);
+			kfree(acomp_ctx->reqs);
+			acomp_ctx->reqs = NULL;
+		}
 
-		if (!IS_ERR_OR_NULL(acomp_ctx->req))
-			acomp_request_free(acomp_ctx->req);
-		acomp_ctx->req = NULL;
-
-		kfree(acomp_ctx->buffer);
-		acomp_ctx->buffer = NULL;
+		if (acomp_ctx->buffers) {
+			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+				kfree(acomp_ctx->buffers[i]);
+			kfree(acomp_ctx->buffers);
+			acomp_ctx->buffers = NULL;
+		}
 
 		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 			crypto_free_acomp(acomp_ctx->acomp);
@@ -271,6 +290,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	int ret = -ENOMEM;
+	u8 i;
 
 	/*
 	 * Just to be even more fail-safe against changes in assumptions and/or
@@ -292,22 +312,41 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 		goto fail;
 	}
 
-	acomp_ctx->nr_reqs = 1;
+	acomp_ctx->nr_reqs = min(ZSWAP_MAX_BATCH_SIZE,
+				 crypto_acomp_batch_size(acomp_ctx->acomp));
 
-	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
-	if (!acomp_ctx->req) {
-		pr_err("could not alloc crypto acomp_request %s\n",
-		       pool->tfm_name);
-		ret = -ENOMEM;
+	acomp_ctx->reqs = kcalloc_node(acomp_ctx->nr_reqs, sizeof(struct acomp_req *),
+				       GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->reqs)
 		goto fail;
+
+	for (i = 0; i < acomp_ctx->nr_reqs; ++i) {
+		acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
+		if (!acomp_ctx->reqs[i]) {
+			pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
+				i, pool->tfm_name);
+			goto fail;
+		}
 	}
 
-	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!acomp_ctx->buffer) {
-		ret = -ENOMEM;
+	acomp_ctx->buffers = kcalloc_node(acomp_ctx->nr_reqs, sizeof(u8 *),
+					  GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffers)
 		goto fail;
+
+	for (i = 0; i < acomp_ctx->nr_reqs; ++i) {
+		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
+						     cpu_to_node(cpu));
+		if (!acomp_ctx->buffers[i])
+			goto fail;
 	}
 
+	/*
+	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
+	 * or non-poll mode of acomp, hence there is only one "wait" per
+	 * acomp_ctx, with callback set to reqs[0], under the assumption that
+	 * there is at least 1 request per acomp_ctx.
+	 */
 	crypto_init_wait(&acomp_ctx->wait);
 
 	/*
@@ -315,7 +354,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
-	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+	acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
 	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
@@ -407,8 +446,8 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 		struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 
 		acomp_ctx->acomp = NULL;
-		acomp_ctx->req = NULL;
-		acomp_ctx->buffer = NULL;
+		acomp_ctx->reqs = NULL;
+		acomp_ctx->buffers = NULL;
 		acomp_ctx->__online = false;
 		acomp_ctx->nr_reqs = 0;
 		mutex_init(&acomp_ctx->mutex);
@@ -1026,7 +1065,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	u8 *dst;
 
 	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
-	dst = acomp_ctx->buffer;
+	dst = acomp_ctx->buffers[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
@@ -1036,7 +1075,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * giving the dst buffer with enough length to avoid buffer overflow.
 	 */
 	sg_init_one(&output, dst, PAGE_SIZE * 2);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
 
 	/*
 	 * it maybe looks a little bit silly that we send an asynchronous request,
@@ -1050,8 +1089,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * but in different threads running on different cpu, we have different
 	 * acomp instance, so multiple threads can do (de)compression in parallel.
 	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
+	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
+	dlen = acomp_ctx->reqs[0]->dlen;
 	if (comp_ret)
 		goto unlock;
 
@@ -1102,19 +1141,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	 */
 	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
 	    !virt_addr_valid(src)) {
-		memcpy(acomp_ctx->buffer, src, entry->length);
-		src = acomp_ctx->buffer;
+		memcpy(acomp_ctx->buffers[0], src, entry->length);
+		src = acomp_ctx->buffers[0];
 		zpool_unmap_handle(zpool, entry->handle);
 	}
 
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_folio(&output, folio, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
-	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
-	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE);
+	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait));
+	BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
 
-	if (src != acomp_ctx->buffer)
+	if (src != acomp_ctx->buffers[0])
 		zpool_unmap_handle(zpool, entry->handle);
 	acomp_ctx_put_unlock(acomp_ctx);
 }
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios.
  2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
                   ` (12 preceding siblings ...)
  2025-03-03  8:47 ` [PATCH v8 13/14] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
@ 2025-03-03  8:47 ` Kanchana P Sridhar
  2025-03-03 11:07   ` kernel test robot
                     ` (2 more replies)
  13 siblings, 3 replies; 44+ messages in thread
From: Kanchana P Sridhar @ 2025-03-03  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch introduces three new procedures:

 zswap_store_folio()
 zswap_store_pages()
 zswap_batch_compress()

zswap_store_folio() stores a folio in chunks of "batch_size" pages. If the
compressor supports batching and has multiple acomp_ctx->nr_reqs, the
batch_size will be the acomp_ctx->nr_reqs. If the compressor does not
support batching, the batch_size will be ZSWAP_MAX_BATCH_SIZE. The
compressor having multiple acomp_ctx->nr_reqs is passed as a bool
"batching" parameter to zswap_store_pages() and zswap_batch_compress().
This refactorization allows us to move the loop over the folio's pages from
zswap_store() to zswap_store_folio(), and also enables batching.

zswap_store_pages() implements all the computes done earlier in
zswap_store_page() for a single-page, for multiple pages in a folio, namely
a "batch". zswap_store_pages() starts by allocating all zswap entries
required to store the batch. Next, it calls zswap_batch_compress() to
compress the batch. Finally, it adds the batch's zswap entries to the
xarray and LRU, charges zswap memory and increments zswap stats.

The error handling and cleanup required for all failure scenarios that can
occur while storing a batch in zswap are consolidated to a single
"store_pages_failed" label in zswap_store_pages().

And finally, this patch introduces zswap_batch_compress(), which does the
following:
  - If the compressor supports batching, sets up a request chain for
    compressing the batch in one shot. If Intel IAA is the zswap
    compressor, the request chain will be compressed in parallel in
    hardware. If all requests in the chain are compressed without errors,
    the compressed buffers are then stored in zpool.
  - If the compressor does not support batching, each page in the batch is
    compressed and stored sequentially. zswap_batch_compress() replaces
    zswap_compress(), thereby eliminating code duplication and facilitating
    maintainability of the code with the introduction of batching.

The call to the crypto layer is exactly the same in both cases: when batch
compressing a request chain or when sequentially compressing each page in
the batch.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 396 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 270 insertions(+), 126 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index fae59d6d5147..135d5792ce50 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1051,74 +1051,141 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
 	mutex_unlock(&acomp_ctx->mutex);
 }
 
-static bool zswap_compress(struct page *page, struct zswap_entry *entry,
-			   struct zswap_pool *pool)
+/*
+ * Unified code paths for compressors that do and do not support
+ * batching. This procedure will compress multiple @nr_pages in @folio,
+ * starting from @index.
+ * If @batching is set to true, it will create a request chain for
+ * compression batching. It is assumed that the caller has verified
+ * that the acomp_ctx->nr_reqs is at least @nr_pages.
+ * If @batching is set to false, it will process each page sequentially.
+ * In both cases, if all compressions were successful, it will proceed
+ * to store the compressed buffers in zpool.
+ */
+static bool zswap_batch_compress(struct folio *folio,
+				 long index,
+				 unsigned int nr_pages,
+				 struct zswap_entry *entries[],
+				 struct zswap_pool *pool,
+				 struct crypto_acomp_ctx *acomp_ctx,
+				 bool batching)
 {
-	struct crypto_acomp_ctx *acomp_ctx;
-	struct scatterlist input, output;
-	int comp_ret = 0, alloc_ret = 0;
-	unsigned int dlen = PAGE_SIZE;
-	unsigned long handle;
-	struct zpool *zpool;
-	char *buf;
+	struct scatterlist inputs[ZSWAP_MAX_BATCH_SIZE];
+	struct scatterlist outputs[ZSWAP_MAX_BATCH_SIZE];
+	struct zpool *zpool = pool->zpool;
+	int acomp_idx = 0, nr_to_store = 1;
+	unsigned int i, j;
+	int err = 0;
 	gfp_t gfp;
-	u8 *dst;
 
-	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
-	dst = acomp_ctx->buffers[0];
-	sg_init_table(&input, 1);
-	sg_set_page(&input, page, PAGE_SIZE, 0);
+	lockdep_assert_held(&acomp_ctx->mutex);
 
-	/*
-	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
-	 * and hardware-accelerators may won't check the dst buffer size, so
-	 * giving the dst buffer with enough length to avoid buffer overflow.
-	 */
-	sg_init_one(&output, dst, PAGE_SIZE * 2);
-	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
-
-	/*
-	 * it maybe looks a little bit silly that we send an asynchronous request,
-	 * then wait for its completion synchronously. This makes the process look
-	 * synchronous in fact.
-	 * Theoretically, acomp supports users send multiple acomp requests in one
-	 * acomp instance, then get those requests done simultaneously. but in this
-	 * case, zswap actually does store and load page by page, there is no
-	 * existing method to send the second page before the first page is done
-	 * in one thread doing zwap.
-	 * but in different threads running on different cpu, we have different
-	 * acomp instance, so multiple threads can do (de)compression in parallel.
-	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
-	dlen = acomp_ctx->reqs[0]->dlen;
-	if (comp_ret)
-		goto unlock;
-
-	zpool = pool->zpool;
 	gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
 	if (zpool_malloc_support_movable(zpool))
 		gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
-	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle);
-	if (alloc_ret)
-		goto unlock;
 
-	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
-	memcpy(buf, dst, dlen);
-	zpool_unmap_handle(zpool, handle);
+	for (i = 0; i < nr_pages; ++i) {
+		struct page *page = folio_page(folio, index + i);
 
-	entry->handle = handle;
-	entry->length = dlen;
+		sg_init_table(&inputs[acomp_idx], 1);
+		sg_set_page(&inputs[acomp_idx], page, PAGE_SIZE, 0);
 
-unlock:
-	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
-		zswap_reject_compress_poor++;
-	else if (comp_ret)
-		zswap_reject_compress_fail++;
-	else if (alloc_ret)
-		zswap_reject_alloc_fail++;
+		/*
+		 * Each dst buffer should be of size (PAGE_SIZE * 2).
+		 * Reflect same in sg_list.
+		 */
+		sg_init_one(&outputs[acomp_idx], acomp_ctx->buffers[acomp_idx], PAGE_SIZE * 2);
+		acomp_request_set_params(acomp_ctx->reqs[acomp_idx], &inputs[acomp_idx],
+					 &outputs[acomp_idx], PAGE_SIZE, PAGE_SIZE);
+
+		if (batching) {
+			/* Add the acomp request to the chain. */
+			if (likely(i))
+				acomp_request_chain(acomp_ctx->reqs[acomp_idx], acomp_ctx->reqs[0]);
+			else
+				acomp_reqchain_init(acomp_ctx->reqs[0], 0, crypto_req_done,
+						    &acomp_ctx->wait);
+
+			if (i == (nr_pages - 1)) {
+				/* Process the request chain. */
+				err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
+
+				/*
+				 * Get the individual compress errors from request chaining.
+				 */
+				for (j = 0; j < nr_pages; ++j) {
+					if (unlikely(acomp_request_err(acomp_ctx->reqs[j]))) {
+						err = -EINVAL;
+						if (acomp_request_err(acomp_ctx->reqs[j]) == -ENOSPC)
+							zswap_reject_compress_poor++;
+						else
+							zswap_reject_compress_fail++;
+					}
+				}
+				/*
+				 * Request chaining cleanup:
+				 *
+				 * - Clear the CRYPTO_TFM_REQ_CHAIN bit on acomp_ctx->reqs[0].
+				 * - Reset the acomp_ctx->wait to notify acomp_ctx->reqs[0].
+				 */
+				acomp_reqchain_clear(acomp_ctx->reqs[0], &acomp_ctx->wait);
+				if (unlikely(err))
+					return false;
+				j = 0;
+				nr_to_store = nr_pages;
+				goto store_zpool;
+			}
+
+			++acomp_idx;
+			continue;
+		} else {
+			err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
+
+			if (unlikely(err)) {
+				if (err == -ENOSPC)
+					zswap_reject_compress_poor++;
+				else
+					zswap_reject_compress_fail++;
+				return false;
+			}
+			j = i;
+			nr_to_store = 1;
+		}
 
-	acomp_ctx_put_unlock(acomp_ctx);
-	return comp_ret == 0 && alloc_ret == 0;
+store_zpool:
+		/*
+		 * All batch pages were successfully compressed.
+		 * Store the pages in zpool.
+		 */
+		acomp_idx = -1;
+		while (nr_to_store--) {
+			unsigned long handle;
+			char *buf;
+
+			++acomp_idx;
+			prefetchw(entries[j]);
+			err = zpool_malloc(zpool, acomp_ctx->reqs[acomp_idx]->dlen, gfp, &handle);
+
+			if (unlikely(err)) {
+				if (err == -ENOSPC)
+					zswap_reject_compress_poor++;
+				else
+					zswap_reject_alloc_fail++;
+
+				return false;
+			}
+
+			buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
+			memcpy(buf, acomp_ctx->buffers[acomp_idx], acomp_ctx->reqs[acomp_idx]->dlen);
+			zpool_unmap_handle(zpool, handle);
+
+			entries[j]->handle = handle;
+			entries[j]->length = acomp_ctx->reqs[acomp_idx]->dlen;
+			++j;
+		}
+	}
+
+	return true;
 }
 
 static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
@@ -1581,84 +1648,165 @@ static void shrink_worker(struct work_struct *w)
 * main API
 **********************************/
 
-static bool zswap_store_page(struct page *page,
-			     struct obj_cgroup *objcg,
-			     struct zswap_pool *pool)
+/*
+ * Store multiple pages in @folio, starting from the page at index @si up to
+ * and including the page at index @ei.
+ *
+ * The error handling from all failure points is consolidated to the
+ * "store_pages_failed" label, based on the initialization of the zswap entries'
+ * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
+ * entry's handle is subsequently modified only upon a successful zpool_malloc()
+ * after the page is compressed.
+ */
+static bool zswap_store_pages(struct folio *folio,
+			      long si,
+			      long ei,
+			      struct obj_cgroup *objcg,
+			      struct zswap_pool *pool,
+			      struct crypto_acomp_ctx *acomp_ctx,
+			      bool batching)
 {
-	swp_entry_t page_swpentry = page_swap_entry(page);
-	struct zswap_entry *entry, *old;
+	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
+	int node_id = folio_nid(folio);
+	u8 i, from_i = 0, nr_pages = ei - si + 1;
 
-	/* allocate entry */
-	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
-	if (!entry) {
-		zswap_reject_kmemcache_fail++;
-		return false;
+	for (i = 0; i < nr_pages; ++i) {
+		entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
+
+		if (unlikely(!entries[i])) {
+			zswap_reject_kmemcache_fail++;
+			nr_pages = i;
+			goto store_pages_failed;
+		}
+
+		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
 	}
 
-	if (!zswap_compress(page, entry, pool))
-		goto compress_failed;
+	if (!zswap_batch_compress(folio, si, nr_pages, entries, pool, acomp_ctx, batching))
+		goto store_pages_failed;
 
-	old = xa_store(swap_zswap_tree(page_swpentry),
-		       swp_offset(page_swpentry),
-		       entry, GFP_KERNEL);
-	if (xa_is_err(old)) {
-		int err = xa_err(old);
+	for (i = 0; i < nr_pages; ++i) {
+		swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, si + i));
+		struct zswap_entry *old, *entry = entries[i];
 
-		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-		zswap_reject_alloc_fail++;
-		goto store_failed;
-	}
+		old = xa_store(swap_zswap_tree(page_swpentry),
+			       swp_offset(page_swpentry),
+			       entry, GFP_KERNEL);
+		if (unlikely(xa_is_err(old))) {
+			int err = xa_err(old);
 
-	/*
-	 * We may have had an existing entry that became stale when
-	 * the folio was redirtied and now the new version is being
-	 * swapped out. Get rid of the old.
-	 */
-	if (old)
-		zswap_entry_free(old);
+			WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+			zswap_reject_alloc_fail++;
+			from_i = i;
+			goto store_pages_failed;
+		}
 
-	/*
-	 * The entry is successfully compressed and stored in the tree, there is
-	 * no further possibility of failure. Grab refs to the pool and objcg,
-	 * charge zswap memory, and increment zswap_stored_pages.
-	 * The opposite actions will be performed by zswap_entry_free()
-	 * when the entry is removed from the tree.
-	 */
-	zswap_pool_get(pool);
-	if (objcg) {
-		obj_cgroup_get(objcg);
-		obj_cgroup_charge_zswap(objcg, entry->length);
-	}
-	atomic_long_inc(&zswap_stored_pages);
+		/*
+		 * We may have had an existing entry that became stale when
+		 * the folio was redirtied and now the new version is being
+		 * swapped out. Get rid of the old.
+		 */
+		if (unlikely(old))
+			zswap_entry_free(old);
 
-	/*
-	 * We finish initializing the entry while it's already in xarray.
-	 * This is safe because:
-	 *
-	 * 1. Concurrent stores and invalidations are excluded by folio lock.
-	 *
-	 * 2. Writeback is excluded by the entry not being on the LRU yet.
-	 *    The publishing order matters to prevent writeback from seeing
-	 *    an incoherent entry.
-	 */
-	entry->pool = pool;
-	entry->swpentry = page_swpentry;
-	entry->objcg = objcg;
-	entry->referenced = true;
-	if (entry->length) {
-		INIT_LIST_HEAD(&entry->lru);
-		zswap_lru_add(&zswap_list_lru, entry);
+		/*
+		 * The entry is successfully compressed and stored in the tree, there is
+		 * no further possibility of failure. Grab refs to the pool and objcg,
+		 * charge zswap memory, and increment zswap_stored_pages.
+		 * The opposite actions will be performed by zswap_entry_free()
+		 * when the entry is removed from the tree.
+		 */
+		zswap_pool_get(pool);
+		if (objcg) {
+			obj_cgroup_get(objcg);
+			obj_cgroup_charge_zswap(objcg, entry->length);
+		}
+		atomic_long_inc(&zswap_stored_pages);
+
+		/*
+		 * We finish initializing the entry while it's already in xarray.
+		 * This is safe because:
+		 *
+		 * 1. Concurrent stores and invalidations are excluded by folio lock.
+		 *
+		 * 2. Writeback is excluded by the entry not being on the LRU yet.
+		 *    The publishing order matters to prevent writeback from seeing
+		 *    an incoherent entry.
+		 */
+		entry->pool = pool;
+		entry->swpentry = page_swpentry;
+		entry->objcg = objcg;
+		entry->referenced = true;
+		if (likely(entry->length)) {
+			INIT_LIST_HEAD(&entry->lru);
+			zswap_lru_add(&zswap_list_lru, entry);
+		}
 	}
 
 	return true;
 
-store_failed:
-	zpool_free(pool->zpool, entry->handle);
-compress_failed:
-	zswap_entry_cache_free(entry);
+store_pages_failed:
+	for (i = from_i; i < nr_pages; ++i) {
+		if (!IS_ERR_VALUE(entries[i]->handle))
+			zpool_free(pool->zpool, entries[i]->handle);
+
+		zswap_entry_cache_free(entries[i]);
+	}
+
 	return false;
 }
 
+/*
+ * Store all pages in a folio by calling zswap_batch_compress().
+ * If the compressor supports batching, i.e., has multiple acomp requests,
+ * the folio will be compressed in batches of "acomp_ctx->nr_reqs" using
+ * request chaining.
+ * If the compressor has only one acomp request, the folio will be compressed
+ * in batches of ZSWAP_MAX_BATCH_SIZE pages, where each page in the batch is
+ * compressed sequentially.
+ */
+static bool zswap_store_folio(struct folio *folio,
+			      struct obj_cgroup *objcg,
+			      struct zswap_pool *pool)
+{
+	long nr_pages = folio_nr_pages(folio);
+	struct crypto_acomp_ctx *acomp_ctx;
+	unsigned int batch_size;
+	bool ret = true, batching;
+	long si, ei;
+
+	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+
+	batching = ((acomp_ctx->nr_reqs > 1) && (nr_pages > 1));
+
+	batch_size = batching ? acomp_ctx->nr_reqs : ZSWAP_MAX_BATCH_SIZE;
+
+	if (!batching)
+		acomp_ctx_put_unlock(acomp_ctx);
+
+	/* Store the folio in batches of "batch_size" pages. */
+	for (si = 0, ei = min(si + batch_size - 1, nr_pages - 1);
+	     ((si < nr_pages) && (ei < nr_pages));
+	     si = ei + 1, ei = min(si + batch_size - 1, nr_pages - 1)) {
+
+		if (!batching)
+			acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+
+		if (!zswap_store_pages(folio, si, ei, objcg, pool, acomp_ctx, batching)) {
+			ret = false;
+			break;
+		}
+
+		if (!batching)
+			acomp_ctx_put_unlock(acomp_ctx);
+	}
+
+	if (batching || !ret)
+		acomp_ctx_put_unlock(acomp_ctx);
+
+	return ret;
+}
+
 bool zswap_store(struct folio *folio)
 {
 	long nr_pages = folio_nr_pages(folio);
@@ -1667,7 +1815,6 @@ bool zswap_store(struct folio *folio)
 	struct mem_cgroup *memcg = NULL;
 	struct zswap_pool *pool;
 	bool ret = false;
-	long index;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1701,12 +1848,8 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
-	for (index = 0; index < nr_pages; ++index) {
-		struct page *page = folio_page(folio, index);
-
-		if (!zswap_store_page(page, objcg, pool))
-			goto put_pool;
-	}
+	if (!zswap_store_folio(folio, objcg, pool))
+		goto put_pool;
 
 	if (objcg)
 		count_objcg_events(objcg, ZSWPOUT, nr_pages);
@@ -1733,6 +1876,7 @@ bool zswap_store(struct folio *folio)
 		pgoff_t offset = swp_offset(swp);
 		struct zswap_entry *entry;
 		struct xarray *tree;
+		long index;
 
 		for (index = 0; index < nr_pages; ++index) {
 			tree = swap_zswap_tree(swp_entry(type, offset + index));
-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios.
  2025-03-03  8:47 ` [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios Kanchana P Sridhar
@ 2025-03-03 11:07   ` kernel test robot
  2025-03-03 18:21     ` Nhat Pham
  2025-03-03 11:07   ` kernel test robot
  2025-03-06 21:17   ` Yosry Ahmed
  2 siblings, 1 reply; 44+ messages in thread
From: kernel test robot @ 2025-03-03 11:07 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosry.ahmed,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi
  Cc: llvm, oe-kbuild-all, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Hi Kanchana,

kernel test robot noticed the following build errors:

[auto build test ERROR on 5f089a9aa987ccf72df0c6955e168e865f280603]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchana-P-Sridhar/crypto-acomp-Add-synchronous-asynchronous-acomp-request-chaining/20250303-164927
base:   5f089a9aa987ccf72df0c6955e168e865f280603
patch link:    https://lore.kernel.org/r/20250303084724.6490-15-kanchana.p.sridhar%40intel.com
patch subject: [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios.
config: s390-randconfig-001-20250303 (https://download.01.org/0day-ci/archive/20250303/202503031847.j1iReOtf-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250303/202503031847.j1iReOtf-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503031847.j1iReOtf-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/zswap.c:1166:4: error: call to undeclared function 'prefetchw'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    1166 |                         prefetchw(entries[j]);
         |                         ^
   1 error generated.


vim +/prefetchw +1166 mm/zswap.c

  1053	
  1054	/*
  1055	 * Unified code paths for compressors that do and do not support
  1056	 * batching. This procedure will compress multiple @nr_pages in @folio,
  1057	 * starting from @index.
  1058	 * If @batching is set to true, it will create a request chain for
  1059	 * compression batching. It is assumed that the caller has verified
  1060	 * that the acomp_ctx->nr_reqs is at least @nr_pages.
  1061	 * If @batching is set to false, it will process each page sequentially.
  1062	 * In both cases, if all compressions were successful, it will proceed
  1063	 * to store the compressed buffers in zpool.
  1064	 */
  1065	static bool zswap_batch_compress(struct folio *folio,
  1066					 long index,
  1067					 unsigned int nr_pages,
  1068					 struct zswap_entry *entries[],
  1069					 struct zswap_pool *pool,
  1070					 struct crypto_acomp_ctx *acomp_ctx,
  1071					 bool batching)
  1072	{
  1073		struct scatterlist inputs[ZSWAP_MAX_BATCH_SIZE];
  1074		struct scatterlist outputs[ZSWAP_MAX_BATCH_SIZE];
  1075		struct zpool *zpool = pool->zpool;
  1076		int acomp_idx = 0, nr_to_store = 1;
  1077		unsigned int i, j;
  1078		int err = 0;
  1079		gfp_t gfp;
  1080	
  1081		lockdep_assert_held(&acomp_ctx->mutex);
  1082	
  1083		gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
  1084		if (zpool_malloc_support_movable(zpool))
  1085			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
  1086	
  1087		for (i = 0; i < nr_pages; ++i) {
  1088			struct page *page = folio_page(folio, index + i);
  1089	
  1090			sg_init_table(&inputs[acomp_idx], 1);
  1091			sg_set_page(&inputs[acomp_idx], page, PAGE_SIZE, 0);
  1092	
  1093			/*
  1094			 * Each dst buffer should be of size (PAGE_SIZE * 2).
  1095			 * Reflect same in sg_list.
  1096			 */
  1097			sg_init_one(&outputs[acomp_idx], acomp_ctx->buffers[acomp_idx], PAGE_SIZE * 2);
  1098			acomp_request_set_params(acomp_ctx->reqs[acomp_idx], &inputs[acomp_idx],
  1099						 &outputs[acomp_idx], PAGE_SIZE, PAGE_SIZE);
  1100	
  1101			if (batching) {
  1102				/* Add the acomp request to the chain. */
  1103				if (likely(i))
  1104					acomp_request_chain(acomp_ctx->reqs[acomp_idx], acomp_ctx->reqs[0]);
  1105				else
  1106					acomp_reqchain_init(acomp_ctx->reqs[0], 0, crypto_req_done,
  1107							    &acomp_ctx->wait);
  1108	
  1109				if (i == (nr_pages - 1)) {
  1110					/* Process the request chain. */
  1111					err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
  1112	
  1113					/*
  1114					 * Get the individual compress errors from request chaining.
  1115					 */
  1116					for (j = 0; j < nr_pages; ++j) {
  1117						if (unlikely(acomp_request_err(acomp_ctx->reqs[j]))) {
  1118							err = -EINVAL;
  1119							if (acomp_request_err(acomp_ctx->reqs[j]) == -ENOSPC)
  1120								zswap_reject_compress_poor++;
  1121							else
  1122								zswap_reject_compress_fail++;
  1123						}
  1124					}
  1125					/*
  1126					 * Request chaining cleanup:
  1127					 *
  1128					 * - Clear the CRYPTO_TFM_REQ_CHAIN bit on acomp_ctx->reqs[0].
  1129					 * - Reset the acomp_ctx->wait to notify acomp_ctx->reqs[0].
  1130					 */
  1131					acomp_reqchain_clear(acomp_ctx->reqs[0], &acomp_ctx->wait);
  1132					if (unlikely(err))
  1133						return false;
  1134					j = 0;
  1135					nr_to_store = nr_pages;
  1136					goto store_zpool;
  1137				}
  1138	
  1139				++acomp_idx;
  1140				continue;
  1141			} else {
  1142				err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
  1143	
  1144				if (unlikely(err)) {
  1145					if (err == -ENOSPC)
  1146						zswap_reject_compress_poor++;
  1147					else
  1148						zswap_reject_compress_fail++;
  1149					return false;
  1150				}
  1151				j = i;
  1152				nr_to_store = 1;
  1153			}
  1154	
  1155	store_zpool:
  1156			/*
  1157			 * All batch pages were successfully compressed.
  1158			 * Store the pages in zpool.
  1159			 */
  1160			acomp_idx = -1;
  1161			while (nr_to_store--) {
  1162				unsigned long handle;
  1163				char *buf;
  1164	
  1165				++acomp_idx;
> 1166				prefetchw(entries[j]);
  1167				err = zpool_malloc(zpool, acomp_ctx->reqs[acomp_idx]->dlen, gfp, &handle);
  1168	
  1169				if (unlikely(err)) {
  1170					if (err == -ENOSPC)
  1171						zswap_reject_compress_poor++;
  1172					else
  1173						zswap_reject_alloc_fail++;
  1174	
  1175					return false;
  1176				}
  1177	
  1178				buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
  1179				memcpy(buf, acomp_ctx->buffers[acomp_idx], acomp_ctx->reqs[acomp_idx]->dlen);
  1180				zpool_unmap_handle(zpool, handle);
  1181	
  1182				entries[j]->handle = handle;
  1183				entries[j]->length = acomp_ctx->reqs[acomp_idx]->dlen;
  1184				++j;
  1185			}
  1186		}
  1187	
  1188		return true;
  1189	}
  1190	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios.
  2025-03-03  8:47 ` [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios Kanchana P Sridhar
  2025-03-03 11:07   ` kernel test robot
@ 2025-03-03 11:07   ` kernel test robot
  2025-03-06 21:17   ` Yosry Ahmed
  2 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2025-03-03 11:07 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosry.ahmed,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, linux-crypto, herbert, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi
  Cc: oe-kbuild-all, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Hi Kanchana,

kernel test robot noticed the following build errors:

[auto build test ERROR on 5f089a9aa987ccf72df0c6955e168e865f280603]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchana-P-Sridhar/crypto-acomp-Add-synchronous-asynchronous-acomp-request-chaining/20250303-164927
base:   5f089a9aa987ccf72df0c6955e168e865f280603
patch link:    https://lore.kernel.org/r/20250303084724.6490-15-kanchana.p.sridhar%40intel.com
patch subject: [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios.
config: csky-randconfig-002-20250303 (https://download.01.org/0day-ci/archive/20250303/202503031843.GyX5ZVrC-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250303/202503031843.GyX5ZVrC-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503031843.GyX5ZVrC-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/zswap.c: In function 'zswap_batch_compress':
>> mm/zswap.c:1166:25: error: implicit declaration of function 'prefetchw' [-Wimplicit-function-declaration]
    1166 |                         prefetchw(entries[j]);
         |                         ^~~~~~~~~


vim +/prefetchw +1166 mm/zswap.c

  1053	
  1054	/*
  1055	 * Unified code paths for compressors that do and do not support
  1056	 * batching. This procedure will compress multiple @nr_pages in @folio,
  1057	 * starting from @index.
  1058	 * If @batching is set to true, it will create a request chain for
  1059	 * compression batching. It is assumed that the caller has verified
  1060	 * that the acomp_ctx->nr_reqs is at least @nr_pages.
  1061	 * If @batching is set to false, it will process each page sequentially.
  1062	 * In both cases, if all compressions were successful, it will proceed
  1063	 * to store the compressed buffers in zpool.
  1064	 */
  1065	static bool zswap_batch_compress(struct folio *folio,
  1066					 long index,
  1067					 unsigned int nr_pages,
  1068					 struct zswap_entry *entries[],
  1069					 struct zswap_pool *pool,
  1070					 struct crypto_acomp_ctx *acomp_ctx,
  1071					 bool batching)
  1072	{
  1073		struct scatterlist inputs[ZSWAP_MAX_BATCH_SIZE];
  1074		struct scatterlist outputs[ZSWAP_MAX_BATCH_SIZE];
  1075		struct zpool *zpool = pool->zpool;
  1076		int acomp_idx = 0, nr_to_store = 1;
  1077		unsigned int i, j;
  1078		int err = 0;
  1079		gfp_t gfp;
  1080	
  1081		lockdep_assert_held(&acomp_ctx->mutex);
  1082	
  1083		gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
  1084		if (zpool_malloc_support_movable(zpool))
  1085			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
  1086	
  1087		for (i = 0; i < nr_pages; ++i) {
  1088			struct page *page = folio_page(folio, index + i);
  1089	
  1090			sg_init_table(&inputs[acomp_idx], 1);
  1091			sg_set_page(&inputs[acomp_idx], page, PAGE_SIZE, 0);
  1092	
  1093			/*
  1094			 * Each dst buffer should be of size (PAGE_SIZE * 2).
  1095			 * Reflect same in sg_list.
  1096			 */
  1097			sg_init_one(&outputs[acomp_idx], acomp_ctx->buffers[acomp_idx], PAGE_SIZE * 2);
  1098			acomp_request_set_params(acomp_ctx->reqs[acomp_idx], &inputs[acomp_idx],
  1099						 &outputs[acomp_idx], PAGE_SIZE, PAGE_SIZE);
  1100	
  1101			if (batching) {
  1102				/* Add the acomp request to the chain. */
  1103				if (likely(i))
  1104					acomp_request_chain(acomp_ctx->reqs[acomp_idx], acomp_ctx->reqs[0]);
  1105				else
  1106					acomp_reqchain_init(acomp_ctx->reqs[0], 0, crypto_req_done,
  1107							    &acomp_ctx->wait);
  1108	
  1109				if (i == (nr_pages - 1)) {
  1110					/* Process the request chain. */
  1111					err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
  1112	
  1113					/*
  1114					 * Get the individual compress errors from request chaining.
  1115					 */
  1116					for (j = 0; j < nr_pages; ++j) {
  1117						if (unlikely(acomp_request_err(acomp_ctx->reqs[j]))) {
  1118							err = -EINVAL;
  1119							if (acomp_request_err(acomp_ctx->reqs[j]) == -ENOSPC)
  1120								zswap_reject_compress_poor++;
  1121							else
  1122								zswap_reject_compress_fail++;
  1123						}
  1124					}
  1125					/*
  1126					 * Request chaining cleanup:
  1127					 *
  1128					 * - Clear the CRYPTO_TFM_REQ_CHAIN bit on acomp_ctx->reqs[0].
  1129					 * - Reset the acomp_ctx->wait to notify acomp_ctx->reqs[0].
  1130					 */
  1131					acomp_reqchain_clear(acomp_ctx->reqs[0], &acomp_ctx->wait);
  1132					if (unlikely(err))
  1133						return false;
  1134					j = 0;
  1135					nr_to_store = nr_pages;
  1136					goto store_zpool;
  1137				}
  1138	
  1139				++acomp_idx;
  1140				continue;
  1141			} else {
  1142				err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
  1143	
  1144				if (unlikely(err)) {
  1145					if (err == -ENOSPC)
  1146						zswap_reject_compress_poor++;
  1147					else
  1148						zswap_reject_compress_fail++;
  1149					return false;
  1150				}
  1151				j = i;
  1152				nr_to_store = 1;
  1153			}
  1154	
  1155	store_zpool:
  1156			/*
  1157			 * All batch pages were successfully compressed.
  1158			 * Store the pages in zpool.
  1159			 */
  1160			acomp_idx = -1;
  1161			while (nr_to_store--) {
  1162				unsigned long handle;
  1163				char *buf;
  1164	
  1165				++acomp_idx;
> 1166				prefetchw(entries[j]);
  1167				err = zpool_malloc(zpool, acomp_ctx->reqs[acomp_idx]->dlen, gfp, &handle);
  1168	
  1169				if (unlikely(err)) {
  1170					if (err == -ENOSPC)
  1171						zswap_reject_compress_poor++;
  1172					else
  1173						zswap_reject_alloc_fail++;
  1174	
  1175					return false;
  1176				}
  1177	
  1178				buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
  1179				memcpy(buf, acomp_ctx->buffers[acomp_idx], acomp_ctx->reqs[acomp_idx]->dlen);
  1180				zpool_unmap_handle(zpool, handle);
  1181	
  1182				entries[j]->handle = handle;
  1183				entries[j]->length = acomp_ctx->reqs[acomp_idx]->dlen;
  1184				++j;
  1185			}
  1186		}
  1187	
  1188		return true;
  1189	}
  1190	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios.
  2025-03-03 11:07   ` kernel test robot
@ 2025-03-03 18:21     ` Nhat Pham
  2025-03-03 21:34       ` Sridhar, Kanchana P
  0 siblings, 1 reply; 44+ messages in thread
From: Nhat Pham @ 2025-03-03 18:21 UTC (permalink / raw)
  To: kernel test robot
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosry.ahmed,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, llvm, oe-kbuild-all, wajdi.k.feghali,
	vinodh.gopal

On Mon, Mar 3, 2025 at 3:07 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi Kanchana,
>
> kernel test robot noticed the following build errors:
>
> > 1166                          prefetchw(entries[j]);
> --

Why are we doing this anyway? Does it have a notable performance
difference? At the very least, leave a comment explaining why we're
prefetching this (although the build error suggests that we have to
remove it anyway).


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios.
  2025-03-03 18:21     ` Nhat Pham
@ 2025-03-03 21:34       ` Sridhar, Kanchana P
  2025-03-06 21:20         ` Yosry Ahmed
  0 siblings, 1 reply; 44+ messages in thread
From: Sridhar, Kanchana P @ 2025-03-03 21:34 UTC (permalink / raw)
  To: Nhat Pham, lkp
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, llvm, oe-kbuild-all, Feghali, Wajdi K, Gopal,
	Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Monday, March 3, 2025 10:22 AM
> To: lkp <lkp@intel.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; llvm@lists.linux.dev; oe-kbuild-
> all@lists.linux.dev; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 14/14] mm: zswap: Compress batching with request
> chaining in zswap_store() of large folios.
> 
> On Mon, Mar 3, 2025 at 3:07 AM kernel test robot <lkp@intel.com> wrote:
> >
> > Hi Kanchana,
> >
> > kernel test robot noticed the following build errors:
> >
> > > 1166                          prefetchw(entries[j]);
> > --
> 
> Why are we doing this anyway? Does it have a notable performance
> difference? At the very least, leave a comment explaining why we're
> prefetching this (although the build error suggests that we have to
> remove it anyway).

Hi Nhat,

Yes, it does. The use of prefetchw reduces sys time by ~1.5% because
it minimizes cache-miss latency by moving the zswap entry to the cache
before it is written to. 

This is data with kernel compilation test, v8 without prefetchw and v8 as-is:

--------------------------------------------------------------------------------
 Kernel compile       v8 without               v8      v8 without              v8
 allmodconfig          prefetchw                        prefetchw
 2M folios
 --------------------------------------------------------------------------------
 zswap compressor    deflate-iaa      deflate-iaa            zstd            zstd   
 --------------------------------------------------------------------------------
 real_sec                 732.89           735.63          768.53          758.21
 user_sec              15,708.37        15,699.84       15,702.64       15,678.73
 sys_sec                4,632.58         4,563.70        5,735.06        5,635.69
 --------------------------------------------------------------------------------
 Max_Res_Set_Size_KB   1,874,672        1,867,516       1,874,684       1,872,888
 --------------------------------------------------------------------------------
 memcg_high                    0                0               0               0
 memcg_swap_fail               0                0               0               0
 zswpout             114,742,930      112,836,725      92,904,961      89,596,085
 zswpin               41,184,897       39,983,793      31,018,149      29,163,932
 pswpout                     625            1,069             558           1,059
 pswpin                      599            1,056             540           1,051
 thp_swpout                    1                2               1               2
 thp_swpout_fallback      10,967           10,195           6,918           6,141
 pgmajfault           42,588,331       41,349,069      31,931,882      30,006,422
 ZSWPOUT-2048kB            7,661            8,710           6,799           7,480
 SWPOUT-2048kB                 1                2               1               2
 --------------------------------------------------------------------------------


Sure, I will add a comment, and also "#include <linux/prefetch.h>" in zswap.c
that will resolve the build error. This is similar to how these files handle prefetchw:
mm/vmscan.c, kernel/locking/qspinlock.c, include/asm-generic/xor.h, etc.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-03  8:47 ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Kanchana P Sridhar
@ 2025-03-06 19:35   ` Yosry Ahmed
  2025-03-06 19:46     ` Yosry Ahmed
  2025-03-07  0:01     ` Sridhar, Kanchana P
  2025-03-06 19:42   ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Yosry Ahmed
  1 sibling, 2 replies; 44+ messages in thread
From: Yosry Ahmed @ 2025-03-06 19:35 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote:
> This patch modifies the acomp_ctx resources' lifetime to be from pool
> creation to deletion. A "bool __online" and "u8 nr_reqs" are added to
> "struct crypto_acomp_ctx" which simplify a few things:
> 
> 1) zswap_pool_create() will initialize all members of each percpu acomp_ctx
>    to 0 or NULL and only then initialize the mutex.
> 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online
>    to true, without locking the mutex.
> 3) CPU hotunplug will lock the mutex before setting __online to false. It
>    will not delete any resources.
> 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online
>    is true, and if so, return the mutex for use in zswap compress and
>    decompress ops.
> 5) CPU onlining after offlining will simply check if either __online or
>    nr_reqs are non-0, and return 0 if so, without re-allocating the
>    resources.
> 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to
>    delete the acomp_ctx resources.
> 7) Common resource deletion code in case of zswap_cpu_comp_prepare()
>    errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new
>    acomp_ctx_dealloc().
> 
> The CPU hot[un]plug callback functions are moved to "pool functions"
> accordingly.
> 
> The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU
> offlining, and only deleting them when the pool is destroyed, is as follows:
> 
>     IAA with batching: 64.8 KB
>     Software compressors: 8.2 KB
> 
> I would appreciate code review comments on whether this memory cost is
> acceptable, for the latency improvement that it provides due to a faster
> reclaim restart after a CPU hotunplug-hotplug sequence - all that the
> hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and
> if so, set __online to true and return, and reclaim can proceed.

I like the idea of allocating the resources on memory hotplug but
leaving them allocated until the pool is torn down. It avoids allocating
unnecessary memory if some CPUs are never onlined, but it simplifies
things because we don't have to synchronize against the resources being
freed in CPU offline.

The only case that would suffer from this AFAICT is if someone onlines
many CPUs, uses them once, and then offline them and not use them again.
I am not familiar with CPU hotplug use cases so I can't tell if that's
something people do, but I am inclined to agree with this
simplification.

> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 182 insertions(+), 91 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 10f2a16e7586..cff96df1df8b 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -144,10 +144,12 @@ bool zswap_never_enabled(void)
>  struct crypto_acomp_ctx {
>  	struct crypto_acomp *acomp;
>  	struct acomp_req *req;
> -	struct crypto_wait wait;

Is there a reason for moving this? If not please avoid unrelated changes.

>  	u8 *buffer;
> +	u8 nr_reqs;
> +	struct crypto_wait wait;
>  	struct mutex mutex;
>  	bool is_sleepable;
> +	bool __online;

I don't believe we need this.

If we are not freeing resources during CPU offlining, then we do not
need a CPU offline callback and acomp_ctx->__online serves no purpose.

The whole point of synchronizing between offlining and
compress/decompress operations is to avoid UAF. If offlining does not
free resources, then we can hold the mutex directly in the
compress/decompress path and drop the hotunplug callback completely.

I also believe nr_reqs can be dropped from this patch, as it seems like
it's only used know when to set __online.

>  };
>  
>  /*
> @@ -246,6 +248,122 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>  **********************************/
>  static void __zswap_pool_empty(struct percpu_ref *ref);
>  
> +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> +{
> +	if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) {
> +
> +		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> +			acomp_request_free(acomp_ctx->req);
> +		acomp_ctx->req = NULL;
> +
> +		kfree(acomp_ctx->buffer);
> +		acomp_ctx->buffer = NULL;
> +
> +		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> +			crypto_free_acomp(acomp_ctx->acomp);
> +
> +		acomp_ctx->nr_reqs = 0;
> +	}
> +}

Please split the pure refactoring into a separate patch to make it
easier to review.

> +
> +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)

Why is the function moved while being changed? It's really hard to see
the diff this way. If the function needs to be moved please do that
separately as well.

I also see some ordering changes inside the function (e.g. we now
allocate the request before the buffer). Not sure if these are
intentional. If not, please keep the diff to the required changes only.

> +{
> +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> +	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> +	int ret = -ENOMEM;
> +
> +	/*
> +	 * Just to be even more fail-safe against changes in assumptions and/or
> +	 * implementation of the CPU hotplug code.
> +	 */
> +	if (acomp_ctx->__online)
> +		return 0;
> +
> +	if (acomp_ctx->nr_reqs) {
> +		acomp_ctx->__online = true;
> +		return 0;
> +	}
> +
> +	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
> +	if (IS_ERR(acomp_ctx->acomp)) {
> +		pr_err("could not alloc crypto acomp %s : %ld\n",
> +			pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
> +		ret = PTR_ERR(acomp_ctx->acomp);
> +		goto fail;
> +	}
> +
> +	acomp_ctx->nr_reqs = 1;
> +
> +	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> +	if (!acomp_ctx->req) {
> +		pr_err("could not alloc crypto acomp_request %s\n",
> +		       pool->tfm_name);
> +		ret = -ENOMEM;
> +		goto fail;
> +	}
> +
> +	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> +	if (!acomp_ctx->buffer) {
> +		ret = -ENOMEM;
> +		goto fail;
> +	}
> +
> +	crypto_init_wait(&acomp_ctx->wait);
> +
> +	/*
> +	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
> +	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
> +	 * won't be called, crypto_wait_req() will return without blocking.
> +	 */
> +	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> +				   crypto_req_done, &acomp_ctx->wait);
> +
> +	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
> +
> +	acomp_ctx->__online = true;
> +
> +	return 0;
> +
> +fail:
> +	acomp_ctx_dealloc(acomp_ctx);
> +
> +	return ret;
> +}
> +
> +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> +	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> +
> +	mutex_lock(&acomp_ctx->mutex);
> +	acomp_ctx->__online = false;
> +	mutex_unlock(&acomp_ctx->mutex);
> +
> +	return 0;
> +}
> +
> +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> +	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> +
> +	/*
> +	 * The lifetime of acomp_ctx resources is from pool creation to
> +	 * pool deletion.
> +	 *
> +	 * Reclaims should not be happening because, we get to this routine only
> +	 * in two scenarios:
> +	 *
> +	 * 1) pool creation failures before/during the pool ref initialization.
> +	 * 2) we are in the process of releasing the pool, it is off the
> +	 *    zswap_pools list and has no references.
> +	 *
> +	 * Hence, there is no need for locks.
> +	 */
> +	acomp_ctx->__online = false;
> +	acomp_ctx_dealloc(acomp_ctx);

Since __online can be dropped, we can probably drop
zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly?

> +}
> +
>  static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>  {
>  	struct zswap_pool *pool;
> @@ -285,13 +403,21 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>  		goto error;
>  	}
>  
> -	for_each_possible_cpu(cpu)
> -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> +	for_each_possible_cpu(cpu) {
> +		struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> +
> +		acomp_ctx->acomp = NULL;
> +		acomp_ctx->req = NULL;
> +		acomp_ctx->buffer = NULL;
> +		acomp_ctx->__online = false;
> +		acomp_ctx->nr_reqs = 0;

Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them
right away?

If it is in fact needed we should probably just use __GFP_ZERO.

> +		mutex_init(&acomp_ctx->mutex);
> +	}
>  
>  	ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
>  				       &pool->node);
>  	if (ret)
> -		goto error;
> +		goto ref_fail;
>  
>  	/* being the current pool takes 1 ref; this func expects the
>  	 * caller to always add the new pool as the current pool
> @@ -307,6 +433,9 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>  	return pool;
>  
>  ref_fail:
> +	for_each_possible_cpu(cpu)
> +		zswap_cpu_comp_dealloc(cpu, &pool->node);
> +
>  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);

I am wondering if we can guard these by hlist_empty(&pool->node) instead
of having separate labels. If we do that we can probably make all the
cleanup calls conditional and merge this cleanup code with
zswap_pool_destroy().

Although I am not too sure about whether or not we should rely on
hlist_empty() for this. I am just thinking out loud, no need to do
anything here. If you decide to pursue this tho please make it a
separate refactoring patch.

>  error:
>  	if (pool->acomp_ctx)
> @@ -361,8 +490,13 @@ static struct zswap_pool *__zswap_pool_create_fallback(void)
>  
>  static void zswap_pool_destroy(struct zswap_pool *pool)
>  {
> +	int cpu;
> +
>  	zswap_pool_debug("destroying", pool);
>  
> +	for_each_possible_cpu(cpu)
> +		zswap_cpu_comp_dealloc(cpu, &pool->node);
> +
>  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
>  	free_percpu(pool->acomp_ctx);
>  
> @@ -816,85 +950,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
>  /*********************************
>  * compressed storage functions
>  **********************************/
> -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
> -{
> -	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> -	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> -	struct crypto_acomp *acomp = NULL;
> -	struct acomp_req *req = NULL;
> -	u8 *buffer = NULL;
> -	int ret;
> -
> -	buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> -	if (!buffer) {
> -		ret = -ENOMEM;
> -		goto fail;
> -	}
> -
> -	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
> -	if (IS_ERR(acomp)) {
> -		pr_err("could not alloc crypto acomp %s : %ld\n",
> -				pool->tfm_name, PTR_ERR(acomp));
> -		ret = PTR_ERR(acomp);
> -		goto fail;
> -	}
> -
> -	req = acomp_request_alloc(acomp);
> -	if (!req) {
> -		pr_err("could not alloc crypto acomp_request %s\n",
> -		       pool->tfm_name);
> -		ret = -ENOMEM;
> -		goto fail;
> -	}
> -
> -	/*
> -	 * Only hold the mutex after completing allocations, otherwise we may
> -	 * recurse into zswap through reclaim and attempt to hold the mutex
> -	 * again resulting in a deadlock.
> -	 */
> -	mutex_lock(&acomp_ctx->mutex);
> -	crypto_init_wait(&acomp_ctx->wait);
> -
> -	/*
> -	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
> -	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
> -	 * won't be called, crypto_wait_req() will return without blocking.
> -	 */
> -	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> -				   crypto_req_done, &acomp_ctx->wait);
> -
> -	acomp_ctx->buffer = buffer;
> -	acomp_ctx->acomp = acomp;
> -	acomp_ctx->is_sleepable = acomp_is_async(acomp);
> -	acomp_ctx->req = req;
> -	mutex_unlock(&acomp_ctx->mutex);
> -	return 0;
> -
> -fail:
> -	if (acomp)
> -		crypto_free_acomp(acomp);
> -	kfree(buffer);
> -	return ret;
> -}
> -
> -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
> -{
> -	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> -	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> -
> -	mutex_lock(&acomp_ctx->mutex);
> -	if (!IS_ERR_OR_NULL(acomp_ctx)) {
> -		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> -			acomp_request_free(acomp_ctx->req);
> -		acomp_ctx->req = NULL;
> -		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> -			crypto_free_acomp(acomp_ctx->acomp);
> -		kfree(acomp_ctx->buffer);
> -	}
> -	mutex_unlock(&acomp_ctx->mutex);
> -
> -	return 0;
> -}
>  
>  static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
>  {
> @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
>  
>  	for (;;) {
>  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> -		mutex_lock(&acomp_ctx->mutex);
> -		if (likely(acomp_ctx->req))
> -			return acomp_ctx;
>  		/*
> -		 * It is possible that we were migrated to a different CPU after
> -		 * getting the per-CPU ctx but before the mutex was acquired. If
> -		 * the old CPU got offlined, zswap_cpu_comp_dead() could have
> -		 * already freed ctx->req (among other things) and set it to
> -		 * NULL. Just try again on the new CPU that we ended up on.
> +		 * If the CPU onlining code successfully allocates acomp_ctx resources,
> +		 * it sets acomp_ctx->__online to true. Until this happens, we have
> +		 * two options:
> +		 *
> +		 * 1. Return NULL and fail all stores on this CPU.
> +		 * 2. Retry, until onlining has finished allocating resources.
> +		 *
> +		 * In theory, option 1 could be more appropriate, because it
> +		 * allows the calling procedure to decide how it wants to handle
> +		 * reclaim racing with CPU hotplug. For instance, it might be Ok
> +		 * for compress to return an error for the backing swap device
> +		 * to store the folio. Decompress could wait until we get a
> +		 * valid and locked mutex after onlining has completed. For now,
> +		 * we go with option 2 because adding a do-while in
> +		 * zswap_decompress() adds latency for software compressors.
> +		 *
> +		 * Once initialized, the resources will be de-allocated only
> +		 * when the pool is destroyed. The acomp_ctx will hold on to the
> +		 * resources through CPU offlining/onlining at any time until
> +		 * the pool is destroyed.
> +		 *
> +		 * This prevents races/deadlocks between reclaim and CPU acomp_ctx
> +		 * resource allocation that are a dependency for reclaim.
> +		 * It further simplifies the interaction with CPU onlining and
> +		 * offlining:
> +		 *
> +		 * - CPU onlining does not take the mutex. It only allocates
> +		 *   resources and sets __online to true.
> +		 * - CPU offlining acquires the mutex before setting
> +		 *   __online to false. If reclaim has acquired the mutex,
> +		 *   offlining will have to wait for reclaim to complete before
> +		 *   hotunplug can proceed. Further, hotplug merely sets
> +		 *   __online to false. It does not delete the acomp_ctx
> +		 *   resources.
> +		 *
> +		 * Option 1 is better than potentially not exiting the earlier
> +		 * for (;;) loop because the system is running low on memory
> +		 * and/or CPUs are getting offlined for whatever reason. At
> +		 * least failing this store will prevent data loss by failing
> +		 * zswap_store(), and saving the data in the backing swap device.
>  		 */

I believe we can dropped. I don't think we can have any store/load
operations on a CPU before it's fully onlined, and we should always have
a reference on the pool here, so the resources cannot go away.

So unless I missed something we can drop this completely now and just
hold the mutex directly in the load/store paths.

> +		mutex_lock(&acomp_ctx->mutex);
> +		if (likely(acomp_ctx->__online))
> +			return acomp_ctx;
> +
>  		mutex_unlock(&acomp_ctx->mutex);
>  	}
>  }
> -- 
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-03  8:47 ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Kanchana P Sridhar
  2025-03-06 19:35   ` Yosry Ahmed
@ 2025-03-06 19:42   ` Yosry Ahmed
  1 sibling, 0 replies; 44+ messages in thread
From: Yosry Ahmed @ 2025-03-06 19:42 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote:
> This patch modifies the acomp_ctx resources' lifetime to be from pool
> creation to deletion. A "bool __online" and "u8 nr_reqs" are added to
> "struct crypto_acomp_ctx" which simplify a few things:
> 
> 1) zswap_pool_create() will initialize all members of each percpu acomp_ctx
>    to 0 or NULL and only then initialize the mutex.
> 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online
>    to true, without locking the mutex.
> 3) CPU hotunplug will lock the mutex before setting __online to false. It
>    will not delete any resources.
> 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online
>    is true, and if so, return the mutex for use in zswap compress and
>    decompress ops.
> 5) CPU onlining after offlining will simply check if either __online or
>    nr_reqs are non-0, and return 0 if so, without re-allocating the
>    resources.
> 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to
>    delete the acomp_ctx resources.
> 7) Common resource deletion code in case of zswap_cpu_comp_prepare()
>    errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new
>    acomp_ctx_dealloc().
> 
> The CPU hot[un]plug callback functions are moved to "pool functions"
> accordingly.
> 
> The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU
> offlining, and only deleting them when the pool is destroyed, is as follows:
> 
>     IAA with batching: 64.8 KB
>     Software compressors: 8.2 KB

I am assuming this is specifically on x86_64, so let's call that out.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-06 19:35   ` Yosry Ahmed
@ 2025-03-06 19:46     ` Yosry Ahmed
  2025-03-07  0:01     ` Sridhar, Kanchana P
  1 sibling, 0 replies; 44+ messages in thread
From: Yosry Ahmed @ 2025-03-06 19:46 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Thu, Mar 06, 2025 at 07:35:36PM +0000, Yosry Ahmed wrote:
> On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote:
> > This patch modifies the acomp_ctx resources' lifetime to be from pool
> > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to
> > "struct crypto_acomp_ctx" which simplify a few things:
> > 
> > 1) zswap_pool_create() will initialize all members of each percpu acomp_ctx
> >    to 0 or NULL and only then initialize the mutex.
> > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online
> >    to true, without locking the mutex.
> > 3) CPU hotunplug will lock the mutex before setting __online to false. It
> >    will not delete any resources.
> > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online
> >    is true, and if so, return the mutex for use in zswap compress and
> >    decompress ops.
> > 5) CPU onlining after offlining will simply check if either __online or
> >    nr_reqs are non-0, and return 0 if so, without re-allocating the
> >    resources.
> > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc() to
> >    delete the acomp_ctx resources.
> > 7) Common resource deletion code in case of zswap_cpu_comp_prepare()
> >    errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new
> >    acomp_ctx_dealloc().
> > 
> > The CPU hot[un]plug callback functions are moved to "pool functions"
> > accordingly.
> > 
> > The per-cpu memory cost of not deleting the acomp_ctx resources upon CPU
> > offlining, and only deleting them when the pool is destroyed, is as follows:
> > 
> >     IAA with batching: 64.8 KB
> >     Software compressors: 8.2 KB
> > 
> > I would appreciate code review comments on whether this memory cost is
> > acceptable, for the latency improvement that it provides due to a faster
> > reclaim restart after a CPU hotunplug-hotplug sequence - all that the
> > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and
> > if so, set __online to true and return, and reclaim can proceed.
> 
> I like the idea of allocating the resources on memory hotplug but
> leaving them allocated until the pool is torn down. It avoids allocating
> unnecessary memory if some CPUs are never onlined, but it simplifies
> things because we don't have to synchronize against the resources being
> freed in CPU offline.
> 
> The only case that would suffer from this AFAICT is if someone onlines
> many CPUs, uses them once, and then offline them and not use them again.
> I am not familiar with CPU hotplug use cases so I can't tell if that's
> something people do, but I am inclined to agree with this
> simplification.
> 
> > 
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++------------------
> >  1 file changed, 182 insertions(+), 91 deletions(-)
> > 
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 10f2a16e7586..cff96df1df8b 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void)
> >  struct crypto_acomp_ctx {
> >  	struct crypto_acomp *acomp;
> >  	struct acomp_req *req;
> > -	struct crypto_wait wait;
> 
> Is there a reason for moving this? If not please avoid unrelated changes.
> 
> >  	u8 *buffer;
> > +	u8 nr_reqs;
> > +	struct crypto_wait wait;
> >  	struct mutex mutex;
> >  	bool is_sleepable;
> > +	bool __online;
> 
> I don't believe we need this.
> 
> If we are not freeing resources during CPU offlining, then we do not
> need a CPU offline callback and acomp_ctx->__online serves no purpose.
> 
> The whole point of synchronizing between offlining and
> compress/decompress operations is to avoid UAF. If offlining does not
> free resources, then we can hold the mutex directly in the
> compress/decompress path and drop the hotunplug callback completely.
> 
> I also believe nr_reqs can be dropped from this patch, as it seems like
> it's only used know when to set __online.
> 
> >  };
> >  
> >  /*
> > @@ -246,6 +248,122 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> >  **********************************/
> >  static void __zswap_pool_empty(struct percpu_ref *ref);
> >  
> > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> > +{
> > +	if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) {

Also, we can just return early here to save an indentation level:

	if (IS_ERR_OR_NULL(acomp_ctx) || !acomp_ctx->nr_reqs)
		return;

> > +
> > +		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > +			acomp_request_free(acomp_ctx->req);
> > +		acomp_ctx->req = NULL;
> > +
> > +		kfree(acomp_ctx->buffer);
> > +		acomp_ctx->buffer = NULL;
> > +
> > +		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > +			crypto_free_acomp(acomp_ctx->acomp);
> > +
> > +		acomp_ctx->nr_reqs = 0;
> > +	}
> > +}
> 
> Please split the pure refactoring into a separate patch to make it
> easier to review.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 13/14] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-03-03  8:47 ` [PATCH v8 13/14] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
@ 2025-03-06 20:00   ` Yosry Ahmed
  2025-04-30 21:15     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 44+ messages in thread
From: Yosry Ahmed @ 2025-03-06 20:00 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Mon, Mar 03, 2025 at 12:47:23AM -0800, Kanchana P Sridhar wrote:
> This patch adds support for the per-CPU acomp_ctx to track multiple
> compression/decompression requests and multiple compression destination
> buffers. The zswap_cpu_comp_prepare() CPU onlining code will get the
> maximum batch-size the compressor supports. If so, it will allocate the
> necessary batching resources.
> 
> However, zswap does not use more than one request yet. Follow-up patches
> will actually utilize the multiple acomp_ctx requests/buffers for batch
> compression/decompression of multiple pages.
> 
> The newly added ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used
> for batching. There is a small extra memory overhead of allocating the
> "reqs" and "buffers" arrays for compressors that do not support batching.

That's two pointers per-CPU (i.e. 16 bytes on x86_64), right? Please
call that out in the commit log.

> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 99 +++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 69 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index cff96df1df8b..fae59d6d5147 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -78,6 +78,16 @@ static bool zswap_pool_reached_full;
>  
>  #define ZSWAP_PARAM_UNSET ""
>  
> +/*
> + * For compression batching of large folios:
> + * Maximum number of acomp compress requests that will be processed
> + * in a batch, iff the zswap compressor supports batching.
> + * This limit exists because we preallocate enough requests and buffers
> + * in the per-cpu acomp_ctx accordingly. Hence, a higher limit means higher
> + * memory usage.
> + */

That's too verbose. Let's do something like:

/* Limit the batch size to limit per-CPU memory usage for reqs and buffers */
#define ZSWAP_MAX_BATCH_SIZE 8U

> +#define ZSWAP_MAX_BATCH_SIZE 8U
> +
>  static int zswap_setup(void);
>  
>  /* Enable/disable zswap */
> @@ -143,8 +153,8 @@ bool zswap_never_enabled(void)
>  
>  struct crypto_acomp_ctx {
>  	struct crypto_acomp *acomp;
> -	struct acomp_req *req;
> -	u8 *buffer;
> +	struct acomp_req **reqs;
> +	u8 **buffers;
>  	u8 nr_reqs;
>  	struct crypto_wait wait;
>  	struct mutex mutex;
> @@ -251,13 +261,22 @@ static void __zswap_pool_empty(struct percpu_ref *ref);
>  static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
>  {
>  	if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) {
> +		u8 i;
> +
> +		if (acomp_ctx->reqs) {
> +			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +				if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))

Hmm I just realized we check IS_ERR_OR_NULL() here for the requests, but
only a NULL check in zswap_cpu_comp_prepare(). We also check
IS_ERR_OR_NULL here for acomp, but only IS_ERR() in
zswap_cpu_comp_prepare().

This doesn't make sense. Would you be able to include a patch before
this one to make these consistent? I can also send a follow up patch.

> +					acomp_request_free(acomp_ctx->reqs[i]);

Please add braces for the for loop here for readability, since the body
has more than one line, even if it's technically not required.

> +			kfree(acomp_ctx->reqs);
> +			acomp_ctx->reqs = NULL;
> +		}
>  
> -		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> -			acomp_request_free(acomp_ctx->req);
> -		acomp_ctx->req = NULL;
> -
> -		kfree(acomp_ctx->buffer);
> -		acomp_ctx->buffer = NULL;
> +		if (acomp_ctx->buffers) {
> +			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> +				kfree(acomp_ctx->buffers[i]);
> +			kfree(acomp_ctx->buffers);
> +			acomp_ctx->buffers = NULL;
> +		}
>  
>  		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>  			crypto_free_acomp(acomp_ctx->acomp);
> @@ -271,6 +290,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>  	int ret = -ENOMEM;
> +	u8 i;
>  
>  	/*
>  	 * Just to be even more fail-safe against changes in assumptions and/or
> @@ -292,22 +312,41 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  		goto fail;
>  	}
>  
> -	acomp_ctx->nr_reqs = 1;
> +	acomp_ctx->nr_reqs = min(ZSWAP_MAX_BATCH_SIZE,
> +				 crypto_acomp_batch_size(acomp_ctx->acomp));
>  
> -	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> -	if (!acomp_ctx->req) {
> -		pr_err("could not alloc crypto acomp_request %s\n",
> -		       pool->tfm_name);
> -		ret = -ENOMEM;
> +	acomp_ctx->reqs = kcalloc_node(acomp_ctx->nr_reqs, sizeof(struct acomp_req *),
> +				       GFP_KERNEL, cpu_to_node(cpu));
> +	if (!acomp_ctx->reqs)
>  		goto fail;
> +
> +	for (i = 0; i < acomp_ctx->nr_reqs; ++i) {
> +		acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
> +		if (!acomp_ctx->reqs[i]) {
> +			pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
> +				i, pool->tfm_name);
> +			goto fail;
> +		}
>  	}
>  
> -	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> -	if (!acomp_ctx->buffer) {
> -		ret = -ENOMEM;
> +	acomp_ctx->buffers = kcalloc_node(acomp_ctx->nr_reqs, sizeof(u8 *),
> +					  GFP_KERNEL, cpu_to_node(cpu));
> +	if (!acomp_ctx->buffers)
>  		goto fail;
> +
> +	for (i = 0; i < acomp_ctx->nr_reqs; ++i) {
> +		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> +						     cpu_to_node(cpu));
> +		if (!acomp_ctx->buffers[i])
> +			goto fail;
>  	}
>  
> +	/*
> +	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
> +	 * or non-poll mode of acomp, hence there is only one "wait" per
> +	 * acomp_ctx, with callback set to reqs[0], under the assumption that
> +	 * there is at least 1 request per acomp_ctx.
> +	 */

I am not sure I understand. Does this say that we assume that scomp or
non-poll acomp will never use batching so having a single "wait" is
fine?

If so, this needs to be enforced at runtime or at least have a warning,
and not just mentioned in a comment, in case batching support is ever
added for these. Please clarify.

We should also probably merge the comments above crypto_init_wait() and
acomp_request_set_callback() now.

>  	crypto_init_wait(&acomp_ctx->wait);
>  
>  	/*


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios.
  2025-03-03  8:47 ` [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios Kanchana P Sridhar
  2025-03-03 11:07   ` kernel test robot
  2025-03-03 11:07   ` kernel test robot
@ 2025-03-06 21:17   ` Yosry Ahmed
  2025-04-30 21:05     ` Sridhar, Kanchana P
  2 siblings, 1 reply; 44+ messages in thread
From: Yosry Ahmed @ 2025-03-06 21:17 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, wajdi.k.feghali, vinodh.gopal

On Mon, Mar 03, 2025 at 12:47:24AM -0800, Kanchana P Sridhar wrote:
> This patch introduces three new procedures:
> 
>  zswap_store_folio()
>  zswap_store_pages()
>  zswap_batch_compress()
> 
> zswap_store_folio() stores a folio in chunks of "batch_size" pages. If the
> compressor supports batching and has multiple acomp_ctx->nr_reqs, the
> batch_size will be the acomp_ctx->nr_reqs. If the compressor does not
> support batching, the batch_size will be ZSWAP_MAX_BATCH_SIZE. The
> compressor having multiple acomp_ctx->nr_reqs is passed as a bool
> "batching" parameter to zswap_store_pages() and zswap_batch_compress().
> This refactorization allows us to move the loop over the folio's pages from
> zswap_store() to zswap_store_folio(), and also enables batching.
> 
> zswap_store_pages() implements all the computes done earlier in
> zswap_store_page() for a single-page, for multiple pages in a folio, namely
> a "batch". zswap_store_pages() starts by allocating all zswap entries
> required to store the batch. Next, it calls zswap_batch_compress() to
> compress the batch. Finally, it adds the batch's zswap entries to the
> xarray and LRU, charges zswap memory and increments zswap stats.
> 
> The error handling and cleanup required for all failure scenarios that can
> occur while storing a batch in zswap are consolidated to a single
> "store_pages_failed" label in zswap_store_pages().
> 
> And finally, this patch introduces zswap_batch_compress(), which does the
> following:
>   - If the compressor supports batching, sets up a request chain for
>     compressing the batch in one shot. If Intel IAA is the zswap
>     compressor, the request chain will be compressed in parallel in
>     hardware. If all requests in the chain are compressed without errors,
>     the compressed buffers are then stored in zpool.
>   - If the compressor does not support batching, each page in the batch is
>     compressed and stored sequentially. zswap_batch_compress() replaces
>     zswap_compress(), thereby eliminating code duplication and facilitating
>     maintainability of the code with the introduction of batching.
> 
> The call to the crypto layer is exactly the same in both cases: when batch
> compressing a request chain or when sequentially compressing each page in
> the batch.
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---

High-level comment: It may be better to break this down into two
patches, one reworking the code to process folios in batches, and one to
add support for compression batching and request chaining. This may make
it more digestable.

>  mm/zswap.c | 396 ++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 270 insertions(+), 126 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index fae59d6d5147..135d5792ce50 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1051,74 +1051,141 @@ static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
>  	mutex_unlock(&acomp_ctx->mutex);
>  }
>  
> -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> -			   struct zswap_pool *pool)
> +/*
> + * Unified code paths for compressors that do and do not support
> + * batching. This procedure will compress multiple @nr_pages in @folio,
> + * starting from @index.
> + * If @batching is set to true, it will create a request chain for
> + * compression batching. It is assumed that the caller has verified
> + * that the acomp_ctx->nr_reqs is at least @nr_pages.
> + * If @batching is set to false, it will process each page sequentially.
> + * In both cases, if all compressions were successful, it will proceed
> + * to store the compressed buffers in zpool.
> + */
> +static bool zswap_batch_compress(struct folio *folio,
> +				 long index,
> +				 unsigned int nr_pages,
> +				 struct zswap_entry *entries[],
> +				 struct zswap_pool *pool,
> +				 struct crypto_acomp_ctx *acomp_ctx,
> +				 bool batching)

If we have a single compress function let's keep it called
zswap_compress() please.

>  {
> -	struct crypto_acomp_ctx *acomp_ctx;
> -	struct scatterlist input, output;
> -	int comp_ret = 0, alloc_ret = 0;
> -	unsigned int dlen = PAGE_SIZE;
> -	unsigned long handle;
> -	struct zpool *zpool;
> -	char *buf;
> +	struct scatterlist inputs[ZSWAP_MAX_BATCH_SIZE];
> +	struct scatterlist outputs[ZSWAP_MAX_BATCH_SIZE];
> +	struct zpool *zpool = pool->zpool;
> +	int acomp_idx = 0, nr_to_store = 1;
> +	unsigned int i, j;
> +	int err = 0;
>  	gfp_t gfp;
> -	u8 *dst;
>  
> -	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> -	dst = acomp_ctx->buffers[0];
> -	sg_init_table(&input, 1);
> -	sg_set_page(&input, page, PAGE_SIZE, 0);
> +	lockdep_assert_held(&acomp_ctx->mutex);
>  
> -	/*
> -	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
> -	 * and hardware-accelerators may won't check the dst buffer size, so
> -	 * giving the dst buffer with enough length to avoid buffer overflow.
> -	 */
> -	sg_init_one(&output, dst, PAGE_SIZE * 2);
> -	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
> -
> -	/*
> -	 * it maybe looks a little bit silly that we send an asynchronous request,
> -	 * then wait for its completion synchronously. This makes the process look
> -	 * synchronous in fact.
> -	 * Theoretically, acomp supports users send multiple acomp requests in one
> -	 * acomp instance, then get those requests done simultaneously. but in this
> -	 * case, zswap actually does store and load page by page, there is no
> -	 * existing method to send the second page before the first page is done
> -	 * in one thread doing zwap.
> -	 * but in different threads running on different cpu, we have different
> -	 * acomp instance, so multiple threads can do (de)compression in parallel.
> -	 */
> -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
> -	dlen = acomp_ctx->reqs[0]->dlen;
> -	if (comp_ret)
> -		goto unlock;
> -
> -	zpool = pool->zpool;
>  	gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
>  	if (zpool_malloc_support_movable(zpool))
>  		gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> -	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle);
> -	if (alloc_ret)
> -		goto unlock;
>  
> -	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> -	memcpy(buf, dst, dlen);
> -	zpool_unmap_handle(zpool, handle);
> +	for (i = 0; i < nr_pages; ++i) {
> +		struct page *page = folio_page(folio, index + i);
>  
> -	entry->handle = handle;
> -	entry->length = dlen;
> +		sg_init_table(&inputs[acomp_idx], 1);
> +		sg_set_page(&inputs[acomp_idx], page, PAGE_SIZE, 0);
>  
> -unlock:
> -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> -		zswap_reject_compress_poor++;
> -	else if (comp_ret)
> -		zswap_reject_compress_fail++;
> -	else if (alloc_ret)
> -		zswap_reject_alloc_fail++;
> +		/*
> +		 * Each dst buffer should be of size (PAGE_SIZE * 2).
> +		 * Reflect same in sg_list.

The existing comment about overcompression was useful, please move it
as-is.

> +		 */
> +		sg_init_one(&outputs[acomp_idx], acomp_ctx->buffers[acomp_idx], PAGE_SIZE * 2);
> +		acomp_request_set_params(acomp_ctx->reqs[acomp_idx], &inputs[acomp_idx],
> +					 &outputs[acomp_idx], PAGE_SIZE, PAGE_SIZE);
> +
> +		if (batching) {
> +			/* Add the acomp request to the chain. */
> +			if (likely(i))
> +				acomp_request_chain(acomp_ctx->reqs[acomp_idx], acomp_ctx->reqs[0]);
> +			else
> +				acomp_reqchain_init(acomp_ctx->reqs[0], 0, crypto_req_done,
> +						    &acomp_ctx->wait);
> +
> +			if (i == (nr_pages - 1)) {
> +				/* Process the request chain. */
> +				err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
> +
> +				/*
> +				 * Get the individual compress errors from request chaining.
> +				 */
> +				for (j = 0; j < nr_pages; ++j) {
> +					if (unlikely(acomp_request_err(acomp_ctx->reqs[j]))) {
> +						err = -EINVAL;
> +						if (acomp_request_err(acomp_ctx->reqs[j]) == -ENOSPC)
> +							zswap_reject_compress_poor++;
> +						else
> +							zswap_reject_compress_fail++;
> +					}
> +				}
> +				/*
> +				 * Request chaining cleanup:
> +				 *
> +				 * - Clear the CRYPTO_TFM_REQ_CHAIN bit on acomp_ctx->reqs[0].
> +				 * - Reset the acomp_ctx->wait to notify acomp_ctx->reqs[0].
> +				 */
> +				acomp_reqchain_clear(acomp_ctx->reqs[0], &acomp_ctx->wait);
> +				if (unlikely(err))
> +					return false;
> +				j = 0;
> +				nr_to_store = nr_pages;
> +				goto store_zpool;
> +			}
> +
> +			++acomp_idx;
> +			continue;

The code structure here is really hard to read:

(1) Why do we need 'batching' to begin with? Can't we use the same
request chaining APIs even for the '!batching' case? Are there any
regressions for doing so?

(2) Instead of the control flow having 'continue', 'goto', and nested
loops, can we just have separate loops?

IOW can we do something like:

	for (..) {
		if (i == 0)
			acomp_reqchain_init(..);
		else
			acomp_request_chain(..);
	}

	crypto_wait_req(..);
	/* collect errors */
	acomp_reqchain_clear(..);

	for (..) {
		/* store */
	}

> +		} else {
> +			err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
> +
> +			if (unlikely(err)) {
> +				if (err == -ENOSPC)
> +					zswap_reject_compress_poor++;
> +				else
> +					zswap_reject_compress_fail++;
> +				return false;
> +			}
> +			j = i;
> +			nr_to_store = 1;
> +		}
>  
> -	acomp_ctx_put_unlock(acomp_ctx);
> -	return comp_ret == 0 && alloc_ret == 0;
> +store_zpool:
> +		/*
> +		 * All batch pages were successfully compressed.
> +		 * Store the pages in zpool.
> +		 */
> +		acomp_idx = -1;
> +		while (nr_to_store--) {
> +			unsigned long handle;
> +			char *buf;
> +
> +			++acomp_idx;
> +			prefetchw(entries[j]);
> +			err = zpool_malloc(zpool, acomp_ctx->reqs[acomp_idx]->dlen, gfp, &handle);
> +
> +			if (unlikely(err)) {
> +				if (err == -ENOSPC)
> +					zswap_reject_compress_poor++;
> +				else
> +					zswap_reject_alloc_fail++;
> +
> +				return false;
> +			}
> +
> +			buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> +			memcpy(buf, acomp_ctx->buffers[acomp_idx], acomp_ctx->reqs[acomp_idx]->dlen);
> +			zpool_unmap_handle(zpool, handle);
> +
> +			entries[j]->handle = handle;
> +			entries[j]->length = acomp_ctx->reqs[acomp_idx]->dlen;
> +			++j;
> +		}
> +	}
> +
> +	return true;
>  }
>  
>  static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> @@ -1581,84 +1648,165 @@ static void shrink_worker(struct work_struct *w)
>  * main API
>  **********************************/
>  
> -static bool zswap_store_page(struct page *page,
> -			     struct obj_cgroup *objcg,
> -			     struct zswap_pool *pool)
> +/*
> + * Store multiple pages in @folio, starting from the page at index @si up to
> + * and including the page at index @ei.

'start' and 'end'?

> + *
> + * The error handling from all failure points is consolidated to the
> + * "store_pages_failed" label, based on the initialization of the zswap entries'
> + * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
> + * entry's handle is subsequently modified only upon a successful zpool_malloc()
> + * after the page is compressed.

This comment is not useful here. Instead comment on specific parts of
the function where the intention of the code is not clear.

> + */
> +static bool zswap_store_pages(struct folio *folio,
> +			      long si,
> +			      long ei,
> +			      struct obj_cgroup *objcg,
> +			      struct zswap_pool *pool,
> +			      struct crypto_acomp_ctx *acomp_ctx,
> +			      bool batching)
>  {
> -	swp_entry_t page_swpentry = page_swap_entry(page);
> -	struct zswap_entry *entry, *old;
> +	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> +	int node_id = folio_nid(folio);
> +	u8 i, from_i = 0, nr_pages = ei - si + 1;
>  
> -	/* allocate entry */
> -	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> -	if (!entry) {
> -		zswap_reject_kmemcache_fail++;
> -		return false;
> +	for (i = 0; i < nr_pages; ++i) {
> +		entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> +
> +		if (unlikely(!entries[i])) {
> +			zswap_reject_kmemcache_fail++;
> +			nr_pages = i;
> +			goto store_pages_failed;
> +		}
> +
> +		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
>  	}
>  
> -	if (!zswap_compress(page, entry, pool))
> -		goto compress_failed;
> +	if (!zswap_batch_compress(folio, si, nr_pages, entries, pool, acomp_ctx, batching))
> +		goto store_pages_failed;
>  
> -	old = xa_store(swap_zswap_tree(page_swpentry),
> -		       swp_offset(page_swpentry),
> -		       entry, GFP_KERNEL);
> -	if (xa_is_err(old)) {
> -		int err = xa_err(old);
> +	for (i = 0; i < nr_pages; ++i) {
> +		swp_entry_t page_swpentry = page_swap_entry(folio_page(folio, si + i));
> +		struct zswap_entry *old, *entry = entries[i];
>  
> -		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> -		zswap_reject_alloc_fail++;
> -		goto store_failed;
> -	}
> +		old = xa_store(swap_zswap_tree(page_swpentry),
> +			       swp_offset(page_swpentry),
> +			       entry, GFP_KERNEL);
> +		if (unlikely(xa_is_err(old))) {
> +			int err = xa_err(old);
>  
> -	/*
> -	 * We may have had an existing entry that became stale when
> -	 * the folio was redirtied and now the new version is being
> -	 * swapped out. Get rid of the old.
> -	 */
> -	if (old)
> -		zswap_entry_free(old);
> +			WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> +			zswap_reject_alloc_fail++;
> +			from_i = i;

'error_idx' or 'store_fail_idx'?

> +			goto store_pages_failed;
> +		}
>  
> -	/*
> -	 * The entry is successfully compressed and stored in the tree, there is
> -	 * no further possibility of failure. Grab refs to the pool and objcg,
> -	 * charge zswap memory, and increment zswap_stored_pages.
> -	 * The opposite actions will be performed by zswap_entry_free()
> -	 * when the entry is removed from the tree.
> -	 */
> -	zswap_pool_get(pool);
> -	if (objcg) {
> -		obj_cgroup_get(objcg);
> -		obj_cgroup_charge_zswap(objcg, entry->length);
> -	}
> -	atomic_long_inc(&zswap_stored_pages);
> +		/*
> +		 * We may have had an existing entry that became stale when
> +		 * the folio was redirtied and now the new version is being
> +		 * swapped out. Get rid of the old.
> +		 */
> +		if (unlikely(old))
> +			zswap_entry_free(old);
>  
> -	/*
> -	 * We finish initializing the entry while it's already in xarray.
> -	 * This is safe because:
> -	 *
> -	 * 1. Concurrent stores and invalidations are excluded by folio lock.
> -	 *
> -	 * 2. Writeback is excluded by the entry not being on the LRU yet.
> -	 *    The publishing order matters to prevent writeback from seeing
> -	 *    an incoherent entry.
> -	 */
> -	entry->pool = pool;
> -	entry->swpentry = page_swpentry;
> -	entry->objcg = objcg;
> -	entry->referenced = true;
> -	if (entry->length) {
> -		INIT_LIST_HEAD(&entry->lru);
> -		zswap_lru_add(&zswap_list_lru, entry);
> +		/*
> +		 * The entry is successfully compressed and stored in the tree, there is
> +		 * no further possibility of failure. Grab refs to the pool and objcg,
> +		 * charge zswap memory, and increment zswap_stored_pages.
> +		 * The opposite actions will be performed by zswap_entry_free()
> +		 * when the entry is removed from the tree.
> +		 */
> +		zswap_pool_get(pool);
> +		if (objcg) {
> +			obj_cgroup_get(objcg);
> +			obj_cgroup_charge_zswap(objcg, entry->length);
> +		}
> +		atomic_long_inc(&zswap_stored_pages);
> +
> +		/*
> +		 * We finish initializing the entry while it's already in xarray.
> +		 * This is safe because:
> +		 *
> +		 * 1. Concurrent stores and invalidations are excluded by folio lock.
> +		 *
> +		 * 2. Writeback is excluded by the entry not being on the LRU yet.
> +		 *    The publishing order matters to prevent writeback from seeing
> +		 *    an incoherent entry.
> +		 */
> +		entry->pool = pool;
> +		entry->swpentry = page_swpentry;
> +		entry->objcg = objcg;
> +		entry->referenced = true;
> +		if (likely(entry->length)) {
> +			INIT_LIST_HEAD(&entry->lru);
> +			zswap_lru_add(&zswap_list_lru, entry);
> +		}
>  	}
>  
>  	return true;
>  
> -store_failed:
> -	zpool_free(pool->zpool, entry->handle);
> -compress_failed:
> -	zswap_entry_cache_free(entry);
> +store_pages_failed:
> +	for (i = from_i; i < nr_pages; ++i) {
> +		if (!IS_ERR_VALUE(entries[i]->handle))
> +			zpool_free(pool->zpool, entries[i]->handle);
> +
> +		zswap_entry_cache_free(entries[i]);
> +	}
> +
>  	return false;
>  }
>  
> +/*
> + * Store all pages in a folio by calling zswap_batch_compress().
> + * If the compressor supports batching, i.e., has multiple acomp requests,
> + * the folio will be compressed in batches of "acomp_ctx->nr_reqs" using
> + * request chaining.
> + * If the compressor has only one acomp request, the folio will be compressed
> + * in batches of ZSWAP_MAX_BATCH_SIZE pages, where each page in the batch is
> + * compressed sequentially.
> + */
> +static bool zswap_store_folio(struct folio *folio,
> +			      struct obj_cgroup *objcg,
> +			      struct zswap_pool *pool)
> +{
> +	long nr_pages = folio_nr_pages(folio);
> +	struct crypto_acomp_ctx *acomp_ctx;
> +	unsigned int batch_size;
> +	bool ret = true, batching;
> +	long si, ei;
> +
> +	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> +
> +	batching = ((acomp_ctx->nr_reqs > 1) && (nr_pages > 1));
> +
> +	batch_size = batching ? acomp_ctx->nr_reqs : ZSWAP_MAX_BATCH_SIZE;

acomp_ctx->nr_reqs is a function of the compressor. Can we store it in
the pool to avoid having to hold the mutex here just for this?

This would also save a tiny bit of memory as right now we have it
replicated for all CPUs.

> +
> +	if (!batching)
> +		acomp_ctx_put_unlock(acomp_ctx);

So in the batching case we keep the mutex held throughout, otherwise we
hold it for every batch. Why?

IIRC you mentioned that only holding it for every batch removes the zstd
regressions, perhaps because holding the mutex for too long is a
problem. But why not do the same for the batching case too?

Also, the current behavior is that we only hold the mutex during
compression. Now we changed this to do it per-folio or per-batch. Does
the current behavior cause regressions? Ideally we'd like to keep the
mutex held as little as possible.

This is worth explaining in detail.

> +
> +	/* Store the folio in batches of "batch_size" pages. */
> +	for (si = 0, ei = min(si + batch_size - 1, nr_pages - 1);
> +	     ((si < nr_pages) && (ei < nr_pages));
> +	     si = ei + 1, ei = min(si + batch_size - 1, nr_pages - 1)) {

This is too complicated. Can we do:

	for (start = 0; start < nr_pages; start += batch_size) {
		end = min(start + batch_size - 1, nr_pages - 1);
		...
	}

Also, we subtract 1 from ei here and add 1 in zswap_store_pages(), so
perhaps it's better to make it exclusive:

	for (start = 0; start < nr_pages; start += batch_size) {
		end = min(start + batch_size, nr_pages);
		...
	}

> +
> +		if (!batching)
> +			acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> +
> +		if (!zswap_store_pages(folio, si, ei, objcg, pool, acomp_ctx, batching)) {
> +			ret = false;
> +			break;
> +		}
> +
> +		if (!batching)
> +			acomp_ctx_put_unlock(acomp_ctx);
> +	}
> +
> +	if (batching || !ret)
> +		acomp_ctx_put_unlock(acomp_ctx);
> +
> +	return ret;
> +}
> +
>  bool zswap_store(struct folio *folio)
>  {
>  	long nr_pages = folio_nr_pages(folio);
> @@ -1667,7 +1815,6 @@ bool zswap_store(struct folio *folio)
>  	struct mem_cgroup *memcg = NULL;
>  	struct zswap_pool *pool;
>  	bool ret = false;
> -	long index;

These seems like an unrelated change.

>  
>  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
>  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1701,12 +1848,8 @@ bool zswap_store(struct folio *folio)
>  		mem_cgroup_put(memcg);
>  	}
>  
> -	for (index = 0; index < nr_pages; ++index) {
> -		struct page *page = folio_page(folio, index);
> -
> -		if (!zswap_store_page(page, objcg, pool))
> -			goto put_pool;
> -	}
> +	if (!zswap_store_folio(folio, objcg, pool))
> +		goto put_pool;

If we move the locking out of zswap_store_folio() and simplify the loop
there, I belive it will be simple enough for us to inline it here.

>  
>  	if (objcg)
>  		count_objcg_events(objcg, ZSWPOUT, nr_pages);
> @@ -1733,6 +1876,7 @@ bool zswap_store(struct folio *folio)
>  		pgoff_t offset = swp_offset(swp);
>  		struct zswap_entry *entry;
>  		struct xarray *tree;
> +		long index;
>  
>  		for (index = 0; index < nr_pages; ++index) {
>  			tree = swap_zswap_tree(swp_entry(type, offset + index));
> -- 
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios.
  2025-03-03 21:34       ` Sridhar, Kanchana P
@ 2025-03-06 21:20         ` Yosry Ahmed
  2025-04-30 21:07           ` Sridhar, Kanchana P
  0 siblings, 1 reply; 44+ messages in thread
From: Yosry Ahmed @ 2025-03-06 21:20 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Nhat Pham, lkp, linux-kernel, linux-mm, hannes, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, llvm, oe-kbuild-all, Feghali, Wajdi K, Gopal,
	Vinodh

On Mon, Mar 03, 2025 at 09:34:04PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Monday, March 3, 2025 10:22 AM
> > To: lkp <lkp@intel.com>
> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> > kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> > yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; llvm@lists.linux.dev; oe-kbuild-
> > all@lists.linux.dev; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> > Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v8 14/14] mm: zswap: Compress batching with request
> > chaining in zswap_store() of large folios.
> > 
> > On Mon, Mar 3, 2025 at 3:07 AM kernel test robot <lkp@intel.com> wrote:
> > >
> > > Hi Kanchana,
> > >
> > > kernel test robot noticed the following build errors:
> > >
> > > > 1166                          prefetchw(entries[j]);
> > > --
> > 
> > Why are we doing this anyway? Does it have a notable performance
> > difference? At the very least, leave a comment explaining why we're
> > prefetching this (although the build error suggests that we have to
> > remove it anyway).
> 
> Hi Nhat,
> 
> Yes, it does. The use of prefetchw reduces sys time by ~1.5% because
> it minimizes cache-miss latency by moving the zswap entry to the cache
> before it is written to. 
> 
> This is data with kernel compilation test, v8 without prefetchw and v8 as-is:
> 
> --------------------------------------------------------------------------------
>  Kernel compile       v8 without               v8      v8 without              v8
>  allmodconfig          prefetchw                        prefetchw
>  2M folios
>  --------------------------------------------------------------------------------
>  zswap compressor    deflate-iaa      deflate-iaa            zstd            zstd   
>  --------------------------------------------------------------------------------
>  real_sec                 732.89           735.63          768.53          758.21
>  user_sec              15,708.37        15,699.84       15,702.64       15,678.73
>  sys_sec                4,632.58         4,563.70        5,735.06        5,635.69
>  --------------------------------------------------------------------------------
>  Max_Res_Set_Size_KB   1,874,672        1,867,516       1,874,684       1,872,888
>  --------------------------------------------------------------------------------
>  memcg_high                    0                0               0               0
>  memcg_swap_fail               0                0               0               0
>  zswpout             114,742,930      112,836,725      92,904,961      89,596,085
>  zswpin               41,184,897       39,983,793      31,018,149      29,163,932
>  pswpout                     625            1,069             558           1,059
>  pswpin                      599            1,056             540           1,051
>  thp_swpout                    1                2               1               2
>  thp_swpout_fallback      10,967           10,195           6,918           6,141
>  pgmajfault           42,588,331       41,349,069      31,931,882      30,006,422
>  ZSWPOUT-2048kB            7,661            8,710           6,799           7,480
>  SWPOUT-2048kB                 1                2               1               2
>  --------------------------------------------------------------------------------
> 
> 
> Sure, I will add a comment, and also "#include <linux/prefetch.h>" in zswap.c
> that will resolve the build error. This is similar to how these files handle prefetchw:
> mm/vmscan.c, kernel/locking/qspinlock.c, include/asm-generic/xor.h, etc.

Please also explicitly mention that the prefetch and likely/unlikely
annotations prevent regressions with software compression like zstd, and
generally improve the performance with the batching code by ~1.5%.

> 
> Thanks,
> Kanchana
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-06 19:35   ` Yosry Ahmed
  2025-03-06 19:46     ` Yosry Ahmed
@ 2025-03-07  0:01     ` Sridhar, Kanchana P
  2025-03-07 19:30       ` Yosry Ahmed
  1 sibling, 1 reply; 44+ messages in thread
From: Sridhar, Kanchana P @ 2025-03-07  0:01 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, March 6, 2025 11:36 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> allocation/deletion and mutex lock usage.
> 
> On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote:
> > This patch modifies the acomp_ctx resources' lifetime to be from pool
> > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to
> > "struct crypto_acomp_ctx" which simplify a few things:
> >
> > 1) zswap_pool_create() will initialize all members of each percpu
> acomp_ctx
> >    to 0 or NULL and only then initialize the mutex.
> > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online
> >    to true, without locking the mutex.
> > 3) CPU hotunplug will lock the mutex before setting __online to false. It
> >    will not delete any resources.
> > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online
> >    is true, and if so, return the mutex for use in zswap compress and
> >    decompress ops.
> > 5) CPU onlining after offlining will simply check if either __online or
> >    nr_reqs are non-0, and return 0 if so, without re-allocating the
> >    resources.
> > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc()
> to
> >    delete the acomp_ctx resources.
> > 7) Common resource deletion code in case of zswap_cpu_comp_prepare()
> >    errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new
> >    acomp_ctx_dealloc().
> >
> > The CPU hot[un]plug callback functions are moved to "pool functions"
> > accordingly.
> >
> > The per-cpu memory cost of not deleting the acomp_ctx resources upon
> CPU
> > offlining, and only deleting them when the pool is destroyed, is as follows:
> >
> >     IAA with batching: 64.8 KB
> >     Software compressors: 8.2 KB
> >
> > I would appreciate code review comments on whether this memory cost is
> > acceptable, for the latency improvement that it provides due to a faster
> > reclaim restart after a CPU hotunplug-hotplug sequence - all that the
> > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and
> > if so, set __online to true and return, and reclaim can proceed.
> 
> I like the idea of allocating the resources on memory hotplug but
> leaving them allocated until the pool is torn down. It avoids allocating
> unnecessary memory if some CPUs are never onlined, but it simplifies
> things because we don't have to synchronize against the resources being
> freed in CPU offline.
> 
> The only case that would suffer from this AFAICT is if someone onlines
> many CPUs, uses them once, and then offline them and not use them again.
> I am not familiar with CPU hotplug use cases so I can't tell if that's
> something people do, but I am inclined to agree with this
> simplification.

Thanks Yosry, for your code review comments! Good to know that this
simplification is acceptable.

> 
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++--------------
> ----
> >  1 file changed, 182 insertions(+), 91 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 10f2a16e7586..cff96df1df8b 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void)
> >  struct crypto_acomp_ctx {
> >  	struct crypto_acomp *acomp;
> >  	struct acomp_req *req;
> > -	struct crypto_wait wait;
> 
> Is there a reason for moving this? If not please avoid unrelated changes.

The reason is so that req/buffer, and reqs/buffers with batching, go together
logically, hence I found this easier to understand. I can restore this to the
original order, if that's preferable.

> 
> >  	u8 *buffer;
> > +	u8 nr_reqs;
> > +	struct crypto_wait wait;
> >  	struct mutex mutex;
> >  	bool is_sleepable;
> > +	bool __online;
> 
> I don't believe we need this.
> 
> If we are not freeing resources during CPU offlining, then we do not
> need a CPU offline callback and acomp_ctx->__online serves no purpose.
> 
> The whole point of synchronizing between offlining and
> compress/decompress operations is to avoid UAF. If offlining does not
> free resources, then we can hold the mutex directly in the
> compress/decompress path and drop the hotunplug callback completely.
> 
> I also believe nr_reqs can be dropped from this patch, as it seems like
> it's only used know when to set __online.

All great points! In fact, that was the original solution I had implemented
(not having an offline callback). But then, I spent some time understanding
the v6.13 hotfix for synchronizing freeing of resources, and this comment
in zswap_cpu_comp_prepare():

	/*
	 * Only hold the mutex after completing allocations, otherwise we may
	 * recurse into zswap through reclaim and attempt to hold the mutex
	 * again resulting in a deadlock.
	 */

Hence, I figured the constraint of "recurse into zswap through reclaim" was
something to comprehend in the simplification (even though I had a tough
time imagining how this could happen).

Hence, I added the "bool __online" because zswap_cpu_comp_prepare()
does not acquire the mutex lock while allocating resources. We have already
initialized the mutex, so in theory, it is possible for compress/decompress
to acquire the mutex lock. The __online acts as a way to indicate whether
compress/decompress can proceed reliably to use the resources.

The "nr_reqs" was needed as a way to distinguish between initial and
subsequent calls into zswap_cpu_comp_prepare(), for e.g., on a CPU that
goes through an online-offline-online sequence. In the initial onlining,
we need to allocate resources because nr_reqs=0. If resources are to
be allocated, we set acomp_ctx->nr_reqs and proceed to allocate
reqs/buffers/etc. In the subsequent onlining, we can quickly inspect
nr_reqs as being greater than 0 and return, thus avoiding any latency
delays before reclaim/page-faults can be handled on that CPU.

Please let me know if this rationale seems reasonable for why
__online and nr_reqs were introduced.

> 
> >  };
> >
> >  /*
> > @@ -246,6 +248,122 @@ static inline struct xarray
> *swap_zswap_tree(swp_entry_t swp)
> >  **********************************/
> >  static void __zswap_pool_empty(struct percpu_ref *ref);
> >
> > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> > +{
> > +	if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) {
> > +
> > +		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > +			acomp_request_free(acomp_ctx->req);
> > +		acomp_ctx->req = NULL;
> > +
> > +		kfree(acomp_ctx->buffer);
> > +		acomp_ctx->buffer = NULL;
> > +
> > +		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > +			crypto_free_acomp(acomp_ctx->acomp);
> > +
> > +		acomp_ctx->nr_reqs = 0;
> > +	}
> > +}
> 
> Please split the pure refactoring into a separate patch to make it
> easier to review.

Sure, will do.

> 
> > +
> > +static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node
> *node)
> 
> Why is the function moved while being changed? It's really hard to see
> the diff this way. If the function needs to be moved please do that
> separately as well.

Sure, will do.

> 
> I also see some ordering changes inside the function (e.g. we now
> allocate the request before the buffer). Not sure if these are
> intentional. If not, please keep the diff to the required changes only.

The reason for this was, I am trying to organize the allocations based
on dependencies. Unless requests are allocated, there is no point in
allocating buffers. Please let me know if this is Ok.

> 
> > +{
> > +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > +	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > +	int ret = -ENOMEM;
> > +
> > +	/*
> > +	 * Just to be even more fail-safe against changes in assumptions
> and/or
> > +	 * implementation of the CPU hotplug code.
> > +	 */
> > +	if (acomp_ctx->__online)
> > +		return 0;
> > +
> > +	if (acomp_ctx->nr_reqs) {
> > +		acomp_ctx->__online = true;
> > +		return 0;
> > +	}
> > +
> > +	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0,
> 0, cpu_to_node(cpu));
> > +	if (IS_ERR(acomp_ctx->acomp)) {
> > +		pr_err("could not alloc crypto acomp %s : %ld\n",
> > +			pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
> > +		ret = PTR_ERR(acomp_ctx->acomp);
> > +		goto fail;
> > +	}
> > +
> > +	acomp_ctx->nr_reqs = 1;
> > +
> > +	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> > +	if (!acomp_ctx->req) {
> > +		pr_err("could not alloc crypto acomp_request %s\n",
> > +		       pool->tfm_name);
> > +		ret = -ENOMEM;
> > +		goto fail;
> > +	}
> > +
> > +	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> cpu_to_node(cpu));
> > +	if (!acomp_ctx->buffer) {
> > +		ret = -ENOMEM;
> > +		goto fail;
> > +	}
> > +
> > +	crypto_init_wait(&acomp_ctx->wait);
> > +
> > +	/*
> > +	 * if the backend of acomp is async zip, crypto_req_done() will
> wakeup
> > +	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
> > +	 * won't be called, crypto_wait_req() will return without blocking.
> > +	 */
> > +	acomp_request_set_callback(acomp_ctx->req,
> CRYPTO_TFM_REQ_MAY_BACKLOG,
> > +				   crypto_req_done, &acomp_ctx->wait);
> > +
> > +	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
> > +
> > +	acomp_ctx->__online = true;
> > +
> > +	return 0;
> > +
> > +fail:
> > +	acomp_ctx_dealloc(acomp_ctx);
> > +
> > +	return ret;
> > +}
> > +
> > +static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node
> *node)
> > +{
> > +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > +	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > +
> > +	mutex_lock(&acomp_ctx->mutex);
> > +	acomp_ctx->__online = false;
> > +	mutex_unlock(&acomp_ctx->mutex);
> > +
> > +	return 0;
> > +}
> > +
> > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node
> *node)
> > +{
> > +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > +	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > +
> > +	/*
> > +	 * The lifetime of acomp_ctx resources is from pool creation to
> > +	 * pool deletion.
> > +	 *
> > +	 * Reclaims should not be happening because, we get to this routine
> only
> > +	 * in two scenarios:
> > +	 *
> > +	 * 1) pool creation failures before/during the pool ref initialization.
> > +	 * 2) we are in the process of releasing the pool, it is off the
> > +	 *    zswap_pools list and has no references.
> > +	 *
> > +	 * Hence, there is no need for locks.
> > +	 */
> > +	acomp_ctx->__online = false;
> > +	acomp_ctx_dealloc(acomp_ctx);
> 
> Since __online can be dropped, we can probably drop
> zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly?

I suppose there is value in having a way in zswap to know for sure, that
resource allocation has completed, and it is safe for compress/decompress
to proceed. Especially because the mutex has been initialized before we
get to resource allocation. Would you agree?

> 
> > +}
> > +
> >  static struct zswap_pool *zswap_pool_create(char *type, char
> *compressor)
> >  {
> >  	struct zswap_pool *pool;
> > @@ -285,13 +403,21 @@ static struct zswap_pool
> *zswap_pool_create(char *type, char *compressor)
> >  		goto error;
> >  	}
> >
> > -	for_each_possible_cpu(cpu)
> > -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > +	for_each_possible_cpu(cpu) {
> > +		struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > +
> > +		acomp_ctx->acomp = NULL;
> > +		acomp_ctx->req = NULL;
> > +		acomp_ctx->buffer = NULL;
> > +		acomp_ctx->__online = false;
> > +		acomp_ctx->nr_reqs = 0;
> 
> Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them
> right away?

Yes, I figured this is needed for two reasons:

1) For the error handling in zswap_cpu_comp_prepare() and calls into
    zswap_cpu_comp_dealloc() to be handled by the common procedure
    "acomp_ctx_dealloc()" unambiguously.
2) The second scenario I thought of that would need this, is let's say
     the zswap compressor is switched immediately after setting the
     compressor. Some cores have executed the onlining code and
     some haven't. Because there are no pool refs held,
     zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured
     it would help to initialize these acomp_ctx members before the
     hand-off to "cpuhp_state_add_instance()" in zswap_pool_create().

Please let me know if these are valid considerations.

> 
> If it is in fact needed we should probably just use __GFP_ZERO.

Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of "alloc_percpu()"
for the acomp_ctx?

> 
> > +		mutex_init(&acomp_ctx->mutex);
> > +	}
> >
> >  	ret =
> cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> >  				       &pool->node);
> >  	if (ret)
> > -		goto error;
> > +		goto ref_fail;
> >
> >  	/* being the current pool takes 1 ref; this func expects the
> >  	 * caller to always add the new pool as the current pool
> > @@ -307,6 +433,9 @@ static struct zswap_pool *zswap_pool_create(char
> *type, char *compressor)
> >  	return pool;
> >
> >  ref_fail:
> > +	for_each_possible_cpu(cpu)
> > +		zswap_cpu_comp_dealloc(cpu, &pool->node);
> > +
> >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> 
> I am wondering if we can guard these by hlist_empty(&pool->node) instead
> of having separate labels. If we do that we can probably make all the
> cleanup calls conditional and merge this cleanup code with
> zswap_pool_destroy().
> 
> Although I am not too sure about whether or not we should rely on
> hlist_empty() for this. I am just thinking out loud, no need to do
> anything here. If you decide to pursue this tho please make it a
> separate refactoring patch.

Sure, makes sense.

> 
> >  error:
> >  	if (pool->acomp_ctx)
> > @@ -361,8 +490,13 @@ static struct zswap_pool
> *__zswap_pool_create_fallback(void)
> >
> >  static void zswap_pool_destroy(struct zswap_pool *pool)
> >  {
> > +	int cpu;
> > +
> >  	zswap_pool_debug("destroying", pool);
> >
> > +	for_each_possible_cpu(cpu)
> > +		zswap_cpu_comp_dealloc(cpu, &pool->node);
> > +
> >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> >  	free_percpu(pool->acomp_ctx);
> >
> > @@ -816,85 +950,6 @@ static void zswap_entry_free(struct zswap_entry
> *entry)
> >  /*********************************
> >  * compressed storage functions
> >  **********************************/
> > -static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node
> *node)
> > -{
> > -	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > -	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > -	struct crypto_acomp *acomp = NULL;
> > -	struct acomp_req *req = NULL;
> > -	u8 *buffer = NULL;
> > -	int ret;
> > -
> > -	buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> cpu_to_node(cpu));
> > -	if (!buffer) {
> > -		ret = -ENOMEM;
> > -		goto fail;
> > -	}
> > -
> > -	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
> > -	if (IS_ERR(acomp)) {
> > -		pr_err("could not alloc crypto acomp %s : %ld\n",
> > -				pool->tfm_name, PTR_ERR(acomp));
> > -		ret = PTR_ERR(acomp);
> > -		goto fail;
> > -	}
> > -
> > -	req = acomp_request_alloc(acomp);
> > -	if (!req) {
> > -		pr_err("could not alloc crypto acomp_request %s\n",
> > -		       pool->tfm_name);
> > -		ret = -ENOMEM;
> > -		goto fail;
> > -	}
> > -
> > -	/*
> > -	 * Only hold the mutex after completing allocations, otherwise we
> may
> > -	 * recurse into zswap through reclaim and attempt to hold the mutex
> > -	 * again resulting in a deadlock.
> > -	 */
> > -	mutex_lock(&acomp_ctx->mutex);
> > -	crypto_init_wait(&acomp_ctx->wait);
> > -
> > -	/*
> > -	 * if the backend of acomp is async zip, crypto_req_done() will
> wakeup
> > -	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
> > -	 * won't be called, crypto_wait_req() will return without blocking.
> > -	 */
> > -	acomp_request_set_callback(req,
> CRYPTO_TFM_REQ_MAY_BACKLOG,
> > -				   crypto_req_done, &acomp_ctx->wait);
> > -
> > -	acomp_ctx->buffer = buffer;
> > -	acomp_ctx->acomp = acomp;
> > -	acomp_ctx->is_sleepable = acomp_is_async(acomp);
> > -	acomp_ctx->req = req;
> > -	mutex_unlock(&acomp_ctx->mutex);
> > -	return 0;
> > -
> > -fail:
> > -	if (acomp)
> > -		crypto_free_acomp(acomp);
> > -	kfree(buffer);
> > -	return ret;
> > -}
> > -
> > -static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node
> *node)
> > -{
> > -	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > -	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > -
> > -	mutex_lock(&acomp_ctx->mutex);
> > -	if (!IS_ERR_OR_NULL(acomp_ctx)) {
> > -		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > -			acomp_request_free(acomp_ctx->req);
> > -		acomp_ctx->req = NULL;
> > -		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > -			crypto_free_acomp(acomp_ctx->acomp);
> > -		kfree(acomp_ctx->buffer);
> > -	}
> > -	mutex_unlock(&acomp_ctx->mutex);
> > -
> > -	return 0;
> > -}
> >
> >  static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct
> zswap_pool *pool)
> >  {
> > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx
> *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
> >
> >  	for (;;) {
> >  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > -		mutex_lock(&acomp_ctx->mutex);
> > -		if (likely(acomp_ctx->req))
> > -			return acomp_ctx;
> >  		/*
> > -		 * It is possible that we were migrated to a different CPU
> after
> > -		 * getting the per-CPU ctx but before the mutex was
> acquired. If
> > -		 * the old CPU got offlined, zswap_cpu_comp_dead() could
> have
> > -		 * already freed ctx->req (among other things) and set it to
> > -		 * NULL. Just try again on the new CPU that we ended up on.
> > +		 * If the CPU onlining code successfully allocates acomp_ctx
> resources,
> > +		 * it sets acomp_ctx->__online to true. Until this happens, we
> have
> > +		 * two options:
> > +		 *
> > +		 * 1. Return NULL and fail all stores on this CPU.
> > +		 * 2. Retry, until onlining has finished allocating resources.
> > +		 *
> > +		 * In theory, option 1 could be more appropriate, because it
> > +		 * allows the calling procedure to decide how it wants to
> handle
> > +		 * reclaim racing with CPU hotplug. For instance, it might be
> Ok
> > +		 * for compress to return an error for the backing swap device
> > +		 * to store the folio. Decompress could wait until we get a
> > +		 * valid and locked mutex after onlining has completed. For
> now,
> > +		 * we go with option 2 because adding a do-while in
> > +		 * zswap_decompress() adds latency for software
> compressors.
> > +		 *
> > +		 * Once initialized, the resources will be de-allocated only
> > +		 * when the pool is destroyed. The acomp_ctx will hold on to
> the
> > +		 * resources through CPU offlining/onlining at any time until
> > +		 * the pool is destroyed.
> > +		 *
> > +		 * This prevents races/deadlocks between reclaim and CPU
> acomp_ctx
> > +		 * resource allocation that are a dependency for reclaim.
> > +		 * It further simplifies the interaction with CPU onlining and
> > +		 * offlining:
> > +		 *
> > +		 * - CPU onlining does not take the mutex. It only allocates
> > +		 *   resources and sets __online to true.
> > +		 * - CPU offlining acquires the mutex before setting
> > +		 *   __online to false. If reclaim has acquired the mutex,
> > +		 *   offlining will have to wait for reclaim to complete before
> > +		 *   hotunplug can proceed. Further, hotplug merely sets
> > +		 *   __online to false. It does not delete the acomp_ctx
> > +		 *   resources.
> > +		 *
> > +		 * Option 1 is better than potentially not exiting the earlier
> > +		 * for (;;) loop because the system is running low on memory
> > +		 * and/or CPUs are getting offlined for whatever reason. At
> > +		 * least failing this store will prevent data loss by failing
> > +		 * zswap_store(), and saving the data in the backing swap
> device.
> >  		 */
> 
> I believe we can dropped. I don't think we can have any store/load
> operations on a CPU before it's fully onlined, and we should always have
> a reference on the pool here, so the resources cannot go away.
> 
> So unless I missed something we can drop this completely now and just
> hold the mutex directly in the load/store paths.

Based on the above explanations, please let me know if it is a good idea
to keep the __online, or if you think further simplification is possible.

Thanks,
Kanchana

> 
> > +		mutex_lock(&acomp_ctx->mutex);
> > +		if (likely(acomp_ctx->__online))
> > +			return acomp_ctx;
> > +
> >  		mutex_unlock(&acomp_ctx->mutex);
> >  	}
> >  }
> > --
> > 2.27.0
> >


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-07  0:01     ` Sridhar, Kanchana P
@ 2025-03-07 19:30       ` Yosry Ahmed
  2025-03-08  2:47         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 44+ messages in thread
From: Yosry Ahmed @ 2025-03-07 19:30 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Fri, Mar 07, 2025 at 12:01:14AM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Thursday, March 6, 2025 11:36 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> > Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> > allocation/deletion and mutex lock usage.
> > 
> > On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote:
> > > This patch modifies the acomp_ctx resources' lifetime to be from pool
> > > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to
> > > "struct crypto_acomp_ctx" which simplify a few things:
> > >
> > > 1) zswap_pool_create() will initialize all members of each percpu
> > acomp_ctx
> > >    to 0 or NULL and only then initialize the mutex.
> > > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online
> > >    to true, without locking the mutex.
> > > 3) CPU hotunplug will lock the mutex before setting __online to false. It
> > >    will not delete any resources.
> > > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online
> > >    is true, and if so, return the mutex for use in zswap compress and
> > >    decompress ops.
> > > 5) CPU onlining after offlining will simply check if either __online or
> > >    nr_reqs are non-0, and return 0 if so, without re-allocating the
> > >    resources.
> > > 6) zswap_pool_destroy() will call a newly added zswap_cpu_comp_dealloc()
> > to
> > >    delete the acomp_ctx resources.
> > > 7) Common resource deletion code in case of zswap_cpu_comp_prepare()
> > >    errors, and for use in zswap_cpu_comp_dealloc(), is factored into a new
> > >    acomp_ctx_dealloc().
> > >
> > > The CPU hot[un]plug callback functions are moved to "pool functions"
> > > accordingly.
> > >
> > > The per-cpu memory cost of not deleting the acomp_ctx resources upon
> > CPU
> > > offlining, and only deleting them when the pool is destroyed, is as follows:
> > >
> > >     IAA with batching: 64.8 KB
> > >     Software compressors: 8.2 KB
> > >
> > > I would appreciate code review comments on whether this memory cost is
> > > acceptable, for the latency improvement that it provides due to a faster
> > > reclaim restart after a CPU hotunplug-hotplug sequence - all that the
> > > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0, and
> > > if so, set __online to true and return, and reclaim can proceed.
> > 
> > I like the idea of allocating the resources on memory hotplug but
> > leaving them allocated until the pool is torn down. It avoids allocating
> > unnecessary memory if some CPUs are never onlined, but it simplifies
> > things because we don't have to synchronize against the resources being
> > freed in CPU offline.
> > 
> > The only case that would suffer from this AFAICT is if someone onlines
> > many CPUs, uses them once, and then offline them and not use them again.
> > I am not familiar with CPU hotplug use cases so I can't tell if that's
> > something people do, but I am inclined to agree with this
> > simplification.
> 
> Thanks Yosry, for your code review comments! Good to know that this
> simplification is acceptable.
> 
> > 
> > >
> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > ---
> > >  mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++--------------
> > ----
> > >  1 file changed, 182 insertions(+), 91 deletions(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 10f2a16e7586..cff96df1df8b 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void)
> > >  struct crypto_acomp_ctx {
> > >  	struct crypto_acomp *acomp;
> > >  	struct acomp_req *req;
> > > -	struct crypto_wait wait;
> > 
> > Is there a reason for moving this? If not please avoid unrelated changes.
> 
> The reason is so that req/buffer, and reqs/buffers with batching, go together
> logically, hence I found this easier to understand. I can restore this to the
> original order, if that's preferable.

I see. In that case, this fits better in the patch that actually adds
support for having multiple requests and buffers, and please call it out
explicitly in the commit message.

> 
> > 
> > >  	u8 *buffer;
> > > +	u8 nr_reqs;
> > > +	struct crypto_wait wait;
> > >  	struct mutex mutex;
> > >  	bool is_sleepable;
> > > +	bool __online;
> > 
> > I don't believe we need this.
> > 
> > If we are not freeing resources during CPU offlining, then we do not
> > need a CPU offline callback and acomp_ctx->__online serves no purpose.
> > 
> > The whole point of synchronizing between offlining and
> > compress/decompress operations is to avoid UAF. If offlining does not
> > free resources, then we can hold the mutex directly in the
> > compress/decompress path and drop the hotunplug callback completely.
> > 
> > I also believe nr_reqs can be dropped from this patch, as it seems like
> > it's only used know when to set __online.
> 
> All great points! In fact, that was the original solution I had implemented
> (not having an offline callback). But then, I spent some time understanding
> the v6.13 hotfix for synchronizing freeing of resources, and this comment
> in zswap_cpu_comp_prepare():
> 
> 	/*
> 	 * Only hold the mutex after completing allocations, otherwise we may
> 	 * recurse into zswap through reclaim and attempt to hold the mutex
> 	 * again resulting in a deadlock.
> 	 */
> 
> Hence, I figured the constraint of "recurse into zswap through reclaim" was
> something to comprehend in the simplification (even though I had a tough
> time imagining how this could happen).

The constraint here is about zswap_cpu_comp_prepare() holding the mutex,
making an allocation which internally triggers reclaim, then recursing
into zswap and trying to hold the same mutex again causing a deadlock.

If zswap_cpu_comp_prepare() does not need to hold the mutex to begin
with, the constraint naturally goes away.

> 
> Hence, I added the "bool __online" because zswap_cpu_comp_prepare()
> does not acquire the mutex lock while allocating resources. We have already
> initialized the mutex, so in theory, it is possible for compress/decompress
> to acquire the mutex lock. The __online acts as a way to indicate whether
> compress/decompress can proceed reliably to use the resources.

For compress/decompress to acquire the mutex they need to run on that
CPU, and I don't think that's possible before onlining completes, so
zswap_cpu_comp_prepare() must have already completed before
compress/decompress can use that CPU IIUC.

> 
> The "nr_reqs" was needed as a way to distinguish between initial and
> subsequent calls into zswap_cpu_comp_prepare(), for e.g., on a CPU that
> goes through an online-offline-online sequence. In the initial onlining,
> we need to allocate resources because nr_reqs=0. If resources are to
> be allocated, we set acomp_ctx->nr_reqs and proceed to allocate
> reqs/buffers/etc. In the subsequent onlining, we can quickly inspect
> nr_reqs as being greater than 0 and return, thus avoiding any latency
> delays before reclaim/page-faults can be handled on that CPU.
> 
> Please let me know if this rationale seems reasonable for why
> __online and nr_reqs were introduced.

Based on what I said, I still don't believe they are needed, but please
correct me if I am wrong.

[..]
> > I also see some ordering changes inside the function (e.g. we now
> > allocate the request before the buffer). Not sure if these are
> > intentional. If not, please keep the diff to the required changes only.
> 
> The reason for this was, I am trying to organize the allocations based
> on dependencies. Unless requests are allocated, there is no point in
> allocating buffers. Please let me know if this is Ok.

Please separate refactoring changes in general from functional changes
because it makes code review harder.

In this specific instance, I think moving the code is probably not worth
it, as there's also no point in allocating requests if we cannot
allocate buffers. In fact, since the buffers are larger, in theory their
allocation is more likely to fail, so it makes since to do it first.

Anyway, please propose such refactoring changes separately and they can
be discussed as such.

[..]
> > > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct hlist_node
> > *node)
> > > +{
> > > +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > node);
> > > +	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > >acomp_ctx, cpu);
> > > +
> > > +	/*
> > > +	 * The lifetime of acomp_ctx resources is from pool creation to
> > > +	 * pool deletion.
> > > +	 *
> > > +	 * Reclaims should not be happening because, we get to this routine
> > only
> > > +	 * in two scenarios:
> > > +	 *
> > > +	 * 1) pool creation failures before/during the pool ref initialization.
> > > +	 * 2) we are in the process of releasing the pool, it is off the
> > > +	 *    zswap_pools list and has no references.
> > > +	 *
> > > +	 * Hence, there is no need for locks.
> > > +	 */
> > > +	acomp_ctx->__online = false;
> > > +	acomp_ctx_dealloc(acomp_ctx);
> > 
> > Since __online can be dropped, we can probably drop
> > zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly?
> 
> I suppose there is value in having a way in zswap to know for sure, that
> resource allocation has completed, and it is safe for compress/decompress
> to proceed. Especially because the mutex has been initialized before we
> get to resource allocation. Would you agree?

As I mentioned above, I believe compress/decompress cannot run on a CPU
before the onlining completes. Please correct me if I am wrong.

> 
> > 
> > > +}
> > > +
> > >  static struct zswap_pool *zswap_pool_create(char *type, char
> > *compressor)
> > >  {
> > >  	struct zswap_pool *pool;
> > > @@ -285,13 +403,21 @@ static struct zswap_pool
> > *zswap_pool_create(char *type, char *compressor)
> > >  		goto error;
> > >  	}
> > >
> > > -	for_each_possible_cpu(cpu)
> > > -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > > +	for_each_possible_cpu(cpu) {
> > > +		struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > >acomp_ctx, cpu);
> > > +
> > > +		acomp_ctx->acomp = NULL;
> > > +		acomp_ctx->req = NULL;
> > > +		acomp_ctx->buffer = NULL;
> > > +		acomp_ctx->__online = false;
> > > +		acomp_ctx->nr_reqs = 0;
> > 
> > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them
> > right away?
> 
> Yes, I figured this is needed for two reasons:
> 
> 1) For the error handling in zswap_cpu_comp_prepare() and calls into
>     zswap_cpu_comp_dealloc() to be handled by the common procedure
>     "acomp_ctx_dealloc()" unambiguously.

This makes sense. When you move the refactoring to create
acomp_ctx_dealloc() to a separate patch, please include this change in
it and call it out explicitly in the commit message.

> 2) The second scenario I thought of that would need this, is let's say
>      the zswap compressor is switched immediately after setting the
>      compressor. Some cores have executed the onlining code and
>      some haven't. Because there are no pool refs held,
>      zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured
>      it would help to initialize these acomp_ctx members before the
>      hand-off to "cpuhp_state_add_instance()" in zswap_pool_create().

I believe cpuhp_state_add_instance() calls the onlining function
synchronously on all present CPUs, so I don't think it's possible to end
up in a state where the pool is being destroyed and some CPU executed
zswap_cpu_comp_prepare() while others haven't.

That being said, this made me think of a different problem. If pool
destruction races with CPU onlining, there could be a race between
zswap_cpu_comp_prepare() allocating resources and
zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them.

I believe we must always call cpuhp_state_remove_instance() *before*
freeing the resources to prevent this race from happening. This needs to
be documented with a comment.

Let me know if I missed something.

> 
> Please let me know if these are valid considerations.
> 
> > 
> > If it is in fact needed we should probably just use __GFP_ZERO.
> 
> Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of "alloc_percpu()"
> for the acomp_ctx?

Yeah if we need to initialize all/most fields to 0 let's use
alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO.

[..]
> > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx
> > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
> > >
> > >  	for (;;) {
> > >  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > > -		mutex_lock(&acomp_ctx->mutex);
> > > -		if (likely(acomp_ctx->req))
> > > -			return acomp_ctx;
> > >  		/*
> > > -		 * It is possible that we were migrated to a different CPU
> > after
> > > -		 * getting the per-CPU ctx but before the mutex was
> > acquired. If
> > > -		 * the old CPU got offlined, zswap_cpu_comp_dead() could
> > have
> > > -		 * already freed ctx->req (among other things) and set it to
> > > -		 * NULL. Just try again on the new CPU that we ended up on.
> > > +		 * If the CPU onlining code successfully allocates acomp_ctx
> > resources,
> > > +		 * it sets acomp_ctx->__online to true. Until this happens, we
> > have
> > > +		 * two options:
> > > +		 *
> > > +		 * 1. Return NULL and fail all stores on this CPU.
> > > +		 * 2. Retry, until onlining has finished allocating resources.
> > > +		 *
> > > +		 * In theory, option 1 could be more appropriate, because it
> > > +		 * allows the calling procedure to decide how it wants to
> > handle
> > > +		 * reclaim racing with CPU hotplug. For instance, it might be
> > Ok
> > > +		 * for compress to return an error for the backing swap device
> > > +		 * to store the folio. Decompress could wait until we get a
> > > +		 * valid and locked mutex after onlining has completed. For
> > now,
> > > +		 * we go with option 2 because adding a do-while in
> > > +		 * zswap_decompress() adds latency for software
> > compressors.
> > > +		 *
> > > +		 * Once initialized, the resources will be de-allocated only
> > > +		 * when the pool is destroyed. The acomp_ctx will hold on to
> > the
> > > +		 * resources through CPU offlining/onlining at any time until
> > > +		 * the pool is destroyed.
> > > +		 *
> > > +		 * This prevents races/deadlocks between reclaim and CPU
> > acomp_ctx
> > > +		 * resource allocation that are a dependency for reclaim.
> > > +		 * It further simplifies the interaction with CPU onlining and
> > > +		 * offlining:
> > > +		 *
> > > +		 * - CPU onlining does not take the mutex. It only allocates
> > > +		 *   resources and sets __online to true.
> > > +		 * - CPU offlining acquires the mutex before setting
> > > +		 *   __online to false. If reclaim has acquired the mutex,
> > > +		 *   offlining will have to wait for reclaim to complete before
> > > +		 *   hotunplug can proceed. Further, hotplug merely sets
> > > +		 *   __online to false. It does not delete the acomp_ctx
> > > +		 *   resources.
> > > +		 *
> > > +		 * Option 1 is better than potentially not exiting the earlier
> > > +		 * for (;;) loop because the system is running low on memory
> > > +		 * and/or CPUs are getting offlined for whatever reason. At
> > > +		 * least failing this store will prevent data loss by failing
> > > +		 * zswap_store(), and saving the data in the backing swap
> > device.
> > >  		 */
> > 
> > I believe we can dropped. I don't think we can have any store/load
> > operations on a CPU before it's fully onlined, and we should always have
> > a reference on the pool here, so the resources cannot go away.
> > 
> > So unless I missed something we can drop this completely now and just
> > hold the mutex directly in the load/store paths.
> 
> Based on the above explanations, please let me know if it is a good idea
> to keep the __online, or if you think further simplification is possible.

I still think it's not needed. Let me know if I missed anything.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-07 19:30       ` Yosry Ahmed
@ 2025-03-08  2:47         ` Sridhar, Kanchana P
  2025-03-10 17:31           ` Yosry Ahmed
  0 siblings, 1 reply; 44+ messages in thread
From: Sridhar, Kanchana P @ 2025-03-08  2:47 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Friday, March 7, 2025 11:30 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> allocation/deletion and mutex lock usage.
> 
> On Fri, Mar 07, 2025 at 12:01:14AM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Thursday, March 6, 2025 11:36 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>;
> > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> > > allocation/deletion and mutex lock usage.
> > >
> > > On Mon, Mar 03, 2025 at 12:47:22AM -0800, Kanchana P Sridhar wrote:
> > > > This patch modifies the acomp_ctx resources' lifetime to be from pool
> > > > creation to deletion. A "bool __online" and "u8 nr_reqs" are added to
> > > > "struct crypto_acomp_ctx" which simplify a few things:
> > > >
> > > > 1) zswap_pool_create() will initialize all members of each percpu
> > > acomp_ctx
> > > >    to 0 or NULL and only then initialize the mutex.
> > > > 2) CPU hotplug will set nr_reqs to 1, allocate resources and set __online
> > > >    to true, without locking the mutex.
> > > > 3) CPU hotunplug will lock the mutex before setting __online to false. It
> > > >    will not delete any resources.
> > > > 4) acomp_ctx_get_cpu_lock() will lock the mutex, then check if __online
> > > >    is true, and if so, return the mutex for use in zswap compress and
> > > >    decompress ops.
> > > > 5) CPU onlining after offlining will simply check if either __online or
> > > >    nr_reqs are non-0, and return 0 if so, without re-allocating the
> > > >    resources.
> > > > 6) zswap_pool_destroy() will call a newly added
> zswap_cpu_comp_dealloc()
> > > to
> > > >    delete the acomp_ctx resources.
> > > > 7) Common resource deletion code in case of
> zswap_cpu_comp_prepare()
> > > >    errors, and for use in zswap_cpu_comp_dealloc(), is factored into a
> new
> > > >    acomp_ctx_dealloc().
> > > >
> > > > The CPU hot[un]plug callback functions are moved to "pool functions"
> > > > accordingly.
> > > >
> > > > The per-cpu memory cost of not deleting the acomp_ctx resources upon
> > > CPU
> > > > offlining, and only deleting them when the pool is destroyed, is as
> follows:
> > > >
> > > >     IAA with batching: 64.8 KB
> > > >     Software compressors: 8.2 KB
> > > >
> > > > I would appreciate code review comments on whether this memory cost
> is
> > > > acceptable, for the latency improvement that it provides due to a faster
> > > > reclaim restart after a CPU hotunplug-hotplug sequence - all that the
> > > > hotplug code needs to do is to check if acomp_ctx->nr_reqs is non-0,
> and
> > > > if so, set __online to true and return, and reclaim can proceed.
> > >
> > > I like the idea of allocating the resources on memory hotplug but
> > > leaving them allocated until the pool is torn down. It avoids allocating
> > > unnecessary memory if some CPUs are never onlined, but it simplifies
> > > things because we don't have to synchronize against the resources being
> > > freed in CPU offline.
> > >
> > > The only case that would suffer from this AFAICT is if someone onlines
> > > many CPUs, uses them once, and then offline them and not use them
> again.
> > > I am not familiar with CPU hotplug use cases so I can't tell if that's
> > > something people do, but I am inclined to agree with this
> > > simplification.
> >
> > Thanks Yosry, for your code review comments! Good to know that this
> > simplification is acceptable.
> >
> > >
> > > >
> > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > > ---
> > > >  mm/zswap.c | 273 +++++++++++++++++++++++++++++++++++----------
> ----
> > > ----
> > > >  1 file changed, 182 insertions(+), 91 deletions(-)
> > > >
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 10f2a16e7586..cff96df1df8b 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -144,10 +144,12 @@ bool zswap_never_enabled(void)
> > > >  struct crypto_acomp_ctx {
> > > >  	struct crypto_acomp *acomp;
> > > >  	struct acomp_req *req;
> > > > -	struct crypto_wait wait;
> > >
> > > Is there a reason for moving this? If not please avoid unrelated changes.
> >
> > The reason is so that req/buffer, and reqs/buffers with batching, go together
> > logically, hence I found this easier to understand. I can restore this to the
> > original order, if that's preferable.
> 
> I see. In that case, this fits better in the patch that actually adds
> support for having multiple requests and buffers, and please call it out
> explicitly in the commit message.

Thanks Yosry, for the follow up comments. Sure, this makes sense.

> 
> >
> > >
> > > >  	u8 *buffer;
> > > > +	u8 nr_reqs;
> > > > +	struct crypto_wait wait;
> > > >  	struct mutex mutex;
> > > >  	bool is_sleepable;
> > > > +	bool __online;
> > >
> > > I don't believe we need this.
> > >
> > > If we are not freeing resources during CPU offlining, then we do not
> > > need a CPU offline callback and acomp_ctx->__online serves no purpose.
> > >
> > > The whole point of synchronizing between offlining and
> > > compress/decompress operations is to avoid UAF. If offlining does not
> > > free resources, then we can hold the mutex directly in the
> > > compress/decompress path and drop the hotunplug callback completely.
> > >
> > > I also believe nr_reqs can be dropped from this patch, as it seems like
> > > it's only used know when to set __online.
> >
> > All great points! In fact, that was the original solution I had implemented
> > (not having an offline callback). But then, I spent some time understanding
> > the v6.13 hotfix for synchronizing freeing of resources, and this comment
> > in zswap_cpu_comp_prepare():
> >
> > 	/*
> > 	 * Only hold the mutex after completing allocations, otherwise we
> may
> > 	 * recurse into zswap through reclaim and attempt to hold the mutex
> > 	 * again resulting in a deadlock.
> > 	 */
> >
> > Hence, I figured the constraint of "recurse into zswap through reclaim" was
> > something to comprehend in the simplification (even though I had a tough
> > time imagining how this could happen).
> 
> The constraint here is about zswap_cpu_comp_prepare() holding the mutex,
> making an allocation which internally triggers reclaim, then recursing
> into zswap and trying to hold the same mutex again causing a deadlock.
> 
> If zswap_cpu_comp_prepare() does not need to hold the mutex to begin
> with, the constraint naturally goes away.

Actually, if it is possible for the allocations in zswap_cpu_comp_prepare()
to trigger reclaim, then I believe we need some way for reclaim to know if
the acomp_ctx resources are available. Hence, this seems like a potential
for deadlock regardless of the mutex.

I verified that all the zswap_cpu_comp_prepare() allocations are done with
GFP_KERNEL, which implicitly allows direct reclaim. So this appears to be a
risk for deadlock between zswap_compress() and zswap_cpu_comp_prepare()
in general, i.e., aside from this patchset.

I can think of the following options to resolve this, and would welcome
other suggestions:

1) Less intrusive: acomp_ctx_get_cpu_lock() should get the mutex, check
    if acomp_ctx->__online is true, and if so, return the mutex. If
    acomp_ctx->__online is false, then it returns NULL. In other words, we
    don't have the for loop.
    - This will cause recursions into direct reclaim from zswap_cpu_comp_prepare()
       to fail, cpuhotplug to fail. However, there is no deadlock.
        - zswap_compress() will need to detect NULL returned by
          acomp_ctx_get_cpu_lock(), and return an error.
        - zswap_decompress() will need a BUG_ON(!acomp_ctx) after calling
          acomp_ctx_get_cpu_lock().
    - We won't be migrated to a different CPU because we hold the mutex, hence
      zswap_cpu_comp_dead() will wait on the mutex.

2) More intrusive: We would need to use a gfp_t that prevents direct reclaim
    and kswapd, i.e., something similar to GFP_TRANSHUGE_LIGHT in gfp_types.h,
    but for non-THP allocations. If we decide to adopt this approach, we would
    need changes in include/crypto/acompress.h, crypto/api.c, and crypto/acompress.c
    to allow crypto_create_tfm_node() to call crypto_alloc_tfmmem() with this
    new gfp_t, in lieu of GFP_KERNEL.

> 
> >
> > Hence, I added the "bool __online" because zswap_cpu_comp_prepare()
> > does not acquire the mutex lock while allocating resources. We have
> already
> > initialized the mutex, so in theory, it is possible for compress/decompress
> > to acquire the mutex lock. The __online acts as a way to indicate whether
> > compress/decompress can proceed reliably to use the resources.
> 
> For compress/decompress to acquire the mutex they need to run on that
> CPU, and I don't think that's possible before onlining completes, so
> zswap_cpu_comp_prepare() must have already completed before
> compress/decompress can use that CPU IIUC.

If we can make this assumption, that would be great! However, I am not
totally sure because of the GFP_KERNEL allocations in
zswap_cpu_comp_prepare().

> 
> >
> > The "nr_reqs" was needed as a way to distinguish between initial and
> > subsequent calls into zswap_cpu_comp_prepare(), for e.g., on a CPU that
> > goes through an online-offline-online sequence. In the initial onlining,
> > we need to allocate resources because nr_reqs=0. If resources are to
> > be allocated, we set acomp_ctx->nr_reqs and proceed to allocate
> > reqs/buffers/etc. In the subsequent onlining, we can quickly inspect
> > nr_reqs as being greater than 0 and return, thus avoiding any latency
> > delays before reclaim/page-faults can be handled on that CPU.
> >
> > Please let me know if this rationale seems reasonable for why
> > __online and nr_reqs were introduced.
> 
> Based on what I said, I still don't believe they are needed, but please
> correct me if I am wrong.

Same comments as above. 

> 
> [..]
> > > I also see some ordering changes inside the function (e.g. we now
> > > allocate the request before the buffer). Not sure if these are
> > > intentional. If not, please keep the diff to the required changes only.
> >
> > The reason for this was, I am trying to organize the allocations based
> > on dependencies. Unless requests are allocated, there is no point in
> > allocating buffers. Please let me know if this is Ok.
> 
> Please separate refactoring changes in general from functional changes
> because it makes code review harder.

Sure, I will do so.

> 
> In this specific instance, I think moving the code is probably not worth
> it, as there's also no point in allocating requests if we cannot
> allocate buffers. In fact, since the buffers are larger, in theory their
> allocation is more likely to fail, so it makes since to do it first.

Understood, makes better sense than allocating the requests first.

> 
> Anyway, please propose such refactoring changes separately and they can
> be discussed as such.

Ok.

> 
> [..]
> > > > +static void zswap_cpu_comp_dealloc(unsigned int cpu, struct
> hlist_node
> > > *node)
> > > > +{
> > > > +	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > > node);
> > > > +	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > >acomp_ctx, cpu);
> > > > +
> > > > +	/*
> > > > +	 * The lifetime of acomp_ctx resources is from pool creation to
> > > > +	 * pool deletion.
> > > > +	 *
> > > > +	 * Reclaims should not be happening because, we get to this routine
> > > only
> > > > +	 * in two scenarios:
> > > > +	 *
> > > > +	 * 1) pool creation failures before/during the pool ref initialization.
> > > > +	 * 2) we are in the process of releasing the pool, it is off the
> > > > +	 *    zswap_pools list and has no references.
> > > > +	 *
> > > > +	 * Hence, there is no need for locks.
> > > > +	 */
> > > > +	acomp_ctx->__online = false;
> > > > +	acomp_ctx_dealloc(acomp_ctx);
> > >
> > > Since __online can be dropped, we can probably drop
> > > zswap_cpu_comp_dealloc() and call acomp_ctx_dealloc() directly?
> >
> > I suppose there is value in having a way in zswap to know for sure, that
> > resource allocation has completed, and it is safe for compress/decompress
> > to proceed. Especially because the mutex has been initialized before we
> > get to resource allocation. Would you agree?
> 
> As I mentioned above, I believe compress/decompress cannot run on a CPU
> before the onlining completes. Please correct me if I am wrong.
> 
> >
> > >
> > > > +}
> > > > +
> > > >  static struct zswap_pool *zswap_pool_create(char *type, char
> > > *compressor)
> > > >  {
> > > >  	struct zswap_pool *pool;
> > > > @@ -285,13 +403,21 @@ static struct zswap_pool
> > > *zswap_pool_create(char *type, char *compressor)
> > > >  		goto error;
> > > >  	}
> > > >
> > > > -	for_each_possible_cpu(cpu)
> > > > -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > > > +	for_each_possible_cpu(cpu) {
> > > > +		struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > >acomp_ctx, cpu);
> > > > +
> > > > +		acomp_ctx->acomp = NULL;
> > > > +		acomp_ctx->req = NULL;
> > > > +		acomp_ctx->buffer = NULL;
> > > > +		acomp_ctx->__online = false;
> > > > +		acomp_ctx->nr_reqs = 0;
> > >
> > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them
> > > right away?
> >
> > Yes, I figured this is needed for two reasons:
> >
> > 1) For the error handling in zswap_cpu_comp_prepare() and calls into
> >     zswap_cpu_comp_dealloc() to be handled by the common procedure
> >     "acomp_ctx_dealloc()" unambiguously.
> 
> This makes sense. When you move the refactoring to create
> acomp_ctx_dealloc() to a separate patch, please include this change in
> it and call it out explicitly in the commit message.

Sure.

> 
> > 2) The second scenario I thought of that would need this, is let's say
> >      the zswap compressor is switched immediately after setting the
> >      compressor. Some cores have executed the onlining code and
> >      some haven't. Because there are no pool refs held,
> >      zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured
> >      it would help to initialize these acomp_ctx members before the
> >      hand-off to "cpuhp_state_add_instance()" in zswap_pool_create().
> 
> I believe cpuhp_state_add_instance() calls the onlining function
> synchronously on all present CPUs, so I don't think it's possible to end
> up in a state where the pool is being destroyed and some CPU executed
> zswap_cpu_comp_prepare() while others haven't.

I looked at the cpuhotplug code some more. The startup callback is
invoked sequentially for_each_present_cpu(). If an error occurs for any
one of them, it calls the teardown callback only on the earlier cores that
have already finished running the startup callback. However, 
zswap_cpu_comp_dealloc() will be called for all cores, even the ones
for which the startup callback was not run. Hence, I believe the
zero initialization is useful, albeit using alloc_percpu_gfp(__GFP_ZERO)
to allocate the acomp_ctx.

> 
> That being said, this made me think of a different problem. If pool
> destruction races with CPU onlining, there could be a race between
> zswap_cpu_comp_prepare() allocating resources and
> zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them.
> 
> I believe we must always call cpuhp_state_remove_instance() *before*
> freeing the resources to prevent this race from happening. This needs to
> be documented with a comment.

Yes, this race condition is possible, thanks for catching this! The problem with
calling cpuhp_state_remove_instance() before freeing the resources is that
cpuhp_state_add_instance() and cpuhp_state_remove_instance() both
acquire a "mutex_lock(&cpuhp_state_mutex);" at the beginning; and hence
are serialized.

For the reasons motivating why acomp_ctx->__online is set to false in
zswap_cpu_comp_dead(), I cannot call cpuhp_state_remove_instance()
before calling acomp_ctx_dealloc() because the latter could wait until
acomp_ctx->__online to be true before deleting the resources. I will
think about this some more.

Another possibility is to not rely on cpuhotplug in zswap, and instead
manage the per-cpu acomp_ctx resource allocation entirely in
zswap_pool_create(), and deletion entirely in zswap_pool_destroy(),
along with the necessary error handling. Let me think about this some
more as well.

> 
> Let me know if I missed something.
> 
> >
> > Please let me know if these are valid considerations.
> >
> > >
> > > If it is in fact needed we should probably just use __GFP_ZERO.
> >
> > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of
> "alloc_percpu()"
> > for the acomp_ctx?
> 
> Yeah if we need to initialize all/most fields to 0 let's use
> alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO.

Sounds good.

> 
> [..]
> > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx
> > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
> > > >
> > > >  	for (;;) {
> > > >  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > > > -		mutex_lock(&acomp_ctx->mutex);
> > > > -		if (likely(acomp_ctx->req))
> > > > -			return acomp_ctx;
> > > >  		/*
> > > > -		 * It is possible that we were migrated to a different CPU
> > > after
> > > > -		 * getting the per-CPU ctx but before the mutex was
> > > acquired. If
> > > > -		 * the old CPU got offlined, zswap_cpu_comp_dead() could
> > > have
> > > > -		 * already freed ctx->req (among other things) and set it to
> > > > -		 * NULL. Just try again on the new CPU that we ended up on.
> > > > +		 * If the CPU onlining code successfully allocates acomp_ctx
> > > resources,
> > > > +		 * it sets acomp_ctx->__online to true. Until this happens, we
> > > have
> > > > +		 * two options:
> > > > +		 *
> > > > +		 * 1. Return NULL and fail all stores on this CPU.
> > > > +		 * 2. Retry, until onlining has finished allocating resources.
> > > > +		 *
> > > > +		 * In theory, option 1 could be more appropriate, because it
> > > > +		 * allows the calling procedure to decide how it wants to
> > > handle
> > > > +		 * reclaim racing with CPU hotplug. For instance, it might be
> > > Ok
> > > > +		 * for compress to return an error for the backing swap device
> > > > +		 * to store the folio. Decompress could wait until we get a
> > > > +		 * valid and locked mutex after onlining has completed. For
> > > now,
> > > > +		 * we go with option 2 because adding a do-while in
> > > > +		 * zswap_decompress() adds latency for software
> > > compressors.
> > > > +		 *
> > > > +		 * Once initialized, the resources will be de-allocated only
> > > > +		 * when the pool is destroyed. The acomp_ctx will hold on to
> > > the
> > > > +		 * resources through CPU offlining/onlining at any time until
> > > > +		 * the pool is destroyed.
> > > > +		 *
> > > > +		 * This prevents races/deadlocks between reclaim and CPU
> > > acomp_ctx
> > > > +		 * resource allocation that are a dependency for reclaim.
> > > > +		 * It further simplifies the interaction with CPU onlining and
> > > > +		 * offlining:
> > > > +		 *
> > > > +		 * - CPU onlining does not take the mutex. It only allocates
> > > > +		 *   resources and sets __online to true.
> > > > +		 * - CPU offlining acquires the mutex before setting
> > > > +		 *   __online to false. If reclaim has acquired the mutex,
> > > > +		 *   offlining will have to wait for reclaim to complete before
> > > > +		 *   hotunplug can proceed. Further, hotplug merely sets
> > > > +		 *   __online to false. It does not delete the acomp_ctx
> > > > +		 *   resources.
> > > > +		 *
> > > > +		 * Option 1 is better than potentially not exiting the earlier
> > > > +		 * for (;;) loop because the system is running low on memory
> > > > +		 * and/or CPUs are getting offlined for whatever reason. At
> > > > +		 * least failing this store will prevent data loss by failing
> > > > +		 * zswap_store(), and saving the data in the backing swap
> > > device.
> > > >  		 */
> > >
> > > I believe we can dropped. I don't think we can have any store/load
> > > operations on a CPU before it's fully onlined, and we should always have
> > > a reference on the pool here, so the resources cannot go away.
> > >
> > > So unless I missed something we can drop this completely now and just
> > > hold the mutex directly in the load/store paths.
> >
> > Based on the above explanations, please let me know if it is a good idea
> > to keep the __online, or if you think further simplification is possible.
> 
> I still think it's not needed. Let me know if I missed anything.

Let me think some more about whether it is feasible to not have cpuhotplug
manage the acomp_ctx resource allocation, and instead have this be done
through the pool creation/deletion routines.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-08  2:47         ` Sridhar, Kanchana P
@ 2025-03-10 17:31           ` Yosry Ahmed
  2025-03-17 21:15             ` Sridhar, Kanchana P
  0 siblings, 1 reply; 44+ messages in thread
From: Yosry Ahmed @ 2025-03-10 17:31 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote:
> 
[..]
> > > > >  	u8 *buffer;
> > > > > +	u8 nr_reqs;
> > > > > +	struct crypto_wait wait;
> > > > >  	struct mutex mutex;
> > > > >  	bool is_sleepable;
> > > > > +	bool __online;
> > > >
> > > > I don't believe we need this.
> > > >
> > > > If we are not freeing resources during CPU offlining, then we do not
> > > > need a CPU offline callback and acomp_ctx->__online serves no purpose.
> > > >
> > > > The whole point of synchronizing between offlining and
> > > > compress/decompress operations is to avoid UAF. If offlining does not
> > > > free resources, then we can hold the mutex directly in the
> > > > compress/decompress path and drop the hotunplug callback completely.
> > > >
> > > > I also believe nr_reqs can be dropped from this patch, as it seems like
> > > > it's only used know when to set __online.
> > >
> > > All great points! In fact, that was the original solution I had implemented
> > > (not having an offline callback). But then, I spent some time understanding
> > > the v6.13 hotfix for synchronizing freeing of resources, and this comment
> > > in zswap_cpu_comp_prepare():
> > >
> > > 	/*
> > > 	 * Only hold the mutex after completing allocations, otherwise we
> > may
> > > 	 * recurse into zswap through reclaim and attempt to hold the mutex
> > > 	 * again resulting in a deadlock.
> > > 	 */
> > >
> > > Hence, I figured the constraint of "recurse into zswap through reclaim" was
> > > something to comprehend in the simplification (even though I had a tough
> > > time imagining how this could happen).
> > 
> > The constraint here is about zswap_cpu_comp_prepare() holding the mutex,
> > making an allocation which internally triggers reclaim, then recursing
> > into zswap and trying to hold the same mutex again causing a deadlock.
> > 
> > If zswap_cpu_comp_prepare() does not need to hold the mutex to begin
> > with, the constraint naturally goes away.
> 
> Actually, if it is possible for the allocations in zswap_cpu_comp_prepare()
> to trigger reclaim, then I believe we need some way for reclaim to know if
> the acomp_ctx resources are available. Hence, this seems like a potential
> for deadlock regardless of the mutex.

I took a closer look and I believe my hotfix was actually unnecessary. I
sent it out in response to a syzbot report, but upon closer look it
seems like it was not an actual problem. Sorry if my patch confused you.

Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like
CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The comment above
says:

 * PREPARE: The callbacks are invoked on a control CPU before the
 * hotplugged CPU is started up or after the hotplugged CPU has died.

So even if we go into reclaim during zswap_cpu_comp_prepare(), it will
never be on the CPU that we are allocating resources for.

The other case where zswap_cpu_comp_prepare() could race with
compression/decompression is when a pool is being created. In this case,
reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the same
CPU AFAICT. However, because the pool is still under creation, it will
not be used (i.e. zswap_pool_current_get() won't find it).

So I think we don't need to worry about zswap_cpu_comp_prepare() racing
with compression or decompression for the same pool and CPU.

> 
> I verified that all the zswap_cpu_comp_prepare() allocations are done with
> GFP_KERNEL, which implicitly allows direct reclaim. So this appears to be a
> risk for deadlock between zswap_compress() and zswap_cpu_comp_prepare()
> in general, i.e., aside from this patchset.
> 
> I can think of the following options to resolve this, and would welcome
> other suggestions:
> 
> 1) Less intrusive: acomp_ctx_get_cpu_lock() should get the mutex, check
>     if acomp_ctx->__online is true, and if so, return the mutex. If
>     acomp_ctx->__online is false, then it returns NULL. In other words, we
>     don't have the for loop.
>     - This will cause recursions into direct reclaim from zswap_cpu_comp_prepare()
>        to fail, cpuhotplug to fail. However, there is no deadlock.
>         - zswap_compress() will need to detect NULL returned by
>           acomp_ctx_get_cpu_lock(), and return an error.
>         - zswap_decompress() will need a BUG_ON(!acomp_ctx) after calling
>           acomp_ctx_get_cpu_lock().
>     - We won't be migrated to a different CPU because we hold the mutex, hence
>       zswap_cpu_comp_dead() will wait on the mutex.
> 
> 2) More intrusive: We would need to use a gfp_t that prevents direct reclaim
>     and kswapd, i.e., something similar to GFP_TRANSHUGE_LIGHT in gfp_types.h,
>     but for non-THP allocations. If we decide to adopt this approach, we would
>     need changes in include/crypto/acompress.h, crypto/api.c, and crypto/acompress.c
>     to allow crypto_create_tfm_node() to call crypto_alloc_tfmmem() with this
>     new gfp_t, in lieu of GFP_KERNEL.
> 
> > 
> > >
> > > Hence, I added the "bool __online" because zswap_cpu_comp_prepare()
> > > does not acquire the mutex lock while allocating resources. We have
> > already
> > > initialized the mutex, so in theory, it is possible for compress/decompress
> > > to acquire the mutex lock. The __online acts as a way to indicate whether
> > > compress/decompress can proceed reliably to use the resources.
> > 
> > For compress/decompress to acquire the mutex they need to run on that
> > CPU, and I don't think that's possible before onlining completes, so
> > zswap_cpu_comp_prepare() must have already completed before
> > compress/decompress can use that CPU IIUC.
> 
> If we can make this assumption, that would be great! However, I am not
> totally sure because of the GFP_KERNEL allocations in
> zswap_cpu_comp_prepare().

As I mentioned above, when zswap_cpu_comp_prepare() is run we are in one
of two situations:
- The pool is under creation, so we cannot race with stores/loads from
  that same pool.
- The CPU is being onlined, in which case zswap_cpu_comp_prepare() is
  called from a control CPU before tasks start running on the CPU being
  onlined.

Please correct me if I am wrong.

[..]
> > > > > @@ -285,13 +403,21 @@ static struct zswap_pool
> > > > *zswap_pool_create(char *type, char *compressor)
> > > > >  		goto error;
> > > > >  	}
> > > > >
> > > > > -	for_each_possible_cpu(cpu)
> > > > > -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > > > > +	for_each_possible_cpu(cpu) {
> > > > > +		struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > > >acomp_ctx, cpu);
> > > > > +
> > > > > +		acomp_ctx->acomp = NULL;
> > > > > +		acomp_ctx->req = NULL;
> > > > > +		acomp_ctx->buffer = NULL;
> > > > > +		acomp_ctx->__online = false;
> > > > > +		acomp_ctx->nr_reqs = 0;
> > > >
> > > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize them
> > > > right away?
> > >
> > > Yes, I figured this is needed for two reasons:
> > >
> > > 1) For the error handling in zswap_cpu_comp_prepare() and calls into
> > >     zswap_cpu_comp_dealloc() to be handled by the common procedure
> > >     "acomp_ctx_dealloc()" unambiguously.
> > 
> > This makes sense. When you move the refactoring to create
> > acomp_ctx_dealloc() to a separate patch, please include this change in
> > it and call it out explicitly in the commit message.
> 
> Sure.
> 
> > 
> > > 2) The second scenario I thought of that would need this, is let's say
> > >      the zswap compressor is switched immediately after setting the
> > >      compressor. Some cores have executed the onlining code and
> > >      some haven't. Because there are no pool refs held,
> > >      zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I figured
> > >      it would help to initialize these acomp_ctx members before the
> > >      hand-off to "cpuhp_state_add_instance()" in zswap_pool_create().
> > 
> > I believe cpuhp_state_add_instance() calls the onlining function
> > synchronously on all present CPUs, so I don't think it's possible to end
> > up in a state where the pool is being destroyed and some CPU executed
> > zswap_cpu_comp_prepare() while others haven't.
> 
> I looked at the cpuhotplug code some more. The startup callback is
> invoked sequentially for_each_present_cpu(). If an error occurs for any
> one of them, it calls the teardown callback only on the earlier cores that
> have already finished running the startup callback. However, 
> zswap_cpu_comp_dealloc() will be called for all cores, even the ones
> for which the startup callback was not run. Hence, I believe the
> zero initialization is useful, albeit using alloc_percpu_gfp(__GFP_ZERO)
> to allocate the acomp_ctx.

Yeah this is point (1) above IIUC, and I agree about zero initialization
for that.

> 
> > 
> > That being said, this made me think of a different problem. If pool
> > destruction races with CPU onlining, there could be a race between
> > zswap_cpu_comp_prepare() allocating resources and
> > zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them.
> > 
> > I believe we must always call cpuhp_state_remove_instance() *before*
> > freeing the resources to prevent this race from happening. This needs to
> > be documented with a comment.
> 
> Yes, this race condition is possible, thanks for catching this! The problem with
> calling cpuhp_state_remove_instance() before freeing the resources is that
> cpuhp_state_add_instance() and cpuhp_state_remove_instance() both
> acquire a "mutex_lock(&cpuhp_state_mutex);" at the beginning; and hence
> are serialized.
> 
> For the reasons motivating why acomp_ctx->__online is set to false in
> zswap_cpu_comp_dead(), I cannot call cpuhp_state_remove_instance()
> before calling acomp_ctx_dealloc() because the latter could wait until
> acomp_ctx->__online to be true before deleting the resources. I will
> think about this some more.

I believe this problem goes away with acomp_ctx->__online going away,
right?

> 
> Another possibility is to not rely on cpuhotplug in zswap, and instead
> manage the per-cpu acomp_ctx resource allocation entirely in
> zswap_pool_create(), and deletion entirely in zswap_pool_destroy(),
> along with the necessary error handling. Let me think about this some
> more as well.
> 
> > 
> > Let me know if I missed something.
> > 
> > >
> > > Please let me know if these are valid considerations.
> > >
> > > >
> > > > If it is in fact needed we should probably just use __GFP_ZERO.
> > >
> > > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of
> > "alloc_percpu()"
> > > for the acomp_ctx?
> > 
> > Yeah if we need to initialize all/most fields to 0 let's use
> > alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO.
> 
> Sounds good.
> 
> > 
> > [..]
> > > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx
> > > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
> > > > >
> > > > >  	for (;;) {
> > > > >  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > > > > -		mutex_lock(&acomp_ctx->mutex);
> > > > > -		if (likely(acomp_ctx->req))
> > > > > -			return acomp_ctx;
> > > > >  		/*
> > > > > -		 * It is possible that we were migrated to a different CPU
> > > > after
> > > > > -		 * getting the per-CPU ctx but before the mutex was
> > > > acquired. If
> > > > > -		 * the old CPU got offlined, zswap_cpu_comp_dead() could
> > > > have
> > > > > -		 * already freed ctx->req (among other things) and set it to
> > > > > -		 * NULL. Just try again on the new CPU that we ended up on.
> > > > > +		 * If the CPU onlining code successfully allocates acomp_ctx
> > > > resources,
> > > > > +		 * it sets acomp_ctx->__online to true. Until this happens, we
> > > > have
> > > > > +		 * two options:
> > > > > +		 *
> > > > > +		 * 1. Return NULL and fail all stores on this CPU.
> > > > > +		 * 2. Retry, until onlining has finished allocating resources.
> > > > > +		 *
> > > > > +		 * In theory, option 1 could be more appropriate, because it
> > > > > +		 * allows the calling procedure to decide how it wants to
> > > > handle
> > > > > +		 * reclaim racing with CPU hotplug. For instance, it might be
> > > > Ok
> > > > > +		 * for compress to return an error for the backing swap device
> > > > > +		 * to store the folio. Decompress could wait until we get a
> > > > > +		 * valid and locked mutex after onlining has completed. For
> > > > now,
> > > > > +		 * we go with option 2 because adding a do-while in
> > > > > +		 * zswap_decompress() adds latency for software
> > > > compressors.
> > > > > +		 *
> > > > > +		 * Once initialized, the resources will be de-allocated only
> > > > > +		 * when the pool is destroyed. The acomp_ctx will hold on to
> > > > the
> > > > > +		 * resources through CPU offlining/onlining at any time until
> > > > > +		 * the pool is destroyed.
> > > > > +		 *
> > > > > +		 * This prevents races/deadlocks between reclaim and CPU
> > > > acomp_ctx
> > > > > +		 * resource allocation that are a dependency for reclaim.
> > > > > +		 * It further simplifies the interaction with CPU onlining and
> > > > > +		 * offlining:
> > > > > +		 *
> > > > > +		 * - CPU onlining does not take the mutex. It only allocates
> > > > > +		 *   resources and sets __online to true.
> > > > > +		 * - CPU offlining acquires the mutex before setting
> > > > > +		 *   __online to false. If reclaim has acquired the mutex,
> > > > > +		 *   offlining will have to wait for reclaim to complete before
> > > > > +		 *   hotunplug can proceed. Further, hotplug merely sets
> > > > > +		 *   __online to false. It does not delete the acomp_ctx
> > > > > +		 *   resources.
> > > > > +		 *
> > > > > +		 * Option 1 is better than potentially not exiting the earlier
> > > > > +		 * for (;;) loop because the system is running low on memory
> > > > > +		 * and/or CPUs are getting offlined for whatever reason. At
> > > > > +		 * least failing this store will prevent data loss by failing
> > > > > +		 * zswap_store(), and saving the data in the backing swap
> > > > device.
> > > > >  		 */
> > > >
> > > > I believe we can dropped. I don't think we can have any store/load
> > > > operations on a CPU before it's fully onlined, and we should always have
> > > > a reference on the pool here, so the resources cannot go away.
> > > >
> > > > So unless I missed something we can drop this completely now and just
> > > > hold the mutex directly in the load/store paths.
> > >
> > > Based on the above explanations, please let me know if it is a good idea
> > > to keep the __online, or if you think further simplification is possible.
> > 
> > I still think it's not needed. Let me know if I missed anything.
> 
> Let me think some more about whether it is feasible to not have cpuhotplug
> manage the acomp_ctx resource allocation, and instead have this be done
> through the pool creation/deletion routines.

I don't think this is necessary, see my comments above.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-10 17:31           ` Yosry Ahmed
@ 2025-03-17 21:15             ` Sridhar, Kanchana P
  2025-03-17 23:25               ` Herbert Xu
  2025-03-18 14:23               ` Yosry Ahmed
  0 siblings, 2 replies; 44+ messages in thread
From: Sridhar, Kanchana P @ 2025-03-17 21:15 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Monday, March 10, 2025 10:31 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> allocation/deletion and mutex lock usage.
> 
> On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote:
> >
> [..]
> > > > > >  	u8 *buffer;
> > > > > > +	u8 nr_reqs;
> > > > > > +	struct crypto_wait wait;
> > > > > >  	struct mutex mutex;
> > > > > >  	bool is_sleepable;
> > > > > > +	bool __online;
> > > > >
> > > > > I don't believe we need this.
> > > > >
> > > > > If we are not freeing resources during CPU offlining, then we do not
> > > > > need a CPU offline callback and acomp_ctx->__online serves no
> purpose.
> > > > >
> > > > > The whole point of synchronizing between offlining and
> > > > > compress/decompress operations is to avoid UAF. If offlining does not
> > > > > free resources, then we can hold the mutex directly in the
> > > > > compress/decompress path and drop the hotunplug callback
> completely.
> > > > >
> > > > > I also believe nr_reqs can be dropped from this patch, as it seems like
> > > > > it's only used know when to set __online.
> > > >
> > > > All great points! In fact, that was the original solution I had implemented
> > > > (not having an offline callback). But then, I spent some time
> understanding
> > > > the v6.13 hotfix for synchronizing freeing of resources, and this comment
> > > > in zswap_cpu_comp_prepare():
> > > >
> > > > 	/*
> > > > 	 * Only hold the mutex after completing allocations, otherwise we
> > > may
> > > > 	 * recurse into zswap through reclaim and attempt to hold the mutex
> > > > 	 * again resulting in a deadlock.
> > > > 	 */
> > > >
> > > > Hence, I figured the constraint of "recurse into zswap through reclaim"
> was
> > > > something to comprehend in the simplification (even though I had a
> tough
> > > > time imagining how this could happen).
> > >
> > > The constraint here is about zswap_cpu_comp_prepare() holding the
> mutex,
> > > making an allocation which internally triggers reclaim, then recursing
> > > into zswap and trying to hold the same mutex again causing a deadlock.
> > >
> > > If zswap_cpu_comp_prepare() does not need to hold the mutex to begin
> > > with, the constraint naturally goes away.
> >
> > Actually, if it is possible for the allocations in zswap_cpu_comp_prepare()
> > to trigger reclaim, then I believe we need some way for reclaim to know if
> > the acomp_ctx resources are available. Hence, this seems like a potential
> > for deadlock regardless of the mutex.
> 
> I took a closer look and I believe my hotfix was actually unnecessary. I
> sent it out in response to a syzbot report, but upon closer look it
> seems like it was not an actual problem. Sorry if my patch confused you.
> 
> Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like
> CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The comment
> above
> says:
> 
>  * PREPARE: The callbacks are invoked on a control CPU before the
>  * hotplugged CPU is started up or after the hotplugged CPU has died.
> 
> So even if we go into reclaim during zswap_cpu_comp_prepare(), it will
> never be on the CPU that we are allocating resources for.
> 
> The other case where zswap_cpu_comp_prepare() could race with
> compression/decompression is when a pool is being created. In this case,
> reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the
> same
> CPU AFAICT. However, because the pool is still under creation, it will
> not be used (i.e. zswap_pool_current_get() won't find it).
> 
> So I think we don't need to worry about zswap_cpu_comp_prepare() racing
> with compression or decompression for the same pool and CPU.

Thanks Yosry, for this observation! You are right, when considered purely
from a CPU hotplug perspective, zswap_cpu_comp_prepare() and
zswap_cpu_comp_dead() in fact run on a control CPU, because the state is
registered in the PREPARE section of "enum cpuhp_state" in cpuhotplug.h.

The problem however is that, in the current architecture, CPU onlining/ 
zswap_pool creation, and CPU offlining/zswap_pool deletion have the
same semantics as far as these resources are concerned. Hence, although
zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which
the "hotplug" code is called is in fact online. It is possible for the memory
allocation calls in zswap_cpu_comp_prepare() to recurse into
zswap_compress(), which now needs to be handled by the current pool,
since the new pool has not yet been added to the zswap_pools, as you
pointed out.

The ref on the current pool has not yet been dropped. Could there be
a potential for a deadlock at pool transition time: the new pool is blocked
from allocating acomp_ctx resources, triggering reclaim, which the old
pool needs to handle?

I see other places in the kernel that use CPU hotplug for resource allocation,
outside of the context of CPU onlining. IIUC, it is difficult to guarantee that
the startup/teardown callbacks are modifying acomp_ctx resources for a
dysfunctional CPU.

Now that I think about it, the only real constraint is that the acomp_ctx
resources are guaranteed to exist for a functional CPU which can run zswap
compress/decompress.

I think we can simplify this as follows, and would welcome suggestions
to improve the proposed solution:

1) We dis-associate the acomp_ctx from the pool, and instead, have this
    be a global percpu zswap resource that gets allocated once in zswap_setup(),
    just like the zswap_entry_cache.
2) The acomp_ctx resources will get allocated during zswap_setup(), using
    the cpuhp_setup_state_multi callback() in zswap_setup(), that registers
    zswap_cpu_comp_prepare(), but no teardown callback.
3) We call cpuhp_state_add_instance() for_each_possible_cpu(cpu) in
     zswap_setup(). 
4) The acomp_ctx resources persist through subsequent "real CPU offline/online
     state transitions".
5) zswap_[de]compress() can go ahead and lock the mutex, and use the
    reqs/buffers without worrying about whether these resources are
    initialized or still exist/are being deleted.
6) "struct zswap_pool" is now de-coupled from this global percpu zswap
    acomp_ctx.
7) To address the issue of how many reqs/buffers to allocate, there could
     potentially be a memory cost for non-batching compressors, if we want
     to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and buffers.
     This would allow the acomp_ctx to seamlessly handle batching
     compressors, non-batching compressors, and transitions among the
     two compressor types in a pretty general manner, that relies only on
     the ZSWAP_MAX_BATCH_SIZE, which we define anyway.

     I believe we can maximize the chances of success for the allocation of
     the acomp_ctx resources if this is done in zswap_setup(), but please
     correct me if I am wrong.

     The added memory cost for platforms without IAA would be
     ~57 KB per cpu, on x86. Would this be acceptable?
     If not, I don't believe this simplification would be worth it, because
     allocating for one req/buffer, then dynamically adding more resources
     if a newly selected compressor requires more resources, would run
     into the same race conditions and added checks as in
     acomp_ctx_get_cpu_lock(), which I believe, seem to be necessary because
     CPU onlining/zswap_pool creation and CPU offlining/zswap_pool
     deletion have the same semantics for these resources.

    The only other fallback solution in lieu of the proposed simplification that
    I can think of is to keep the lifespan of these resources from pool creation
    to deletion, using the CPU hotplug code. Although, it is not totally clear
    to me if there is potential for deadlock during pool transitions, as noted above.

Please do let me know your suggestions on how to proceed. If the
simplification and added memory cost are acceptable, I can implement this as
part of my v9 changes.

> 
> >
> > I verified that all the zswap_cpu_comp_prepare() allocations are done with
> > GFP_KERNEL, which implicitly allows direct reclaim. So this appears to be a
> > risk for deadlock between zswap_compress() and
> zswap_cpu_comp_prepare()
> > in general, i.e., aside from this patchset.
> >
> > I can think of the following options to resolve this, and would welcome
> > other suggestions:
> >
> > 1) Less intrusive: acomp_ctx_get_cpu_lock() should get the mutex, check
> >     if acomp_ctx->__online is true, and if so, return the mutex. If
> >     acomp_ctx->__online is false, then it returns NULL. In other words, we
> >     don't have the for loop.
> >     - This will cause recursions into direct reclaim from
> zswap_cpu_comp_prepare()
> >        to fail, cpuhotplug to fail. However, there is no deadlock.
> >         - zswap_compress() will need to detect NULL returned by
> >           acomp_ctx_get_cpu_lock(), and return an error.
> >         - zswap_decompress() will need a BUG_ON(!acomp_ctx) after calling
> >           acomp_ctx_get_cpu_lock().
> >     - We won't be migrated to a different CPU because we hold the mutex,
> hence
> >       zswap_cpu_comp_dead() will wait on the mutex.
> >
> > 2) More intrusive: We would need to use a gfp_t that prevents direct
> reclaim
> >     and kswapd, i.e., something similar to GFP_TRANSHUGE_LIGHT in
> gfp_types.h,
> >     but for non-THP allocations. If we decide to adopt this approach, we
> would
> >     need changes in include/crypto/acompress.h, crypto/api.c, and
> crypto/acompress.c
> >     to allow crypto_create_tfm_node() to call crypto_alloc_tfmmem() with
> this
> >     new gfp_t, in lieu of GFP_KERNEL.
> >
> > >
> > > >
> > > > Hence, I added the "bool __online" because
> zswap_cpu_comp_prepare()
> > > > does not acquire the mutex lock while allocating resources. We have
> > > already
> > > > initialized the mutex, so in theory, it is possible for
> compress/decompress
> > > > to acquire the mutex lock. The __online acts as a way to indicate
> whether
> > > > compress/decompress can proceed reliably to use the resources.
> > >
> > > For compress/decompress to acquire the mutex they need to run on that
> > > CPU, and I don't think that's possible before onlining completes, so
> > > zswap_cpu_comp_prepare() must have already completed before
> > > compress/decompress can use that CPU IIUC.
> >
> > If we can make this assumption, that would be great! However, I am not
> > totally sure because of the GFP_KERNEL allocations in
> > zswap_cpu_comp_prepare().
> 
> As I mentioned above, when zswap_cpu_comp_prepare() is run we are in one
> of two situations:
> - The pool is under creation, so we cannot race with stores/loads from
>   that same pool.
> - The CPU is being onlined, in which case zswap_cpu_comp_prepare() is
>   called from a control CPU before tasks start running on the CPU being
>   onlined.
> 
> Please correct me if I am wrong.

I agree and think we can simplify this even further.

> 
> [..]
> > > > > > @@ -285,13 +403,21 @@ static struct zswap_pool
> > > > > *zswap_pool_create(char *type, char *compressor)
> > > > > >  		goto error;
> > > > > >  	}
> > > > > >
> > > > > > -	for_each_possible_cpu(cpu)
> > > > > > -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)-
> >mutex);
> > > > > > +	for_each_possible_cpu(cpu) {
> > > > > > +		struct crypto_acomp_ctx *acomp_ctx =
> per_cpu_ptr(pool-
> > > > > >acomp_ctx, cpu);
> > > > > > +
> > > > > > +		acomp_ctx->acomp = NULL;
> > > > > > +		acomp_ctx->req = NULL;
> > > > > > +		acomp_ctx->buffer = NULL;
> > > > > > +		acomp_ctx->__online = false;
> > > > > > +		acomp_ctx->nr_reqs = 0;
> > > > >
> > > > > Why is this needed? Wouldn't zswap_cpu_comp_prepare() initialize
> them
> > > > > right away?
> > > >
> > > > Yes, I figured this is needed for two reasons:
> > > >
> > > > 1) For the error handling in zswap_cpu_comp_prepare() and calls into
> > > >     zswap_cpu_comp_dealloc() to be handled by the common procedure
> > > >     "acomp_ctx_dealloc()" unambiguously.
> > >
> > > This makes sense. When you move the refactoring to create
> > > acomp_ctx_dealloc() to a separate patch, please include this change in
> > > it and call it out explicitly in the commit message.
> >
> > Sure.
> >
> > >
> > > > 2) The second scenario I thought of that would need this, is let's say
> > > >      the zswap compressor is switched immediately after setting the
> > > >      compressor. Some cores have executed the onlining code and
> > > >      some haven't. Because there are no pool refs held,
> > > >      zswap_cpu_comp_dealloc() would be called per-CPU. Hence, I
> figured
> > > >      it would help to initialize these acomp_ctx members before the
> > > >      hand-off to "cpuhp_state_add_instance()" in zswap_pool_create().
> > >
> > > I believe cpuhp_state_add_instance() calls the onlining function
> > > synchronously on all present CPUs, so I don't think it's possible to end
> > > up in a state where the pool is being destroyed and some CPU executed
> > > zswap_cpu_comp_prepare() while others haven't.
> >
> > I looked at the cpuhotplug code some more. The startup callback is
> > invoked sequentially for_each_present_cpu(). If an error occurs for any
> > one of them, it calls the teardown callback only on the earlier cores that
> > have already finished running the startup callback. However,
> > zswap_cpu_comp_dealloc() will be called for all cores, even the ones
> > for which the startup callback was not run. Hence, I believe the
> > zero initialization is useful, albeit using alloc_percpu_gfp(__GFP_ZERO)
> > to allocate the acomp_ctx.
> 
> Yeah this is point (1) above IIUC, and I agree about zero initialization
> for that.

Thanks. I will re-examine if zero initialization is required as part of the
simplification, if we think the latter makes sense.

> 
> >
> > >
> > > That being said, this made me think of a different problem. If pool
> > > destruction races with CPU onlining, there could be a race between
> > > zswap_cpu_comp_prepare() allocating resources and
> > > zswap_cpu_comp_dealloc() (or acomp_ctx_dealloc()) freeing them.
> > >
> > > I believe we must always call cpuhp_state_remove_instance() *before*
> > > freeing the resources to prevent this race from happening. This needs to
> > > be documented with a comment.
> >
> > Yes, this race condition is possible, thanks for catching this! The problem
> with
> > calling cpuhp_state_remove_instance() before freeing the resources is that
> > cpuhp_state_add_instance() and cpuhp_state_remove_instance() both
> > acquire a "mutex_lock(&cpuhp_state_mutex);" at the beginning; and hence
> > are serialized.
> >
> > For the reasons motivating why acomp_ctx->__online is set to false in
> > zswap_cpu_comp_dead(), I cannot call cpuhp_state_remove_instance()
> > before calling acomp_ctx_dealloc() because the latter could wait until
> > acomp_ctx->__online to be true before deleting the resources. I will
> > think about this some more.
> 
> I believe this problem goes away with acomp_ctx->__online going away,
> right?

Yes, I don't think we need acomp_ctx->__online.

> 
> >
> > Another possibility is to not rely on cpuhotplug in zswap, and instead
> > manage the per-cpu acomp_ctx resource allocation entirely in
> > zswap_pool_create(), and deletion entirely in zswap_pool_destroy(),
> > along with the necessary error handling. Let me think about this some
> > more as well.
> >
> > >
> > > Let me know if I missed something.
> > >
> > > >
> > > > Please let me know if these are valid considerations.
> > > >
> > > > >
> > > > > If it is in fact needed we should probably just use __GFP_ZERO.
> > > >
> > > > Sure. Are you suggesting I use "alloc_percpu_gfp()" instead of
> > > "alloc_percpu()"
> > > > for the acomp_ctx?
> > >
> > > Yeah if we need to initialize all/most fields to 0 let's use
> > > alloc_percpu_gfp() and pass GFP_KERNEL | __GFP_ZERO.
> >
> > Sounds good.
> >
> > >
> > > [..]
> > > > > > @@ -902,16 +957,52 @@ static struct crypto_acomp_ctx
> > > > > *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
> > > > > >
> > > > > >  	for (;;) {
> > > > > >  		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > > > > > -		mutex_lock(&acomp_ctx->mutex);
> > > > > > -		if (likely(acomp_ctx->req))
> > > > > > -			return acomp_ctx;
> > > > > >  		/*
> > > > > > -		 * It is possible that we were migrated to a different
> CPU
> > > > > after
> > > > > > -		 * getting the per-CPU ctx but before the mutex was
> > > > > acquired. If
> > > > > > -		 * the old CPU got offlined, zswap_cpu_comp_dead()
> could
> > > > > have
> > > > > > -		 * already freed ctx->req (among other things) and
> set it to
> > > > > > -		 * NULL. Just try again on the new CPU that we
> ended up on.
> > > > > > +		 * If the CPU onlining code successfully allocates
> acomp_ctx
> > > > > resources,
> > > > > > +		 * it sets acomp_ctx->__online to true. Until this
> happens, we
> > > > > have
> > > > > > +		 * two options:
> > > > > > +		 *
> > > > > > +		 * 1. Return NULL and fail all stores on this CPU.
> > > > > > +		 * 2. Retry, until onlining has finished allocating
> resources.
> > > > > > +		 *
> > > > > > +		 * In theory, option 1 could be more appropriate,
> because it
> > > > > > +		 * allows the calling procedure to decide how it wants
> to
> > > > > handle
> > > > > > +		 * reclaim racing with CPU hotplug. For instance, it
> might be
> > > > > Ok
> > > > > > +		 * for compress to return an error for the backing
> swap device
> > > > > > +		 * to store the folio. Decompress could wait until we
> get a
> > > > > > +		 * valid and locked mutex after onlining has
> completed. For
> > > > > now,
> > > > > > +		 * we go with option 2 because adding a do-while in
> > > > > > +		 * zswap_decompress() adds latency for software
> > > > > compressors.
> > > > > > +		 *
> > > > > > +		 * Once initialized, the resources will be de-allocated
> only
> > > > > > +		 * when the pool is destroyed. The acomp_ctx will
> hold on to
> > > > > the
> > > > > > +		 * resources through CPU offlining/onlining at any
> time until
> > > > > > +		 * the pool is destroyed.
> > > > > > +		 *
> > > > > > +		 * This prevents races/deadlocks between reclaim
> and CPU
> > > > > acomp_ctx
> > > > > > +		 * resource allocation that are a dependency for
> reclaim.
> > > > > > +		 * It further simplifies the interaction with CPU
> onlining and
> > > > > > +		 * offlining:
> > > > > > +		 *
> > > > > > +		 * - CPU onlining does not take the mutex. It only
> allocates
> > > > > > +		 *   resources and sets __online to true.
> > > > > > +		 * - CPU offlining acquires the mutex before setting
> > > > > > +		 *   __online to false. If reclaim has acquired the
> mutex,
> > > > > > +		 *   offlining will have to wait for reclaim to complete
> before
> > > > > > +		 *   hotunplug can proceed. Further, hotplug merely
> sets
> > > > > > +		 *   __online to false. It does not delete the
> acomp_ctx
> > > > > > +		 *   resources.
> > > > > > +		 *
> > > > > > +		 * Option 1 is better than potentially not exiting the
> earlier
> > > > > > +		 * for (;;) loop because the system is running low on
> memory
> > > > > > +		 * and/or CPUs are getting offlined for whatever
> reason. At
> > > > > > +		 * least failing this store will prevent data loss by
> failing
> > > > > > +		 * zswap_store(), and saving the data in the backing
> swap
> > > > > device.
> > > > > >  		 */
> > > > >
> > > > > I believe we can dropped. I don't think we can have any store/load
> > > > > operations on a CPU before it's fully onlined, and we should always
> have
> > > > > a reference on the pool here, so the resources cannot go away.
> > > > >
> > > > > So unless I missed something we can drop this completely now and
> just
> > > > > hold the mutex directly in the load/store paths.
> > > >
> > > > Based on the above explanations, please let me know if it is a good idea
> > > > to keep the __online, or if you think further simplification is possible.
> > >
> > > I still think it's not needed. Let me know if I missed anything.
> >
> > Let me think some more about whether it is feasible to not have cpuhotplug
> > manage the acomp_ctx resource allocation, and instead have this be done
> > through the pool creation/deletion routines.
> 
> I don't think this is necessary, see my comments above.

If the highlighted memory cost for non-batching compressors is acceptable,
I believe we can aim for a simplification.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-17 21:15             ` Sridhar, Kanchana P
@ 2025-03-17 23:25               ` Herbert Xu
  2025-03-18 14:23               ` Yosry Ahmed
  1 sibling, 0 replies; 44+ messages in thread
From: Herbert Xu @ 2025-03-17 23:25 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Yosry Ahmed, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Mar 17, 2025 at 09:15:09PM +0000, Sridhar, Kanchana P wrote:
>
> The problem however is that, in the current architecture, CPU onlining/ 
> zswap_pool creation, and CPU offlining/zswap_pool deletion have the
> same semantics as far as these resources are concerned. Hence, although
> zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which
> the "hotplug" code is called is in fact online. It is possible for the memory
> allocation calls in zswap_cpu_comp_prepare() to recurse into
> zswap_compress(), which now needs to be handled by the current pool,
> since the new pool has not yet been added to the zswap_pools, as you
> pointed out.

Please hold onto your patch-set for a while because I intend to
get rid of the per-cpu pool in zswap after conversion to acomp.

There is no reason to have a per-cpu pool at all, except for the
stream memory used by the algorithm.  Since we've already moved
that into the Crypto API for acomp, this means zswap no longer
needs to have any per-cpu pools.  In fact, with LZO decompression
it could go all the way through with no per-cpu resources at all.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-17 21:15             ` Sridhar, Kanchana P
  2025-03-17 23:25               ` Herbert Xu
@ 2025-03-18 14:23               ` Yosry Ahmed
  2025-03-18 17:38                 ` Sridhar, Kanchana P
  1 sibling, 1 reply; 44+ messages in thread
From: Yosry Ahmed @ 2025-03-18 14:23 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Mar 17, 2025 at 09:15:09PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Monday, March 10, 2025 10:31 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> > Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> > allocation/deletion and mutex lock usage.
> > 
> > On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote:
> > >
> > [..]
> > > > > > >  	u8 *buffer;
> > > > > > > +	u8 nr_reqs;
> > > > > > > +	struct crypto_wait wait;
> > > > > > >  	struct mutex mutex;
> > > > > > >  	bool is_sleepable;
> > > > > > > +	bool __online;
> > > > > >
> > > > > > I don't believe we need this.
> > > > > >
> > > > > > If we are not freeing resources during CPU offlining, then we do not
> > > > > > need a CPU offline callback and acomp_ctx->__online serves no
> > purpose.
> > > > > >
> > > > > > The whole point of synchronizing between offlining and
> > > > > > compress/decompress operations is to avoid UAF. If offlining does not
> > > > > > free resources, then we can hold the mutex directly in the
> > > > > > compress/decompress path and drop the hotunplug callback
> > completely.
> > > > > >
> > > > > > I also believe nr_reqs can be dropped from this patch, as it seems like
> > > > > > it's only used know when to set __online.
> > > > >
> > > > > All great points! In fact, that was the original solution I had implemented
> > > > > (not having an offline callback). But then, I spent some time
> > understanding
> > > > > the v6.13 hotfix for synchronizing freeing of resources, and this comment
> > > > > in zswap_cpu_comp_prepare():
> > > > >
> > > > > 	/*
> > > > > 	 * Only hold the mutex after completing allocations, otherwise we
> > > > may
> > > > > 	 * recurse into zswap through reclaim and attempt to hold the mutex
> > > > > 	 * again resulting in a deadlock.
> > > > > 	 */
> > > > >
> > > > > Hence, I figured the constraint of "recurse into zswap through reclaim"
> > was
> > > > > something to comprehend in the simplification (even though I had a
> > tough
> > > > > time imagining how this could happen).
> > > >
> > > > The constraint here is about zswap_cpu_comp_prepare() holding the
> > mutex,
> > > > making an allocation which internally triggers reclaim, then recursing
> > > > into zswap and trying to hold the same mutex again causing a deadlock.
> > > >
> > > > If zswap_cpu_comp_prepare() does not need to hold the mutex to begin
> > > > with, the constraint naturally goes away.
> > >
> > > Actually, if it is possible for the allocations in zswap_cpu_comp_prepare()
> > > to trigger reclaim, then I believe we need some way for reclaim to know if
> > > the acomp_ctx resources are available. Hence, this seems like a potential
> > > for deadlock regardless of the mutex.
> > 
> > I took a closer look and I believe my hotfix was actually unnecessary. I
> > sent it out in response to a syzbot report, but upon closer look it
> > seems like it was not an actual problem. Sorry if my patch confused you.
> > 
> > Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like
> > CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The comment
> > above
> > says:
> > 
> >  * PREPARE: The callbacks are invoked on a control CPU before the
> >  * hotplugged CPU is started up or after the hotplugged CPU has died.
> > 
> > So even if we go into reclaim during zswap_cpu_comp_prepare(), it will
> > never be on the CPU that we are allocating resources for.
> > 
> > The other case where zswap_cpu_comp_prepare() could race with
> > compression/decompression is when a pool is being created. In this case,
> > reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the
> > same
> > CPU AFAICT. However, because the pool is still under creation, it will
> > not be used (i.e. zswap_pool_current_get() won't find it).
> > 
> > So I think we don't need to worry about zswap_cpu_comp_prepare() racing
> > with compression or decompression for the same pool and CPU.
> 
> Thanks Yosry, for this observation! You are right, when considered purely
> from a CPU hotplug perspective, zswap_cpu_comp_prepare() and
> zswap_cpu_comp_dead() in fact run on a control CPU, because the state is
> registered in the PREPARE section of "enum cpuhp_state" in cpuhotplug.h.
> 
> The problem however is that, in the current architecture, CPU onlining/ 
> zswap_pool creation, and CPU offlining/zswap_pool deletion have the
> same semantics as far as these resources are concerned. Hence, although
> zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which
> the "hotplug" code is called is in fact online. It is possible for the memory
> allocation calls in zswap_cpu_comp_prepare() to recurse into
> zswap_compress(), which now needs to be handled by the current pool,
> since the new pool has not yet been added to the zswap_pools, as you
> pointed out.
> 
> The ref on the current pool has not yet been dropped. Could there be
> a potential for a deadlock at pool transition time: the new pool is blocked
> from allocating acomp_ctx resources, triggering reclaim, which the old
> pool needs to handle?

I am not sure how this could lead to a deadlock. The compression will be
happening in a different pool with a different acomp_ctx.

> 
> I see other places in the kernel that use CPU hotplug for resource allocation,
> outside of the context of CPU onlining. IIUC, it is difficult to guarantee that
> the startup/teardown callbacks are modifying acomp_ctx resources for a
> dysfunctional CPU.

IIUC, outside the context of CPU onlining, CPU hotplug callbacks get
called when they are added. In this case, only the newly added callbacks
will be executed. IOW, zswap's hotplug callback should not be randomly
getting called when irrelevant code adds hotplug callbacks. It should
only happen during zswap pool initialization or CPU onlining.

Please correct me if I am wrong.

> 
> Now that I think about it, the only real constraint is that the acomp_ctx
> resources are guaranteed to exist for a functional CPU which can run zswap
> compress/decompress.

I believe this is already the case as I previously described, because
the hotplug callback can only be called in two scenarios:
- Zswap pool initialization, in which case compress/decompress
  operations cannot run on the pool we are initializing.
- CPU onlining, in which case compress/decompress operations cannot run
  on the CPU we are onlining.

Please correct me if I am wrong.

> 
> I think we can simplify this as follows, and would welcome suggestions
> to improve the proposed solution:
> 
> 1) We dis-associate the acomp_ctx from the pool, and instead, have this
>     be a global percpu zswap resource that gets allocated once in zswap_setup(),
>     just like the zswap_entry_cache.
> 2) The acomp_ctx resources will get allocated during zswap_setup(), using
>     the cpuhp_setup_state_multi callback() in zswap_setup(), that registers
>     zswap_cpu_comp_prepare(), but no teardown callback.
> 3) We call cpuhp_state_add_instance() for_each_possible_cpu(cpu) in
>      zswap_setup(). 
> 4) The acomp_ctx resources persist through subsequent "real CPU offline/online
>      state transitions".
> 5) zswap_[de]compress() can go ahead and lock the mutex, and use the
>     reqs/buffers without worrying about whether these resources are
>     initialized or still exist/are being deleted.
> 6) "struct zswap_pool" is now de-coupled from this global percpu zswap
>     acomp_ctx.
> 7) To address the issue of how many reqs/buffers to allocate, there could
>      potentially be a memory cost for non-batching compressors, if we want
>      to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and buffers.
>      This would allow the acomp_ctx to seamlessly handle batching
>      compressors, non-batching compressors, and transitions among the
>      two compressor types in a pretty general manner, that relies only on
>      the ZSWAP_MAX_BATCH_SIZE, which we define anyway.
> 
>      I believe we can maximize the chances of success for the allocation of
>      the acomp_ctx resources if this is done in zswap_setup(), but please
>      correct me if I am wrong.
> 
>      The added memory cost for platforms without IAA would be
>      ~57 KB per cpu, on x86. Would this be acceptable?

I think that's a lot of memory to waste per-CPU, and I don't see a good
reason for it.

>      If not, I don't believe this simplification would be worth it, because
>      allocating for one req/buffer, then dynamically adding more resources
>      if a newly selected compressor requires more resources, would run
>      into the same race conditions and added checks as in
>      acomp_ctx_get_cpu_lock(), which I believe, seem to be necessary because
>      CPU onlining/zswap_pool creation and CPU offlining/zswap_pool
>      deletion have the same semantics for these resources.

Agree that using a single acomp_ctx per-CPU but making the resources
resizable is not a win.

> 
>     The only other fallback solution in lieu of the proposed simplification that
>     I can think of is to keep the lifespan of these resources from pool creation
>     to deletion, using the CPU hotplug code. Although, it is not totally clear
>     to me if there is potential for deadlock during pool transitions, as noted above.

I am not sure what's the deadlock scenario you're worried about, please
clarify.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-18 14:23               ` Yosry Ahmed
@ 2025-03-18 17:38                 ` Sridhar, Kanchana P
  2025-03-18 19:05                   ` Yosry Ahmed
  0 siblings, 1 reply; 44+ messages in thread
From: Sridhar, Kanchana P @ 2025-03-18 17:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Tuesday, March 18, 2025 7:24 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> allocation/deletion and mutex lock usage.
> 
> On Mon, Mar 17, 2025 at 09:15:09PM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Monday, March 10, 2025 10:31 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>;
> > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> > > allocation/deletion and mutex lock usage.
> > >
> > > On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > [..]
> > > > > > > >  	u8 *buffer;
> > > > > > > > +	u8 nr_reqs;
> > > > > > > > +	struct crypto_wait wait;
> > > > > > > >  	struct mutex mutex;
> > > > > > > >  	bool is_sleepable;
> > > > > > > > +	bool __online;
> > > > > > >
> > > > > > > I don't believe we need this.
> > > > > > >
> > > > > > > If we are not freeing resources during CPU offlining, then we do
> not
> > > > > > > need a CPU offline callback and acomp_ctx->__online serves no
> > > purpose.
> > > > > > >
> > > > > > > The whole point of synchronizing between offlining and
> > > > > > > compress/decompress operations is to avoid UAF. If offlining does
> not
> > > > > > > free resources, then we can hold the mutex directly in the
> > > > > > > compress/decompress path and drop the hotunplug callback
> > > completely.
> > > > > > >
> > > > > > > I also believe nr_reqs can be dropped from this patch, as it seems
> like
> > > > > > > it's only used know when to set __online.
> > > > > >
> > > > > > All great points! In fact, that was the original solution I had
> implemented
> > > > > > (not having an offline callback). But then, I spent some time
> > > understanding
> > > > > > the v6.13 hotfix for synchronizing freeing of resources, and this
> comment
> > > > > > in zswap_cpu_comp_prepare():
> > > > > >
> > > > > > 	/*
> > > > > > 	 * Only hold the mutex after completing allocations,
> otherwise we
> > > > > may
> > > > > > 	 * recurse into zswap through reclaim and attempt to hold the
> mutex
> > > > > > 	 * again resulting in a deadlock.
> > > > > > 	 */
> > > > > >
> > > > > > Hence, I figured the constraint of "recurse into zswap through
> reclaim"
> > > was
> > > > > > something to comprehend in the simplification (even though I had a
> > > tough
> > > > > > time imagining how this could happen).
> > > > >
> > > > > The constraint here is about zswap_cpu_comp_prepare() holding the
> > > mutex,
> > > > > making an allocation which internally triggers reclaim, then recursing
> > > > > into zswap and trying to hold the same mutex again causing a
> deadlock.
> > > > >
> > > > > If zswap_cpu_comp_prepare() does not need to hold the mutex to
> begin
> > > > > with, the constraint naturally goes away.
> > > >
> > > > Actually, if it is possible for the allocations in
> zswap_cpu_comp_prepare()
> > > > to trigger reclaim, then I believe we need some way for reclaim to know
> if
> > > > the acomp_ctx resources are available. Hence, this seems like a
> potential
> > > > for deadlock regardless of the mutex.
> > >
> > > I took a closer look and I believe my hotfix was actually unnecessary. I
> > > sent it out in response to a syzbot report, but upon closer look it
> > > seems like it was not an actual problem. Sorry if my patch confused you.
> > >
> > > Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like
> > > CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The
> comment
> > > above
> > > says:
> > >
> > >  * PREPARE: The callbacks are invoked on a control CPU before the
> > >  * hotplugged CPU is started up or after the hotplugged CPU has died.
> > >
> > > So even if we go into reclaim during zswap_cpu_comp_prepare(), it will
> > > never be on the CPU that we are allocating resources for.
> > >
> > > The other case where zswap_cpu_comp_prepare() could race with
> > > compression/decompression is when a pool is being created. In this case,
> > > reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the
> > > same
> > > CPU AFAICT. However, because the pool is still under creation, it will
> > > not be used (i.e. zswap_pool_current_get() won't find it).
> > >
> > > So I think we don't need to worry about zswap_cpu_comp_prepare()
> racing
> > > with compression or decompression for the same pool and CPU.
> >
> > Thanks Yosry, for this observation! You are right, when considered purely
> > from a CPU hotplug perspective, zswap_cpu_comp_prepare() and
> > zswap_cpu_comp_dead() in fact run on a control CPU, because the state is
> > registered in the PREPARE section of "enum cpuhp_state" in cpuhotplug.h.
> >
> > The problem however is that, in the current architecture, CPU onlining/
> > zswap_pool creation, and CPU offlining/zswap_pool deletion have the
> > same semantics as far as these resources are concerned. Hence, although
> > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which
> > the "hotplug" code is called is in fact online. It is possible for the memory
> > allocation calls in zswap_cpu_comp_prepare() to recurse into
> > zswap_compress(), which now needs to be handled by the current pool,
> > since the new pool has not yet been added to the zswap_pools, as you
> > pointed out.
> >
> > The ref on the current pool has not yet been dropped. Could there be
> > a potential for a deadlock at pool transition time: the new pool is blocked
> > from allocating acomp_ctx resources, triggering reclaim, which the old
> > pool needs to handle?
> 
> I am not sure how this could lead to a deadlock. The compression will be
> happening in a different pool with a different acomp_ctx.

I was thinking about this from the perspective of comparing the trade-offs
between these two approaches:
a) Allocating acomp_ctx resources for a pool when a CPU is functional, vs.
b) Allocating acomp_ctx resources once upfront.

With (a), when the user switches zswap to use a new compressor, it is possible
that the system is already in a low memory situation and the CPU could be doing
a lot of swapouts. It occurred to me that in theory, the call to switch the
compressor through the sysfs interface could never return if the acomp_ctx
allocations trigger direct reclaim in this scenario. This was in the context of
exploring if a better design is possible, while acknowledging that this could still
happen today.

With (b), this situation is avoided by design, and we can switch to a new pool
without triggering additional reclaim. Sorry, I should have articulated this better.

> 
> >
> > I see other places in the kernel that use CPU hotplug for resource allocation,
> > outside of the context of CPU onlining. IIUC, it is difficult to guarantee that
> > the startup/teardown callbacks are modifying acomp_ctx resources for a
> > dysfunctional CPU.
> 
> IIUC, outside the context of CPU onlining, CPU hotplug callbacks get
> called when they are added. In this case, only the newly added callbacks
> will be executed. IOW, zswap's hotplug callback should not be randomly
> getting called when irrelevant code adds hotplug callbacks. It should
> only happen during zswap pool initialization or CPU onlining.
> 
> Please correct me if I am wrong.

Yes, this is my understanding as well.

> 
> >
> > Now that I think about it, the only real constraint is that the acomp_ctx
> > resources are guaranteed to exist for a functional CPU which can run zswap
> > compress/decompress.
> 
> I believe this is already the case as I previously described, because
> the hotplug callback can only be called in two scenarios:
> - Zswap pool initialization, in which case compress/decompress
>   operations cannot run on the pool we are initializing.
> - CPU onlining, in which case compress/decompress operations cannot run
>   on the CPU we are onlining.
> 
> Please correct me if I am wrong.

Agreed, this is my understanding too.

> 
> >
> > I think we can simplify this as follows, and would welcome suggestions
> > to improve the proposed solution:
> >
> > 1) We dis-associate the acomp_ctx from the pool, and instead, have this
> >     be a global percpu zswap resource that gets allocated once in
> zswap_setup(),
> >     just like the zswap_entry_cache.
> > 2) The acomp_ctx resources will get allocated during zswap_setup(), using
> >     the cpuhp_setup_state_multi callback() in zswap_setup(), that registers
> >     zswap_cpu_comp_prepare(), but no teardown callback.
> > 3) We call cpuhp_state_add_instance() for_each_possible_cpu(cpu) in
> >      zswap_setup().
> > 4) The acomp_ctx resources persist through subsequent "real CPU
> offline/online
> >      state transitions".
> > 5) zswap_[de]compress() can go ahead and lock the mutex, and use the
> >     reqs/buffers without worrying about whether these resources are
> >     initialized or still exist/are being deleted.
> > 6) "struct zswap_pool" is now de-coupled from this global percpu zswap
> >     acomp_ctx.
> > 7) To address the issue of how many reqs/buffers to allocate, there could
> >      potentially be a memory cost for non-batching compressors, if we want
> >      to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and buffers.
> >      This would allow the acomp_ctx to seamlessly handle batching
> >      compressors, non-batching compressors, and transitions among the
> >      two compressor types in a pretty general manner, that relies only on
> >      the ZSWAP_MAX_BATCH_SIZE, which we define anyway.
> >
> >      I believe we can maximize the chances of success for the allocation of
> >      the acomp_ctx resources if this is done in zswap_setup(), but please
> >      correct me if I am wrong.
> >
> >      The added memory cost for platforms without IAA would be
> >      ~57 KB per cpu, on x86. Would this be acceptable?
> 
> I think that's a lot of memory to waste per-CPU, and I don't see a good
> reason for it.

Yes, it appears so. Towards trying to see if a better design is possible
by de-coupling the acomp_ctx from zswap_pool:
Would this cost be acceptable if it is incurred based on a build config
option, saying CONFIG_ALLOC_ZSWAP_BATCHING_RESOURCES (default OFF)?
If this is set, we go ahead and allocate ZSWAP_MAX_BATCH_SIZE
acomp_ctx resources once, during zswap_setup(). If not, we allocate
only one req/buffer in the global percpu acomp_ctx? 

Since we are comprehending that other compressors may want to
do batching in future, I thought a more general config option name
would be more appropriate.

> 
> >      If not, I don't believe this simplification would be worth it, because
> >      allocating for one req/buffer, then dynamically adding more resources
> >      if a newly selected compressor requires more resources, would run
> >      into the same race conditions and added checks as in
> >      acomp_ctx_get_cpu_lock(), which I believe, seem to be necessary
> because
> >      CPU onlining/zswap_pool creation and CPU offlining/zswap_pool
> >      deletion have the same semantics for these resources.
> 
> Agree that using a single acomp_ctx per-CPU but making the resources
> resizable is not a win.

Yes, this makes sense: resizing is not the way to go.

> 
> >
> >     The only other fallback solution in lieu of the proposed simplification that
> >     I can think of is to keep the lifespan of these resources from pool creation
> >     to deletion, using the CPU hotplug code. Although, it is not totally clear
> >     to me if there is potential for deadlock during pool transitions, as noted
> above.
> 
> I am not sure what's the deadlock scenario you're worried about, please
> clarify.

My apologies: I was referring to triggering direct reclaim during pool creation,
which could, in theory, run into a scenario where the pool switching would
have to wait for reclaim to free up enough memory for the acomp_ctx
resources allocation: this could cause the system to hang, but not a deadlock.
This can happen even today, hence trying to see if a better design is possible.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-18 17:38                 ` Sridhar, Kanchana P
@ 2025-03-18 19:05                   ` Yosry Ahmed
  2025-03-18 21:09                     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 44+ messages in thread
From: Yosry Ahmed @ 2025-03-18 19:05 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Mar 18, 2025 at 05:38:49PM +0000, Sridhar, Kanchana P wrote:
[..]
> > > Thanks Yosry, for this observation! You are right, when considered purely
> > > from a CPU hotplug perspective, zswap_cpu_comp_prepare() and
> > > zswap_cpu_comp_dead() in fact run on a control CPU, because the state is
> > > registered in the PREPARE section of "enum cpuhp_state" in cpuhotplug.h.
> > >
> > > The problem however is that, in the current architecture, CPU onlining/
> > > zswap_pool creation, and CPU offlining/zswap_pool deletion have the
> > > same semantics as far as these resources are concerned. Hence, although
> > > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which
> > > the "hotplug" code is called is in fact online. It is possible for the memory
> > > allocation calls in zswap_cpu_comp_prepare() to recurse into
> > > zswap_compress(), which now needs to be handled by the current pool,
> > > since the new pool has not yet been added to the zswap_pools, as you
> > > pointed out.
> > >
> > > The ref on the current pool has not yet been dropped. Could there be
> > > a potential for a deadlock at pool transition time: the new pool is blocked
> > > from allocating acomp_ctx resources, triggering reclaim, which the old
> > > pool needs to handle?
> > 
> > I am not sure how this could lead to a deadlock. The compression will be
> > happening in a different pool with a different acomp_ctx.
> 
> I was thinking about this from the perspective of comparing the trade-offs
> between these two approaches:
> a) Allocating acomp_ctx resources for a pool when a CPU is functional, vs.
> b) Allocating acomp_ctx resources once upfront.
> 
> With (a), when the user switches zswap to use a new compressor, it is possible
> that the system is already in a low memory situation and the CPU could be doing
> a lot of swapouts. It occurred to me that in theory, the call to switch the
> compressor through the sysfs interface could never return if the acomp_ctx
> allocations trigger direct reclaim in this scenario. This was in the context of
> exploring if a better design is possible, while acknowledging that this could still
> happen today.

If the system is already in a low memory situation a lot of things will
hang. Switching the compressor is not a common operation at all and we
shouldn't really worry about that. Even if we remove the acomp_ctx
allocation, we still need to make some allocations in that path anyway.

> 
> With (b), this situation is avoided by design, and we can switch to a new pool
> without triggering additional reclaim. Sorry, I should have articulated this better.

But we have to either allocate more memory unnecessarily or add config
options and make batching a build-time decision. This is unwarranted
imo.

FWIW, the mutexes and buffers used to be per-CPU not per-acomp_ctx, but
they were changed in commit 8ba2f844f050 ("mm/zswap: change per-cpu
mutex and buffer to per-acomp_ctx"). What you're suggesting is not quite
the same as what we had before that commit, it's moving the acomp_ctx
itself to be per-CPU but not per-pool, including the mtuex and buffer.
But I thought the context may be useful.

[..]
> > > 7) To address the issue of how many reqs/buffers to allocate, there could
> > >      potentially be a memory cost for non-batching compressors, if we want
> > >      to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and buffers.
> > >      This would allow the acomp_ctx to seamlessly handle batching
> > >      compressors, non-batching compressors, and transitions among the
> > >      two compressor types in a pretty general manner, that relies only on
> > >      the ZSWAP_MAX_BATCH_SIZE, which we define anyway.
> > >
> > >      I believe we can maximize the chances of success for the allocation of
> > >      the acomp_ctx resources if this is done in zswap_setup(), but please
> > >      correct me if I am wrong.
> > >
> > >      The added memory cost for platforms without IAA would be
> > >      ~57 KB per cpu, on x86. Would this be acceptable?
> > 
> > I think that's a lot of memory to waste per-CPU, and I don't see a good
> > reason for it.
> 
> Yes, it appears so. Towards trying to see if a better design is possible
> by de-coupling the acomp_ctx from zswap_pool:
> Would this cost be acceptable if it is incurred based on a build config
> option, saying CONFIG_ALLOC_ZSWAP_BATCHING_RESOURCES (default OFF)?
> If this is set, we go ahead and allocate ZSWAP_MAX_BATCH_SIZE
> acomp_ctx resources once, during zswap_setup(). If not, we allocate
> only one req/buffer in the global percpu acomp_ctx? 

We should avoid making batching a build time decision if we can help it.
A lot of kernels are shipped to different hardware that may or may not
support batching, so users will have to either decide to turn off
batching completely or eat the overhead even for hardware that does not
support batching (or for users that use SW compression).

[..] 
> > >
> > >     The only other fallback solution in lieu of the proposed simplification that
> > >     I can think of is to keep the lifespan of these resources from pool creation
> > >     to deletion, using the CPU hotplug code. Although, it is not totally clear
> > >     to me if there is potential for deadlock during pool transitions, as noted
> > above.
> > 
> > I am not sure what's the deadlock scenario you're worried about, please
> > clarify.
> 
> My apologies: I was referring to triggering direct reclaim during pool creation,
> which could, in theory, run into a scenario where the pool switching would
> have to wait for reclaim to free up enough memory for the acomp_ctx
> resources allocation: this could cause the system to hang, but not a deadlock.
> This can happen even today, hence trying to see if a better design is possible.

I think the concern here is unfounded. We shouldn't care about the
performance of zswap compressor switching, especially under memory
pressure. A lot of things will slow down under memory pressure, and
compressor switching should be the least of our concerns.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-18 19:05                   ` Yosry Ahmed
@ 2025-03-18 21:09                     ` Sridhar, Kanchana P
  2025-03-18 23:14                       ` Yosry Ahmed
  0 siblings, 1 reply; 44+ messages in thread
From: Sridhar, Kanchana P @ 2025-03-18 21:09 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Tuesday, March 18, 2025 12:06 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> allocation/deletion and mutex lock usage.
> 
> On Tue, Mar 18, 2025 at 05:38:49PM +0000, Sridhar, Kanchana P wrote:
> [..]
> > > > Thanks Yosry, for this observation! You are right, when considered purely
> > > > from a CPU hotplug perspective, zswap_cpu_comp_prepare() and
> > > > zswap_cpu_comp_dead() in fact run on a control CPU, because the state
> is
> > > > registered in the PREPARE section of "enum cpuhp_state" in
> cpuhotplug.h.
> > > >
> > > > The problem however is that, in the current architecture, CPU onlining/
> > > > zswap_pool creation, and CPU offlining/zswap_pool deletion have the
> > > > same semantics as far as these resources are concerned. Hence,
> although
> > > > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which
> > > > the "hotplug" code is called is in fact online. It is possible for the memory
> > > > allocation calls in zswap_cpu_comp_prepare() to recurse into
> > > > zswap_compress(), which now needs to be handled by the current pool,
> > > > since the new pool has not yet been added to the zswap_pools, as you
> > > > pointed out.
> > > >
> > > > The ref on the current pool has not yet been dropped. Could there be
> > > > a potential for a deadlock at pool transition time: the new pool is
> blocked
> > > > from allocating acomp_ctx resources, triggering reclaim, which the old
> > > > pool needs to handle?
> > >
> > > I am not sure how this could lead to a deadlock. The compression will be
> > > happening in a different pool with a different acomp_ctx.
> >
> > I was thinking about this from the perspective of comparing the trade-offs
> > between these two approaches:
> > a) Allocating acomp_ctx resources for a pool when a CPU is functional, vs.
> > b) Allocating acomp_ctx resources once upfront.
> >
> > With (a), when the user switches zswap to use a new compressor, it is
> possible
> > that the system is already in a low memory situation and the CPU could be
> doing
> > a lot of swapouts. It occurred to me that in theory, the call to switch the
> > compressor through the sysfs interface could never return if the acomp_ctx
> > allocations trigger direct reclaim in this scenario. This was in the context of
> > exploring if a better design is possible, while acknowledging that this could
> still
> > happen today.
> 
> If the system is already in a low memory situation a lot of things will
> hang. Switching the compressor is not a common operation at all and we
> shouldn't really worry about that. Even if we remove the acomp_ctx
> allocation, we still need to make some allocations in that path anyway.

Ok, these are good points.

> 
> >
> > With (b), this situation is avoided by design, and we can switch to a new pool
> > without triggering additional reclaim. Sorry, I should have articulated this
> better.
> 
> But we have to either allocate more memory unnecessarily or add config
> options and make batching a build-time decision. This is unwarranted
> imo.
> 
> FWIW, the mutexes and buffers used to be per-CPU not per-acomp_ctx, but
> they were changed in commit 8ba2f844f050 ("mm/zswap: change per-cpu
> mutex and buffer to per-acomp_ctx"). What you're suggesting is not quite
> the same as what we had before that commit, it's moving the acomp_ctx
> itself to be per-CPU but not per-pool, including the mtuex and buffer.
> But I thought the context may be useful.

Thanks for sharing the context.

> 
> [..]
> > > > 7) To address the issue of how many reqs/buffers to allocate, there could
> > > >      potentially be a memory cost for non-batching compressors, if we
> want
> > > >      to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and
> buffers.
> > > >      This would allow the acomp_ctx to seamlessly handle batching
> > > >      compressors, non-batching compressors, and transitions among the
> > > >      two compressor types in a pretty general manner, that relies only on
> > > >      the ZSWAP_MAX_BATCH_SIZE, which we define anyway.
> > > >
> > > >      I believe we can maximize the chances of success for the allocation of
> > > >      the acomp_ctx resources if this is done in zswap_setup(), but please
> > > >      correct me if I am wrong.
> > > >
> > > >      The added memory cost for platforms without IAA would be
> > > >      ~57 KB per cpu, on x86. Would this be acceptable?
> > >
> > > I think that's a lot of memory to waste per-CPU, and I don't see a good
> > > reason for it.
> >
> > Yes, it appears so. Towards trying to see if a better design is possible
> > by de-coupling the acomp_ctx from zswap_pool:
> > Would this cost be acceptable if it is incurred based on a build config
> > option, saying CONFIG_ALLOC_ZSWAP_BATCHING_RESOURCES (default
> OFF)?
> > If this is set, we go ahead and allocate ZSWAP_MAX_BATCH_SIZE
> > acomp_ctx resources once, during zswap_setup(). If not, we allocate
> > only one req/buffer in the global percpu acomp_ctx?
> 
> We should avoid making batching a build time decision if we can help it.
> A lot of kernels are shipped to different hardware that may or may not
> support batching, so users will have to either decide to turn off
> batching completely or eat the overhead even for hardware that does not
> support batching (or for users that use SW compression).
> 
> [..]
> > > >
> > > >     The only other fallback solution in lieu of the proposed simplification
> that
> > > >     I can think of is to keep the lifespan of these resources from pool
> creation
> > > >     to deletion, using the CPU hotplug code. Although, it is not totally
> clear
> > > >     to me if there is potential for deadlock during pool transitions, as
> noted
> > > above.
> > >
> > > I am not sure what's the deadlock scenario you're worried about, please
> > > clarify.
> >
> > My apologies: I was referring to triggering direct reclaim during pool
> creation,
> > which could, in theory, run into a scenario where the pool switching would
> > have to wait for reclaim to free up enough memory for the acomp_ctx
> > resources allocation: this could cause the system to hang, but not a
> deadlock.
> > This can happen even today, hence trying to see if a better design is
> possible.
> 
> I think the concern here is unfounded. We shouldn't care about the
> performance of zswap compressor switching, especially under memory
> pressure. A lot of things will slow down under memory pressure, and
> compressor switching should be the least of our concerns.

Sounds good. It then appears that making the per-cpu acomp_ctx resources'
lifetime track that of the zswap_pool, is the way to go. These resources
will be allocated as per the requirements of the compressor, i.e., zswap_pool,
and will persist through CPU online/offline transitions through the hotplug
interface. If this plan is acceptable, it appears that acomp_ctx_get_cpu_lock()
and acomp_ctx_put_unlock() can be replaced with mutex_lock()/unlock() in
zswap_[de]compress()? I will incorporate these changes in v9 if this sounds Ok.

Thanks,
Kanchana 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-18 21:09                     ` Sridhar, Kanchana P
@ 2025-03-18 23:14                       ` Yosry Ahmed
  2025-03-18 23:37                         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 44+ messages in thread
From: Yosry Ahmed @ 2025-03-18 23:14 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Mar 18, 2025 at 09:09:05PM +0000, Sridhar, Kanchana P wrote:
[..]
> > > > >
> > > > >     The only other fallback solution in lieu of the proposed simplification
> > that
> > > > >     I can think of is to keep the lifespan of these resources from pool
> > creation
> > > > >     to deletion, using the CPU hotplug code. Although, it is not totally
> > clear
> > > > >     to me if there is potential for deadlock during pool transitions, as
> > noted
> > > > above.
> > > >
> > > > I am not sure what's the deadlock scenario you're worried about, please
> > > > clarify.
> > >
> > > My apologies: I was referring to triggering direct reclaim during pool
> > creation,
> > > which could, in theory, run into a scenario where the pool switching would
> > > have to wait for reclaim to free up enough memory for the acomp_ctx
> > > resources allocation: this could cause the system to hang, but not a
> > deadlock.
> > > This can happen even today, hence trying to see if a better design is
> > possible.
> > 
> > I think the concern here is unfounded. We shouldn't care about the
> > performance of zswap compressor switching, especially under memory
> > pressure. A lot of things will slow down under memory pressure, and
> > compressor switching should be the least of our concerns.
> 
> Sounds good. It then appears that making the per-cpu acomp_ctx resources'
> lifetime track that of the zswap_pool, is the way to go. These resources
> will be allocated as per the requirements of the compressor, i.e., zswap_pool,
> and will persist through CPU online/offline transitions through the hotplug
> interface. If this plan is acceptable, it appears that acomp_ctx_get_cpu_lock()
> and acomp_ctx_put_unlock() can be replaced with mutex_lock()/unlock() in
> zswap_[de]compress()? I will incorporate these changes in v9 if this sounds Ok.

Sounds good to me. Thanks!


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-18 23:14                       ` Yosry Ahmed
@ 2025-03-18 23:37                         ` Sridhar, Kanchana P
  2025-04-20 21:01                           ` Andrew Morton
  0 siblings, 1 reply; 44+ messages in thread
From: Sridhar, Kanchana P @ 2025-03-18 23:37 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Tuesday, March 18, 2025 4:14 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> allocation/deletion and mutex lock usage.
> 
> On Tue, Mar 18, 2025 at 09:09:05PM +0000, Sridhar, Kanchana P wrote:
> [..]
> > > > > >
> > > > > >     The only other fallback solution in lieu of the proposed
> simplification
> > > that
> > > > > >     I can think of is to keep the lifespan of these resources from pool
> > > creation
> > > > > >     to deletion, using the CPU hotplug code. Although, it is not totally
> > > clear
> > > > > >     to me if there is potential for deadlock during pool transitions, as
> > > noted
> > > > > above.
> > > > >
> > > > > I am not sure what's the deadlock scenario you're worried about,
> please
> > > > > clarify.
> > > >
> > > > My apologies: I was referring to triggering direct reclaim during pool
> > > creation,
> > > > which could, in theory, run into a scenario where the pool switching
> would
> > > > have to wait for reclaim to free up enough memory for the acomp_ctx
> > > > resources allocation: this could cause the system to hang, but not a
> > > deadlock.
> > > > This can happen even today, hence trying to see if a better design is
> > > possible.
> > >
> > > I think the concern here is unfounded. We shouldn't care about the
> > > performance of zswap compressor switching, especially under memory
> > > pressure. A lot of things will slow down under memory pressure, and
> > > compressor switching should be the least of our concerns.
> >
> > Sounds good. It then appears that making the per-cpu acomp_ctx resources'
> > lifetime track that of the zswap_pool, is the way to go. These resources
> > will be allocated as per the requirements of the compressor, i.e.,
> zswap_pool,
> > and will persist through CPU online/offline transitions through the hotplug
> > interface. If this plan is acceptable, it appears that
> acomp_ctx_get_cpu_lock()
> > and acomp_ctx_put_unlock() can be replaced with mutex_lock()/unlock() in
> > zswap_[de]compress()? I will incorporate these changes in v9 if this sounds
> Ok.
> 
> Sounds good to me. Thanks!

Thanks Yosry! Will proceed accordingly.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-03-18 23:37                         ` Sridhar, Kanchana P
@ 2025-04-20 21:01                           ` Andrew Morton
  2025-04-20 23:35                             ` Nhat Pham
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2025-04-20 21:01 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Yosry Ahmed, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, m95d


https://bugzilla.kernel.org/show_bug.cgi?id=220035 might be related to
this patchset?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.
  2025-04-20 21:01                           ` Andrew Morton
@ 2025-04-20 23:35                             ` Nhat Pham
  2025-04-21  3:31                               ` [PATCH] crypto: scomp - Fix off-by-one bug when calculating last page Herbert Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Nhat Pham @ 2025-04-20 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sridhar, Kanchana P, Yosry Ahmed, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, m95d

On Sun, Apr 20, 2025 at 2:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
>
> https://bugzilla.kernel.org/show_bug.cgi?id=220035 might be related to
> this patchset?

The bug report is a bit confusing to me. The dmesg trace says the
kernel version is 6.14, but the reporter says the bisected broken
commit is one of the 6.15-rc1 merge requests?

Anyhow, this looks like a crypto/compression infra bug. Herbert, does
this ring any bell for you?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] crypto: scomp - Fix off-by-one bug when calculating last page
  2025-04-20 23:35                             ` Nhat Pham
@ 2025-04-21  3:31                               ` Herbert Xu
  2025-04-21 19:02                                 ` Marius Dinu
  0 siblings, 1 reply; 44+ messages in thread
From: Herbert Xu @ 2025-04-21  3:31 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Andrew Morton, Sridhar, Kanchana P, Yosry Ahmed, linux-kernel,
	linux-mm, hannes, chengming.zhou, usamaarif642, ryan.roberts,
	21cnbao, ying.huang, linux-crypto, davem, clabbe, ardb, ebiggers,
	surenb, Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh,
	m95d

On Sun, Apr 20, 2025 at 04:35:44PM -0700, Nhat Pham wrote:
>
> Anyhow, this looks like a crypto/compression infra bug. Herbert, does
> this ring any bell for you?

Yes this looks like an off-by-one bug in the new scomp scratch
code.

---8<---
Fix off-by-one bug in the last page calculation for src and dst.

Reported-by: Nhat Pham <nphamcs@gmail.com>
Fixes: 2d3553ecb4e3 ("crypto: scomp - Remove support for some non-trivial SG lists")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/crypto/scompress.c b/crypto/scompress.c
index 5762fcc63b51..36934c78d127 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -215,8 +215,8 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 			spage = nth_page(spage, soff / PAGE_SIZE);
 			soff = offset_in_page(soff);
 
-			n = slen / PAGE_SIZE;
-			n += (offset_in_page(slen) + soff - 1) / PAGE_SIZE;
+			n = (slen - 1) / PAGE_SIZE;
+			n += (offset_in_page(slen - 1) + soff) / PAGE_SIZE;
 			if (PageHighMem(nth_page(spage, n)) &&
 			    size_add(soff, slen) > PAGE_SIZE)
 				break;
@@ -243,9 +243,9 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 			dpage = nth_page(dpage, doff / PAGE_SIZE);
 			doff = offset_in_page(doff);
 
-			n = dlen / PAGE_SIZE;
-			n += (offset_in_page(dlen) + doff - 1) / PAGE_SIZE;
-			if (PageHighMem(dpage + n) &&
+			n = (dlen - 1) / PAGE_SIZE;
+			n += (offset_in_page(dlen - 1) + doff) / PAGE_SIZE;
+			if (PageHighMem(nth_page(dpage, n)) &&
 			    size_add(doff, dlen) > PAGE_SIZE)
 				break;
 			dst = kmap_local_page(dpage) + doff;
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] crypto: scomp - Fix off-by-one bug when calculating last page
  2025-04-21  3:31                               ` [PATCH] crypto: scomp - Fix off-by-one bug when calculating last page Herbert Xu
@ 2025-04-21 19:02                                 ` Marius Dinu
  0 siblings, 0 replies; 44+ messages in thread
From: Marius Dinu @ 2025-04-21 19:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Nhat Pham, Andrew Morton, Sridhar, Kanchana P, Yosry Ahmed,
	linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642,
	ryan.roberts, 21cnbao, ying.huang, linux-crypto, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Feghali, Wajdi K,
	Gopal, Vinodh, m95d

On Mon, 2025-04-21 11.31.31 ++0800, Herbert Xu wrote:
> On Sun, Apr 20, 2025 at 04:35:44PM -0700, Nhat Pham wrote:
> >
> > Anyhow, this looks like a crypto/compression infra bug. Herbert, does
> > this ring any bell for you?
> 
> Yes this looks like an off-by-one bug in the new scomp scratch
> code.
> 
> ---8<---
> Fix off-by-one bug in the last page calculation for src and dst.
> 
> Reported-by: Nhat Pham <nphamcs@gmail.com>
> Fixes: 2d3553ecb4e3 ("crypto: scomp - Remove support for some non-trivial SG lists")
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/crypto/scompress.c b/crypto/scompress.c
> index 5762fcc63b51..36934c78d127 100644
> --- a/crypto/scompress.c
> +++ b/crypto/scompress.c
> @@ -215,8 +215,8 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
>  			spage = nth_page(spage, soff / PAGE_SIZE);
>  			soff = offset_in_page(soff);
>  
> -			n = slen / PAGE_SIZE;
> -			n += (offset_in_page(slen) + soff - 1) / PAGE_SIZE;
> +			n = (slen - 1) / PAGE_SIZE;
> +			n += (offset_in_page(slen - 1) + soff) / PAGE_SIZE;
>  			if (PageHighMem(nth_page(spage, n)) &&
>  			    size_add(soff, slen) > PAGE_SIZE)
>  				break;
> @@ -243,9 +243,9 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
>  			dpage = nth_page(dpage, doff / PAGE_SIZE);
>  			doff = offset_in_page(doff);
>  
> -			n = dlen / PAGE_SIZE;
> -			n += (offset_in_page(dlen) + doff - 1) / PAGE_SIZE;
> -			if (PageHighMem(dpage + n) &&
> +			n = (dlen - 1) / PAGE_SIZE;
> +			n += (offset_in_page(dlen - 1) + doff) / PAGE_SIZE;
> +			if (PageHighMem(nth_page(dpage, n)) &&
>  			    size_add(doff, dlen) > PAGE_SIZE)
>  				break;
>  			dst = kmap_local_page(dpage) + doff;
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Tested the patch with git master branch on the same SBC as my bug report.
It works. stress-ng --pageswap doesn't crash anymore.

Thank you!

Marius Dinu



^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios.
  2025-03-06 21:17   ` Yosry Ahmed
@ 2025-04-30 21:05     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 44+ messages in thread
From: Sridhar, Kanchana P @ 2025-04-30 21:05 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, March 6, 2025 1:17 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 14/14] mm: zswap: Compress batching with request
> chaining in zswap_store() of large folios.
> 
> On Mon, Mar 03, 2025 at 12:47:24AM -0800, Kanchana P Sridhar wrote:
> > This patch introduces three new procedures:
> >
> >  zswap_store_folio()
> >  zswap_store_pages()
> >  zswap_batch_compress()
> >
> > zswap_store_folio() stores a folio in chunks of "batch_size" pages. If the
> > compressor supports batching and has multiple acomp_ctx->nr_reqs, the
> > batch_size will be the acomp_ctx->nr_reqs. If the compressor does not
> > support batching, the batch_size will be ZSWAP_MAX_BATCH_SIZE. The
> > compressor having multiple acomp_ctx->nr_reqs is passed as a bool
> > "batching" parameter to zswap_store_pages() and
> zswap_batch_compress().
> > This refactorization allows us to move the loop over the folio's pages from
> > zswap_store() to zswap_store_folio(), and also enables batching.
> >
> > zswap_store_pages() implements all the computes done earlier in
> > zswap_store_page() for a single-page, for multiple pages in a folio, namely
> > a "batch". zswap_store_pages() starts by allocating all zswap entries
> > required to store the batch. Next, it calls zswap_batch_compress() to
> > compress the batch. Finally, it adds the batch's zswap entries to the
> > xarray and LRU, charges zswap memory and increments zswap stats.
> >
> > The error handling and cleanup required for all failure scenarios that can
> > occur while storing a batch in zswap are consolidated to a single
> > "store_pages_failed" label in zswap_store_pages().
> >
> > And finally, this patch introduces zswap_batch_compress(), which does the
> > following:
> >   - If the compressor supports batching, sets up a request chain for
> >     compressing the batch in one shot. If Intel IAA is the zswap
> >     compressor, the request chain will be compressed in parallel in
> >     hardware. If all requests in the chain are compressed without errors,
> >     the compressed buffers are then stored in zpool.
> >   - If the compressor does not support batching, each page in the batch is
> >     compressed and stored sequentially. zswap_batch_compress() replaces
> >     zswap_compress(), thereby eliminating code duplication and facilitating
> >     maintainability of the code with the introduction of batching.
> >
> > The call to the crypto layer is exactly the same in both cases: when batch
> > compressing a request chain or when sequentially compressing each page in
> > the batch.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> 
> High-level comment: It may be better to break this down into two
> patches, one reworking the code to process folios in batches, and one to
> add support for compression batching and request chaining. This may make
> it more digestable.

Hi Yosry, 

Thanks a lot for all your suggestions!

I have addressed this in v9.

> 
> >  mm/zswap.c | 396 ++++++++++++++++++++++++++++++++++++-------------
> ----
> >  1 file changed, 270 insertions(+), 126 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index fae59d6d5147..135d5792ce50 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1051,74 +1051,141 @@ static void acomp_ctx_put_unlock(struct
> crypto_acomp_ctx *acomp_ctx)
> >  	mutex_unlock(&acomp_ctx->mutex);
> >  }
> >
> > -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > -			   struct zswap_pool *pool)
> > +/*
> > + * Unified code paths for compressors that do and do not support
> > + * batching. This procedure will compress multiple @nr_pages in @folio,
> > + * starting from @index.
> > + * If @batching is set to true, it will create a request chain for
> > + * compression batching. It is assumed that the caller has verified
> > + * that the acomp_ctx->nr_reqs is at least @nr_pages.
> > + * If @batching is set to false, it will process each page sequentially.
> > + * In both cases, if all compressions were successful, it will proceed
> > + * to store the compressed buffers in zpool.
> > + */
> > +static bool zswap_batch_compress(struct folio *folio,
> > +				 long index,
> > +				 unsigned int nr_pages,
> > +				 struct zswap_entry *entries[],
> > +				 struct zswap_pool *pool,
> > +				 struct crypto_acomp_ctx *acomp_ctx,
> > +				 bool batching)
> 
> If we have a single compress function let's keep it called
> zswap_compress() please.

Yes, I have addressed this in v9.

> 
> >  {
> > -	struct crypto_acomp_ctx *acomp_ctx;
> > -	struct scatterlist input, output;
> > -	int comp_ret = 0, alloc_ret = 0;
> > -	unsigned int dlen = PAGE_SIZE;
> > -	unsigned long handle;
> > -	struct zpool *zpool;
> > -	char *buf;
> > +	struct scatterlist inputs[ZSWAP_MAX_BATCH_SIZE];
> > +	struct scatterlist outputs[ZSWAP_MAX_BATCH_SIZE];
> > +	struct zpool *zpool = pool->zpool;
> > +	int acomp_idx = 0, nr_to_store = 1;
> > +	unsigned int i, j;
> > +	int err = 0;
> >  	gfp_t gfp;
> > -	u8 *dst;
> >
> > -	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > -	dst = acomp_ctx->buffers[0];
> > -	sg_init_table(&input, 1);
> > -	sg_set_page(&input, page, PAGE_SIZE, 0);
> > +	lockdep_assert_held(&acomp_ctx->mutex);
> >
> > -	/*
> > -	 * We need PAGE_SIZE * 2 here since there maybe over-compression
> case,
> > -	 * and hardware-accelerators may won't check the dst buffer size, so
> > -	 * giving the dst buffer with enough length to avoid buffer overflow.
> > -	 */
> > -	sg_init_one(&output, dst, PAGE_SIZE * 2);
> > -	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output,
> PAGE_SIZE, dlen);
> > -
> > -	/*
> > -	 * it maybe looks a little bit silly that we send an asynchronous
> request,
> > -	 * then wait for its completion synchronously. This makes the process
> look
> > -	 * synchronous in fact.
> > -	 * Theoretically, acomp supports users send multiple acomp requests
> in one
> > -	 * acomp instance, then get those requests done simultaneously. but
> in this
> > -	 * case, zswap actually does store and load page by page, there is no
> > -	 * existing method to send the second page before the first page is
> done
> > -	 * in one thread doing zwap.
> > -	 * but in different threads running on different cpu, we have different
> > -	 * acomp instance, so multiple threads can do (de)compression in
> parallel.
> > -	 */
> > -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >reqs[0]), &acomp_ctx->wait);
> > -	dlen = acomp_ctx->reqs[0]->dlen;
> > -	if (comp_ret)
> > -		goto unlock;
> > -
> > -	zpool = pool->zpool;
> >  	gfp = __GFP_NORETRY | __GFP_NOWARN |
> __GFP_KSWAPD_RECLAIM;
> >  	if (zpool_malloc_support_movable(zpool))
> >  		gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> > -	alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle);
> > -	if (alloc_ret)
> > -		goto unlock;
> >
> > -	buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
> > -	memcpy(buf, dst, dlen);
> > -	zpool_unmap_handle(zpool, handle);
> > +	for (i = 0; i < nr_pages; ++i) {
> > +		struct page *page = folio_page(folio, index + i);
> >
> > -	entry->handle = handle;
> > -	entry->length = dlen;
> > +		sg_init_table(&inputs[acomp_idx], 1);
> > +		sg_set_page(&inputs[acomp_idx], page, PAGE_SIZE, 0);
> >
> > -unlock:
> > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > -		zswap_reject_compress_poor++;
> > -	else if (comp_ret)
> > -		zswap_reject_compress_fail++;
> > -	else if (alloc_ret)
> > -		zswap_reject_alloc_fail++;
> > +		/*
> > +		 * Each dst buffer should be of size (PAGE_SIZE * 2).
> > +		 * Reflect same in sg_list.
> 
> The existing comment about overcompression was useful, please move it
> as-is.

Done.

> 
> > +		 */
> > +		sg_init_one(&outputs[acomp_idx], acomp_ctx-
> >buffers[acomp_idx], PAGE_SIZE * 2);
> > +		acomp_request_set_params(acomp_ctx->reqs[acomp_idx],
> &inputs[acomp_idx],
> > +					 &outputs[acomp_idx], PAGE_SIZE,
> PAGE_SIZE);
> > +
> > +		if (batching) {
> > +			/* Add the acomp request to the chain. */
> > +			if (likely(i))
> > +				acomp_request_chain(acomp_ctx-
> >reqs[acomp_idx], acomp_ctx->reqs[0]);
> > +			else
> > +				acomp_reqchain_init(acomp_ctx->reqs[0], 0,
> crypto_req_done,
> > +						    &acomp_ctx->wait);
> > +
> > +			if (i == (nr_pages - 1)) {
> > +				/* Process the request chain. */
> > +				err =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]),
> &acomp_ctx->wait);
> > +
> > +				/*
> > +				 * Get the individual compress errors from
> request chaining.
> > +				 */
> > +				for (j = 0; j < nr_pages; ++j) {
> > +					if
> (unlikely(acomp_request_err(acomp_ctx->reqs[j]))) {
> > +						err = -EINVAL;
> > +						if
> (acomp_request_err(acomp_ctx->reqs[j]) == -ENOSPC)
> > +
> 	zswap_reject_compress_poor++;
> > +						else
> > +
> 	zswap_reject_compress_fail++;
> > +					}
> > +				}
> > +				/*
> > +				 * Request chaining cleanup:
> > +				 *
> > +				 * - Clear the CRYPTO_TFM_REQ_CHAIN bit
> on acomp_ctx->reqs[0].
> > +				 * - Reset the acomp_ctx->wait to notify
> acomp_ctx->reqs[0].
> > +				 */
> > +				acomp_reqchain_clear(acomp_ctx->reqs[0],
> &acomp_ctx->wait);
> > +				if (unlikely(err))
> > +					return false;
> > +				j = 0;
> > +				nr_to_store = nr_pages;
> > +				goto store_zpool;
> > +			}
> > +
> > +			++acomp_idx;
> > +			continue;
> 
> The code structure here is really hard to read:
> 
> (1) Why do we need 'batching' to begin with? Can't we use the same
> request chaining APIs even for the '!batching' case? Are there any
> regressions for doing so?
> 
> (2) Instead of the control flow having 'continue', 'goto', and nested
> loops, can we just have separate loops?
> 
> IOW can we do something like:
> 
> 	for (..) {
> 		if (i == 0)
> 			acomp_reqchain_init(..);
> 		else
> 			acomp_request_chain(..);
> 	}
> 
> 	crypto_wait_req(..);
> 	/* collect errors */
> 	acomp_reqchain_clear(..);
> 
> 	for (..) {
> 		/* store */
> 	}
> 

Thanks for these suggestions! Hopefully the v9 version is much simpler.

> > +		} else {
> > +			err =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]),
> &acomp_ctx->wait);
> > +
> > +			if (unlikely(err)) {
> > +				if (err == -ENOSPC)
> > +					zswap_reject_compress_poor++;
> > +				else
> > +					zswap_reject_compress_fail++;
> > +				return false;
> > +			}
> > +			j = i;
> > +			nr_to_store = 1;
> > +		}
> >
> > -	acomp_ctx_put_unlock(acomp_ctx);
> > -	return comp_ret == 0 && alloc_ret == 0;
> > +store_zpool:
> > +		/*
> > +		 * All batch pages were successfully compressed.
> > +		 * Store the pages in zpool.
> > +		 */
> > +		acomp_idx = -1;
> > +		while (nr_to_store--) {
> > +			unsigned long handle;
> > +			char *buf;
> > +
> > +			++acomp_idx;
> > +			prefetchw(entries[j]);
> > +			err = zpool_malloc(zpool, acomp_ctx-
> >reqs[acomp_idx]->dlen, gfp, &handle);
> > +
> > +			if (unlikely(err)) {
> > +				if (err == -ENOSPC)
> > +					zswap_reject_compress_poor++;
> > +				else
> > +					zswap_reject_alloc_fail++;
> > +
> > +				return false;
> > +			}
> > +
> > +			buf = zpool_map_handle(zpool, handle,
> ZPOOL_MM_WO);
> > +			memcpy(buf, acomp_ctx->buffers[acomp_idx],
> acomp_ctx->reqs[acomp_idx]->dlen);
> > +			zpool_unmap_handle(zpool, handle);
> > +
> > +			entries[j]->handle = handle;
> > +			entries[j]->length = acomp_ctx->reqs[acomp_idx]-
> >dlen;
> > +			++j;
> > +		}
> > +	}
> > +
> > +	return true;
> >  }
> >
> >  static void zswap_decompress(struct zswap_entry *entry, struct folio
> *folio)
> > @@ -1581,84 +1648,165 @@ static void shrink_worker(struct work_struct
> *w)
> >  * main API
> >  **********************************/
> >
> > -static bool zswap_store_page(struct page *page,
> > -			     struct obj_cgroup *objcg,
> > -			     struct zswap_pool *pool)
> > +/*
> > + * Store multiple pages in @folio, starting from the page at index @si up to
> > + * and including the page at index @ei.
> 
> 'start' and 'end'?

Incorporated this.

> 
> > + *
> > + * The error handling from all failure points is consolidated to the
> > + * "store_pages_failed" label, based on the initialization of the zswap
> entries'
> > + * handles to ERR_PTR(-EINVAL) at allocation time, and the fact that the
> > + * entry's handle is subsequently modified only upon a successful
> zpool_malloc()
> > + * after the page is compressed.
> 
> This comment is not useful here. Instead comment on specific parts of
> the function where the intention of the code is not clear.

Done.

> 
> > + */
> > +static bool zswap_store_pages(struct folio *folio,
> > +			      long si,
> > +			      long ei,
> > +			      struct obj_cgroup *objcg,
> > +			      struct zswap_pool *pool,
> > +			      struct crypto_acomp_ctx *acomp_ctx,
> > +			      bool batching)
> >  {
> > -	swp_entry_t page_swpentry = page_swap_entry(page);
> > -	struct zswap_entry *entry, *old;
> > +	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> > +	int node_id = folio_nid(folio);
> > +	u8 i, from_i = 0, nr_pages = ei - si + 1;
> >
> > -	/* allocate entry */
> > -	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > -	if (!entry) {
> > -		zswap_reject_kmemcache_fail++;
> > -		return false;
> > +	for (i = 0; i < nr_pages; ++i) {
> > +		entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
> > +
> > +		if (unlikely(!entries[i])) {
> > +			zswap_reject_kmemcache_fail++;
> > +			nr_pages = i;
> > +			goto store_pages_failed;
> > +		}
> > +
> > +		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> >  	}
> >
> > -	if (!zswap_compress(page, entry, pool))
> > -		goto compress_failed;
> > +	if (!zswap_batch_compress(folio, si, nr_pages, entries, pool,
> acomp_ctx, batching))
> > +		goto store_pages_failed;
> >
> > -	old = xa_store(swap_zswap_tree(page_swpentry),
> > -		       swp_offset(page_swpentry),
> > -		       entry, GFP_KERNEL);
> > -	if (xa_is_err(old)) {
> > -		int err = xa_err(old);
> > +	for (i = 0; i < nr_pages; ++i) {
> > +		swp_entry_t page_swpentry =
> page_swap_entry(folio_page(folio, si + i));
> > +		struct zswap_entry *old, *entry = entries[i];
> >
> > -		WARN_ONCE(err != -ENOMEM, "unexpected xarray error:
> %d\n", err);
> > -		zswap_reject_alloc_fail++;
> > -		goto store_failed;
> > -	}
> > +		old = xa_store(swap_zswap_tree(page_swpentry),
> > +			       swp_offset(page_swpentry),
> > +			       entry, GFP_KERNEL);
> > +		if (unlikely(xa_is_err(old))) {
> > +			int err = xa_err(old);
> >
> > -	/*
> > -	 * We may have had an existing entry that became stale when
> > -	 * the folio was redirtied and now the new version is being
> > -	 * swapped out. Get rid of the old.
> > -	 */
> > -	if (old)
> > -		zswap_entry_free(old);
> > +			WARN_ONCE(err != -ENOMEM, "unexpected xarray
> error: %d\n", err);
> > +			zswap_reject_alloc_fail++;
> > +			from_i = i;
> 
> 'error_idx' or 'store_fail_idx'?

Done.

> 
> > +			goto store_pages_failed;
> > +		}
> >
> > -	/*
> > -	 * The entry is successfully compressed and stored in the tree, there is
> > -	 * no further possibility of failure. Grab refs to the pool and objcg,
> > -	 * charge zswap memory, and increment zswap_stored_pages.
> > -	 * The opposite actions will be performed by zswap_entry_free()
> > -	 * when the entry is removed from the tree.
> > -	 */
> > -	zswap_pool_get(pool);
> > -	if (objcg) {
> > -		obj_cgroup_get(objcg);
> > -		obj_cgroup_charge_zswap(objcg, entry->length);
> > -	}
> > -	atomic_long_inc(&zswap_stored_pages);
> > +		/*
> > +		 * We may have had an existing entry that became stale when
> > +		 * the folio was redirtied and now the new version is being
> > +		 * swapped out. Get rid of the old.
> > +		 */
> > +		if (unlikely(old))
> > +			zswap_entry_free(old);
> >
> > -	/*
> > -	 * We finish initializing the entry while it's already in xarray.
> > -	 * This is safe because:
> > -	 *
> > -	 * 1. Concurrent stores and invalidations are excluded by folio lock.
> > -	 *
> > -	 * 2. Writeback is excluded by the entry not being on the LRU yet.
> > -	 *    The publishing order matters to prevent writeback from seeing
> > -	 *    an incoherent entry.
> > -	 */
> > -	entry->pool = pool;
> > -	entry->swpentry = page_swpentry;
> > -	entry->objcg = objcg;
> > -	entry->referenced = true;
> > -	if (entry->length) {
> > -		INIT_LIST_HEAD(&entry->lru);
> > -		zswap_lru_add(&zswap_list_lru, entry);
> > +		/*
> > +		 * The entry is successfully compressed and stored in the tree,
> there is
> > +		 * no further possibility of failure. Grab refs to the pool and
> objcg,
> > +		 * charge zswap memory, and increment
> zswap_stored_pages.
> > +		 * The opposite actions will be performed by
> zswap_entry_free()
> > +		 * when the entry is removed from the tree.
> > +		 */
> > +		zswap_pool_get(pool);
> > +		if (objcg) {
> > +			obj_cgroup_get(objcg);
> > +			obj_cgroup_charge_zswap(objcg, entry->length);
> > +		}
> > +		atomic_long_inc(&zswap_stored_pages);
> > +
> > +		/*
> > +		 * We finish initializing the entry while it's already in xarray.
> > +		 * This is safe because:
> > +		 *
> > +		 * 1. Concurrent stores and invalidations are excluded by folio
> lock.
> > +		 *
> > +		 * 2. Writeback is excluded by the entry not being on the LRU
> yet.
> > +		 *    The publishing order matters to prevent writeback from
> seeing
> > +		 *    an incoherent entry.
> > +		 */
> > +		entry->pool = pool;
> > +		entry->swpentry = page_swpentry;
> > +		entry->objcg = objcg;
> > +		entry->referenced = true;
> > +		if (likely(entry->length)) {
> > +			INIT_LIST_HEAD(&entry->lru);
> > +			zswap_lru_add(&zswap_list_lru, entry);
> > +		}
> >  	}
> >
> >  	return true;
> >
> > -store_failed:
> > -	zpool_free(pool->zpool, entry->handle);
> > -compress_failed:
> > -	zswap_entry_cache_free(entry);
> > +store_pages_failed:
> > +	for (i = from_i; i < nr_pages; ++i) {
> > +		if (!IS_ERR_VALUE(entries[i]->handle))
> > +			zpool_free(pool->zpool, entries[i]->handle);
> > +
> > +		zswap_entry_cache_free(entries[i]);
> > +	}
> > +
> >  	return false;
> >  }
> >
> > +/*
> > + * Store all pages in a folio by calling zswap_batch_compress().
> > + * If the compressor supports batching, i.e., has multiple acomp requests,
> > + * the folio will be compressed in batches of "acomp_ctx->nr_reqs" using
> > + * request chaining.
> > + * If the compressor has only one acomp request, the folio will be
> compressed
> > + * in batches of ZSWAP_MAX_BATCH_SIZE pages, where each page in the
> batch is
> > + * compressed sequentially.
> > + */
> > +static bool zswap_store_folio(struct folio *folio,
> > +			      struct obj_cgroup *objcg,
> > +			      struct zswap_pool *pool)
> > +{
> > +	long nr_pages = folio_nr_pages(folio);
> > +	struct crypto_acomp_ctx *acomp_ctx;
> > +	unsigned int batch_size;
> > +	bool ret = true, batching;
> > +	long si, ei;
> > +
> > +	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > +
> > +	batching = ((acomp_ctx->nr_reqs > 1) && (nr_pages > 1));
> > +
> > +	batch_size = batching ? acomp_ctx->nr_reqs :
> ZSWAP_MAX_BATCH_SIZE;
> 
> acomp_ctx->nr_reqs is a function of the compressor. Can we store it in
> the pool to avoid having to hold the mutex here just for this?
> 
> This would also save a tiny bit of memory as right now we have it
> replicated for all CPUs.

Great suggestion! I have incorporated this in v9.

> 
> > +
> > +	if (!batching)
> > +		acomp_ctx_put_unlock(acomp_ctx);
> 
> So in the batching case we keep the mutex held throughout, otherwise we
> hold it for every batch. Why?
> 
> IIRC you mentioned that only holding it for every batch removes the zstd
> regressions, perhaps because holding the mutex for too long is a
> problem. But why not do the same for the batching case too?
> 
> Also, the current behavior is that we only hold the mutex during
> compression. Now we changed this to do it per-folio or per-batch. Does
> the current behavior cause regressions? Ideally we'd like to keep the
> mutex held as little as possible.
> 
> This is worth explaining in detail.

In v9, the mutex lock is held within zswap_compress() for the duration
of a batch. This was one of the optimizations that uniformly improved
performance for both, non-batching and batching compressors.
With the latest code, other options that I tried, such as locking per folio
or locking per page, caused regressions for zstd.

> 
> > +
> > +	/* Store the folio in batches of "batch_size" pages. */
> > +	for (si = 0, ei = min(si + batch_size - 1, nr_pages - 1);
> > +	     ((si < nr_pages) && (ei < nr_pages));
> > +	     si = ei + 1, ei = min(si + batch_size - 1, nr_pages - 1)) {
> 
> This is too complicated. Can we do:
> 
> 	for (start = 0; start < nr_pages; start += batch_size) {
> 		end = min(start + batch_size - 1, nr_pages - 1);
> 		...
> 	}
> 
> Also, we subtract 1 from ei here and add 1 in zswap_store_pages(), so
> perhaps it's better to make it exclusive:
> 
> 	for (start = 0; start < nr_pages; start += batch_size) {
> 		end = min(start + batch_size, nr_pages);
> 		...
> 	}
> 

Done.

> > +
> > +		if (!batching)
> > +			acomp_ctx = acomp_ctx_get_cpu_lock(pool);
> > +
> > +		if (!zswap_store_pages(folio, si, ei, objcg, pool, acomp_ctx,
> batching)) {
> > +			ret = false;
> > +			break;
> > +		}
> > +
> > +		if (!batching)
> > +			acomp_ctx_put_unlock(acomp_ctx);
> > +	}
> > +
> > +	if (batching || !ret)
> > +		acomp_ctx_put_unlock(acomp_ctx);
> > +
> > +	return ret;
> > +}
> > +
> >  bool zswap_store(struct folio *folio)
> >  {
> >  	long nr_pages = folio_nr_pages(folio);
> > @@ -1667,7 +1815,6 @@ bool zswap_store(struct folio *folio)
> >  	struct mem_cgroup *memcg = NULL;
> >  	struct zswap_pool *pool;
> >  	bool ret = false;
> > -	long index;
> 
> These seems like an unrelated change.

I now have start/end, and reuse start in the error handling.

> 
> >
> >  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > @@ -1701,12 +1848,8 @@ bool zswap_store(struct folio *folio)
> >  		mem_cgroup_put(memcg);
> >  	}
> >
> > -	for (index = 0; index < nr_pages; ++index) {
> > -		struct page *page = folio_page(folio, index);
> > -
> > -		if (!zswap_store_page(page, objcg, pool))
> > -			goto put_pool;
> > -	}
> > +	if (!zswap_store_folio(folio, objcg, pool))
> > +		goto put_pool;
> 
> If we move the locking out of zswap_store_folio() and simplify the loop
> there, I belive it will be simple enough for us to inline it here.

Yes indeed. Incorporated in v9.

> 
> >
> >  	if (objcg)
> >  		count_objcg_events(objcg, ZSWPOUT, nr_pages);
> > @@ -1733,6 +1876,7 @@ bool zswap_store(struct folio *folio)
> >  		pgoff_t offset = swp_offset(swp);
> >  		struct zswap_entry *entry;
> >  		struct xarray *tree;
> > +		long index;
> >
> >  		for (index = 0; index < nr_pages; ++index) {
> >  			tree = swap_zswap_tree(swp_entry(type, offset +
> index));
> > --
> > 2.27.0
> >


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios.
  2025-03-06 21:20         ` Yosry Ahmed
@ 2025-04-30 21:07           ` Sridhar, Kanchana P
  0 siblings, 0 replies; 44+ messages in thread
From: Sridhar, Kanchana P @ 2025-04-30 21:07 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, lkp, linux-kernel, linux-mm, hannes, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, llvm, oe-kbuild-all, Feghali, Wajdi K, Gopal,
	Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, March 6, 2025 1:21 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; lkp <lkp@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; llvm@lists.linux.dev; oe-kbuild-
> all@lists.linux.dev; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 14/14] mm: zswap: Compress batching with request
> chaining in zswap_store() of large folios.
> 
> On Mon, Mar 03, 2025 at 09:34:04PM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Nhat Pham <nphamcs@gmail.com>
> > > Sent: Monday, March 3, 2025 10:22 AM
> > > To: lkp <lkp@intel.com>
> > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> > > kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> > > yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; llvm@lists.linux.dev; oe-kbuild-
> > > all@lists.linux.dev; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> > > Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v8 14/14] mm: zswap: Compress batching with
> request
> > > chaining in zswap_store() of large folios.
> > >
> > > On Mon, Mar 3, 2025 at 3:07 AM kernel test robot <lkp@intel.com>
> wrote:
> > > >
> > > > Hi Kanchana,
> > > >
> > > > kernel test robot noticed the following build errors:
> > > >
> > > > > 1166                          prefetchw(entries[j]);
> > > > --
> > >
> > > Why are we doing this anyway? Does it have a notable performance
> > > difference? At the very least, leave a comment explaining why we're
> > > prefetching this (although the build error suggests that we have to
> > > remove it anyway).
> >
> > Hi Nhat,
> >
> > Yes, it does. The use of prefetchw reduces sys time by ~1.5% because
> > it minimizes cache-miss latency by moving the zswap entry to the cache
> > before it is written to.
> >
> > This is data with kernel compilation test, v8 without prefetchw and v8 as-is:
> >
> > --------------------------------------------------------------------------------
> >  Kernel compile       v8 without               v8      v8 without              v8
> >  allmodconfig          prefetchw                        prefetchw
> >  2M folios
> >  --------------------------------------------------------------------------------
> >  zswap compressor    deflate-iaa      deflate-iaa            zstd            zstd
> >  --------------------------------------------------------------------------------
> >  real_sec                 732.89           735.63          768.53          758.21
> >  user_sec              15,708.37        15,699.84       15,702.64       15,678.73
> >  sys_sec                4,632.58         4,563.70        5,735.06        5,635.69
> >  --------------------------------------------------------------------------------
> >  Max_Res_Set_Size_KB   1,874,672        1,867,516       1,874,684
> 1,872,888
> >  --------------------------------------------------------------------------------
> >  memcg_high                    0                0               0               0
> >  memcg_swap_fail               0                0               0               0
> >  zswpout             114,742,930      112,836,725      92,904,961      89,596,085
> >  zswpin               41,184,897       39,983,793      31,018,149      29,163,932
> >  pswpout                     625            1,069             558           1,059
> >  pswpin                      599            1,056             540           1,051
> >  thp_swpout                    1                2               1               2
> >  thp_swpout_fallback      10,967           10,195           6,918           6,141
> >  pgmajfault           42,588,331       41,349,069      31,931,882      30,006,422
> >  ZSWPOUT-2048kB            7,661            8,710           6,799           7,480
> >  SWPOUT-2048kB                 1                2               1               2
> >  --------------------------------------------------------------------------------
> >
> >
> > Sure, I will add a comment, and also "#include <linux/prefetch.h>" in
> zswap.c
> > that will resolve the build error. This is similar to how these files handle
> prefetchw:
> > mm/vmscan.c, kernel/locking/qspinlock.c, include/asm-generic/xor.h, etc.
> 
> Please also explicitly mention that the prefetch and likely/unlikely
> annotations prevent regressions with software compression like zstd, and
> generally improve the performance with the batching code by ~1.5%.

Yes, I have mentioned this in the comments and commit log. So also the
mutex locking.

> 
> >
> > Thanks,
> > Kanchana
> >

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v8 13/14] mm: zswap: Allocate pool batching resources if the compressor supports batching.
  2025-03-06 20:00   ` Yosry Ahmed
@ 2025-04-30 21:15     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 44+ messages in thread
From: Sridhar, Kanchana P @ 2025-04-30 21:15 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, March 6, 2025 12:01 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 13/14] mm: zswap: Allocate pool batching resources if
> the compressor supports batching.
> 
> On Mon, Mar 03, 2025 at 12:47:23AM -0800, Kanchana P Sridhar wrote:
> > This patch adds support for the per-CPU acomp_ctx to track multiple
> > compression/decompression requests and multiple compression destination
> > buffers. The zswap_cpu_comp_prepare() CPU onlining code will get the
> > maximum batch-size the compressor supports. If so, it will allocate the
> > necessary batching resources.
> >
> > However, zswap does not use more than one request yet. Follow-up
> patches
> > will actually utilize the multiple acomp_ctx requests/buffers for batch
> > compression/decompression of multiple pages.
> >
> > The newly added ZSWAP_MAX_BATCH_SIZE limits the amount of extra
> memory used
> > for batching. There is a small extra memory overhead of allocating the
> > "reqs" and "buffers" arrays for compressors that do not support batching.
> 
> That's two pointers per-CPU (i.e. 16 bytes on x86_64), right? Please
> call that out in the commit log.

Yes, this is done.

Thanks,
Kanchana

> 
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 99 +++++++++++++++++++++++++++++++++++++-------------
> ----
> >  1 file changed, 69 insertions(+), 30 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index cff96df1df8b..fae59d6d5147 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -78,6 +78,16 @@ static bool zswap_pool_reached_full;
> >
> >  #define ZSWAP_PARAM_UNSET ""
> >
> > +/*
> > + * For compression batching of large folios:
> > + * Maximum number of acomp compress requests that will be processed
> > + * in a batch, iff the zswap compressor supports batching.
> > + * This limit exists because we preallocate enough requests and buffers
> > + * in the per-cpu acomp_ctx accordingly. Hence, a higher limit means
> higher
> > + * memory usage.
> > + */
> 
> That's too verbose. Let's do something like:
> 
> /* Limit the batch size to limit per-CPU memory usage for reqs and buffers */
> #define ZSWAP_MAX_BATCH_SIZE 8U

Addressed in v9.

> 
> > +#define ZSWAP_MAX_BATCH_SIZE 8U
> > +
> >  static int zswap_setup(void);
> >
> >  /* Enable/disable zswap */
> > @@ -143,8 +153,8 @@ bool zswap_never_enabled(void)
> >
> >  struct crypto_acomp_ctx {
> >  	struct crypto_acomp *acomp;
> > -	struct acomp_req *req;
> > -	u8 *buffer;
> > +	struct acomp_req **reqs;
> > +	u8 **buffers;
> >  	u8 nr_reqs;
> >  	struct crypto_wait wait;
> >  	struct mutex mutex;
> > @@ -251,13 +261,22 @@ static void __zswap_pool_empty(struct
> percpu_ref *ref);
> >  static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> >  {
> >  	if (!IS_ERR_OR_NULL(acomp_ctx) && acomp_ctx->nr_reqs) {
> > +		u8 i;
> > +
> > +		if (acomp_ctx->reqs) {
> > +			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > +				if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
> 
> Hmm I just realized we check IS_ERR_OR_NULL() here for the requests, but
> only a NULL check in zswap_cpu_comp_prepare(). We also check
> IS_ERR_OR_NULL here for acomp, but only IS_ERR() in
> zswap_cpu_comp_prepare().
> 
> This doesn't make sense. Would you be able to include a patch before
> this one to make these consistent? I can also send a follow up patch.

No worries, I have included this as patch 16 in the v9 series.

> 
> > +					acomp_request_free(acomp_ctx-
> >reqs[i]);
> 
> Please add braces for the for loop here for readability, since the body
> has more than one line, even if it's technically not required.

Done.

> 
> > +			kfree(acomp_ctx->reqs);
> > +			acomp_ctx->reqs = NULL;
> > +		}
> >
> > -		if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > -			acomp_request_free(acomp_ctx->req);
> > -		acomp_ctx->req = NULL;
> > -
> > -		kfree(acomp_ctx->buffer);
> > -		acomp_ctx->buffer = NULL;
> > +		if (acomp_ctx->buffers) {
> > +			for (i = 0; i < acomp_ctx->nr_reqs; ++i)
> > +				kfree(acomp_ctx->buffers[i]);
> > +			kfree(acomp_ctx->buffers);
> > +			acomp_ctx->buffers = NULL;
> > +		}
> >
> >  		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> >  			crypto_free_acomp(acomp_ctx->acomp);
> > @@ -271,6 +290,7 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> >  	int ret = -ENOMEM;
> > +	u8 i;
> >
> >  	/*
> >  	 * Just to be even more fail-safe against changes in assumptions
> and/or
> > @@ -292,22 +312,41 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  		goto fail;
> >  	}
> >
> > -	acomp_ctx->nr_reqs = 1;
> > +	acomp_ctx->nr_reqs = min(ZSWAP_MAX_BATCH_SIZE,
> > +				 crypto_acomp_batch_size(acomp_ctx-
> >acomp));
> >
> > -	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> > -	if (!acomp_ctx->req) {
> > -		pr_err("could not alloc crypto acomp_request %s\n",
> > -		       pool->tfm_name);
> > -		ret = -ENOMEM;
> > +	acomp_ctx->reqs = kcalloc_node(acomp_ctx->nr_reqs, sizeof(struct
> acomp_req *),
> > +				       GFP_KERNEL, cpu_to_node(cpu));
> > +	if (!acomp_ctx->reqs)
> >  		goto fail;
> > +
> > +	for (i = 0; i < acomp_ctx->nr_reqs; ++i) {
> > +		acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx-
> >acomp);
> > +		if (!acomp_ctx->reqs[i]) {
> > +			pr_err("could not alloc crypto acomp_request
> reqs[%d] %s\n",
> > +				i, pool->tfm_name);
> > +			goto fail;
> > +		}
> >  	}
> >
> > -	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> cpu_to_node(cpu));
> > -	if (!acomp_ctx->buffer) {
> > -		ret = -ENOMEM;
> > +	acomp_ctx->buffers = kcalloc_node(acomp_ctx->nr_reqs, sizeof(u8
> *),
> > +					  GFP_KERNEL, cpu_to_node(cpu));
> > +	if (!acomp_ctx->buffers)
> >  		goto fail;
> > +
> > +	for (i = 0; i < acomp_ctx->nr_reqs; ++i) {
> > +		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2,
> GFP_KERNEL,
> > +						     cpu_to_node(cpu));
> > +		if (!acomp_ctx->buffers[i])
> > +			goto fail;
> >  	}
> >
> > +	/*
> > +	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
> > +	 * or non-poll mode of acomp, hence there is only one "wait" per
> > +	 * acomp_ctx, with callback set to reqs[0], under the assumption that
> > +	 * there is at least 1 request per acomp_ctx.
> > +	 */
> 
> I am not sure I understand. Does this say that we assume that scomp or
> non-poll acomp will never use batching so having a single "wait" is
> fine?
> 
> If so, this needs to be enforced at runtime or at least have a warning,
> and not just mentioned in a comment, in case batching support is ever
> added for these. Please clarify.

This was pertaining to the request chaining batching implementation and
is no longer relevant. I have deleted this comment in v9, in which
crypto_acomp_batch_[de]compress() do not take a "struct crypto_wait"
parameter.

> 
> We should also probably merge the comments above crypto_init_wait() and
> acomp_request_set_callback() now.

Done, and clarified the use of the single "wait" in zswap calls to
crypto_acomp_[de]compress().

Thanks,
Kanchana

> 
> >  	crypto_init_wait(&acomp_ctx->wait);
> >
> >  	/*


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-04-30 21:15 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-03  8:47 [PATCH v8 00/14] zswap IAA compress batching Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 01/14] crypto: acomp - Add synchronous/asynchronous acomp request chaining Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 02/14] crypto: acomp - New interfaces to facilitate batching support in acomp & drivers Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 03/14] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 04/14] crypto: iaa - Implement batch compression/decompression with request chaining Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 05/14] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 06/14] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 07/14] crypto: iaa - Re-organize the iaa_crypto driver code Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 08/14] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 09/14] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 10/14] crypto: iaa - Descriptor allocation timeouts with mitigations in iaa_crypto Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 11/14] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks Kanchana P Sridhar
2025-03-03  8:47 ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Kanchana P Sridhar
2025-03-06 19:35   ` Yosry Ahmed
2025-03-06 19:46     ` Yosry Ahmed
2025-03-07  0:01     ` Sridhar, Kanchana P
2025-03-07 19:30       ` Yosry Ahmed
2025-03-08  2:47         ` Sridhar, Kanchana P
2025-03-10 17:31           ` Yosry Ahmed
2025-03-17 21:15             ` Sridhar, Kanchana P
2025-03-17 23:25               ` Herbert Xu
2025-03-18 14:23               ` Yosry Ahmed
2025-03-18 17:38                 ` Sridhar, Kanchana P
2025-03-18 19:05                   ` Yosry Ahmed
2025-03-18 21:09                     ` Sridhar, Kanchana P
2025-03-18 23:14                       ` Yosry Ahmed
2025-03-18 23:37                         ` Sridhar, Kanchana P
2025-04-20 21:01                           ` Andrew Morton
2025-04-20 23:35                             ` Nhat Pham
2025-04-21  3:31                               ` [PATCH] crypto: scomp - Fix off-by-one bug when calculating last page Herbert Xu
2025-04-21 19:02                                 ` Marius Dinu
2025-03-06 19:42   ` [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage Yosry Ahmed
2025-03-03  8:47 ` [PATCH v8 13/14] mm: zswap: Allocate pool batching resources if the compressor supports batching Kanchana P Sridhar
2025-03-06 20:00   ` Yosry Ahmed
2025-04-30 21:15     ` Sridhar, Kanchana P
2025-03-03  8:47 ` [PATCH v8 14/14] mm: zswap: Compress batching with request chaining in zswap_store() of large folios Kanchana P Sridhar
2025-03-03 11:07   ` kernel test robot
2025-03-03 18:21     ` Nhat Pham
2025-03-03 21:34       ` Sridhar, Kanchana P
2025-03-06 21:20         ` Yosry Ahmed
2025-04-30 21:07           ` Sridhar, Kanchana P
2025-03-03 11:07   ` kernel test robot
2025-03-06 21:17   ` Yosry Ahmed
2025-04-30 21:05     ` Sridhar, Kanchana P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox