[PATCH v13 00/22] zswap compression batching with optimized iaa

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver
@ 2025-11-04  9:12 Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 01/22] crypto: iaa - Reorganize the iaa_crypto driver code Kanchana P Sridhar
                   ` (22 more replies)
  0 siblings, 23 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

v13: zswap compression batching with optimized iaa_crypto driver
================================================================
This updated patch-series further generalizes the batching implementation of
zswap_compress() for non-batching and batching compressors. It makes sure the
bulk allocation of zswap entries preserves the current behavior of addition of
an entry to the LRU list for the nid of the page.

Based on Herbert's suggestions, the batching interfaces from zswap to crypto,
from crypto to iaa_crypto, and the batching implementation within iaa_crypto now
use the folio directly as the source (sg_page_iter for retrieving pages), and
destination SG lists. A unit_size has been added to struct acomp_req, with
kernel users such as zswap using the new acomp_request_set_unit_size() API to
set the unit size to use while breaking down the request's src/dst
scatterlists. zswap sets the unit-size to PAGE_SIZE.

Following Andrew's suggestion, the next two paragraphs emphasize generality and
alignment with current kernel efforts.

Architectural considerations for the zswap batching framework:
==============================================================
We have designed the zswap batching framework to be hardware-agnostic. It has no
dependencies on Intel-specific features and can be leveraged by any hardware
accelerator or software-based compressor. In other words, the framework is open
and inclusive by design.

Other ongoing work that can use batching:
=========================================
This patch-series demonstrates the performance benefits of compress
batching when used in zswap_store() of large folios. shrink_folio_list()
"reclaim batching" of any-order folios is the next major work that uses
this zswap compress batching framework: our testing of kernel_compilation
with writeback and the zswap shrinker indicates 10X fewer pages get
written back when we reclaim 32 folios as a batch, as compared to one
folio at a time: this is with deflate-iaa and with zstd. We expect to
submit a patch-series with this data and the resulting performance
improvements shortly. Reclaim batching relieves memory pressure faster
than reclaiming one folio at a time, hence alleviates the need to scan
slab memory for writeback.

Many thanks to Nhat for suggesting ideas on using batching with the
ongoing kcompressd work, as well as beneficially using decompression
batching & block IO batching to improve zswap writeback efficiency.

Experiments with kernel compilation benchmark (allmod config) that
combine zswap compress batching, reclaim batching, swapin_readahead()
decompression batching of prefetched pages, and writeback batching show
that 0 pages are written back to disk with deflate-iaa and zstd. For
comparison, the baselines for these compressors see 200K-800K pages
written to disk.

To summarize, these are future clients of the batching framework:

   - shrink_folio_list() reclaim batching of multiple folios:
       Implemented, will submit patch-series.
   - zswap writeback with decompress batching:
       Implemented, will submit patch-series.
   - zram:
       Implemented, will submit patch-series.
   - kcompressd:
       Not yet implemented.
   - file systems:
       Not yet implemented.
   - swapin_readahead() decompression batching of prefetched pages:
       Implemented, will submit patch-series.


iaa_crypto Driver Rearchitecting and Optimizations:
===================================================

The most significant highlight of v13 is a new, lightweight and highly
optimized iaa_crypto driver, resulting directly in the latency and
throughput improvements noted later in this cover letter.

 1) Better stability, more functionally versatile to support zswap
    with better performance on different Intel platforms.

    a) Patches 0002, 0005 and 0011 together resolve a race condition in
       mainline v6.15, reported from internal validation, when IAA
       wqs/devices are disabled while workloads are using IAA.

    b) Patch 0002 introduces a new architecture for mapping cores to
       IAAs based on packages instead of NUMA nodes, and generalizing
       how WQs are used: as package level shared resources for all
       same-package cores (default for compress WQs), or dedicated to
       mapped cores (default for decompress WQs). Further, users are
       able to configure multiple WQs and specify how many of those are
       for compress jobs only vs. decompress jobs only. sysfs iaa_crypto
       driver parameters can be used to change the default settings for
       performance tuning.

    c) idxd descriptor allocation moved from blocking to non-blocking
       with retry limits and mitigations if limits are exceeded.

    d) Code cleanup for readability and clearer code flow.

    e) Fixes IAA re-registration errors upon disabling/enabling IAA wqs
       and devices that exists in the mainline v6.15.

    f) Addition of a layer that encapsulates iaa_crypto's core functionality to
       rely only on idxd, dma and scatterlists to provide clean interfaces to
       crypto_acomp.

    g) New Dynamic compression mode for Granite Rapids to get better
       compression ratio by echo-ing 'deflate-iaa-dynamic' as the zswap
       compressor.

    h) New crypto_acomp API crypto_acomp_batch_size() that will return
       the driver's max batch size if the driver has registered a batch_size
       that's greater than 1; or 1 if there is no driver specific definition of
       batch_size.

       Accordingly, iaa_crypto sets the acomp_alg batch_size to its internal
       IAA_CRYPTO_MAX_BATCH_SIZE for fixed and dynamic modes.

 2) Performance optimizations (please refer to the latency data per
    optimization in the commit logs):

    a) Distributing [de]compress jobs in round-robin manner to available
       IAAs on package.

    b) Replacing the compute-intensive iaa_wq_get()/iaa_wq_put() with a
       percpu_ref in struct iaa_wq, thereby eliminating acquiring a
       spinlock in the fast path, while using a combination of the
       iaa_crypto_enabled atomic with spinlocks in the slow path to
       ensure the compress/decompress code sees a consistent state of the
       wq tables.
       
    c) Directly call movdir64b for non-irq use cases, i.e., the most
       common usage. Avoid the overhead of irq-specific computes in
       idxd_submit_desc() to gain latency.

    d) Batching of compressions/decompressions using async submit-poll
       mechanism to derive the benefits of hardware parallelism.

    e) Batching compressors need to manage their own "requests"
       abstraction, and remove this driver-specific aspect from being
       managed by kernel users such as zswap. iaa_crypto maintains
       per-CPU "struct iaa_req **reqs" to submit multiple jobs to the
       hardware accelerator to run in parallel.

    f) Modifies the iaa_crypto batching API and their implementation to expect a
       src SG list that contains the batch's pages and a dst SG list that has
       multiple scatterlists for the batch's output buffers. 

    g) Submit the two largest data buffers first for decompression
       batching, so that the longest running jobs get a head start,
       reducing latency for the batch.

 3)  Compress/decompress batching are implemented using SG lists as the batching
     interface.


Main Changes in Zswap Compression Batching:
===========================================

 Note to zswap maintainers:
 --------------------------
 Patches 19 and 20 can be reviewed and improved/merged independently
 of this series, since they are zswap centric. These 2 patches help
 batching but the crypto_acomp_batch_size() from the iaa_crypto commits
 in this series is not a requirement, unlike patches 21-22.
 
 1) v13 preserves the pool acomp_ctx resources creation/deletion
    simplification of v11, namely, lasting from pool creation-deletion,
    persisting through CPU hot[un]plug operations. Further, zswap no
    longer needs to create multiple "struct acomp_req" in the per-CPU
    acomp_ctx. zswap only needs to manage multiple "u8 **buffers".

 2) We store the compressor's batch-size (@pool->compr_batch_size) directly in
    struct zswap_pool for quick retrieval in the zswap_store() fast path.

 3) Optimizations to not cause regressions in software compressors with
    the introduction of the new unified zswap_compress() framework that
    implements compression batching for all compressors. These optimizations
    help recover the performance for non-batching compressors:

    a) kmem_cache_alloc_bulk(), kmem_cache_free_bulk() to allocate/free
       batch zswap_entry-s. These kmem_cache API allow allocator
       optimizations with internal locks for multiple allocations.

    b) The page's nid is stored in a new nid field added to zswap_entry, so the
       zswap_lru_add()/zswap_lru_del() will add/delete the entry from the LRU
       list of the page's nid. This preserves the current behavior wrt the
       shrinker.

    c) Writes to the zswap_entry right after it is allocated without
       modifying the publishing order. This avoids different code blocks
       in zswap_store_pages() having to bring the zswap_entries to the
       cache for writing, potentially evicting other working set
       structures, impacting performance.

    d) ZSWAP_MAX_BATCH_SIZE is used as the batch-size for software
       compressors, since this gives the best performance with zstd.

    e) Minimize branches in zswap_compress().

 4) During pool creation, these key additions are allocated as part of the
    per-CPU acomp_ctx so as to recover performance with the new, generalized SG
    lists based zswap_compress() batching interface:

    a) An sg_table "acomp_ctx->sg_outputs" is allocated to contain the
       compressor's batch-size number of SG lists that will contain the
       destination buffers/lengths after batch compression.

    b) The per-CPU destination buffers are mapped to the per-CPU SG lists: this
       needs to be done only once, and optimizes performance.

 5) A unified zswap_compress() API is added to compress multiple pages. Thanks
    to Nhat, Yosry and Johannes for their helpful suggestions to accomplish
    this.

 6) Finally, zswap_compress() has been re-written to incorporate Herbert's
    suggestions to use source folios and output SG lists for batching. The new
    zswap_compress() code has been made as generic to software and batching
    compressors as possible, so that it is easy to read and maintain. The
    recent changes related to PAGE_SIZE dst buffers, zsmalloc and incompressible
    pages have been incorporated into the batched zswap_compress() as well. To
    resolve regressions with zstd, I took the liberty of not explicitly checking
    for dlen == 0 and dlen > PAGE_SIZE (as in the mainline); instead,
    expecting that a negative err value will be returned by the software
    compressor in such cases.


Compression Batching:
=====================

This patch-series introduces batch compression of pages in large folios to
improve zswap swapout latency. It preserves the existing zswap protocols
for non-batching software compressors by calling crypto_acomp sequentially
per page in the batch. Additionally, in support of hardware accelerators
that can process a batch as an integral unit, the patch-series allows
zswap to call crypto_acomp without API changes, for compressors
that intrinsically support batching. The zswap_compress() code has very minimal
special casing for software/batching compressors.

The patch series provides a proof point by using the Intel Analytics
Accelerator (IAA) for implementing the compress/decompress batching API
using hardware parallelism in the iaa_crypto driver and another proof point
with a sequential software compressor, zstd.

SUMMARY:
========

  The first proof point is to test with IAA using a sequential call (fully
  synchronous, compress one page at a time) vs. a batching call (fully
  asynchronous, submit a batch to IAA for parallel compression, then poll for
  completion statuses).
  
    The performance testing data with 30 usemem processes/64K folios
    shows 62% throughput gains and 28% elapsed/sys time reductions with
    deflate-iaa; and 5% sys time reduction with zstd for a small
    throughput increase. For PMD folios, a 67% throughput gain and 23%
    elapsed/sys time reduction is seen.

    Kernel compilation test with 64K folios using 32 threads and the
    zswap shrinker_enabled set to "N", demonstrates similar
    improvements: zswap_store() large folios using IAA compress batching
    improves the workload performance by 3.5% and reduces sys time by
    6% as compared to IAA sequential. For zstd, compress batching
    improves workload performance by 3.4% and reduces sys time by
    1.8% as compared to sequentially calling zswap_compress() per page
    in a folio.

    The main takeaway from usemem, a workload that is mostly compression
    dominated (very few swapins) is that the higher the number of batches,
    such as with larger folios, the more the benefit of batching cost
    amortization, as shown by the PMD usemem data. This aligns well with the
    future direction for batching.

  The second proof point is to make sure that software algorithms such as
  zstd do not regress. The data indicates that for sequential software
  algorithms a performance gain is achieved. 
  
    With the performance optimizations implemented in patches 21-22 of v13:
    
    *  zstd usemem metrics with 64K folios are within range of variation
       with a slight sys time improvement. zstd usemem30 workload performance
       with PMD folios improves by 6% and sys time reduces by 8%, for comparable
       throughput as the baseline.

    *  With kernel compilation, I used zstd without the zswap shrinker to enable
       more direct comparisons with the changes in this series. Subsequent patch
       series I expect to submit in collaboration with Nhat, will enable the
       zswap shrinker to quantify the benefits of decompression batching during
       writeback. With this series' compression batching within large folios, we
       get a 6%-1.8% reduction in sys time, a 3.5%-3.4% improvement in workload
       performance with 64K folios for deflate-iaa/zstd respectively.

    These optimizations pertain to ensuring common code paths and removing
    redundant branches/computes. Additionally, using the batching code for
    non-batching compressors to sequentially compress/store batches of up
    to ZSWAP_MAX_BATCH_SIZE pages seems to help, most likely due to
    cache locality of working set structures such as the array of
    zswap_entry-s for the batch.
  
    Our internal validation of zstd with the batching interface vs. IAA with
    the batching interface on Emerald Rapids has shown that IAA
    compress/decompress batching gives 21.3% more memory savings as compared
    to zstd, for 5% performance loss as compared to the baseline without any
    memory pressure. IAA batching demonstrates more than 2X the memory
    savings obtained by zstd at this 95% performance KPI.
    The compression ratio with IAA is 2.23, and with zstd 2.96. Even with
    this compression ratio deficit for IAA, batching is extremely
    beneficial. As we improve the compression ratio of the IAA accelerator,
    we expect to see even better memory savings with IAA as compared to
    software compressors.
    

  Batching Roadmap:
  =================

  1) Compression batching within large folios (this series).
  
  2) zswap writeback decompression batching:

     This is being co-developed with Nhat Pham, and shows promising
     results. We plan to submit an RFC shortly.

  3) Reclaim batching of hybrid folios:
  
     We can expect to see even more significant performance and throughput
     improvements if we use the parallelism offered by IAA to do reclaim
     batching of 4K/large folios (really any-order folios), and using the
     zswap_store() high throughput compression pipeline to batch-compress
     pages comprising these folios, not just batching within large
     folios. This is the reclaim batching patch 13 in v1, which we expect
     to submit in a separate patch-series. As mentioned earlier, reclaim
     batching reduces the # of writeback pages by 10X for zstd and
     deflate-iaa.

  4) swapin_readahead() decompression batching:

     We have developed a zswap load batching interface to be used
     for parallel decompression batching, using swapin_readahead().
  
  These capabilities are architected so as to be useful to zswap and
  zram. We have integrated these components with zram and expect to submit an
  RFC soon.

 
  v13 Performance Summary:
  ========================

  This is a performance testing summary of results with usemem30
  (30 usemem processes running in a cgroup limited at 150G, each trying to
   allocate 10G).

  usemem30 with 64K folios:
  =========================
  
     zswap shrinker_enabled = N.
  
     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
                                                                 vs.
                                                             IAA Sequential
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,118,675       9,901,216       62%
     Average throughput (KB/s)     203,955         330,040       62%     
     elapsed time (sec)              98.94           70.90      -28%      
     sys time (sec)               2,379.29        1,686.18      -29%      
     -----------------------------------------------------------------------
    
     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13    
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd   v13 zstd    
                                                             improvement  
     -----------------------------------------------------------------------
     Total throughput (KB/s)     5,983,561       6,003,851      0.3%
     Average throughput (KB/s)     199,452         200,128      0.3% 
     elapsed time (sec)             100.93           96.62     -4.3%
     sys time (sec)               2,532.49        2,395.83       -5%         
     -----------------------------------------------------------------------

  usemem30 with 2M folios:
  ========================
  
     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
                                                                 vs.
                                                             IAA Sequential
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,309,635      10,558,225       67%
     Average throughput (KB/s)     210,321         351,940       67% 
     elapsed time (sec)              88.70           67.84      -24%
     sys time (sec)               2,059.83        1,581.07      -23%
     -----------------------------------------------------------------------
  
     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13      
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd   v13 zstd           
                                                             improvement
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,562,687       6,567,946      0.1%
     Average throughput (KB/s)     218,756         218,931      0.1% 
     elapsed time (sec)              94.69           88.79       -6%
     sys time (sec)               2,253.97        2,083.43       -8%
     -----------------------------------------------------------------------


  This is a performance testing summary of results with
  kernel_compilation test (allmod config, 32 cores, cgroup limited to 2G).

  zswap shrinker_enabled = N.
  
  kernel_compilation with 64K folios:
  ===================================

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
                                                             vs.
                                                        IAA Sequential
     --------------------------------------------------------------------------
     real_sec                 836.64          806.94      -3.5%
     sys_sec                3,897.57        3,661.83        -6%
     --------------------------------------------------------------------------

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor           zstd            zstd    Improvement
     --------------------------------------------------------------------------
     real_sec                 880.62          850.41      -3.4%
     sys_sec                5,171.90        5,076.51      -1.8%
     --------------------------------------------------------------------------


  kernel_compilation with PMD folios:
  ===================================

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
                                                             vs.
                                                        IAA Sequential
     --------------------------------------------------------------------------
     real_sec                 818.48          779.67      -4.7%
     sys_sec                4,226.52        4,245.18       0.4%
     --------------------------------------------------------------------------
 
     --------------------------------------------------------------------------
              mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor          zstd             zstd    Improvement
     --------------------------------------------------------------------------
     real_sec                888.45           849.54      -4.4%
     sys_sec               5,866.72         5,847.17      -0.3%
     --------------------------------------------------------------------------



The patch-series is organized as follows:
=========================================

 1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
    patches are tagged with "crypto:" in the subject:

    Patch 1) Reorganizes the iaa_crypto driver code into logically related
             sections and avoids forward declarations, in order to facilitate
             subsequent iaa_crypto patches. This patch makes no
             functional changes.

    Patch 2) Makes an infrastructure change in the iaa_crypto driver
             to map IAA devices/work-queues to cores based on packages
             instead of NUMA nodes. This doesn't impact performance on
             the Sapphire Rapids system used for performance
             testing. However, this change fixes functional problems we
             found on Granite Rapids during internal validation, where the
             number of NUMA nodes is greater than the number of packages,
             which was resulting in over-utilization of some IAA devices
             and non-usage of other IAA devices as per the current NUMA
             based mapping infrastructure.

             This patch also develops a new architecture that
             generalizes how IAA device WQs are used. It enables
             designating IAA device WQs as either compress-only or
             decompress-only or generic. Once IAA device WQ types are
             thus defined, it also allows the configuration of whether
             device WQs will be shared by all cores on the package, or
             used only by "mapped cores" obtained by a simple allocation
             of available IAAs to cores on the package.

             As a result of the overhaul of wq_table definition,
             allocation and rebalancing, this patch eliminates
             duplication of device WQs in per-CPU wq_tables, thereby
             saving 140MiB on a 384 cores dual socket Granite Rapids server
             with 8 IAAs.

             Regardless of how the user has configured the WQs' usage,
             the next WQ to use is obtained through a direct look-up in
             per-CPU "cpu_comp_wqs" and "cpu_decomp_wqs" structures so
             as to minimize latency in the critical path driver compress
             and decompress routines.

    Patch 3) Code cleanup, consistency of function parameters.

    Patch 4) Makes a change to iaa_crypto driver's descriptor allocation,
             from blocking to non-blocking with retries/timeouts and
             mitigations in case of timeouts during compress/decompress
             ops. This prevents tasks getting blocked indefinitely, which
             was observed when testing 30 cores running workloads, with
             only 1 IAA enabled on Sapphire Rapids (out of 4). These
             timeouts are typically only encountered, and associated
             mitigations exercised, only in configurations with 1 IAA
             device shared by 30+ cores.

    Patch 5) Optimize iaa_wq refcounts using a percpu_ref instead of
             spinlocks and "int refcount".

    Patch 6) Code simplification and restructuring for understandability
             in core iaa_compress() and iaa_decompress() routines.

    Patch 7) Refactor hardware descriptor setup to their own procedures
             to reduce code clutter.

    Patch 8) Simplify and optimize job submission for the most commonly used
             non-irq async mode by directly calling movdir64b.

    Patch 9) Deprecate exporting symbols for adding IAA compression
             modes.

    Patch 10) All dma_map_sg() calls will pass in 1 for the nents instead of
              sg_nents(), for these main reasons: performance; no existing
              iaa_crypto use cases that allow multiple SG lists to be mapped for
              dma at once; facilitates new SG lists batching interface through
              crypto.

    Patch 11) Move iaa_crypto core functionality to a layer that relies only on
              the idxd driver, dma, and scatterlists. Implement clean interfaces
              to crypto_acomp.

    Patch 12) Define a unit_size in struct acomp_req to enable batching, and
              provides acomp_request_set_unit_size() for use by kernel
              modules. zswap_cpu_comp_prepare() calls this API to set the
              unit_size for zswap as PAGE_SIZE.

    Patch 13) Implement asynchronous descriptor submit and polling mechanisms,
              enablers for batching. Develop IAA batching of compressions and
              decompressions for deriving hardware parallelism.

    Patch 14) Enables the "async" mode, sets it as the default.

    Patch 15) Disables verify_compress by default.

    Patch 16) Decompress batching optimization: Find the two largest
              buffers in the batch and submit them first.
             
    Patch 17) Add a new Dynamic compression mode that can be used on
              Granite Rapids.

    Patch 18) Add a batch_size data member to struct acomp_alg and
              a crypto_acomp_batch_size() API that returns the compressor's
              batch-size, if it has defined one; 1 otherwise.

 2) zswap modifications to enable compress batching in zswap_store()
    of large folios (including pmd-mappable folios):

    Patch 19) Simplifies the zswap_pool's per-CPU acomp_ctx resource
              management and lifetime to be from pool creation to pool
              deletion.

    Patch 20) Uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check for
              valid acomp/req, thereby making it consistent with the resource
              de-allocation code.

    Patch 21) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set
              as 8U) to denote the maximum number of acomp_ctx batching
              resources to allocate, thus limiting the amount of extra
              memory used for batching. Further, the "struct
              crypto_acomp_ctx" is modified to contain multiple buffers.
              New "u8 compr_batch_size" member is added to "struct zswap_pool"
              to track the number of dst buffers associated with the compressor
              (more than 1 if the compressor supports batching).

              Modifies zswap_store() to store the folio in batches of
              pool->compr_batch_size (batching compressors) or
              ZSWAP_MAX_BATCH_SIZE (sequential compressors) by calling a new
              zswap_store_pages() that takes a range of indices in the folio to
              be stored.
              
              zswap_store_pages() bulk-allocates zswap entries for the batch,
              calls zswap_compress() for each page in this range, and stores
              the entries in xarray/LRU.

    Patch 22) Introduces a new unified batching implementation of
              zswap_compress() for compressors that do and do not support
              batching. This eliminates code duplication and facilitates
              code maintainability with the introduction of compress
              batching. Further, there are many optimizations to this common
              code that result in workload throughput and performance
              improvements with software compressors and hardware accelerators
              such as IAA.

              zstd performance is better or on par with mm-unstable. We
              see impressive throughput/performance improvements with
              IAA and workload performance/sys time improvement with zstd
              batching vs. no-batching.


With v13 of this patch series, the IAA compress batching feature will be
enabled seamlessly on Intel platforms that have IAA by selecting
'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
sync_mode driver attribute (the default).


System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 10-24-2025,
commit 813c0fa931ce, without and with this patch-series. Data was
gathered on an Intel Sapphire Rapids (SPR) server, dual-socket 56 cores
per socket, 4 IAA devices per socket, each IAA has total 128 WQ entries,
503 GiB RAM and 525G SSD disk partition swap. Core frequency was fixed
at 2500MHz.

Other kernel configuration parameters:

    zswap compressor  : zstd, deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 0

IAA "compression verification" is disabled and IAA is run in the async
mode (the defaults with this series).

I ran experiments with these workloads:

1) usemem 30 processes with zswap shrinker_enabled=N. Two sets of
   experiments, one with 64K folios, another with PMD folios.

2) Kernel compilation allmodconfig with 2G max memory, 32 threads, with
   zswap shrinker_enabled=N to test batching performance impact in
   isolation. Two sets of experiments, one with 64K folios, another with PMD
   folios.

IAA configuration is done by a CLI: script is included at the end of the
cover letter.


Performance testing (usemem30):
===============================
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. There is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and
sleeping for 10 sec before exiting:

 usemem --init-time -w -O -b 1 -s 10 -n 30 10g
 echo 0 > /sys/module/zswap/parameters/shrinker_enabled

 IAA WQ Configuration (script is iincluded at the end of the cover
 letter):

   ./enable_iaa.sh -d 4 -q 1
   
 This enables all 4 IAAs on the socket, and configures 1 WQ per IAA
 device, each containing 128 entries. The driver distributes compress
 jobs from each core to wqX.0 of all same-package IAAs in a
 round-robin manner. Decompress jobs are send to the wqX.0 of the
 mapped IAA device.

 Since usemem has significantly more swapouts than swapins, this
 configuration is the most optimal.

 64K folios: usemem30: deflate-iaa:
 ==================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
                                                                 vs.
                                                             IAA Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        6,118,675       9,901,216         62%
 Avg throughput (KB/s)            203,955         330,040         62%       
 elapsed time (sec)                 98.94           70.90        -28%      
 sys time (sec)                  2,379.29        1,686.18        -29%      
                                                                         
 -------------------------------------------------------------------------------
 memcg_high                     1,263,467       1,404,068                
 memcg_swap_fail                    1,728           1,377                
 64kB_swpout_fallback               1,728           1,377                
 zswpout                       58,174,008      64,508,622                
 zswpin                                43             138                
 pswpout                                0               0                
 pswpin                                 0               0                
 ZSWPOUT-64kB                   3,634,162       4,030,643                
 SWPOUT-64kB                            0               0
 pgmajfault                         2,398           2,488
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 IAA incompressible pages               0               0
 -------------------------------------------------------------------------------


 2M folios: usemem30: deflate-iaa:
 =================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa     IAA Batching
                                                                  vs.
                                                              IAA Sequential
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        6,309,635      10,558,225        67%
 Avg throughput (KB/s)            210,321         351,940        67%
 elapsed time (sec)                 88.70           67.84       -24%     
 sys time (sec)                  2,059.83        1,581.07       -23%     
                                                               
 -------------------------------------------------------------------------------
 memcg_high                       116,246         125,218     
 memcg_swap_fail                       41             177     
 thp_swpout_fallback                   41             177     
 zswpout                       59,880,021      64,509,854     
 zswpin                                69             425     
 pswpout                                0               0     
 pswpin                                 0               0     
 ZSWPOUT-2048kB                   116,912         125,822     
 thp_swpout                             0               0     
 pgmajfault                         2,408           4,026
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 IAA incompressible pages               0               0
 -------------------------------------------------------------------------------


 64K folios: usemem30: zstd:
 ===========================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13        
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd        v13 zstd    
                                                                 improvement  
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        5,983,561       6,003,851         0.3%
 Avg throughput (KB/s)            199,452         200,128         0.3%       
 elapsed time (sec)                100.93           96.62        -4.3%
 sys time (sec)                  2,532.49        2,395.83          -5%
                                                          
 -------------------------------------------------------------------------------
 memcg_high                     1,122,198       1,113,384 
 memcg_swap_fail                      192              55 
 64kB_swpout_fallback                 192              55 
 zswpout                       48,766,907      48,799,863
 zswpin                                89              68 
 pswpout                                0               0 
 pswpin                                 0               0 
 ZSWPOUT-64kB                   3,047,702       3,049,908 
 SWPOUT-64kB                            0               0 
 pgmajfault                         2,428           2,390 
 zswap_reject_compress_fail             0               0 
 zswap_reject_reclaim_fail              0               0 
 -------------------------------------------------------------------------------


 2M folios: usemem30: zstd:
 ==========================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13      
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd        v13 zstd           
                                                                 improvement
 -------------------------------------------------------------------------------
 Total throughput (KB/s)        6,562,687       6,567,946         0.1%
 Avg throughput (KB/s)            218,756         218,931         0.1%    
 elapsed time (sec)                 94.69           88.79          -6%
 sys time (sec)                  2,253.97        2,083.43          -8%
                                                          
 --------------------------------------------------------------------------------
 memcg_high                        92,709          92,686 
 memcg_swap_fail                       33             226 
 thp_swpout_fallback                   33             226 
 zswpout                       47,851,601      47,847,171
 zswpin                                65             441 
 pswpout                                0               0 
 pswpin                                 0               0 
 ZSWPOUT-2048kB                    93,427          93,238 
 thp_swpout                             0               0 
 pgmajfault                         2,382           2,767 
 zswap_reject_compress_fail             0               0 
 zswap_reject_reclaim_fail              0               0 
 -------------------------------------------------------------------------------


Performance testing (Kernel compilation, allmodconfig):
=======================================================

The experiments with kernel compilation test use 32 threads and build
the "allmodconfig" that takes ~14 minutes, and has considerable
swapout/swapin activity. The cgroup's memory.max is set to 2G. zswap
writeback is not enabled so as to isolate the performance impact of only large
folio batch compression.

 echo 0 > /sys/module/zswap/parameters/shrinker_enabled

 IAA WQ Configuration (script is at the end of the cover letter):

   ./enable_iaa.sh -d 4 -q 2
   
 This enables all 4 IAAs on the socket, and configures 2 WQs per IAA,
 each containing 64 entries. The driver sends decompresses to wqX.0 of
 the mapped IAA device, and distributes compresses to wqX.1 of all
 same-package IAAs in a round-robin manner.

 64K folios: Kernel compilation/allmodconfig: deflate-iaa:
 =========================================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
                                                                 vs.
                                                             IAA Sequential
 -------------------------------------------------------------------------------
 real_sec                          836.64          806.94       -3.5%
 user_sec                       15,702.26       15,695.13
 sys_sec                         3,897.57        3,661.83         -6%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,872,500       1,873,144
 -------------------------------------------------------------------------------
 memcg_high                             0               0        
 memcg_swap_fail                        0               0        
 64kB_swpout_fallback                   0               0        
 zswpout                       94,890,390      93,332,527        
 zswpin                        28,305,656      28,111,525        
 pswpout                                0               0        
 pswpin                                 0               0        
 ZSWPOUT-64kB                   3,088,473       3,018,341        
 SWPOUT-64kB                            0               0
 pgmajfault                    29,958,141      29,776,102
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 IAA incompressible pages             684             442 
 -------------------------------------------------------------------------------


 2M folios: Kernel compilation/allmodconfig: deflate-iaa:
 ========================================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
                                                                 vs.
                                                             IAA Sequential
 -------------------------------------------------------------------------------
 real_sec                          818.48          779.67         -4.7%
 user_sec                       15,798.78       15,807.93  
 sys_sec                         4,226.52        4,245.18          0.4%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,871,096       1,871,100  
 -------------------------------------------------------------------------------
 memcg_high                             0               0     
 memcg_swap_fail                        0               0     
 thp_swpout_fallback                    0               0     
 zswpout                      105,675,621     109,930,550          
 zswpin                        36,537,688      38,205,575     
 pswpout                                0               0     
 pswpin                                 0               0     
 ZSWPOUT-2048kB                    15,600          15,800     
 thp_swpout                             0               0     
 pgmajfault                    37,843,091      39,540,387  
 zswap_reject_compress_fail             0               0  
 zswap_reject_reclaim_fail              0               0  
 IAA incompressible pages             188             349  
 -------------------------------------------------------------------------------


With the iaa_crypto driver changes for non-blocking descriptor allocations,
no timeouts-with-mitigations were seen in compress/decompress jobs, for all
of the above experiments.


 64K folios: Kernel compilation/allmodconfig: zstd:
 ==================================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd    Improvement
 -------------------------------------------------------------------------------
 real_sec                          880.62          850.41        -3.4%
 user_sec                       15,717.23       15,683.17 
 sys_sec                         5,171.90        5,076.51        -1.8%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,871,276       1,874,744 
 -------------------------------------------------------------------------------
 memcg_high                             0               0         
 memcg_swap_fail                        0               0         
 64kB_swpout_fallback                   0               0         
 zswpout                       76,599,637      76,472,392         
 zswpin                        21,833,178      22,538,969         
 pswpout                                0               0         
 pswpin                                 0               0         
 ZSWPOUT-64kB                   2,462,404       2,446,549         
 SWPOUT-64kB                            0               0 
 pgmajfault                    23,027,211      23,830,391 
 zswap_reject_compress_fail             0               0 
 zswap_reject_reclaim_fail              0               0 
 -------------------------------------------------------------------------------


 2M folios: Kernel compilation/allmodconfig: zstd:
 =================================================

 -------------------------------------------------------------------------------
                    mm-unstable-10-24-2025             v13
 -------------------------------------------------------------------------------
 zswap compressor                    zstd            zstd    Improvement
 -------------------------------------------------------------------------------
 real_sec                          888.45          849.54       -4.4%
 user_sec                       15,841.87       15,828.10
 sys_sec                         5,866.72        5,847.17       -0.3%
 -------------------------------------------------------------------------------
 Max_Res_Set_Size_KB            1,871,096       1,872,892
 -------------------------------------------------------------------------------
 memcg_high                             0               0   
 memcg_swap_fail                        0               0   
 thp_swpout_fallback                    0               0   
 zswpout                       89,891,328      90,847,761        
 zswpin                        29,249,656      29,999,617   
 pswpout                                0               0   
 pswpin                                 0               0   
 ZSWPOUT-2048kB                    12,198          12,481   
 thp_swpout                             0               0   
 pgmajfault                    30,077,425      30,915,945
 zswap_reject_compress_fail             0               0
 zswap_reject_reclaim_fail              0               0
 -------------------------------------------------------------------------------



Changes since v12:
==================
1)  Rebased to mm-unstable as of 10-24-2025, commit 813c0fa931ce.
2)  Added "int nid" to zswap_entry to store the page's nid, to preserve zswap
    LRU list/shrinker behavior with bulk allocation, as suggested by Nhat and
    Yosry. No change in memory footprint of struct zswap_entry.
3)  Added a WARN_ON() if kmem_cache_alloc_bulk() returns 0 or a number that's
    different than nr_entries, as suggested by Yosry.
4)  Confirmed that kmem_cache_bulk_free() works for both bulk and non-bulk
    allocated entries, to follow-up on Yosry's comment.
5)  Moved the call to cpuhp_state_remove_instance() to zswap_pool_destroy(), as
    suggested by Yosry.
6)  Variable names changed to "nid" and "wb_enabled", per Yosry's suggestion.
7)  Concise comments in zswap.c, and summarized commit logs, as suggested by
    Yosry.
8)  Minimized branches in zswap_compress().
9)  Deleted allocating extra memory in acomp_req->__ctx[] to statically store
    addresses to SG lists' lengths, as suggested by Herbert.
10) Deleted the iaa_comp API and export symbols, as suggested by Herbert.
11) Deleted @batch_size in struct crypto_acomp. Instead, the value is returned
    from struct acomp_alg directly, as suggested by Herbert.
12) Addressed checkpatch.pl warnings and coding style suggestions in the
    iaa_crypto patches, provided by Vinicius Gomes in internal code
    reviews. Thanks Vinicius!


Changes since v11:
==================
1) Rebased to mm-unstable as of 9-18-2025, commit 1f98191f08b4.
2) Incorporated Herbert's suggestions on submitting the folio as the source and 
   SG lists for the destination to create the compress batching interface from
   zswap to crypto.
3) As per Herbert's suggestion, added a new unit_size member to struct
   acomp_req, along with a acomp_request_set_unit_size() API for kernel modules
   to set the unit size to use while breaking down the request's src/dst
   scatterlists.
4) Implemented iaa_crypto batching using the new SG lists based architecture and
   crypto interfaces.
5) To make the SG lists based approach functional and performant for IAA, I have
   changed all the calls to dma_map_sg() to use nents of 1. This should not be a
   concern, since it eliminates redundant computes to scan an SG list with only
   one scatterlist for existing kernel users, i.e. zswap with the
   zswap_compress() modifications in this series. This will continue to hold
   true with the zram IAA batching support I am developing. There are no kernel
   use cases for the iaa_crypto driver that will break this assumption.
6) Addressed Herbert's comment about batch_size being a statically defined data
   member in struct acomp_alg and struct crypto_acomp.
7) Addressed Nhat's comment about VM_WARN_ON_ONCE(nr_pages >
   ZSWAP_MAX_BATCH_SIZE) in zswap_store_pages().
8) Nhat's comment about deleting struct swap_batch_decomp_data is automatically
   addressed by the SG lists based rewrite of the crypto batching interface.
9) Addressed Barry's comment about renaming pool->batch_size to
   pool->store_batch_size.
10) Incorporated Barry's suggestion to merge patches that introduce data members
    to structures and/or API and their usage.
11) Added performance data to patch 0023's commit log, as suggested by Barry.

Changes since v10:
==================
1) Rebased to mm-unstable as of 7-30-2025, commit 01da54f10fdd.
2) Added change logging in patch 0024 on there being no Intel-specific
   dependencies in the batching framework, as suggested by
   Andrew Morton. Thanks Andrew!
3) Added change logging in patch 0024 on other ongoing work that can use
   batching, as per Andrew's suggestion. Thanks Andrew!
4) Added the IAA configuration script in the cover letter, as suggested
   by Nhat Pham. Thanks Nhat!
5) As suggested by Nhat, dropped patch 0020 from v10, that moves CPU
   hotplug procedures to pool functions.
6) Gathered kernel_compilation 'allmod' config performance data with
   writeback and zswap shrinker_enabled=Y.
7) Changed the pool->batch_size for software compressors to be
   ZSWAP_MAX_BATCH_SIZE since this gives better performance with the zswap
   shrinker enabled.
8) Was unable to replicate in v11 the issue seen in v10 with higher
   memcg_swap_fail than in the baseline, with usemem30/zstd.

Changes since v9:
=================
1) Rebased to mm-unstable as of 6-24-2025, commit 23b9c0472ea3.
2) iaa_crypto rearchitecting, mainline race condition fix, performance
   optimizations, code cleanup.
3) Addressed Herbert's comments in v9 patch 10, that an array based
   crypto_acomp interface is not acceptable.
4) Optimized the implementation of the batching zswap_compress() and
   zswap_store_pages() added in v9, to recover performance when
   integrated with the changes in commit 56e5a103a721 ("zsmalloc: prefer
   the the original page's node for compressed data").

Changes since v8:
=================
1) Rebased to mm-unstable as of 4-21-2025, commit 2c01d9f3c611.
2) Backported commits for reverting request chaining, since these are
   in cryptodev-2.6 but not yet in mm-unstable: without these backports,
   deflate-iaa is non-functional in mm-unstable:
   commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining")
   commit 5976fe19e240 ("Revert "crypto: testmgr - Add multibuffer acomp
                         testing"")
   Backported this hotfix as well:
   commit 002ba346e3d7 ("crypto: scomp - Fix off-by-one bug when
   calculating last page").
3) crypto_acomp_[de]compress() restored to non-request chained
   implementations since request chaining has been removed from acomp in
   commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining").
4) New IAA WQ architecture to denote WQ type and whether or not a WQ
   should be shared among all package cores, or only to the "mapped"
   ones from an even cores-to-IAA distribution scheme.
5) Compress/decompress batching are implemented in iaa_crypto using new
   crypto_acomp_batch_compress()/crypto_acomp_batch_decompress() API.
6) Defines a "void *data" in struct acomp_req, based on Herbert advising
   against using req->base.data in the driver. This is needed for async
   submit-poll to work.
7) In zswap.c, moved the CPU hotplug callbacks to reside in "pool
   functions", per Yosry's suggestion to move procedures in a distinct
   patch before refactoring patches.
8) A new "u8 nr_reqs" member is added to "struct zswap_pool" to track
   the number of requests/buffers associated with the per-cpu acomp_ctx,
   as per Yosry's suggestion.
9) Simplifications to the acomp_ctx resources allocation, deletion,
   locking, and for these to exist from pool creation to pool deletion,
   based on v8 code review discussions with Yosry.
10) Use IS_ERR_OR_NULL() consistently in zswap_cpu_comp_prepare() and
    acomp_ctx_dealloc(), as per Yosry's v8 comment.
11) zswap_store_folio() is deleted, and instead, the loop over
    zswap_store_pages() is moved inline in zswap_store(), per Yosry's
    suggestion.
12) Better structure in zswap_compress(), unified procedure that
    compresses/stores a batch of pages for both, non-batching and
    batching compressors. Renamed from zswap_batch_compress() to
    zswap_compress(): Thanks Yosry for these suggestions.


Changes since v7:
=================
1) Rebased to mm-unstable as of 3-3-2025, commit 5f089a9aa987.
2) Changed the acomp_ctx->nr_reqs to be u8 since ZSWAP_MAX_BATCH_SIZE is
   defined as 8U, for saving memory in this per-cpu structure.
3) Fixed a typo in code comments in acomp_ctx_get_cpu_lock():
   acomp_ctx->initialized to acomp_ctx->__online.
4) Incorporated suggestions from Yosry, Chengming, Nhat and Johannes,
   thanks to all!
   a) zswap_batch_compress() replaces zswap_compress(). Thanks Yosry
      for this suggestion!
   b) Process the folio in sub-batches of ZSWAP_MAX_BATCH_SIZE, regardless
      of whether or not the compressor supports batching. This gets rid of
      the kmalloc(entries), and allows us to allocate an array of
      ZSWAP_MAX_BATCH_SIZE entries on the stack. This is implemented in
      zswap_store_pages().
   c) Use of a common structure and code paths for compressing a folio in
      batches, either as a request chain (in parallel in IAA hardware) or
      sequentially. No code duplication since zswap_compress() has been
      replaced with zswap_batch_compress(), simplifying maintainability.
5) A key difference between compressors that support batching and
   those that do not, is that for the latter, the acomp_ctx mutex is
   locked/unlocked per ZSWAP_MAX_BATCH_SIZE batch, so that decompressions
   to handle page-faults can make progress. This fixes the zstd kernel
   compilation regression seen in v7. For compressors that support
   batching, for e.g. IAA, the mutex is locked/released once for storing
   the folio.
6) Used likely/unlikely compiler directives and prefetchw to restore
   performance with the common code paths.

Changes since v6:
=================
1) Rebased to mm-unstable as of 2-27-2025, commit d58172d128ac.

2) Deleted crypto_acomp_batch_compress() and
   crypto_acomp_batch_decompress() interfaces, as per Herbert's
   suggestion. Batching is instead enabled by chaining the requests. For
   non-batching compressors, there is no request chaining involved. Both,
   batching and non-batching compressions are accomplished by zswap by
   calling:

   crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);

3) iaa_crypto implementation of batch compressions/decompressions using
   request chaining, as per Herbert's suggestions.
4) Simplification of the acomp_ctx resource allocation/deletion with
   respect to CPU hot[un]plug, to address Yosry's suggestions to explore the
   mutex options in zswap_cpu_comp_prepare(). Yosry, please let me know if
   the per-cpu memory cost of this proposed change is acceptable (IAA:
   64.8KB, Software compressors: 8.2KB). On the positive side, I believe
   restarting reclaim on a CPU after it has been through an offline-online
   transition, will be much faster by not deleting the acomp_ctx resources
   when the CPU gets offlined.
5) Use of lockdep assertions rather than comments for internal locking
   rules, as per Yosry's suggestion.
6) No specific references to IAA in zswap.c, as suggested by Yosry.
7) Explored various solutions other than the v6 zswap_store_folio()
   implementation, to fix the zstd regression seen in v5, to attempt to
   unify common code paths, and to allocate smaller arrays for the zswap
   entries on the stack. All these options were found to cause usemem30
   latency regression with zstd. The v6 version of zswap_store_folio() is
   the only implementation that does not cause zstd regression, confirmed
   by 10 consecutive runs, each giving quite consistent latency
   numbers. Hence, the v6 implementation is carried forward to v7, with
   changes for branching for batching vs. sequential compression API
   calls.


Changes since v5:
=================
1) Rebased to mm-unstable as of 2-1-2025, commit 7de6fd8ab650.

Several improvements, regression fixes and bug fixes, based on Yosry's
v5 comments (Thanks Yosry!):

2) Fix for zstd performance regression in v5.
3) Performance debug and fix for marginal improvements with IAA batching
   vs. sequential.
4) Performance testing data compares IAA with and without batching, instead
   of IAA batching against zstd.
5) Commit logs/zswap comments not mentioning crypto_acomp implementation
   details.
6) Delete the pr_info_once() when batching resources are allocated in
   zswap_cpu_comp_prepare().
7) Use kcalloc_node() for the multiple acomp_ctx buffers/reqs in
   zswap_cpu_comp_prepare().
8) Simplify and consolidate error handling cleanup code in
   zswap_cpu_comp_prepare().
9) Introduce zswap_compress_folio() in a separate patch.
10) Bug fix in zswap_store_folio() when xa_store() failure can cause all
    compressed objects and entries to be freed, and UAF when zswap_store()
    tries to free the entries that were already added to the xarray prior
    to the failure.
11) Deleting compressed_bytes/bytes. zswap_store_folio() also comprehends
    the recent fixes in commit bf5eaaaf7941 ("mm/zswap: fix inconsistency
    when zswap_store_page() fails") by Hyeonggon Yoo.

iaa_crypto improvements/fixes/changes:

12) Enables asynchronous mode and makes it the default. With commit
    4ebd9a5ca478 ("crypto: iaa - Fix IAA disabling that occurs when
    sync_mode is set to 'async'"), async mode was previously just sync. We
    now have true async support.
13) Change idxd descriptor allocations from blocking to non-blocking with
    timeouts, and mitigations for compress/decompress ops that fail to
    obtain a descriptor. This is a fix for tasks blocked errors seen in
    configurations where 30+ cores are running workloads under high memory
    pressure, and sending comps/decomps to 1 IAA device.
14) Fixes a bug with unprotected access of "deflate_generic_tfm" in
    deflate_generic_decompress(), which can cause data corruption and
    zswap_decompress() kernel crash.
15) zswap uses crypto_acomp_batch_compress() with async polling instead of
    request chaining for slightly better latency. However, the request
    chaining framework itself is unchanged, preserved from v5.


Changes since v4:
=================
1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
3) Implemented IAA compress batching using request chaining.
4) zswap_store() batching simplifications suggested by Chengming, Yosry and
   Nhat, thanks to all!
   - New zswap_compress_folio() that is called by zswap_store().
   - Move the loop over folio's pages out of zswap_store() and into a
     zswap_store_folio() that stores all pages.
   - Allocate all zswap entries for the folio upfront.
   - Added zswap_batch_compress().
   - Branch to call zswap_compress() or zswap_batch_compress() inside
     zswap_compress_folio().
   - All iterations over pages kept in same function level.
   - No helpers other than the newly added zswap_store_folio() and
     zswap_compress_folio().


Changes since v3:
=================
1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
   based on packages instead of NUMA nodes.
3) Added acomp_has_async_batching() API to crypto acomp, that allows
   zswap/zram to query if a crypto_acomp has registered batch_compress and
   batch_decompress interfaces.
4) Clear the poll bits on the acomp_reqs passed to
   iaa_comp_a[de]compress_batch() so that a module like zswap can be
   confident about the acomp_reqs[0] not having the poll bit set before
   calling the fully synchronous API crypto_acomp_[de]compress().
   Herbert, I would appreciate it if you can review changes 2-4; in patches
   1-8 in v4. I did not want to introduce too many iaa_crypto changes in
   v4, given that patch 7 is already making a major change. I plan to work
   on incorporating the request chaining using the ahash interface in v5
   (I need to understand the basic crypto ahash better). Thanks Herbert!
5) Incorporated Johannes' suggestion to not have a sysctl to enable
   compress batching.
6) Incorporated Yosry's suggestion to allocate batching resources in the
   cpu hotplug onlining code, since there is no longer a sysctl to control
   batching. Thanks Yosry!
7) Incorporated Johannes' suggestions related to making the overall
   sequence of events between zswap_store() and zswap_batch_store() similar
   as much as possible for readability and control flow, better naming of
   procedures, avoiding forward declarations, not inlining error path
   procedures, deleting zswap internal details from zswap.h, etc. Thanks
   Johannes, really appreciate the direction!
   I have tried to explain the minimal future-proofing in terms of the
   zswap_batch_store() signature and the definition of "struct
   zswap_batch_store_sub_batch" in the comments for this struct. I hope the
   new code explains the control flow a bit better.


Changes since v2:
=================
1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
   returned by kmalloc_node() for acomp_ctx->buffers and for
   acomp_ctx->reqs.
3) Fixed a bug in zswap_pool_can_batch() for returning true if
   pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED, and if
   the per-cpu acomp_batch_ctx tests true for batching resources having
   been allocated on this cpu. Also, changed from per_cpu_ptr() to
   raw_cpu_ptr().
4) Incorporated the zswap_store_propagate_errors() compilation warning fix
   suggested by Dan Carpenter. Thanks Dan!
5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments in
   zswap.h, with SWAP_CRYPTO_BATCH_SIZE.

Changes since v1:
=================
1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
   async/poll mode, and to encapsulate the polling functionality in the
   iaa_crypto driver. Thanks Herbert!
3) Incorporated Herbert's and Yosry's suggestions to implement the batching
   API in iaa_crypto and to make its use seamless from zswap's
   perspective. Thanks Herbert and Yosry!
4) Incorporated Yosry's suggestion to make it more convenient for the user
   to enable compress batching, while minimizing the memory footprint
   cost. Thanks Yosry!
5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
   reclaim batching patch from this series, since it requires a broader
   discussion.


IAA configuration script "enable_iaa.sh":
=========================================

 Acknowledgements: Binuraj Ravindran and Rakib Al-Fahad.

 Usage:
 ------

   ./enable_iaa.sh -d <num_IAAs> -q <num_WQs_per_IAA>


 #---------------------------------<cut here>----------------------------------
 #!/usr/bin/env bash
 #SPDX-License-Identifier: BSD-3-Clause
 #Copyright (c) 2025, Intel Corporation
 #Description: Configure IAA devices
 
 VERIFY_COMPRESS_PATH="/sys/bus/dsa/drivers/crypto/verify_compress"
 
 iax_dev_id="0cfe"
 num_iaa=$(lspci -d:${iax_dev_id} | wc -l)
 sockets=$(lscpu | grep Socket | awk '{print $2}')
 echo "Found ${num_iaa} instances in ${sockets} sockets(s)"
 
 #  The same number of devices will be configured in each socket, if there
 #  are  more than one socket.
 #  Normalize with respect to the number of sockets.
 device_num_per_socket=$(( num_iaa/sockets ))
 num_iaa_per_socket=$(( num_iaa / sockets ))
 
 iaa_wqs=2
 verbose=0
 iaa_engines=8
 mode="dedicated"
 wq_type="kernel"
 iaa_crypto_mode="async"
 verify_compress=0
 
 
 # Function to handle errors
 handle_error() {
     echo "Error: $1"
     exit 1
 }
 
 # Process arguments
 
 while getopts "d:hm:q:vD" opt; do
   case $opt in
     d)
       device_num_per_socket=$OPTARG
       ;;
     m)
       iaa_crypto_mode=$OPTARG
       ;;
     q)
       iaa_wqs=$OPTARG
       ;;
     D)
       verbose=1
       ;;
     v)
       verify_compress=1
       ;;
     h)
       echo "Usage: $0 [-d <device_count>][-q <wq_per_device>][-v]"
       echo "       -d - number of devices"
       echo "       -q - number of WQs per device"
       echo "       -v - verbose mode"
       echo "       -h - help"
       exit
       ;;
     \?)
       echo "Invalid option: -$OPTARG" >&2
       exit
       ;;
   esac
 done
 
 LOG="configure_iaa.log"
 
 # Update wq_size based on number of wqs
 wq_size=$(( 128 / iaa_wqs ))
 
 # Take care of the enumeration, if DSA is enabled.
 dsa=`lspci | grep -c 0b25`
 # set first,step counters to correctly enumerate iax devices based on
 # whether running on guest or host with or without dsa
 first=0
 step=1
 [[ $dsa -gt 0 && -d /sys/bus/dsa/devices/dsa0 ]] && first=1 && step=2
 echo "first index: ${first}, step: ${step}"
 
 
 #
 # Switch to software compressors and disable IAAs to have a clean start
 #
 COMPRESSOR=/sys/module/zswap/parameters/compressor
 last_comp=`cat ${COMPRESSOR}`
 echo lzo > ${COMPRESSOR}
 
 echo "Disable IAA devices before configuring"
 
 for ((i = ${first}; i < ${step} * ${num_iaa}; i += ${step})); do
     for ((j = 0; j < ${iaa_wqs}; j += 1)); do
         cmd="accel-config disable-wq iax${i}/wq${i}.${j} >& /dev/null"
        [[ $verbose == 1 ]] && echo $cmd; eval $cmd
      done
     cmd="accel-config disable-device iax${i} >& /dev/null"
     [[ $verbose == 1 ]] && echo $cmd; eval $cmd
 done
 
 rmmod iaa_crypto
 modprobe iaa_crypto
 
 # apply crypto parameters
 echo $verify_compress > ${VERIFY_COMPRESS_PATH} || handle_error "did not change verify_compress"
 # Note: This is a temporary solution for during the kernel transition.
 if [ -f /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa ];then
     echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa || handle_error "did not set g_comp_wqs_per_iaa"
 elif [ -f /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa ];then
     echo 1 > /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa || handle_error "did not set g_wqs_per_iaa"
 fi
 if [ -f /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq ];then
     echo 1 > /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq || handle_error "did not set g_consec_descs_per_gwq"
 fi
 echo ${iaa_crypto_mode} > /sys/bus/dsa/drivers/crypto/sync_mode || handle_error "could not set sync_mode"
 
 
 
 echo "Configuring ${device_num_per_socket} device(s) out of $num_iaa_per_socket per socket"
 if [ "${device_num_per_socket}" -le "${num_iaa_per_socket}" ]; then
     echo "Configuring all devices"
     start=${first}
     end=$(( ${step} * ${device_num_per_socket} ))
 else
    echo "ERROR: Not enough devices"
    exit
 fi
 
 
 #
 # enable all iax devices and wqs
 #
 for (( socket = 0; socket < ${sockets}; socket += 1 )); do
 for ((i = ${start}; i < ${end}; i += ${step})); do
 
     echo "Configuring iaa$i on socket ${socket}"
 
     for ((j = 0; j < ${iaa_engines}; j += 1)); do
         cmd="accel-config config-engine iax${i}/engine${i}.${j} --group-id=0"
         [[ $verbose == 1 ]] && echo $cmd; eval $cmd
     done
 
     # Config  WQs
     for ((j = 0; j < ${iaa_wqs}; j += 1)); do
         # Config WQ: group 0,  priority=10, mode=shared, type = kernel name=kernel, driver_name=crypto
         cmd="accel-config config-wq iax${i}/wq${i}.${j} -g 0 -s ${wq_size} -p 10 -m ${mode} -y ${wq_type} -n iaa_crypto${i}${j} -d crypto"
         [[ $verbose == 1 ]] && echo $cmd; eval $cmd
      done
 
     # Enable Device and WQs
     cmd="accel-config enable-device iax${i}"
     [[ $verbose == 1 ]] && echo $cmd; eval $cmd
 
     for ((j = 0; j < ${iaa_wqs}; j += 1)); do
         cmd="accel-config enable-wq iax${i}/wq${i}.${j}"
         [[ $verbose == 1 ]] && echo $cmd; eval $cmd
      done
 
 done
     start=$(( start + ${step} * ${num_iaa_per_socket} ))
     end=$(( start + (${step} * ${device_num_per_socket}) ))
 done
 
 # Restore the last compressor
 echo "$last_comp" > ${COMPRESSOR}
 
 # Check if the configuration is correct
 echo "Configured IAA devices:"
 accel-config list | grep iax
 
 #---------------------------------<cut here>----------------------------------


I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana



Kanchana P Sridhar (22):
  crypto: iaa - Reorganize the iaa_crypto driver code.
  crypto: iaa - New architecture for IAA device WQ comp/decomp usage &
    core mapping.
  crypto: iaa - Simplify, consistency of function parameters, minor
    stats bug fix.
  crypto: iaa - Descriptor allocation timeouts with mitigations.
  crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting.
  crypto: iaa - Simplify the code flow in iaa_compress() and
    iaa_decompress().
  crypto: iaa - Refactor hardware descriptor setup into separate
    procedures.
  crypto: iaa - Simplified, efficient job submissions for non-irq mode.
  crypto: iaa - Deprecate exporting add/remove IAA compression modes.
  crypto: iaa - Expect a single scatterlist for a [de]compress request's
    src/dst.
  crypto: iaa - Rearchitect iaa_crypto to have clean interfaces with
    crypto_acomp
  crypto: acomp - Define a unit_size in struct acomp_req to enable
    batching.
  crypto: iaa - IAA Batching for parallel compressions/decompressions.
  crypto: iaa - Enable async mode and make it the default.
  crypto: iaa - Disable iaa_verify_compress by default.
  crypto: iaa - Submit the two largest source buffers first in
    decompress batching.
  crypto: iaa - Add deflate-iaa-dynamic compression mode.
  crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's
    batch-size.
  mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to
    deletion.
  mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx
    resources.
  mm: zswap: zswap_store() will process a large folio in batches.
  mm: zswap: Batched zswap_compress() with compress batching of large
    folios.

 .../driver-api/crypto/iaa/iaa-crypto.rst      |  168 +-
 crypto/acompress.c                            |   14 +
 crypto/testmgr.c                              |   10 +
 crypto/testmgr.h                              |   74 +
 drivers/crypto/intel/iaa/Makefile             |    4 +-
 drivers/crypto/intel/iaa/iaa_crypto.h         |   87 +-
 .../intel/iaa/iaa_crypto_comp_dynamic.c       |   22 +
 drivers/crypto/intel/iaa/iaa_crypto_main.c    | 2836 ++++++++++++-----
 drivers/crypto/intel/iaa/iaa_crypto_stats.c   |    8 +
 drivers/crypto/intel/iaa/iaa_crypto_stats.h   |    2 +
 include/crypto/acompress.h                    |   48 +
 include/crypto/internal/acompress.h           |    3 +
 mm/zswap.c                                    |  700 ++--
 13 files changed, 2905 insertions(+), 1071 deletions(-)
 create mode 100644 drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c

-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 01/22] crypto: iaa - Reorganize the iaa_crypto driver code.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 02/22] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping Kanchana P Sridhar
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch merely reorganizes the code in iaa_crypto_main.c, so that
the functions are consolidated into logically related sub-sections of
code, without requiring forward declarations.

This is expected to make the code more maintainable and for it to be
easier to replace functional layers and/or add new features.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 677 +++++++++++----------
 1 file changed, 350 insertions(+), 327 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 23f585219fb4..760997eee8fe 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -24,6 +24,10 @@
 
 #define IAA_ALG_PRIORITY               300
 
+/**************************************
+ * Driver internal global variables.
+ **************************************/
+
 /* number of iaa instances probed */
 static unsigned int nr_iaa;
 static unsigned int nr_cpus;
@@ -36,54 +40,6 @@ static unsigned int cpus_per_iaa;
 /* Per-cpu lookup table for balanced wqs */
 static struct wq_table_entry __percpu *wq_table;
 
-static struct idxd_wq *wq_table_next_wq(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (++entry->cur_wq >= entry->n_wqs)
-		entry->cur_wq = 0;
-
-	if (!entry->wqs[entry->cur_wq])
-		return NULL;
-
-	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
-		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
-		 entry->wqs[entry->cur_wq]->id, cpu);
-
-	return entry->wqs[entry->cur_wq];
-}
-
-static void wq_table_add(int cpu, struct idxd_wq *wq)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
-		return;
-
-	entry->wqs[entry->n_wqs++] = wq;
-
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
-}
-
-static void wq_table_free_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
-}
-
-static void wq_table_clear_entry(int cpu)
-{
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
-
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
-}
-
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
 
@@ -91,36 +47,11 @@ DEFINE_MUTEX(iaa_devices_lock);
 static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
+static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
+
 /* Verify results of IAA compress or not */
 static bool iaa_verify_compress = true;
 
-static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
-{
-	return sprintf(buf, "%d\n", iaa_verify_compress);
-}
-
-static ssize_t verify_compress_store(struct device_driver *driver,
-				     const char *buf, size_t count)
-{
-	int ret = -EBUSY;
-
-	mutex_lock(&iaa_devices_lock);
-
-	if (iaa_crypto_enabled)
-		goto out;
-
-	ret = kstrtobool(buf, &iaa_verify_compress);
-	if (ret)
-		goto out;
-
-	ret = count;
-out:
-	mutex_unlock(&iaa_devices_lock);
-
-	return ret;
-}
-static DRIVER_ATTR_RW(verify_compress);
-
 /*
  * The iaa crypto driver supports three 'sync' methods determining how
  * compressions and decompressions are performed:
@@ -155,6 +86,37 @@ static bool async_mode;
 /* Use interrupts */
 static bool use_irq;
 
+/**************************************************
+ * Driver attributes along with get/set functions.
+ **************************************************/
+
+static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", iaa_verify_compress);
+}
+
+static ssize_t verify_compress_store(struct device_driver *driver,
+				     const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtobool(buf, &iaa_verify_compress);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(verify_compress);
+
 /**
  * set_iaa_sync_mode - Set IAA sync mode
  * @name: The name of the sync mode
@@ -217,7 +179,9 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
-static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
+/****************************
+ * Driver compression modes.
+ ****************************/
 
 static int find_empty_iaa_compression_mode(void)
 {
@@ -409,11 +373,6 @@ static void free_device_compression_mode(struct iaa_device *iaa_device,
 						IDXD_OP_FLAG_WR_SRC2_AECS_COMP | \
 						IDXD_OP_FLAG_AECS_RW_TGLS)
 
-static int check_completion(struct device *dev,
-			    struct iax_completion_record *comp,
-			    bool compress,
-			    bool only_once);
-
 static int init_device_compression_mode(struct iaa_device *iaa_device,
 					struct iaa_compression_mode *mode,
 					int idx, struct idxd_wq *wq)
@@ -500,6 +459,11 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
 	}
 }
 
+/***********************************************************
+ * Functions for use in crypto probe and remove interfaces:
+ * allocate/init/query/deallocate devices/wqs.
+ ***********************************************************/
+
 static struct iaa_device *iaa_device_alloc(void)
 {
 	struct iaa_device *iaa_device;
@@ -513,18 +477,6 @@ static struct iaa_device *iaa_device_alloc(void)
 	return iaa_device;
 }
 
-static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
-{
-	struct iaa_wq *iaa_wq;
-
-	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
-		if (iaa_wq->wq == wq)
-			return true;
-	}
-
-	return false;
-}
-
 static struct iaa_device *add_iaa_device(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
@@ -560,6 +512,27 @@ static void del_iaa_device(struct iaa_device *iaa_device)
 	nr_iaa--;
 }
 
+static void free_iaa_device(struct iaa_device *iaa_device)
+{
+	if (!iaa_device)
+		return;
+
+	remove_device_compression_modes(iaa_device);
+	kfree(iaa_device);
+}
+
+static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
+{
+	struct iaa_wq *iaa_wq;
+
+	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+		if (iaa_wq->wq == wq)
+			return true;
+	}
+
+	return false;
+}
+
 static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 		      struct iaa_wq **new_wq)
 {
@@ -612,23 +585,23 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 	}
 }
 
-static void clear_wq_table(void)
+static void remove_iaa_wq(struct idxd_wq *wq)
 {
-	int cpu;
-
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
-
-	pr_debug("cleared wq table\n");
-}
+	struct iaa_device *iaa_device;
 
-static void free_iaa_device(struct iaa_device *iaa_device)
-{
-	if (!iaa_device)
-		return;
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		if (iaa_has_wq(iaa_device, wq)) {
+			del_iaa_wq(iaa_device, wq);
+			break;
+		}
+	}
 
-	remove_device_compression_modes(iaa_device);
-	kfree(iaa_device);
+	if (nr_iaa) {
+		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+		if (!cpus_per_iaa)
+			cpus_per_iaa = 1;
+	} else
+		cpus_per_iaa = 1;
 }
 
 static void __free_iaa_wq(struct iaa_wq *iaa_wq)
@@ -655,6 +628,75 @@ static void free_iaa_wq(struct iaa_wq *iaa_wq)
 	idxd_wq_set_private(wq, NULL);
 }
 
+static int save_iaa_wq(struct idxd_wq *wq)
+{
+	struct iaa_device *iaa_device, *found = NULL;
+	struct idxd_device *idxd;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int ret = 0;
+
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		if (iaa_device->idxd == wq->idxd) {
+			idxd = iaa_device->idxd;
+			pdev = idxd->pdev;
+			dev = &pdev->dev;
+			/*
+			 * Check to see that we don't already have this wq.
+			 * Shouldn't happen but we don't control probing.
+			 */
+			if (iaa_has_wq(iaa_device, wq)) {
+				dev_dbg(dev, "same wq probed multiple times for iaa_device %p\n",
+					iaa_device);
+				goto out;
+			}
+
+			found = iaa_device;
+
+			ret = add_iaa_wq(iaa_device, wq, NULL);
+			if (ret)
+				goto out;
+
+			break;
+		}
+	}
+
+	if (!found) {
+		struct iaa_device *new_device;
+		struct iaa_wq *new_wq;
+
+		new_device = add_iaa_device(wq->idxd);
+		if (!new_device) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ret = add_iaa_wq(new_device, wq, &new_wq);
+		if (ret) {
+			del_iaa_device(new_device);
+			free_iaa_device(new_device);
+			goto out;
+		}
+
+		ret = init_iaa_device(new_device, new_wq);
+		if (ret) {
+			del_iaa_wq(new_device, new_wq->wq);
+			del_iaa_device(new_device);
+			free_iaa_wq(new_wq);
+			goto out;
+		}
+	}
+
+	if (WARN_ON(nr_iaa == 0))
+		return -EINVAL;
+
+	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
+	if (!cpus_per_iaa)
+		cpus_per_iaa = 1;
+out:
+	return 0;
+}
+
 static int iaa_wq_get(struct idxd_wq *wq)
 {
 	struct idxd_device *idxd = wq->idxd;
@@ -702,6 +744,37 @@ static int iaa_wq_put(struct idxd_wq *wq)
 	return ret;
 }
 
+/***************************************************************
+ * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
+ ***************************************************************/
+
+static void wq_table_free_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	kfree(entry->wqs);
+	memset(entry, 0, sizeof(*entry));
+}
+
+static void wq_table_clear_entry(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	entry->n_wqs = 0;
+	entry->cur_wq = 0;
+	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+}
+
+static void clear_wq_table(void)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		wq_table_clear_entry(cpu);
+
+	pr_debug("cleared wq table\n");
+}
+
 static void free_wq_table(void)
 {
 	int cpu;
@@ -739,92 +812,18 @@ static int alloc_wq_table(int max_wqs)
 	return 0;
 }
 
-static int save_iaa_wq(struct idxd_wq *wq)
+static void wq_table_add(int cpu, struct idxd_wq *wq)
 {
-	struct iaa_device *iaa_device, *found = NULL;
-	struct idxd_device *idxd;
-	struct pci_dev *pdev;
-	struct device *dev;
-	int ret = 0;
-
-	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		if (iaa_device->idxd == wq->idxd) {
-			idxd = iaa_device->idxd;
-			pdev = idxd->pdev;
-			dev = &pdev->dev;
-			/*
-			 * Check to see that we don't already have this wq.
-			 * Shouldn't happen but we don't control probing.
-			 */
-			if (iaa_has_wq(iaa_device, wq)) {
-				dev_dbg(dev, "same wq probed multiple times for iaa_device %p\n",
-					iaa_device);
-				goto out;
-			}
-
-			found = iaa_device;
-
-			ret = add_iaa_wq(iaa_device, wq, NULL);
-			if (ret)
-				goto out;
-
-			break;
-		}
-	}
-
-	if (!found) {
-		struct iaa_device *new_device;
-		struct iaa_wq *new_wq;
-
-		new_device = add_iaa_device(wq->idxd);
-		if (!new_device) {
-			ret = -ENOMEM;
-			goto out;
-		}
-
-		ret = add_iaa_wq(new_device, wq, &new_wq);
-		if (ret) {
-			del_iaa_device(new_device);
-			free_iaa_device(new_device);
-			goto out;
-		}
-
-		ret = init_iaa_device(new_device, new_wq);
-		if (ret) {
-			del_iaa_wq(new_device, new_wq->wq);
-			del_iaa_device(new_device);
-			free_iaa_wq(new_wq);
-			goto out;
-		}
-	}
-
-	if (WARN_ON(nr_iaa == 0))
-		return -EINVAL;
-
-	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-	if (!cpus_per_iaa)
-		cpus_per_iaa = 1;
-out:
-	return 0;
-}
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
 
-static void remove_iaa_wq(struct idxd_wq *wq)
-{
-	struct iaa_device *iaa_device;
+	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+		return;
 
-	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		if (iaa_has_wq(iaa_device, wq)) {
-			del_iaa_wq(iaa_device, wq);
-			break;
-		}
-	}
+	entry->wqs[entry->n_wqs++] = wq;
 
-	if (nr_iaa) {
-		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-		if (!cpus_per_iaa)
-			cpus_per_iaa = 1;
-	} else
-		cpus_per_iaa = 1;
+	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
+		 entry->wqs[entry->n_wqs - 1]->idxd->id,
+		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
 }
 
 static int wq_table_add_wqs(int iaa, int cpu)
@@ -930,6 +929,44 @@ static void rebalance_wq_table(void)
 	pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
 }
 
+/***************************************************************
+ * Assign work-queues for driver ops using per-cpu wq_tables.
+ ***************************************************************/
+
+static struct idxd_wq *wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+
+	if (++entry->cur_wq >= entry->n_wqs)
+		entry->cur_wq = 0;
+
+	if (!entry->wqs[entry->cur_wq])
+		return NULL;
+
+	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
+		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
+		 entry->wqs[entry->cur_wq]->id, cpu);
+
+	return entry->wqs[entry->cur_wq];
+}
+
+/*************************************************
+ * Core iaa_crypto compress/decompress functions.
+ *************************************************/
+
+static int deflate_generic_decompress(struct acomp_req *req)
+{
+	ACOMP_FBREQ_ON_STACK(fbreq, req);
+	int ret;
+
+	ret = crypto_acomp_decompress(fbreq);
+	req->dlen = fbreq->dlen;
+
+	update_total_sw_decomp_calls();
+
+	return ret;
+}
+
 static inline int check_completion(struct device *dev,
 				   struct iax_completion_record *comp,
 				   bool compress,
@@ -990,27 +1027,132 @@ static inline int check_completion(struct device *dev,
 	return ret;
 }
 
-static int deflate_generic_decompress(struct acomp_req *req)
+static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
+				struct acomp_req *req,
+				dma_addr_t *src_addr, dma_addr_t *dst_addr)
 {
-	ACOMP_FBREQ_ON_STACK(fbreq, req);
-	int ret;
+	int ret = 0;
+	int nr_sgs;
 
-	ret = crypto_acomp_decompress(fbreq);
-	req->dlen = fbreq->dlen;
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
 
-	update_total_sw_decomp_calls();
+	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		goto out;
+	}
+	*src_addr = sg_dma_address(req->src);
+	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
+		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
+		req->src, req->slen, sg_dma_len(req->src));
 
+	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
+	if (nr_sgs <= 0 || nr_sgs > 1) {
+		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
+			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
+			iaa_wq->wq->id, ret);
+		ret = -EIO;
+		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+		goto out;
+	}
+	*dst_addr = sg_dma_address(req->dst);
+	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
+		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
+		req->dst, req->dlen, sg_dma_len(req->dst));
+out:
 	return ret;
 }
 
-static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr);
-
 static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 			       struct idxd_wq *wq,
 			       dma_addr_t src_addr, unsigned int slen,
-			       dma_addr_t dst_addr, unsigned int *dlen);
+			       dma_addr_t dst_addr, unsigned int *dlen)
+{
+	struct iaa_device_compression_mode *active_compression_mode;
+	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+	u32 *compression_crc = acomp_request_ctx(req);
+	struct iaa_device *iaa_device;
+	struct idxd_desc *idxd_desc;
+	struct iax_hw_desc *desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int ret = 0;
+
+	iaa_wq = idxd_wq_get_private(wq);
+	iaa_device = iaa_wq->iaa_device;
+	idxd = iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
+
+	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	if (IS_ERR(idxd_desc)) {
+		dev_dbg(dev, "idxd descriptor allocation failed\n");
+		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
+			PTR_ERR(idxd_desc));
+		return PTR_ERR(idxd_desc);
+	}
+	desc = idxd_desc->iax_hw;
+
+	/* Verify (optional) - decompress and check crc, suppress dest write */
+
+	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
+	desc->opcode = IAX_OPCODE_DECOMPRESS;
+	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
+	desc->priv = 0;
+
+	desc->src1_addr = (u64)dst_addr;
+	desc->src1_size = *dlen;
+	desc->dst_addr = (u64)src_addr;
+	desc->max_dst_size = slen;
+	desc->completion_addr = idxd_desc->compl_dma;
+
+	dev_dbg(dev, "(verify) compression mode %s,"
+		" desc->src1_addr %llx, desc->src1_size %d,"
+		" desc->dst_addr %llx, desc->max_dst_size %d,"
+		" desc->src2_addr %llx, desc->src2_size %d\n",
+		active_compression_mode->name,
+		desc->src1_addr, desc->src1_size, desc->dst_addr,
+		desc->max_dst_size, desc->src2_addr, desc->src2_size);
+
+	ret = idxd_submit_desc(wq, idxd_desc);
+	if (ret) {
+		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
+		goto err;
+	}
+
+	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+	if (ret) {
+		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
+		goto err;
+	}
+
+	if (*compression_crc != idxd_desc->iax_completion->crc) {
+		ret = -EINVAL;
+		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
+			" comp=0x%x, decomp=0x%x\n", *compression_crc,
+			idxd_desc->iax_completion->crc);
+		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
+			       8, 1, idxd_desc->iax_completion, 64, 0);
+		goto err;
+	}
+
+	idxd_free_desc(wq, idxd_desc);
+out:
+	return ret;
+err:
+	idxd_free_desc(wq, idxd_desc);
+	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
+
+	goto out;
+}
 
 static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 			      enum idxd_complete_type comp_type,
@@ -1226,133 +1368,6 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 	goto out;
 }
 
-static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
-				dma_addr_t *src_addr, dma_addr_t *dst_addr)
-{
-	int ret = 0;
-	int nr_sgs;
-
-	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
-	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
-
-	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		goto out;
-	}
-	*src_addr = sg_dma_address(req->src);
-	dev_dbg(dev, "verify: dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
-		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
-		req->src, req->slen, sg_dma_len(req->src));
-
-	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
-		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
-			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
-			iaa_wq->wq->id, ret);
-		ret = -EIO;
-		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-		goto out;
-	}
-	*dst_addr = sg_dma_address(req->dst);
-	dev_dbg(dev, "verify: dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
-		" req->dlen %d, sg_dma_len(sg) %d\n", *dst_addr, nr_sgs,
-		req->dst, req->dlen, sg_dma_len(req->dst));
-out:
-	return ret;
-}
-
-static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
-			       struct idxd_wq *wq,
-			       dma_addr_t src_addr, unsigned int slen,
-			       dma_addr_t dst_addr, unsigned int *dlen)
-{
-	struct iaa_device_compression_mode *active_compression_mode;
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
-	u32 *compression_crc = acomp_request_ctx(req);
-	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
-	struct iax_hw_desc *desc;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
-	int ret = 0;
-
-	iaa_wq = idxd_wq_get_private(wq);
-	iaa_device = iaa_wq->iaa_device;
-	idxd = iaa_device->idxd;
-	pdev = idxd->pdev;
-	dev = &pdev->dev;
-
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
-	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
-			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
-	}
-	desc = idxd_desc->iax_hw;
-
-	/* Verify (optional) - decompress and check crc, suppress dest write */
-
-	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
-	desc->opcode = IAX_OPCODE_DECOMPRESS;
-	desc->decompr_flags = IAA_DECOMP_FLAGS | IAA_DECOMP_SUPPRESS_OUTPUT;
-	desc->priv = 0;
-
-	desc->src1_addr = (u64)dst_addr;
-	desc->src1_size = *dlen;
-	desc->dst_addr = (u64)src_addr;
-	desc->max_dst_size = slen;
-	desc->completion_addr = idxd_desc->compl_dma;
-
-	dev_dbg(dev, "(verify) compression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n",
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
-	ret = idxd_submit_desc(wq, idxd_desc);
-	if (ret) {
-		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
-		goto err;
-	}
-
-	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
-	if (ret) {
-		dev_dbg(dev, "(verify) check_completion failed ret=%d\n", ret);
-		goto err;
-	}
-
-	if (*compression_crc != idxd_desc->iax_completion->crc) {
-		ret = -EINVAL;
-		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
-			" comp=0x%x, decomp=0x%x\n", *compression_crc,
-			idxd_desc->iax_completion->crc);
-		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
-			       8, 1, idxd_desc->iax_completion, 64, 0);
-		goto err;
-	}
-
-	idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
-}
-
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			  struct idxd_wq *wq,
 			  dma_addr_t src_addr, unsigned int slen,
@@ -1662,6 +1677,10 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 	ctx->use_irq = use_irq;
 }
 
+/*********************************************
+ * Interfaces to crypto_alg and crypto_acomp.
+ *********************************************/
+
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
@@ -1864,6 +1883,10 @@ static struct idxd_device_driver iaa_crypto_driver = {
 	.desc_complete = iaa_desc_complete,
 };
 
+/********************
+ * Module init/exit.
+ ********************/
+
 static int __init iaa_crypto_init_module(void)
 {
 	int ret = 0;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 02/22] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 01/22] crypto: iaa - Reorganize the iaa_crypto driver code Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 03/22] crypto: iaa - Simplify, consistency of function parameters, minor stats bug fix Kanchana P Sridhar
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch re-architects the iaa_crypto driver in three main aspects, to
make it more robust, stable, generic and functionally versatile to
support zswap users on platforms with different number of cores/IAAs
running workloads with different swap characteristics, and most
importantly, for better performance.

 Summary of latency improvement for large folio compression:
 ===========================================================
 When measured in zswap using a simple madvise workload, where 64K
 Folios are stored using IAA batch compressions, this is how the
 per-page compress latency changes just by setting the
 "distribute_comps" driver parameter to "1":

   --------------------------------------------------------------
   zswap compressor: deflate-iaa
   64K Folios: zswap_store() latency normalized to per-page
   --------------------------------------------------------------
                                         p50 (ns)     p99 (ns)
   --------------------------------------------------------------
   Sequential store                         3,503        3,695
   Batch compress, distribute_comps=0       1,356        1,384
   Batch compress, distribute_comps=1         706          763
   --------------------------------------------------------------

The rearchitecting aspects are:

A) Map IAA devices/wqs to cores based on packages instead of NUMA.

B) The WQ rebalancing algorithm that is invoked as WQs are
   discovered/deleted has been made very general and flexible so that
   the user can control exactly how IAA WQs are used, for optimizing
   performance.

C) Additionally, the "iaa_crypto_enabled" driver global has been
   modified to be an atomic, and used for synchronization between
   dynamic/asynchronous WQ discovery/deletion and the fundamental
   routines comp_wq_table_next_wq() and decomp_wq_table_next_wq() that
   are queried by compress/decompress job submissions.

Description/motivation for (A):
===============================
This patch modifies the algorithm for mapping available IAA devices and
WQs to cores based on packages instead of NUMA nodes. This leads to a
more realistic mapping of IAA devices as compression/decompression
resources for a package, rather than for a NUMA node. This also resolves
problems that were observed during internal validation on Intel Granite
Rapids platforms with many more NUMA nodes than packages: for such
cases, the earlier NUMA based allocation caused some IAAs to be
over-subscribed and some to not be utilized at all.

As a result of this change from NUMA to packages, some of the core
functions used by the iaa_crypto driver's "probe" and "remove" API
have been re-written. The new infrastructure maintains a static mapping
of wqs per IAA device, in the "struct iaa_device" itself. The earlier
implementation would allocate memory per-cpu for this data, which never
changes once the IAA devices/wqs have been initialized.

Two main outcomes from this new iaa_crypto driver infrastructure are:

 1) Resolves "task blocked for more than x seconds" errors observed during
    internal validation on Intel systems with the earlier NUMA node based
    mappings, which was root-caused to the non-optimal IAA-to-core mappings
    described earlier.

 2) Results in a NUM_THREADS factor reduction in memory footprint cost of
    initializing IAA devices/wqs, due to eliminating the per-cpu copies of
    each IAA device's wqs. On a 384 cores Intel Granite Rapids server with
    8 IAA devices, this saves 140MiB.

An auxiliary change included in this patch is that the driver's "nr_iaa",
"nr_iaa_per_package" and "cpus_per_iaa" global variables are made
atomic, because iaa_crypto_probe() and iaa_crypto_remove() change the
values of these variables asynchronously and concurrently as wqs get
added/deleted and rebalance_wq_table() is called. This change allows the
rebalance_wq_table() code to see consistent values of the number of IAA
devices.

Description/motivation for (B):
===============================
This builds upon the package-based driver infrastructure, to provide
more flexibility in using particular WQs for compress-only or
decompress-only jobs. It also introduces the notion of using all the IAA
devices on a package as resources that are shared by all cores on the
package: this significantly improves batching (to be added in subsequent
patches) latency and compress/decompress throughput. sysfs driver
paramters provide configurability of these features.

Two main concepts are introduced as part of the rebalancing changes:

 1) An IAA WQ can be used for specific ops, that determines a WQ "type"
    for the iaa_crypto driver to submit compress/decompress jobs:

    - compress only
    - decompress only
    - generic, i.e, for both compresses and decompresses

    The WQ type is decided based on the number of WQs configured for a
    given IAA device, and the new "g_comp_wqs_per_iaa" driver parameter.

 2) An IAA WQ can be mapped to cores using either of the following
    balancing techniques:

    a) Shared by all cores on a package. The iaa_crypto driver will
       dispatch compress/decompress jobs to all WQs of the same type,
       across all IAA devices on the package:
       - IAA compress jobs will be distributed to all same-package IAA
         compress-only/generic WQs.
       - IAA decompress jobs will be distributed to all same-package IAA
         decompress-only/generic WQs.

    b) Handles compress/decompress jobs only from "mapped cores", i.e.,
       the cores derived by evenly dividing the number of IAAs among the
       number of cores, per package.

Server setups that are moderately to highly contended can benefit from
(2.a). When the mix of workloads running on a system need high compress
throughput, and have relatively lower decompress activity, (2.b) might
be more optimal for decompressions.

These approaches can be accomplished with the following new iaa_crypto
driver parameters. These parameters are global settings and will apply
to all IAAs on a package, interpreted in the context of the number of
WQs configured per IAA device.

 g_comp_wqs_per_iaa:
 ===================
   Number of compress-only WQs. The default is 1, but is applicable only
   if the device has more than 1 WQ. If the device has exactly 1 WQ
   configured, "g_comp_wqs_per_iaa" is a don't care.

   If the IAA device has more than "g_comp_wqs_per_iaa" WQs configured,
   the last "g_comp_wqs_per_iaa" number of WQs will be considered as
   "compress only". The remaining WQs will be considered as
   "decompress only".

   If the device has less than or equal to "g_comp_wqs_per_iaa" WQs, all
   the device's WQs will be considered "generic", i.e., the driver will
   submit compress and decompress jobs to all the WQs configured for the
   device.

   For e.g., if an IAA "X" has 2 WQs, this will set up 1 decompress WQ and
   1 compress WQ:

     echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa

     wqX.0: decompress jobs only.
     wqX.1: compress jobs only.

   This setting would typically benefit workloads that see a high
   level of compress and decompress activity.

   If an IAA has 1 WQ, that WQ will be considered "generic": the driver
   will submit compress and decompress jobs to the same WQ (this is
   independent of the "g_comp_wqs_per_iaa" setting):

     wqX.0: compress and decompress jobs.

   This would typically benefit workloads that see significant cold
   memory being reclaimed, and consequently, high swapout and low swapin
   activity.

 distribute_comps:
 =================
   Distribute compressions to all IAAs on package (default is Y).

   Assuming the WQ type has been established as
   compress-only/decompress-only/generic, this setting will determine if
   the driver will distribute compress jobs to all IAAs on a package
   (default behavior) or not.

   If this is turned off, the driver will dispatch compress jobs to a
   given IAA "compression enabled" WQ only from cores that are mapped to
   that IAA using an algorithm that evenly distributes IAAs per package
   to cores per package. For e.g., on a Sapphire Rapids server with
   56-physical-cores and 4 IAAs per package, with Hyperthreading, 28
   logical cores will be assigned to each IAA. With the
   "distribute_comps" driver parameter turned off, the driver will send
   compress jobs only to it's assigned IAA device.

   Enabling "distribute_comps" would typically benefit workloads in
   terms of batch compress latency and throughput.

 distribute_decomps:
 ===================
   Distribute decompressions to all IAAs on package (default is N).

   Assuming the WQ type has been established as
   compress-only/decompress-only/generic, this setting will determine if
   the driver will distribute decompress jobs to all IAAs on a package
   or not (default behavior).

   We recommend leaving this parameter at its default setting of "N".
   Enabling "distribute_decomps = Y" can be evaluated for workloads that
   are sensitive to p99 decompress latency, and see a high level of
   compress and decompress activity (for e.g. warm memory reclaim/swapin).

Recommended settings for best compress/decompress latency, throughput
and hence memory savings for a moderately contended server, are:

   2 WQs per IAA
   g_comp_wqs_per_iaa = 1 (separate WQ for comps/decomps per IAA)
   distribute_decomps = N
   distribute_comps = Y

For systems that have one IAA device, the distribute_[de]comps settings
will be a no-op. Even for such systems, as long as considerable swapout
and swapin activity is expected, we recommend setting up 2 WQs
for the IAA, one each for compressions/decompressions. If swapouts are
significantly more than swapins, 1 WQ would be a better configuration,
as mentioned earlier.

 Examples:
 =========
   For a Sapphire Rapids server with 2 packages, 56 cores and 4 IAAs per
   package, each IAA has 2 WQs, and these settings are in effect:

     echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa
     echo 1 > /sys/bus/dsa/drivers/crypto/distribute_comps
     echo 0 > /sys/bus/dsa/drivers/crypto/distribute_decomps

     wqX.0: decompress jobs only.
     wqX.1: compress jobs only.

   Compress jobs from all cores on package-0 will be distributed in
   round-robin manner to [iax1, iax3, iax5, iax7]'s wqX.1, to maximize
   compression throughput/latency/memory savings:

     wq1.1
     wq3.1
     wq5.1
     wq7.1

   Likewise, compress jobs from all cores on package-1 will be
   distributed in round-robin manner to [iax9, iax11, iax13, iax15]'s
   wqX.1, to maximize compression throughput/latency/memory savings for
   workloads running on package-1:

     wq9.1
     wq11.1
     wq13.1
     wq15.1

   Decompress jobs will be submitted from mapped logical cores only, as
   follows:

     package-0:

       CPU   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
       IAA:  iax1           iax3           iax5           iax7
       WQ:   wq1.0          wq3.0          wq5.0          wq7.0

     package-1:

       CPU   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
       IAA:  iax9           iax11          iax13           iax15
       WQ:   wq9.0          wq11.0         wq13.0          wq15.0

IAA WQs can be configured using higher level scripts as described in
Documentation/driver-api/crypto/iaa/iaa-crypto.rst. This documentation
has been updated for the above new sysfs parameters.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 .../driver-api/crypto/iaa/iaa-crypto.rst      | 136 +++
 drivers/crypto/intel/iaa/iaa_crypto.h         |  18 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c    | 905 ++++++++++++++----
 3 files changed, 884 insertions(+), 175 deletions(-)

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index f815d4fd8372..0ff4ec603b43 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -290,6 +290,142 @@ The available attributes are:
     'sync' mode. This is to ensure correct iaa_crypto behavior until true
     async polling without interrupts is enabled in iaa_crypto.
 
+  - g_comp_wqs_per_iaa
+
+    Number of compress-only WQs. The default is 1, but is applicable only
+    if the device has more than 1 WQ. If the device has exactly 1 WQ
+    configured, "g_comp_wqs_per_iaa" is a don't care.
+
+    If the IAA device has more than "g_comp_wqs_per_iaa" WQs configured,
+    the last "g_comp_wqs_per_iaa" number of WQs will be considered as
+    "compress only". The remaining WQs will be considered as "decomp only".
+
+    If the device has less than or equal to "g_comp_wqs_per_iaa" WQs, all
+    the device's WQs will be considered "generic", i.e., the driver will
+    submit compress and decompress jobs to all the WQs configured for the
+    device.
+
+    For e.g., if an IAA "X" has 2 WQs, this will set up 1 decompress WQ and
+    1 compress WQ::
+
+      echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa
+
+     wqX.0: decompress jobs only.
+     wqX.1: compress jobs only.
+
+    This setting would typically benefit workloads that see a high
+    level of compress and decompress activity.
+
+    If an IAA has 1 WQ, that WQ will be considered "generic": the driver
+    will submit compress and decompress jobs to the same WQ (this is
+    independent of the "g_comp_wqs_per_iaa" setting):
+
+     wqX.0: compress and decompress jobs.
+
+    This would typically benefit workloads that see significant cold
+    memory being reclaimed, and consequently, high swapout and low swapin
+    activity.
+
+  - distribute_comps
+
+    Distribute compressions to all IAAs on package (default is Y).
+
+    Assuming the WQ type has been established as
+    compress-only/decompress-only/generic, this setting will determine if
+    the driver will distribute compress jobs to all IAAs on a package
+    (default behavior) or not.
+
+    If this is turned off, the driver will dispatch compress jobs to a
+    given IAA "compression enabled" WQ only from cores that are mapped to
+    that IAA using an algorithm that evenly distributes IAAs per package
+    to cores per package. For e.g., on a Sapphire Rapids server with
+    56-physical-cores and 4 IAAs per package, with Hyperthreading, 28
+    logical cores will be assigned to each IAA. With the
+    "distribute_comps" driver parameter turned off, the driver will send
+    compress jobs only to it's assigned IAA device.
+
+    Enabling "distribute_comps" would typically benefit workloads in
+    terms of batch compress latency and throughput.
+
+  - distribute_decomps
+
+    Distribute decompressions to all IAAs on package (default is Y).
+
+    Assuming the WQ type has been established as
+    compress-only/decompress-only/generic, this setting will determine if
+    the driver will distribute decompress jobs to all IAAs on a package
+    (default behavior) or not.
+
+    Enabling "distribute_decomps" would typically benefit workloads that
+    see a high level of compress and decompress activity, especially
+    p99 decompress latency.
+
+    Recommended settings for best compress/decompress latency, throughput
+    and hence memory savings for a moderately contended server that
+    has more than 1 IAA device enabled on a given package:
+
+      2 WQs per IAA
+      g_comp_wqs_per_iaa = 1 (separate WQ for comps/decomps per IAA)
+      distribute_decomps = Y
+      distribute_comps = Y
+
+    For a system that has only 1 IAA device enabled on a given package,
+    the recommended settings are:
+
+      1 WQ per IAA
+      g_comp_wqs_per_iaa = 0 (same WQ for comps/decomps)
+      distribute_decomps = N
+      distribute_comps = N
+
+    Examples:
+
+    For a Sapphire Rapids server with 2 packages, 56 cores and 4 IAAs per
+    package, each IAA has 2 WQs, and these settings are in effect::
+
+      echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa
+      echo 1 > /sys/bus/dsa/drivers/crypto/distribute_comps
+      echo 0 > /sys/bus/dsa/drivers/crypto/distribute_decomps
+
+    This enables the following behavior:
+
+      wqX.0: decompress jobs only.
+      wqX.1: compress jobs only.
+
+    Compress jobs from all cores on package-0 will be distributed in
+    round-robin manner to [iax1, iax3, iax5, iax7]'s wqX.1, to maximize
+    compression throughput/latency/memory savings:
+
+      wq1.1
+      wq3.1
+      wq5.1
+      wq7.1
+
+    Likewise, compress jobs from all cores on package-1 will be
+    distributed in round-robin manner to [iax9, iax11, iax13, iax15]'s
+    wqX.1, to maximize compression throughput/latency/memory savings for
+    workloads running on package-1:
+
+      wq9.1
+      wq11.1
+      wq13.1
+      wq15.1
+
+    Decompress jobs will be submitted from mapped logical cores only, as
+    follows:
+
+      package-0:
+
+        CPU   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
+        IAA:  iax1           iax3           iax5           iax7
+        WQ:   wq1.0          wq3.0          wq5.0          wq7.0
+
+      package-1:
+
+        CPU   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
+        IAA:  iax9           iax11          iax13           iax15
+        WQ:   wq9.0          wq11.0         wq13.0          wq15.0
+
+
 .. _iaa_default_config:
 
 IAA Default Configuration
diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 56985e395263..549ac98a9366 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -46,6 +46,7 @@ struct iaa_wq {
 	struct idxd_wq		*wq;
 	int			ref;
 	bool			remove;
+	bool			mapped;
 
 	struct iaa_device	*iaa_device;
 
@@ -63,6 +64,13 @@ struct iaa_device_compression_mode {
 	dma_addr_t			aecs_comp_table_dma_addr;
 };
 
+struct wq_table_entry {
+	struct idxd_wq	**wqs;
+	unsigned int	max_wqs;
+	unsigned int	n_wqs;
+	unsigned int	cur_wq;
+};
+
 /* Representation of IAA device with wqs, populated by probe */
 struct iaa_device {
 	struct list_head		list;
@@ -73,19 +81,15 @@ struct iaa_device {
 	int				n_wq;
 	struct list_head		wqs;
 
+	struct wq_table_entry		*generic_wq_table;
+	struct wq_table_entry		*comp_wq_table;
+
 	atomic64_t			comp_calls;
 	atomic64_t			comp_bytes;
 	atomic64_t			decomp_calls;
 	atomic64_t			decomp_bytes;
 };
 
-struct wq_table_entry {
-	struct idxd_wq **wqs;
-	int	max_wqs;
-	int	n_wqs;
-	int	cur_wq;
-};
-
 #define IAA_AECS_ALIGN			32
 
 /*
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 760997eee8fe..9de7a8a4d7a8 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -23,32 +23,86 @@
 #define pr_fmt(fmt)			"idxd: " IDXD_SUBDRIVER_NAME ": " fmt
 
 #define IAA_ALG_PRIORITY               300
+#define MAX_PKG_IAA   8
+#define MAX_IAA_WQ    8
 
 /**************************************
  * Driver internal global variables.
  **************************************/
 
 /* number of iaa instances probed */
-static unsigned int nr_iaa;
+static atomic_t nr_iaa = ATOMIC_INIT(0);
 static unsigned int nr_cpus;
-static unsigned int nr_nodes;
-static unsigned int nr_cpus_per_node;
+static unsigned int nr_packages;
+static unsigned int nr_cpus_per_package;
+static atomic_t nr_iaa_per_package = ATOMIC_INIT(0);
 
 /* Number of physical cpus sharing each iaa instance */
-static unsigned int cpus_per_iaa;
+static atomic_t cpus_per_iaa = ATOMIC_INIT(0);
 
-/* Per-cpu lookup table for balanced wqs */
-static struct wq_table_entry __percpu *wq_table;
+/* Per-cpu lookup table for decomp wqs. */
+static struct wq_table_entry __percpu *cpu_decomp_wqs;
+
+/* Per-cpu lookup table for comp wqs. */
+static struct wq_table_entry __percpu *cpu_comp_wqs;
+
+/* All decomp wqs from IAAs on a package. */
+static struct wq_table_entry **pkg_global_decomp_wqs;
+/* All comp wqs from IAAs on a package. */
+static struct wq_table_entry **pkg_global_comp_wqs;
 
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
 
-/* If enabled, IAA hw crypto algos are registered, unavailable otherwise */
-static bool iaa_crypto_enabled;
+/*
+ * If enabled, IAA hw crypto algos are registered, unavailable otherwise:
+ *
+ * We use the atomic @iaa_crypto_enabled to know if the per-CPU
+ * compress/decompress wq tables have been setup successfully.
+ * Since @iaa_crypto_enabled is atomic, the core functions that
+ * return a wq for compression/decompression, namely,
+ * comp_wq_table_next_wq() and decomp_wq_table_next_wq() will
+ * test this atomic before proceeding to query the per-cpu wq tables.
+ *
+ * These events will set @iaa_crypto_enabled to 1:
+ * - Successful rebalance_wq_table() after individual wq addition/removal.
+ *
+ * These events will set @iaa_crypto_enabled to 0:
+ * - Error during rebalance_wq_table() after individual wq addition/removal.
+ * - check_completion() timeouts.
+ * - @nr_iaa is 0.
+ * - module cleanup.
+ */
+static atomic_t iaa_crypto_enabled = ATOMIC_INIT(0);
+
+/*
+ * First wq probed, to use until @iaa_crypto_enabled is 1:
+ *
+ * The first wq probed will be entered in the per-CPU comp/decomp wq tables
+ * until the IAA compression modes are registered. This is done to facilitate
+ * the compress/decompress calls from the crypto testmgr resulting from
+ * calling crypto_register_acomp().
+ *
+ * With the new dynamic package-level rebalancing of WQs being
+ * discovered asynchronously and concurrently with tests
+ * triggered from device registration, this is needed to
+ * determine when it is safe for the rebalancing of decomp/comp
+ * WQs to de-allocate the per-package WQs and re-allocate them
+ * based on the latest number of IAA devices and WQs.
+ */
+static struct idxd_wq *first_wq_found;
+DEFINE_MUTEX(first_wq_found_lock);
+
 static bool iaa_crypto_registered;
 
 static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
 
+/* Distribute decompressions across all IAAs on the package. */
+static bool iaa_distribute_decomps;
+
+/* Distribute compressions across all IAAs on the package. */
+static bool iaa_distribute_comps = true;
+
 /* Verify results of IAA compress or not */
 static bool iaa_verify_compress = true;
 
@@ -86,6 +140,9 @@ static bool async_mode;
 /* Use interrupts */
 static bool use_irq;
 
+/* Number of compress-only wqs per iaa*/
+static unsigned int g_comp_wqs_per_iaa = 1;
+
 /**************************************************
  * Driver attributes along with get/set functions.
  **************************************************/
@@ -102,7 +159,7 @@ static ssize_t verify_compress_store(struct device_driver *driver,
 
 	mutex_lock(&iaa_devices_lock);
 
-	if (iaa_crypto_enabled)
+	if (atomic_read(&iaa_crypto_enabled))
 		goto out;
 
 	ret = kstrtobool(buf, &iaa_verify_compress);
@@ -166,7 +223,7 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 
 	mutex_lock(&iaa_devices_lock);
 
-	if (iaa_crypto_enabled)
+	if (atomic_read(&iaa_crypto_enabled))
 		goto out;
 
 	ret = set_iaa_sync_mode(buf);
@@ -179,6 +236,87 @@ static ssize_t sync_mode_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(sync_mode);
 
+static ssize_t g_comp_wqs_per_iaa_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%u\n", g_comp_wqs_per_iaa);
+}
+
+static ssize_t g_comp_wqs_per_iaa_store(struct device_driver *driver,
+				   const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (atomic_read(&iaa_crypto_enabled))
+		goto out;
+
+	ret = kstrtouint(buf, 10, &g_comp_wqs_per_iaa);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_comp_wqs_per_iaa);
+
+static ssize_t distribute_decomps_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", iaa_distribute_decomps);
+}
+
+static ssize_t distribute_decomps_store(struct device_driver *driver,
+					const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (atomic_read(&iaa_crypto_enabled))
+		goto out;
+
+	ret = kstrtobool(buf, &iaa_distribute_decomps);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(distribute_decomps);
+
+static ssize_t distribute_comps_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", iaa_distribute_comps);
+}
+
+static ssize_t distribute_comps_store(struct device_driver *driver,
+				      const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (atomic_read(&iaa_crypto_enabled))
+		goto out;
+
+	ret = kstrtobool(buf, &iaa_distribute_comps);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(distribute_comps);
+
 /****************************
  * Driver compression modes.
  ****************************/
@@ -464,32 +602,81 @@ static void remove_device_compression_modes(struct iaa_device *iaa_device)
  * allocate/init/query/deallocate devices/wqs.
  ***********************************************************/
 
-static struct iaa_device *iaa_device_alloc(void)
+static struct iaa_device *iaa_device_alloc(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
+	struct wq_table_entry *wqt;
 
 	iaa_device = kzalloc(sizeof(*iaa_device), GFP_KERNEL);
 	if (!iaa_device)
-		return NULL;
+		goto err;
+
+	iaa_device->idxd = idxd;
+
+	/* IAA device's generic/decomp wqs. */
+	iaa_device->generic_wq_table = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->generic_wq_table)
+		goto err;
+
+	wqt = iaa_device->generic_wq_table;
+
+	wqt->wqs = kcalloc(iaa_device->idxd->max_wqs, sizeof(struct idxd_wq *), GFP_KERNEL);
+	if (!wqt->wqs)
+		goto err;
+
+	wqt->max_wqs = iaa_device->idxd->max_wqs;
+	wqt->n_wqs = 0;
+
+	/*
+	 * IAA device's comp wqs (optional). If the device has more than
+	 * "g_comp_wqs_per_iaa" WQs configured, the last "g_comp_wqs_per_iaa"
+	 * number of WQs will be considered as "comp only". The remaining
+	 * WQs will be considered as "decomp only".
+	 * If the device has <= "g_comp_wqs_per_iaa" WQs, all the
+	 * device's WQs will be considered "generic", i.e., cores can submit
+	 * comp and decomp jobs to all the WQs configured for the device.
+	 */
+	iaa_device->comp_wq_table = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+	if (!iaa_device->comp_wq_table)
+		goto err;
+
+	wqt = iaa_device->comp_wq_table;
+
+	wqt->wqs = kcalloc(iaa_device->idxd->max_wqs, sizeof(struct idxd_wq *), GFP_KERNEL);
+	if (!wqt->wqs)
+		goto err;
+
+	wqt->max_wqs = iaa_device->idxd->max_wqs;
+	wqt->n_wqs = 0;
 
 	INIT_LIST_HEAD(&iaa_device->wqs);
 
 	return iaa_device;
+
+err:
+	if (iaa_device) {
+		if (iaa_device->generic_wq_table) {
+			kfree(iaa_device->generic_wq_table->wqs);
+			kfree(iaa_device->generic_wq_table);
+		}
+		kfree(iaa_device->comp_wq_table);
+		kfree(iaa_device);
+	}
+
+	return NULL;
 }
 
 static struct iaa_device *add_iaa_device(struct idxd_device *idxd)
 {
 	struct iaa_device *iaa_device;
 
-	iaa_device = iaa_device_alloc();
+	iaa_device = iaa_device_alloc(idxd);
 	if (!iaa_device)
 		return NULL;
 
-	iaa_device->idxd = idxd;
-
 	list_add_tail(&iaa_device->list, &iaa_devices);
 
-	nr_iaa++;
+	atomic_inc(&nr_iaa);
 
 	return iaa_device;
 }
@@ -509,7 +696,7 @@ static void del_iaa_device(struct iaa_device *iaa_device)
 {
 	list_del(&iaa_device->list);
 
-	nr_iaa--;
+	atomic_dec(&nr_iaa);
 }
 
 static void free_iaa_device(struct iaa_device *iaa_device)
@@ -518,6 +705,17 @@ static void free_iaa_device(struct iaa_device *iaa_device)
 		return;
 
 	remove_device_compression_modes(iaa_device);
+
+	if (iaa_device->generic_wq_table) {
+		kfree(iaa_device->generic_wq_table->wqs);
+		kfree(iaa_device->generic_wq_table);
+	}
+
+	if (iaa_device->comp_wq_table) {
+		kfree(iaa_device->comp_wq_table->wqs);
+		kfree(iaa_device->comp_wq_table);
+	}
+
 	kfree(iaa_device);
 }
 
@@ -567,16 +765,16 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 	struct idxd_device *idxd = iaa_device->idxd;
 	struct pci_dev *pdev = idxd->pdev;
 	struct device *dev = &pdev->dev;
-	struct iaa_wq *iaa_wq;
+	struct iaa_wq *iaa_wq, *next_iaa_wq;
 
-	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+	list_for_each_entry_safe(iaa_wq, next_iaa_wq, &iaa_device->wqs, list) {
 		if (iaa_wq->wq == wq) {
 			list_del(&iaa_wq->list);
 			iaa_device->n_wq--;
 
 			dev_dbg(dev, "removed wq %d from iaa_device %d, n_wq %d, nr_iaa %d\n",
 				wq->id, iaa_device->idxd->id,
-				iaa_device->n_wq, nr_iaa);
+				iaa_device->n_wq, atomic_read(&nr_iaa));
 
 			if (iaa_device->n_wq == 0)
 				del_iaa_device(iaa_device);
@@ -587,21 +785,30 @@ static void del_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 
 static void remove_iaa_wq(struct idxd_wq *wq)
 {
-	struct iaa_device *iaa_device;
+	struct iaa_device *iaa_device, *next_iaa_device;
+	unsigned int num_pkg_iaa = 0;
 
-	list_for_each_entry(iaa_device, &iaa_devices, list) {
+	list_for_each_entry_safe(iaa_device, next_iaa_device, &iaa_devices, list) {
 		if (iaa_has_wq(iaa_device, wq)) {
 			del_iaa_wq(iaa_device, wq);
 			break;
 		}
 	}
 
-	if (nr_iaa) {
-		cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-		if (!cpus_per_iaa)
-			cpus_per_iaa = 1;
-	} else
-		cpus_per_iaa = 1;
+	if (atomic_read(&nr_iaa)) {
+		atomic_set(&cpus_per_iaa, (nr_packages * nr_cpus_per_package) / atomic_read(&nr_iaa));
+		if (!atomic_read(&cpus_per_iaa))
+			atomic_set(&cpus_per_iaa, 1);
+
+		num_pkg_iaa = atomic_read(&nr_iaa) / nr_packages;
+		if (!num_pkg_iaa)
+			num_pkg_iaa = 1;
+	} else {
+		atomic_set(&cpus_per_iaa, 1);
+		num_pkg_iaa = 1;
+	}
+
+	atomic_set(&nr_iaa_per_package, num_pkg_iaa);
 }
 
 static void __free_iaa_wq(struct iaa_wq *iaa_wq)
@@ -635,6 +842,7 @@ static int save_iaa_wq(struct idxd_wq *wq)
 	struct pci_dev *pdev;
 	struct device *dev;
 	int ret = 0;
+	unsigned int num_pkg_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
 		if (iaa_device->idxd == wq->idxd) {
@@ -687,12 +895,19 @@ static int save_iaa_wq(struct idxd_wq *wq)
 		}
 	}
 
-	if (WARN_ON(nr_iaa == 0))
+	if (WARN_ON(atomic_read(&nr_iaa) == 0))
 		return -EINVAL;
 
-	cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa;
-	if (!cpus_per_iaa)
-		cpus_per_iaa = 1;
+	atomic_set(&cpus_per_iaa, (nr_packages * nr_cpus_per_package) / atomic_read(&nr_iaa));
+	if (!atomic_read(&cpus_per_iaa))
+		atomic_set(&cpus_per_iaa, 1);
+
+	num_pkg_iaa = atomic_read(&nr_iaa) / nr_packages;
+	if (!num_pkg_iaa)
+		num_pkg_iaa = 1;
+
+	atomic_set(&nr_iaa_per_package, num_pkg_iaa);
+
 out:
 	return 0;
 }
@@ -748,105 +963,290 @@ static int iaa_wq_put(struct idxd_wq *wq)
  * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
  ***************************************************************/
 
-static void wq_table_free_entry(int cpu)
+/*
+ * Given a cpu, find the closest IAA instance.
+ */
+static inline int cpu_to_iaa(int cpu)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	int package_id, base_iaa, iaa = 0;
 
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
+	if (!nr_packages || !atomic_read(&nr_iaa_per_package) || !atomic_read(&nr_iaa))
+		return -1;
+
+	package_id = topology_logical_package_id(cpu);
+	base_iaa = package_id * atomic_read(&nr_iaa_per_package);
+	iaa = base_iaa + ((cpu % nr_cpus_per_package) / atomic_read(&cpus_per_iaa));
+
+	pr_debug("cpu = %d, package_id = %d, base_iaa = %d, iaa = %d",
+		 cpu, package_id, base_iaa, iaa);
+
+	if (iaa >= 0 && iaa < atomic_read(&nr_iaa))
+		return iaa;
+
+	return (atomic_read(&nr_iaa) - 1);
 }
 
-static void wq_table_clear_entry(int cpu)
+static void free_wq_tables(void)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	if (cpu_decomp_wqs) {
+		free_percpu(cpu_decomp_wqs);
+		cpu_decomp_wqs = NULL;
+	}
 
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+	if (cpu_comp_wqs) {
+		free_percpu(cpu_comp_wqs);
+		cpu_comp_wqs = NULL;
+	}
+
+	pr_debug("freed comp/decomp wq tables\n");
 }
 
-static void clear_wq_table(void)
+static void pkg_global_wqs_dealloc(void)
 {
-	int cpu;
+	int i;
 
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_clear_entry(cpu);
+	if (pkg_global_decomp_wqs) {
+		for (i = 0; i < nr_packages; ++i) {
+			kfree(pkg_global_decomp_wqs[i]->wqs);
+			kfree(pkg_global_decomp_wqs[i]);
+		}
+		kfree(pkg_global_decomp_wqs);
+		pkg_global_decomp_wqs = NULL;
+	}
 
-	pr_debug("cleared wq table\n");
+	if (pkg_global_comp_wqs) {
+		for (i = 0; i < nr_packages; ++i) {
+			kfree(pkg_global_comp_wqs[i]->wqs);
+			kfree(pkg_global_comp_wqs[i]);
+		}
+		kfree(pkg_global_comp_wqs);
+		pkg_global_comp_wqs = NULL;
+	}
 }
 
-static void free_wq_table(void)
+static bool pkg_global_wqs_alloc(void)
 {
-	int cpu;
+	int i;
+
+	pkg_global_decomp_wqs = kcalloc(nr_packages, sizeof(*pkg_global_decomp_wqs), GFP_KERNEL);
+	if (!pkg_global_decomp_wqs)
+		return false;
+
+	for (i = 0; i < nr_packages; ++i) {
+		pkg_global_decomp_wqs[i] = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+		if (!pkg_global_decomp_wqs[i])
+			goto err;
+
+		pkg_global_decomp_wqs[i]->wqs = kcalloc(MAX_PKG_IAA * MAX_IAA_WQ, sizeof(struct idxd_wq *), GFP_KERNEL);
+		if (!pkg_global_decomp_wqs[i]->wqs)
+			goto err;
+
+		pkg_global_decomp_wqs[i]->max_wqs = MAX_PKG_IAA * MAX_IAA_WQ;
+	}
+
+	pkg_global_comp_wqs = kcalloc(nr_packages, sizeof(*pkg_global_comp_wqs), GFP_KERNEL);
+	if (!pkg_global_comp_wqs)
+		goto err;
 
-	for (cpu = 0; cpu < nr_cpus; cpu++)
-		wq_table_free_entry(cpu);
+	for (i = 0; i < nr_packages; ++i) {
+		pkg_global_comp_wqs[i] = kzalloc(sizeof(struct wq_table_entry), GFP_KERNEL);
+		if (!pkg_global_comp_wqs[i])
+			goto err;
+
+		pkg_global_comp_wqs[i]->wqs = kcalloc(MAX_PKG_IAA * MAX_IAA_WQ, sizeof(struct idxd_wq *), GFP_KERNEL);
+		if (!pkg_global_comp_wqs[i]->wqs)
+			goto err;
+
+		pkg_global_comp_wqs[i]->max_wqs = MAX_PKG_IAA * MAX_IAA_WQ;
+	}
 
-	free_percpu(wq_table);
+	return true;
 
-	pr_debug("freed wq table\n");
+err:
+	pkg_global_wqs_dealloc();
+	return false;
 }
 
 static int alloc_wq_table(int max_wqs)
 {
-	struct wq_table_entry *entry;
-	int cpu;
-
-	wq_table = alloc_percpu(struct wq_table_entry);
-	if (!wq_table)
+	cpu_decomp_wqs = alloc_percpu_gfp(struct wq_table_entry, GFP_KERNEL | __GFP_ZERO);
+	if (!cpu_decomp_wqs)
 		return -ENOMEM;
 
-	for (cpu = 0; cpu < nr_cpus; cpu++) {
-		entry = per_cpu_ptr(wq_table, cpu);
-		entry->wqs = kcalloc(max_wqs, sizeof(*entry->wqs), GFP_KERNEL);
-		if (!entry->wqs) {
-			free_wq_table();
-			return -ENOMEM;
-		}
+	cpu_comp_wqs = alloc_percpu_gfp(struct wq_table_entry, GFP_KERNEL | __GFP_ZERO);
+	if (!cpu_comp_wqs)
+		goto err;
 
-		entry->max_wqs = max_wqs;
-	}
+	if (!pkg_global_wqs_alloc())
+		goto err;
 
 	pr_debug("initialized wq table\n");
 
 	return 0;
+
+err:
+	free_wq_tables();
+	return -ENOMEM;
+}
+
+/*
+ * The caller should have established that device_iaa_wqs is not empty,
+ * i.e., every IAA device in "iaa_devices" has at least one WQ.
+ */
+static void add_device_wqs_to_wq_table(struct wq_table_entry *dst_wq_table,
+				       struct wq_table_entry *device_wq_table)
+{
+	int i;
+
+	for (i = 0; i < device_wq_table->n_wqs; ++i)
+		dst_wq_table->wqs[dst_wq_table->n_wqs++] = device_wq_table->wqs[i];
 }
 
-static void wq_table_add(int cpu, struct idxd_wq *wq)
+static bool reinit_pkg_global_wqs(bool comp)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	int cur_iaa = 0, pkg = 0;
+	struct iaa_device *iaa_device;
+	struct wq_table_entry **pkg_wqs = comp ? pkg_global_comp_wqs : pkg_global_decomp_wqs;
+
+	for (pkg = 0; pkg < nr_packages; ++pkg)
+		pkg_wqs[pkg]->n_wqs = 0;
+
+	pkg = 0;
+
+one_iaa_special_case:
+	/* Re-initialize per-package wqs. */
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		struct wq_table_entry *device_wq_table = comp ?
+			((iaa_device->comp_wq_table->n_wqs > 0) ?
+				iaa_device->comp_wq_table : iaa_device->generic_wq_table) :
+			iaa_device->generic_wq_table;
+
+		if (pkg_wqs[pkg]->n_wqs + device_wq_table->n_wqs > pkg_wqs[pkg]->max_wqs) {
+			pkg_wqs[pkg]->wqs = krealloc(pkg_wqs[pkg]->wqs,
+						     ksize(pkg_wqs[pkg]->wqs) +
+						     max((MAX_PKG_IAA * MAX_IAA_WQ), iaa_device->n_wq) * sizeof(struct idxd_wq *),
+						     GFP_KERNEL | __GFP_ZERO);
+			if (!pkg_wqs[pkg]->wqs)
+				return false;
+
+			pkg_wqs[pkg]->max_wqs = ksize(pkg_wqs[pkg]->wqs)/sizeof(struct idxd_wq *);
+		}
+
+		add_device_wqs_to_wq_table(pkg_wqs[pkg], device_wq_table);
+
+		pr_debug("pkg_global_%s_wqs[%d] has %u n_wqs %u max_wqs",
+			 (comp ? "comp" : "decomp"), pkg, pkg_wqs[pkg]->n_wqs, pkg_wqs[pkg]->max_wqs);
+
+		if (++cur_iaa == atomic_read(&nr_iaa_per_package)) {
+			if (++pkg == nr_packages)
+				break;
+			cur_iaa = 0;
+			if (atomic_read(&nr_iaa) == 1)
+				goto one_iaa_special_case;
+		}
+	}
+
+	return true;
+}
+
+static void create_cpu_wq_table(int cpu, struct wq_table_entry *wq_table, bool comp)
+{
+	struct wq_table_entry *entry = comp ?
+		per_cpu_ptr(cpu_comp_wqs, cpu) :
+		per_cpu_ptr(cpu_decomp_wqs, cpu);
+
+	if (!atomic_read(&iaa_crypto_enabled)) {
+		mutex_lock(&first_wq_found_lock);
+
+		if (WARN_ON(!first_wq_found && !wq_table->n_wqs)) {
+			mutex_unlock(&first_wq_found_lock);
+			return;
+		}
+
+		if (!first_wq_found)
+			first_wq_found = wq_table->wqs[0];
 
-	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+		mutex_unlock(&first_wq_found_lock);
+
+		entry->wqs = &first_wq_found;
+		entry->max_wqs = 1;
+		entry->n_wqs = 1;
+		entry->cur_wq = 0;
+		pr_debug("%s: cpu %d: added %u first_wq_found for %s wqs up to wq %d.%d\n", __func__,
+			 cpu, entry->n_wqs, comp ? "comp":"decomp",
+			 entry->wqs[entry->n_wqs - 1]->idxd->id,
+			 entry->wqs[entry->n_wqs - 1]->id);
+		return;
+	}
+
+	entry->wqs = wq_table->wqs;
+	entry->max_wqs = wq_table->max_wqs;
+	entry->n_wqs = wq_table->n_wqs;
+	entry->cur_wq = 0;
+
+	if (entry->n_wqs)
+		pr_debug("%s: cpu %d: added %u iaa %s wqs up to wq %d.%d: entry->max_wqs = %u\n", __func__,
+			 cpu, entry->n_wqs, comp ? "comp":"decomp",
+			 entry->wqs[entry->n_wqs - 1]->idxd->id, entry->wqs[entry->n_wqs - 1]->id,
+			 entry->max_wqs);
+}
+
+static void set_cpu_wq_table_start_wq(int cpu, bool comp)
+{
+	struct wq_table_entry *entry = comp ?
+		per_cpu_ptr(cpu_comp_wqs, cpu) :
+		per_cpu_ptr(cpu_decomp_wqs, cpu);
+	unsigned int num_pkg_iaa = atomic_read(&nr_iaa_per_package);
+
+	if (!num_pkg_iaa)
 		return;
 
-	entry->wqs[entry->n_wqs++] = wq;
+	int start_wq = (entry->n_wqs / num_pkg_iaa) * (cpu_to_iaa(cpu) % num_pkg_iaa);
+
+	if ((start_wq >= 0) && (start_wq < entry->n_wqs))
+		entry->cur_wq = start_wq;
+}
+
+static void create_cpu_wq_table_from_pkg_wqs(bool comp)
+{
+	int cpu;
 
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+	/*
+	 * All CPU on the same package share the same "package global"
+	 * [de]comp_wqs.
+	 */
+	for (cpu = 0; cpu < nr_cpus; cpu += nr_cpus_per_package) {
+		int package_id = topology_logical_package_id(cpu);
+		struct wq_table_entry *pkg_wq_table = comp ?
+			((pkg_global_comp_wqs[package_id]->n_wqs > 0) ?
+				pkg_global_comp_wqs[package_id] : pkg_global_decomp_wqs[package_id])
+			: pkg_global_decomp_wqs[package_id];
+		int pkg_cpu;
+
+		for (pkg_cpu = cpu; pkg_cpu < cpu + nr_cpus_per_package; ++pkg_cpu) {
+			/* Initialize decomp/comp wq_table for CPU. */
+			create_cpu_wq_table(pkg_cpu, pkg_wq_table, comp);
+			/* Stagger the starting WQ in the package WQ table, for each CPU. */
+			set_cpu_wq_table_start_wq(pkg_cpu, comp);
+		}
+	}
 }
 
-static int wq_table_add_wqs(int iaa, int cpu)
+static int add_mapped_device_wq_table_for_cpu(int iaa, int cpu, bool comp)
 {
 	struct iaa_device *iaa_device, *found_device = NULL;
-	int ret = 0, cur_iaa = 0, n_wqs_added = 0;
-	struct idxd_device *idxd;
-	struct iaa_wq *iaa_wq;
-	struct pci_dev *pdev;
-	struct device *dev;
+	struct wq_table_entry *device_wq_table;
+	int ret = 0, cur_iaa = 0;
 
 	list_for_each_entry(iaa_device, &iaa_devices, list) {
-		idxd = iaa_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
-
 		if (cur_iaa != iaa) {
 			cur_iaa++;
 			continue;
 		}
 
 		found_device = iaa_device;
-		dev_dbg(dev, "getting wq from iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 		break;
 	}
@@ -861,93 +1261,219 @@ static int wq_table_add_wqs(int iaa, int cpu)
 		}
 		cur_iaa = 0;
 
-		idxd = found_device->idxd;
-		pdev = idxd->pdev;
-		dev = &pdev->dev;
-		dev_dbg(dev, "getting wq from only iaa_device %d, cur_iaa %d\n",
+		dev_dbg(&found_device->idxd->pdev->dev,
+			"getting wq from only iaa_device %d, cur_iaa %d\n",
 			found_device->idxd->id, cur_iaa);
 	}
 
-	list_for_each_entry(iaa_wq, &found_device->wqs, list) {
-		wq_table_add(cpu, iaa_wq->wq);
-		pr_debug("rebalance: added wq for cpu=%d: iaa wq %d.%d\n",
-			 cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
-		n_wqs_added++;
+	device_wq_table = comp ?
+		((found_device->comp_wq_table->n_wqs > 0) ?
+			found_device->comp_wq_table : found_device->generic_wq_table) :
+		found_device->generic_wq_table;
+
+	create_cpu_wq_table(cpu, device_wq_table, comp);
+
+out:
+	return ret;
+}
+
+static void create_cpu_wq_table_from_mapped_device(bool comp)
+{
+	int cpu, iaa;
+
+	for_each_possible_cpu(cpu) {
+		iaa = cpu_to_iaa(cpu);
+		pr_debug("rebalance: cpu=%d iaa=%d\n", cpu, iaa);
+
+		if (WARN_ON(iaa == -1)) {
+			pr_debug("rebalance (cpu_to_iaa(%d)) failed!\n", cpu);
+			return;
+		}
+
+		if (WARN_ON(add_mapped_device_wq_table_for_cpu(iaa, cpu, comp))) {
+			pr_debug("could not add any wqs of iaa %d to cpu %d!\n", iaa, cpu);
+			return;
+		}
+	}
+}
+
+static int map_iaa_device_wqs(struct iaa_device *iaa_device)
+{
+	struct wq_table_entry *generic, *for_comps;
+	int ret = 0, n_wqs_added = 0;
+	struct iaa_wq *iaa_wq;
+
+	generic = iaa_device->generic_wq_table;
+	for_comps = iaa_device->comp_wq_table;
+
+	list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+		if (iaa_wq->mapped && ++n_wqs_added)
+			continue;
+
+		pr_debug("iaa_device %p: processing wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+
+		if ((!n_wqs_added || ((n_wqs_added + g_comp_wqs_per_iaa) < iaa_device->n_wq)) &&
+			(generic->n_wqs < generic->max_wqs)) {
+
+			generic->wqs[generic->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %p: added decomp wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		} else {
+			if (WARN_ON(for_comps->n_wqs == for_comps->max_wqs))
+				break;
+
+			for_comps->wqs[for_comps->n_wqs++] = iaa_wq->wq;
+			pr_debug("iaa_device %p: added comp wq %d.%d\n", iaa_device, iaa_device->idxd->id, iaa_wq->wq->id);
+		}
+
+		iaa_wq->mapped = true;
+		++n_wqs_added;
 	}
 
-	if (!n_wqs_added) {
-		pr_debug("couldn't find any iaa wqs!\n");
+	if (!n_wqs_added && !iaa_device->n_wq) {
+		pr_debug("iaa_device %d: couldn't find any iaa wqs!\n", iaa_device->idxd->id);
 		ret = -EINVAL;
-		goto out;
 	}
-out:
+
 	return ret;
 }
 
+static void map_iaa_devices(void)
+{
+	struct iaa_device *iaa_device;
+
+	list_for_each_entry(iaa_device, &iaa_devices, list) {
+		WARN_ON(map_iaa_device_wqs(iaa_device));
+	}
+}
+
 /*
- * Rebalance the wq table so that given a cpu, it's easy to find the
- * closest IAA instance.  The idea is to try to choose the most
- * appropriate IAA instance for a caller and spread available
- * workqueues around to clients.
+ * Rebalance the per-cpu wq table based on available IAA devices/WQs.
+ * Three driver parameters control how this algorithm works:
+ *
+ * - g_comp_wqs_per_iaa:
+ *
+ *   If multiple WQs are configured for a given device, this setting determines
+ *   the number of WQs to be used as "compress only" WQs. The remaining WQs will
+ *   be used as "decompress only WQs".
+ *   Note that the comp WQ can be the same as the decomp WQ, for e.g., if
+ *   g_comp_wqs_per_iaa is 0 (regardless of the # of available WQs per device), or,
+ *   if there is only 1 WQ configured for a device (regardless of
+ *   g_comp_wqs_per_iaa).
+ *
+ * - distribute_decomps, distribute_comps:
+ *
+ *   If this is enabled, all [de]comp WQs found from the IAA devices on a
+ *   package, will be aggregated into pkg_global_[de]comp_wqs, then assigned to
+ *   each CPU on the package.
+ *
+ * Note:
+ * -----
+ * rebalance_wq_table() will return true if it was able to successfully
+ * configure comp/decomp wqs for all CPUs, without changing the
+ * @iaa_crypto_enabled atomic. The caller can re-enable the use of the wq
+ * tables after rebalance_wq_table() returns true, by setting the
+ * @iaa_crypto_enabled atomic to 1.
+ * In case of any errors, the @iaa_crypto_enabled atomic will be set to 0,
+ * and rebalance_wq_table() will return false.
  */
-static void rebalance_wq_table(void)
+static bool rebalance_wq_table(void)
 {
-	const struct cpumask *node_cpus;
-	int node_cpu, node, cpu, iaa = 0;
+	int cpu;
 
-	if (nr_iaa == 0)
-		return;
+	if (atomic_read(&nr_iaa) == 0)
+		goto err;
 
-	pr_debug("rebalance: nr_nodes=%d, nr_cpus %d, nr_iaa %d, cpus_per_iaa %d\n",
-		 nr_nodes, nr_cpus, nr_iaa, cpus_per_iaa);
+	map_iaa_devices();
 
-	clear_wq_table();
+	pr_info("rebalance: nr_packages=%d, nr_cpus %d, nr_iaa %d, nr_iaa_per_package %d, cpus_per_iaa %d\n",
+		nr_packages, nr_cpus, atomic_read(&nr_iaa),
+		atomic_read(&nr_iaa_per_package), atomic_read(&cpus_per_iaa));
 
-	if (nr_iaa == 1) {
-		for_each_possible_cpu(cpu) {
-			if (WARN_ON(wq_table_add_wqs(0, cpu)))
-				goto err;
-		}
+	if (iaa_distribute_decomps) {
+		/* Each CPU uses all IAA devices on package for decomps. */
+		if (!reinit_pkg_global_wqs(false))
+			goto err;
+		create_cpu_wq_table_from_pkg_wqs(false);
+	} else {
+		/*
+		 * Each CPU uses the decomp WQ on the mapped IAA device using
+		 * a balanced mapping of cores to IAA.
+		 */
+		create_cpu_wq_table_from_mapped_device(false);
+	}
 
-		return;
+	if (iaa_distribute_comps) {
+		/* Each CPU uses all IAA devices on package for comps. */
+		if (!reinit_pkg_global_wqs(true))
+			goto err;
+		create_cpu_wq_table_from_pkg_wqs(true);
+	} else {
+		/*
+		 * Each CPU uses the comp WQ on the mapped IAA device using
+		 * a balanced mapping of cores to IAA.
+		 */
+		create_cpu_wq_table_from_mapped_device(true);
 	}
 
-	for_each_node_with_cpus(node) {
-		cpu = 0;
-		node_cpus = cpumask_of_node(node);
+	/* Verify that each cpu has comp and decomp wqs.*/
+	for_each_possible_cpu(cpu) {
+		struct wq_table_entry *entry = per_cpu_ptr(cpu_decomp_wqs, cpu);
 
-		for_each_cpu(node_cpu, node_cpus) {
-			iaa = cpu / cpus_per_iaa;
-			if (WARN_ON(wq_table_add_wqs(iaa, node_cpu)))
-				goto err;
-			cpu++;
+		if (!entry->wqs || !entry->n_wqs) {
+			pr_err("%s: cpu %d does not have decomp_wqs", __func__, cpu);
+			goto err;
+		}
+
+		entry = per_cpu_ptr(cpu_comp_wqs, cpu);
+		if (!entry->wqs || !entry->n_wqs) {
+			pr_err("%s: cpu %d does not have comp_wqs", __func__, cpu);
+			goto err;
 		}
 	}
 
-	return;
+	pr_debug("Finished rebalance decomp/comp wqs.");
+	return true;
+
 err:
-	pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
+	atomic_set(&iaa_crypto_enabled, 0);
+	pr_debug("Error during rebalance decomp/comp wqs.");
+	return false;
 }
 
 /***************************************************************
  * Assign work-queues for driver ops using per-cpu wq_tables.
  ***************************************************************/
 
-static struct idxd_wq *wq_table_next_wq(int cpu)
+static struct idxd_wq *decomp_wq_table_next_wq(int cpu)
 {
-	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
+	struct wq_table_entry *entry = per_cpu_ptr(cpu_decomp_wqs, cpu);
+	struct idxd_wq *wq;
+
+	if (!atomic_read(&iaa_crypto_enabled))
+		return NULL;
+
+	wq = entry->wqs[entry->cur_wq];
 
-	if (++entry->cur_wq >= entry->n_wqs)
+	if (++entry->cur_wq == entry->n_wqs)
 		entry->cur_wq = 0;
 
-	if (!entry->wqs[entry->cur_wq])
+	return wq;
+}
+
+static struct idxd_wq *comp_wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(cpu_comp_wqs, cpu);
+	struct idxd_wq *wq;
+
+	if (!atomic_read(&iaa_crypto_enabled))
 		return NULL;
 
-	pr_debug("%s: returning wq at idx %d (iaa wq %d.%d) from cpu %d\n", __func__,
-		 entry->cur_wq, entry->wqs[entry->cur_wq]->idxd->id,
-		 entry->wqs[entry->cur_wq]->id, cpu);
+	wq = entry->wqs[entry->cur_wq];
 
-	return entry->wqs[entry->cur_wq];
+	if (++entry->cur_wq == entry->n_wqs)
+		entry->cur_wq = 0;
+
+	return wq;
 }
 
 /*************************************************
@@ -985,7 +1511,7 @@ static inline int check_completion(struct device *dev,
 			dev_err(dev, "%s completion timed out - "
 				"assuming broken hw, iaa_crypto now DISABLED\n",
 				op_str);
-			iaa_crypto_enabled = false;
+			atomic_set(&iaa_crypto_enabled, 0);
 			ret = -ETIMEDOUT;
 			goto out;
 		}
@@ -1501,18 +2027,13 @@ static int iaa_comp_acompress(struct acomp_req *req)
 
 	compression_ctx = crypto_tfm_ctx(tfm);
 
-	if (!iaa_crypto_enabled) {
-		pr_debug("iaa_crypto disabled, not compressing\n");
-		return -ENODEV;
-	}
-
 	if (!req->src || !req->slen) {
 		pr_debug("invalid src, not compressing\n");
 		return -EINVAL;
 	}
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	wq = comp_wq_table_next_wq(cpu);
 	put_cpu();
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
@@ -1599,18 +2120,13 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 	struct device *dev;
 	struct idxd_wq *wq;
 
-	if (!iaa_crypto_enabled) {
-		pr_debug("iaa_crypto disabled, not decompressing\n");
-		return -ENODEV;
-	}
-
 	if (!req->src || !req->slen) {
 		pr_debug("invalid src, not decompressing\n");
 		return -EINVAL;
 	}
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	wq = decomp_wq_table_next_wq(cpu);
 	put_cpu();
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
@@ -1725,6 +2241,8 @@ static int iaa_register_compression_device(void)
 
 static int iaa_unregister_compression_device(void)
 {
+	atomic_set(&iaa_crypto_enabled, 0);
+
 	if (iaa_crypto_registered)
 		crypto_unregister_acomp(&iaa_acomp_fixed_deflate);
 
@@ -1746,10 +2264,13 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	if (data->type != IDXD_TYPE_IAX)
 		return -ENODEV;
 
+	mutex_lock(&iaa_devices_lock);
+
 	mutex_lock(&wq->wq_lock);
 
 	if (idxd_wq_get_private(wq)) {
 		mutex_unlock(&wq->wq_lock);
+		mutex_unlock(&iaa_devices_lock);
 		return -EBUSY;
 	}
 
@@ -1771,8 +2292,6 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 		goto err;
 	}
 
-	mutex_lock(&iaa_devices_lock);
-
 	if (list_empty(&iaa_devices)) {
 		ret = alloc_wq_table(wq->idxd->max_wqs);
 		if (ret)
@@ -1784,24 +2303,33 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	if (ret)
 		goto err_save;
 
-	rebalance_wq_table();
+	if (!rebalance_wq_table()) {
+		dev_dbg(dev, "%s: IAA rebalancing device wq tables failed\n", __func__);
+		goto err_register;
+	}
+	atomic_set(&iaa_crypto_enabled, 1);
 
 	if (first_wq) {
-		iaa_crypto_enabled = true;
 		ret = iaa_register_compression_device();
 		if (ret != 0) {
-			iaa_crypto_enabled = false;
 			dev_dbg(dev, "IAA compression device registration failed\n");
 			goto err_register;
 		}
+
+		if (!rebalance_wq_table()) {
+			dev_dbg(dev, "%s: Rerun after registration: IAA rebalancing device wq tables failed\n", __func__);
+			goto err_register;
+		}
+		atomic_set(&iaa_crypto_enabled, 1);
+
 		try_module_get(THIS_MODULE);
 
 		pr_info("iaa_crypto now ENABLED\n");
 	}
 
-	mutex_unlock(&iaa_devices_lock);
 out:
 	mutex_unlock(&wq->wq_lock);
+	mutex_unlock(&iaa_devices_lock);
 
 	return ret;
 
@@ -1810,9 +2338,8 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	free_iaa_wq(idxd_wq_get_private(wq));
 err_save:
 	if (first_wq)
-		free_wq_table();
+		free_wq_tables();
 err_alloc:
-	mutex_unlock(&iaa_devices_lock);
 	idxd_drv_disable_wq(wq);
 err:
 	wq->type = IDXD_WQT_NONE;
@@ -1827,13 +2354,17 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 	struct iaa_wq *iaa_wq;
 	bool free = false;
 
+	atomic_set(&iaa_crypto_enabled, 0);
 	idxd_wq_quiesce(wq);
 
-	mutex_lock(&wq->wq_lock);
 	mutex_lock(&iaa_devices_lock);
+	mutex_lock(&wq->wq_lock);
 
 	remove_iaa_wq(wq);
 
+	if (!rebalance_wq_table())
+		pr_debug("%s: IAA rebalancing device wq tables failed\n", __func__);
+
 	spin_lock(&idxd->dev_lock);
 	iaa_wq = idxd_wq_get_private(wq);
 	if (!iaa_wq) {
@@ -1856,18 +2387,24 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 	}
 
 	idxd_drv_disable_wq(wq);
-	rebalance_wq_table();
 
-	if (nr_iaa == 0) {
-		iaa_crypto_enabled = false;
-		free_wq_table();
+	if (atomic_read(&nr_iaa) == 0) {
+		atomic_set(&iaa_crypto_enabled, 0);
+		pkg_global_wqs_dealloc();
+		free_wq_tables();
+		WARN_ON(!list_empty(&iaa_devices));
+		INIT_LIST_HEAD(&iaa_devices);
 		module_put(THIS_MODULE);
 
 		pr_info("iaa_crypto now DISABLED\n");
+	} else if (rebalance_wq_table()) {
+		atomic_set(&iaa_crypto_enabled, 1);
+	} else {
+		pr_debug("%s: IAA re-rebalancing device wq tables failed\n", __func__);
 	}
 out:
-	mutex_unlock(&iaa_devices_lock);
 	mutex_unlock(&wq->wq_lock);
+	mutex_unlock(&iaa_devices_lock);
 }
 
 static enum idxd_dev_type dev_types[] = {
@@ -1890,16 +2427,12 @@ static struct idxd_device_driver iaa_crypto_driver = {
 static int __init iaa_crypto_init_module(void)
 {
 	int ret = 0;
-	int node;
+
+	INIT_LIST_HEAD(&iaa_devices);
 
 	nr_cpus = num_possible_cpus();
-	for_each_node_with_cpus(node)
-		nr_nodes++;
-	if (!nr_nodes) {
-		pr_err("IAA couldn't find any nodes with cpus\n");
-		return -ENODEV;
-	}
-	nr_cpus_per_node = nr_cpus / nr_nodes;
+	nr_cpus_per_package = topology_num_cores_per_package();
+	nr_packages = topology_max_packages();
 
 	ret = iaa_aecs_init_fixed();
 	if (ret < 0) {
@@ -1913,6 +2446,27 @@ static int __init iaa_crypto_init_module(void)
 		goto err_driver_reg;
 	}
 
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_comp_wqs_per_iaa);
+	if (ret) {
+		pr_debug("IAA g_comp_wqs_per_iaa attr creation failed\n");
+		goto err_g_comp_wqs_per_iaa_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				 &driver_attr_distribute_decomps);
+	if (ret) {
+		pr_debug("IAA distribute_decomps attr creation failed\n");
+		goto err_distribute_decomps_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				 &driver_attr_distribute_comps);
+	if (ret) {
+		pr_debug("IAA distribute_comps attr creation failed\n");
+		goto err_distribute_comps_attr_create;
+	}
+
 	ret = driver_create_file(&iaa_crypto_driver.drv,
 				 &driver_attr_verify_compress);
 	if (ret) {
@@ -1938,6 +2492,15 @@ static int __init iaa_crypto_init_module(void)
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
 err_verify_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_comps);
+err_distribute_comps_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_decomps);
+err_distribute_decomps_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_comp_wqs_per_iaa);
+err_g_comp_wqs_per_iaa_attr_create:
 	idxd_driver_unregister(&iaa_crypto_driver);
 err_driver_reg:
 	iaa_aecs_cleanup_fixed();
@@ -1956,6 +2519,12 @@ static void __exit iaa_crypto_cleanup_module(void)
 			   &driver_attr_sync_mode);
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_comps);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_distribute_decomps);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_comp_wqs_per_iaa);
 	idxd_driver_unregister(&iaa_crypto_driver);
 	iaa_aecs_cleanup_fixed();
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 03/22] crypto: iaa - Simplify, consistency of function parameters, minor stats bug fix.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 01/22] crypto: iaa - Reorganize the iaa_crypto driver code Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 02/22] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 04/22] crypto: iaa - Descriptor allocation timeouts with mitigations Kanchana P Sridhar
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch further simplifies the code in some places and makes it more
consistent and readable:

1) Change iaa_compress_verify() @dlen parameter to be a value instead of
   a pointer, because @dlen's value is only read, not modified by this
   procedure.

2) Simplify the success/error return paths in iaa_compress(),
   iaa_decompress() and iaa_compress_verify().

3) Delete dev_dbg() statements to make the code more readable.

4) Change return value from descriptor allocation failures to be
   -ENODEV, for better maintainability.

5) Fix a minor statistics bug in iaa_decompress(), with the
   decomp_bytes getting updated in case of errors.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 107 +++++----------------
 1 file changed, 22 insertions(+), 85 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 9de7a8a4d7a8..44d4e2494bf3 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1596,7 +1596,7 @@ static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
 static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 			       struct idxd_wq *wq,
 			       dma_addr_t src_addr, unsigned int slen,
-			       dma_addr_t dst_addr, unsigned int *dlen)
+			       dma_addr_t dst_addr, unsigned int dlen)
 {
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
@@ -1620,10 +1620,8 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
 	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa compress failed: ret=%ld\n",
-			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
+		dev_dbg(dev, "iaa compress_verify failed: idxd descriptor allocation failure: ret=%ld\n", PTR_ERR(idxd_desc));
+		return -ENODEV;
 	}
 	desc = idxd_desc->iax_hw;
 
@@ -1635,19 +1633,11 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 	desc->priv = 0;
 
 	desc->src1_addr = (u64)dst_addr;
-	desc->src1_size = *dlen;
+	desc->src1_size = dlen;
 	desc->dst_addr = (u64)src_addr;
 	desc->max_dst_size = slen;
 	desc->completion_addr = idxd_desc->compl_dma;
 
-	dev_dbg(dev, "(verify) compression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n",
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
 	ret = idxd_submit_desc(wq, idxd_desc);
 	if (ret) {
 		dev_dbg(dev, "submit_desc (verify) failed ret=%d\n", ret);
@@ -1670,14 +1660,10 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 		goto err;
 	}
 
-	idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
 err:
 	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
 
-	goto out;
+	return ret;
 }
 
 static void iaa_desc_complete(struct idxd_desc *idxd_desc,
@@ -1757,7 +1743,7 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 		}
 
 		ret = iaa_compress_verify(ctx->tfm, ctx->req, iaa_wq->wq, src_addr,
-					  ctx->req->slen, dst_addr, &ctx->req->dlen);
+					  ctx->req->slen, dst_addr, ctx->req->dlen);
 		if (ret) {
 			dev_dbg(dev, "%s: compress verify failed ret=%d\n", __func__, ret);
 			err = -EIO;
@@ -1783,7 +1769,7 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 	iaa_wq_put(idxd_desc->wq);
 }
 
-static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
+static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 			struct idxd_wq *wq,
 			dma_addr_t src_addr, unsigned int slen,
 			dma_addr_t dst_addr, unsigned int *dlen)
@@ -1810,9 +1796,9 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 
 	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
 	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa compress failed: ret=%ld\n", PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
+		dev_dbg(dev, "iaa compress failed: idxd descriptor allocation failure: ret=%ld\n",
+			PTR_ERR(idxd_desc));
+		return -ENODEV;
 	}
 	desc = idxd_desc->iax_hw;
 
@@ -1838,21 +1824,8 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = true;
-
-		dev_dbg(dev, "%s use_async_irq: compression mode %s,"
-			" src_addr %llx, dst_addr %llx\n", __func__,
-			active_compression_mode->name,
-			src_addr, dst_addr);
 	}
 
-	dev_dbg(dev, "%s: compression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n", __func__,
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
 	ret = idxd_submit_desc(wq, idxd_desc);
 	if (ret) {
 		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
@@ -1865,7 +1838,6 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 
 	if (ctx->async_mode) {
 		ret = -EINPROGRESS;
-		dev_dbg(dev, "%s: returning -EINPROGRESS\n", __func__);
 		goto out;
 	}
 
@@ -1883,15 +1855,10 @@ static int iaa_compress(struct crypto_tfm *tfm,	struct acomp_req *req,
 
 	*compression_crc = idxd_desc->iax_completion->crc;
 
-	if (!ctx->async_mode)
-		idxd_free_desc(wq, idxd_desc);
-out:
-	return ret;
 err:
 	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa compress failed: ret=%d\n", ret);
-
-	goto out;
+out:
+	return ret;
 }
 
 static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
@@ -1920,10 +1887,10 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
 	if (IS_ERR(idxd_desc)) {
-		dev_dbg(dev, "idxd descriptor allocation failed\n");
-		dev_dbg(dev, "iaa decompress failed: ret=%ld\n",
+		ret = -ENODEV;
+		dev_dbg(dev, "%s: idxd descriptor allocation failed: ret=%ld\n", __func__,
 			PTR_ERR(idxd_desc));
-		return PTR_ERR(idxd_desc);
+		return ret;
 	}
 	desc = idxd_desc->iax_hw;
 
@@ -1947,21 +1914,8 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = false;
-
-		dev_dbg(dev, "%s: use_async_irq compression mode %s,"
-			" src_addr %llx, dst_addr %llx\n", __func__,
-			active_compression_mode->name,
-			src_addr, dst_addr);
 	}
 
-	dev_dbg(dev, "%s: decompression mode %s,"
-		" desc->src1_addr %llx, desc->src1_size %d,"
-		" desc->dst_addr %llx, desc->max_dst_size %d,"
-		" desc->src2_addr %llx, desc->src2_size %d\n", __func__,
-		active_compression_mode->name,
-		desc->src1_addr, desc->src1_size, desc->dst_addr,
-		desc->max_dst_size, desc->src2_addr, desc->src2_size);
-
 	ret = idxd_submit_desc(wq, idxd_desc);
 	if (ret) {
 		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
@@ -1974,7 +1928,6 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	if (ctx->async_mode) {
 		ret = -EINPROGRESS;
-		dev_dbg(dev, "%s: returning -EINPROGRESS\n", __func__);
 		goto out;
 	}
 
@@ -1996,23 +1949,19 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 		}
 	} else {
 		req->dlen = idxd_desc->iax_completion->output_size;
+
+		/* Update stats */
+		update_total_decomp_bytes_in(slen);
+		update_wq_decomp_bytes(wq, slen);
 	}
 
 	*dlen = req->dlen;
 
-	if (!ctx->async_mode)
+err:
+	if (idxd_desc)
 		idxd_free_desc(wq, idxd_desc);
-
-	/* Update stats */
-	update_total_decomp_bytes_in(slen);
-	update_wq_decomp_bytes(wq, slen);
 out:
 	return ret;
-err:
-	idxd_free_desc(wq, idxd_desc);
-	dev_dbg(dev, "iaa decompress failed: ret=%d\n", ret);
-
-	goto out;
 }
 
 static int iaa_comp_acompress(struct acomp_req *req)
@@ -2059,9 +2008,6 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		goto out;
 	}
 	src_addr = sg_dma_address(req->src);
-	dev_dbg(dev, "dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
-		" req->slen %d, sg_dma_len(sg) %d\n", src_addr, nr_sgs,
-		req->src, req->slen, sg_dma_len(req->src));
 
 	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
 	if (nr_sgs <= 0 || nr_sgs > 1) {
@@ -2072,9 +2018,6 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		goto err_map_dst;
 	}
 	dst_addr = sg_dma_address(req->dst);
-	dev_dbg(dev, "dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
-		" req->dlen %d, sg_dma_len(sg) %d\n", dst_addr, nr_sgs,
-		req->dst, req->dlen, sg_dma_len(req->dst));
 
 	ret = iaa_compress(tfm, req, wq, src_addr, req->slen, dst_addr,
 			   &req->dlen);
@@ -2089,7 +2032,7 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		}
 
 		ret = iaa_compress_verify(tfm, req, wq, src_addr, req->slen,
-					  dst_addr, &req->dlen);
+					  dst_addr, req->dlen);
 		if (ret)
 			dev_dbg(dev, "asynchronous compress verification failed ret=%d\n", ret);
 
@@ -2152,9 +2095,6 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		goto out;
 	}
 	src_addr = sg_dma_address(req->src);
-	dev_dbg(dev, "dma_map_sg, src_addr %llx, nr_sgs %d, req->src %p,"
-		" req->slen %d, sg_dma_len(sg) %d\n", src_addr, nr_sgs,
-		req->src, req->slen, sg_dma_len(req->src));
 
 	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
 	if (nr_sgs <= 0 || nr_sgs > 1) {
@@ -2165,9 +2105,6 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 		goto err_map_dst;
 	}
 	dst_addr = sg_dma_address(req->dst);
-	dev_dbg(dev, "dma_map_sg, dst_addr %llx, nr_sgs %d, req->dst %p,"
-		" req->dlen %d, sg_dma_len(sg) %d\n", dst_addr, nr_sgs,
-		req->dst, req->dlen, sg_dma_len(req->dst));
 
 	ret = iaa_decompress(tfm, req, wq, src_addr, req->slen,
 			     dst_addr, &req->dlen);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 04/22] crypto: iaa - Descriptor allocation timeouts with mitigations.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 03/22] crypto: iaa - Simplify, consistency of function parameters, minor stats bug fix Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 05/22] crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting Kanchana P Sridhar
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the descriptor allocation from blocking to
non-blocking with bounded retries or "timeouts".

This is necessary to prevent task blocked errors in high contention
scenarios, for instance, when the platform has only 1 IAA device
enabled. With 1 IAA device enabled per package on a dual-package
Sapphire Rapids with 56 cores/package, there are 112 logical cores
mapped to this single IAA device. In this scenario, the task blocked
errors can occur because idxd_alloc_desc() is called with
IDXD_OP_BLOCK. With batching, multiple descriptors will need to be
allocated per batch. Any process that is able to do so, can cause
contention for allocating descriptors for all other processes that share
the use of the same sbitmap_queue. Under IDXD_OP_BLOCK, this causes
compress/decompress jobs to stall in stress test scenarios
(e.g. zswap_store() of 2M folios).

In order to make the iaa_crypto driver be more fail-safe, this commit
implements the following:

1) Change compress/decompress descriptor allocations to be non-blocking
   with retries ("timeouts").
2) Return compress error to zswap if descriptor allocation with timeouts
   fails during compress ops. zswap_store() will return an error and the
   folio gets stored in the backing swap device.
3) Fallback to software decompress if descriptor allocation with timeouts
   fails during decompress ops.

With these fixes, there are no task blocked errors seen under stress
testing conditions, and no performance degradation observed.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  5 ++
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 58 +++++++++++++++-------
 2 files changed, 44 insertions(+), 19 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 549ac98a9366..cc76a047b54a 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -21,6 +21,9 @@
 
 #define IAA_COMPLETION_TIMEOUT		1000000
 
+#define IAA_ALLOC_DESC_COMP_TIMEOUT	   1000
+#define IAA_ALLOC_DESC_DECOMP_TIMEOUT	    500
+
 #define IAA_ANALYTICS_ERROR		0x0a
 #define IAA_ERROR_DECOMP_BUF_OVERFLOW	0x0b
 #define IAA_ERROR_COMP_BUF_OVERFLOW	0x19
@@ -141,6 +144,8 @@ enum iaa_mode {
 
 struct iaa_compression_ctx {
 	enum iaa_mode	mode;
+	u16		alloc_comp_desc_timeout;
+	u16		alloc_decomp_desc_timeout;
 	bool		verify_compress;
 	bool		async_mode;
 	bool		use_irq;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 44d4e2494bf3..89e59ef89a69 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1602,7 +1602,8 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	u16 alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1618,7 +1619,11 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_decomp_desc_timeout)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "iaa compress_verify failed: idxd descriptor allocation failure: ret=%ld\n", PTR_ERR(idxd_desc));
 		return -ENODEV;
@@ -1778,7 +1783,8 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	u16 alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1794,7 +1800,11 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_comp_desc_timeout)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		dev_dbg(dev, "iaa compress failed: idxd descriptor allocation failure: ret=%ld\n",
 			PTR_ERR(idxd_desc));
@@ -1869,7 +1879,8 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
-	struct idxd_desc *idxd_desc;
+	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
+	u16 alloc_desc_retries = 0;
 	struct iax_hw_desc *desc;
 	struct idxd_device *idxd;
 	struct iaa_wq *iaa_wq;
@@ -1885,12 +1896,17 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
 
-	idxd_desc = idxd_alloc_desc(wq, IDXD_OP_BLOCK);
+	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_decomp_desc_timeout)) {
+		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
+		cpu_relax();
+	}
+
 	if (IS_ERR(idxd_desc)) {
 		ret = -ENODEV;
 		dev_dbg(dev, "%s: idxd descriptor allocation failed: ret=%ld\n", __func__,
 			PTR_ERR(idxd_desc));
-		return ret;
+		idxd_desc = NULL;
+		goto fallback_software_decomp;
 	}
 	desc = idxd_desc->iax_hw;
 
@@ -1919,7 +1935,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	ret = idxd_submit_desc(wq, idxd_desc);
 	if (ret) {
 		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-		goto err;
+		goto fallback_software_decomp;
 	}
 
 	/* Update stats */
@@ -1932,19 +1948,21 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	}
 
 	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+
+fallback_software_decomp:
 	if (ret) {
-		dev_dbg(dev, "%s: check_completion failed ret=%d\n", __func__, ret);
-		if (idxd_desc->iax_completion->status == IAA_ANALYTICS_ERROR) {
+		dev_dbg(dev, "%s: desc allocation/submission/check_completion failed ret=%d\n", __func__, ret);
+		if (idxd_desc && idxd_desc->iax_completion->status == IAA_ANALYTICS_ERROR) {
 			pr_warn("%s: falling back to deflate-generic decompress, "
 				"analytics error code %x\n", __func__,
 				idxd_desc->iax_completion->error_code);
-			ret = deflate_generic_decompress(req);
-			if (ret) {
-				dev_dbg(dev, "%s: deflate-generic failed ret=%d\n",
-					__func__, ret);
-				goto err;
-			}
-		} else {
+		}
+
+		ret = deflate_generic_decompress(req);
+
+		if (ret) {
+			pr_err("%s: iaa decompress failed: deflate-generic fallback error ret=%d\n",
+			       __func__, ret);
 			goto err;
 		}
 	} else {
@@ -2125,6 +2143,8 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 
 static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 {
+	ctx->alloc_comp_desc_timeout = IAA_ALLOC_DESC_COMP_TIMEOUT;
+	ctx->alloc_decomp_desc_timeout = IAA_ALLOC_DESC_DECOMP_TIMEOUT;
 	ctx->verify_compress = iaa_verify_compress;
 	ctx->async_mode = async_mode;
 	ctx->use_irq = use_irq;
@@ -2139,10 +2159,10 @@ static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 
-	compression_ctx_init(ctx);
-
 	ctx->mode = IAA_MODE_FIXED;
 
+	compression_ctx_init(ctx);
+
 	return 0;
 }
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 05/22] crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 04/22] crypto: iaa - Descriptor allocation timeouts with mitigations Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 06/22] crypto: iaa - Simplify the code flow in iaa_compress() and iaa_decompress() Kanchana P Sridhar
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the reference counting on "struct iaa_wq" to be a
percpu_ref in atomic mode, instead of an "int refcount" combined with
the "idxd->dev_lock" spin_lock currently used as a synchronization
mechanism to achieve get/put semantics.

This enables a more light-weight, cleaner and effective refcount
implementation for the iaa_wq, that prevents race conditions and
significantly reduces batch compress/decompress latency submitted to
the IAA accelerator.

For a single-threaded madvise-based workload with the Silesia.tar
dataset, these are the before/after batch compression latencies for a
compress batch of 8 pages:

 ==================================
               p50 (ns)    p99 (ns)
 ==================================
 before           5,576       5,992
 after            5,472       5,848
 Change            -104        -144
 ==================================

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |   4 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 119 +++++++--------------
 2 files changed, 41 insertions(+), 82 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index cc76a047b54a..9611f2518f42 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -47,8 +47,8 @@ struct iaa_wq {
 	struct list_head	list;
 
 	struct idxd_wq		*wq;
-	int			ref;
-	bool			remove;
+	struct percpu_ref	ref;
+	bool			free;
 	bool			mapped;
 
 	struct iaa_device	*iaa_device;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 89e59ef89a69..ca53445a0a7f 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -701,7 +701,7 @@ static void del_iaa_device(struct iaa_device *iaa_device)
 
 static void free_iaa_device(struct iaa_device *iaa_device)
 {
-	if (!iaa_device)
+	if (!iaa_device || iaa_device->n_wq)
 		return;
 
 	remove_device_compression_modes(iaa_device);
@@ -731,6 +731,13 @@ static bool iaa_has_wq(struct iaa_device *iaa_device, struct idxd_wq *wq)
 	return false;
 }
 
+static void __iaa_wq_release(struct percpu_ref *ref)
+{
+	struct iaa_wq *iaa_wq = container_of(ref, typeof(*iaa_wq), ref);
+
+	iaa_wq->free = true;
+}
+
 static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 		      struct iaa_wq **new_wq)
 {
@@ -738,11 +745,20 @@ static int add_iaa_wq(struct iaa_device *iaa_device, struct idxd_wq *wq,
 	struct pci_dev *pdev = idxd->pdev;
 	struct device *dev = &pdev->dev;
 	struct iaa_wq *iaa_wq;
+	int ret;
 
 	iaa_wq = kzalloc(sizeof(*iaa_wq), GFP_KERNEL);
 	if (!iaa_wq)
 		return -ENOMEM;
 
+	ret = percpu_ref_init(&iaa_wq->ref, __iaa_wq_release,
+			      PERCPU_REF_INIT_ATOMIC, GFP_KERNEL);
+
+	if (ret) {
+		kfree(iaa_wq);
+		return -ENOMEM;
+	}
+
 	iaa_wq->wq = wq;
 	iaa_wq->iaa_device = iaa_device;
 	idxd_wq_set_private(wq, iaa_wq);
@@ -818,6 +834,9 @@ static void __free_iaa_wq(struct iaa_wq *iaa_wq)
 	if (!iaa_wq)
 		return;
 
+	WARN_ON(!percpu_ref_is_zero(&iaa_wq->ref));
+	percpu_ref_exit(&iaa_wq->ref);
+
 	iaa_device = iaa_wq->iaa_device;
 	if (iaa_device->n_wq == 0)
 		free_iaa_device(iaa_wq->iaa_device);
@@ -912,53 +931,6 @@ static int save_iaa_wq(struct idxd_wq *wq)
 	return 0;
 }
 
-static int iaa_wq_get(struct idxd_wq *wq)
-{
-	struct idxd_device *idxd = wq->idxd;
-	struct iaa_wq *iaa_wq;
-	int ret = 0;
-
-	spin_lock(&idxd->dev_lock);
-	iaa_wq = idxd_wq_get_private(wq);
-	if (iaa_wq && !iaa_wq->remove) {
-		iaa_wq->ref++;
-		idxd_wq_get(wq);
-	} else {
-		ret = -ENODEV;
-	}
-	spin_unlock(&idxd->dev_lock);
-
-	return ret;
-}
-
-static int iaa_wq_put(struct idxd_wq *wq)
-{
-	struct idxd_device *idxd = wq->idxd;
-	struct iaa_wq *iaa_wq;
-	bool free = false;
-	int ret = 0;
-
-	spin_lock(&idxd->dev_lock);
-	iaa_wq = idxd_wq_get_private(wq);
-	if (iaa_wq) {
-		iaa_wq->ref--;
-		if (iaa_wq->ref == 0 && iaa_wq->remove) {
-			idxd_wq_set_private(wq, NULL);
-			free = true;
-		}
-		idxd_wq_put(wq);
-	} else {
-		ret = -ENODEV;
-	}
-	spin_unlock(&idxd->dev_lock);
-	if (free) {
-		__free_iaa_wq(iaa_wq);
-		kfree(iaa_wq);
-	}
-
-	return ret;
-}
-
 /***************************************************************
  * Mapping IAA devices and wqs to cores with per-cpu wq_tables.
  ***************************************************************/
@@ -1771,7 +1743,7 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 
 	if (free_desc)
 		idxd_free_desc(idxd_desc->wq, idxd_desc);
-	iaa_wq_put(idxd_desc->wq);
+	percpu_ref_put(&iaa_wq->ref);
 }
 
 static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
@@ -2002,19 +1974,13 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	cpu = get_cpu();
 	wq = comp_wq_table_next_wq(cpu);
 	put_cpu();
-	if (!wq) {
-		pr_debug("no wq configured for cpu=%d\n", cpu);
-		return -ENODEV;
-	}
 
-	ret = iaa_wq_get(wq);
-	if (ret) {
+	iaa_wq = wq ? idxd_wq_get_private(wq) : NULL;
+	if (unlikely(!iaa_wq || !percpu_ref_tryget(&iaa_wq->ref))) {
 		pr_debug("no wq available for cpu=%d\n", cpu);
 		return -ENODEV;
 	}
 
-	iaa_wq = idxd_wq_get_private(wq);
-
 	dev = &wq->idxd->pdev->dev;
 
 	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
@@ -2067,7 +2033,7 @@ static int iaa_comp_acompress(struct acomp_req *req)
 err_map_dst:
 	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
 out:
-	iaa_wq_put(wq);
+	percpu_ref_put(&iaa_wq->ref);
 
 	return ret;
 }
@@ -2089,19 +2055,13 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 	cpu = get_cpu();
 	wq = decomp_wq_table_next_wq(cpu);
 	put_cpu();
-	if (!wq) {
-		pr_debug("no wq configured for cpu=%d\n", cpu);
-		return -ENODEV;
-	}
 
-	ret = iaa_wq_get(wq);
-	if (ret) {
+	iaa_wq = wq ? idxd_wq_get_private(wq) : NULL;
+	if (unlikely(!iaa_wq || !percpu_ref_tryget(&iaa_wq->ref))) {
 		pr_debug("no wq available for cpu=%d\n", cpu);
-		return -ENODEV;
+		return deflate_generic_decompress(req);
 	}
 
-	iaa_wq = idxd_wq_get_private(wq);
-
 	dev = &wq->idxd->pdev->dev;
 
 	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
@@ -2136,7 +2096,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 err_map_dst:
 	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
 out:
-	iaa_wq_put(wq);
+	percpu_ref_put(&iaa_wq->ref);
 
 	return ret;
 }
@@ -2309,7 +2269,6 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 	struct idxd_wq *wq = idxd_dev_to_wq(idxd_dev);
 	struct idxd_device *idxd = wq->idxd;
 	struct iaa_wq *iaa_wq;
-	bool free = false;
 
 	atomic_set(&iaa_crypto_enabled, 0);
 	idxd_wq_quiesce(wq);
@@ -2330,18 +2289,18 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 		goto out;
 	}
 
-	if (iaa_wq->ref) {
-		iaa_wq->remove = true;
-	} else {
-		wq = iaa_wq->wq;
-		idxd_wq_set_private(wq, NULL);
-		free = true;
-	}
+	/* Drop the initial reference. */
+	percpu_ref_kill(&iaa_wq->ref);
+
+	while (!iaa_wq->free)
+		cpu_relax();
+
+	__free_iaa_wq(iaa_wq);
+
+	idxd_wq_set_private(wq, NULL);
 	spin_unlock(&idxd->dev_lock);
-	if (free) {
-		__free_iaa_wq(iaa_wq);
-		kfree(iaa_wq);
-	}
+
+	kfree(iaa_wq);
 
 	idxd_drv_disable_wq(wq);
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 06/22] crypto: iaa - Simplify the code flow in iaa_compress() and iaa_decompress().
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 05/22] crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 07/22] crypto: iaa - Refactor hardware descriptor setup into separate procedures Kanchana P Sridhar
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This commit simplifies and streamlines the logic in the core
iaa_compress() and iaa_decompress() routines, eliminates branches, etc.

This makes it easier to add improvements such as polling for job
completions, essential to accomplish batching with hardware
parallelism.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 114 ++++++++++++---------
 1 file changed, 67 insertions(+), 47 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index ca53445a0a7f..74d5b451e34b 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1798,7 +1798,34 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 	desc->src2_size = sizeof(struct aecs_comp_table_record);
 	desc->completion_addr = idxd_desc->compl_dma;
 
-	if (ctx->use_irq) {
+	if (likely(!ctx->use_irq)) {
+		ret = idxd_submit_desc(wq, idxd_desc);
+		if (ret) {
+			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
+			goto out;
+		}
+
+		/* Update stats */
+		update_total_comp_calls();
+		update_wq_comp_calls(wq);
+
+		if (ctx->async_mode)
+			return -EINPROGRESS;
+
+		ret = check_completion(dev, idxd_desc->iax_completion, true, false);
+		if (ret) {
+			dev_dbg(dev, "check_completion failed ret=%d\n", ret);
+			goto out;
+		}
+
+		*dlen = idxd_desc->iax_completion->output_size;
+
+		/* Update stats */
+		update_total_comp_bytes_out(*dlen);
+		update_wq_comp_bytes(wq, *dlen);
+
+		*compression_crc = idxd_desc->iax_completion->crc;
+	} else {
 		desc->flags |= IDXD_OP_FLAG_RCI;
 
 		idxd_desc->crypto.req = req;
@@ -1806,40 +1833,23 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = true;
-	}
-
-	ret = idxd_submit_desc(wq, idxd_desc);
-	if (ret) {
-		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-		goto err;
-	}
 
-	/* Update stats */
-	update_total_comp_calls();
-	update_wq_comp_calls(wq);
+		ret = idxd_submit_desc(wq, idxd_desc);
+		if (ret) {
+			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
+			goto out;
+		}
 
-	if (ctx->async_mode) {
-		ret = -EINPROGRESS;
-		goto out;
-	}
+		/* Update stats */
+		update_total_comp_calls();
+		update_wq_comp_calls(wq);
 
-	ret = check_completion(dev, idxd_desc->iax_completion, true, false);
-	if (ret) {
-		dev_dbg(dev, "check_completion failed ret=%d\n", ret);
-		goto err;
+		return -EINPROGRESS;
 	}
 
-	*dlen = idxd_desc->iax_completion->output_size;
-
-	/* Update stats */
-	update_total_comp_bytes_out(*dlen);
-	update_wq_comp_bytes(wq, *dlen);
-
-	*compression_crc = idxd_desc->iax_completion->crc;
-
-err:
-	idxd_free_desc(wq, idxd_desc);
 out:
+	idxd_free_desc(wq, idxd_desc);
+
 	return ret;
 }
 
@@ -1894,7 +1904,22 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	desc->src1_size = slen;
 	desc->completion_addr = idxd_desc->compl_dma;
 
-	if (ctx->use_irq) {
+	if (likely(!ctx->use_irq)) {
+		ret = idxd_submit_desc(wq, idxd_desc);
+		if (ret) {
+			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
+			goto fallback_software_decomp;
+		}
+
+		/* Update stats */
+		update_total_decomp_calls();
+		update_wq_decomp_calls(wq);
+
+		if (ctx->async_mode)
+			return -EINPROGRESS;
+
+		ret = check_completion(dev, idxd_desc->iax_completion, false, false);
+	} else {
 		desc->flags |= IDXD_OP_FLAG_RCI;
 
 		idxd_desc->crypto.req = req;
@@ -1902,25 +1927,20 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = false;
-	}
 
-	ret = idxd_submit_desc(wq, idxd_desc);
-	if (ret) {
-		dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-		goto fallback_software_decomp;
-	}
+		ret = idxd_submit_desc(wq, idxd_desc);
+		if (ret) {
+			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
+			goto fallback_software_decomp;
+		}
 
-	/* Update stats */
-	update_total_decomp_calls();
-	update_wq_decomp_calls(wq);
+		/* Update stats */
+		update_total_decomp_calls();
+		update_wq_decomp_calls(wq);
 
-	if (ctx->async_mode) {
-		ret = -EINPROGRESS;
-		goto out;
+		return -EINPROGRESS;
 	}
 
-	ret = check_completion(dev, idxd_desc->iax_completion, false, false);
-
 fallback_software_decomp:
 	if (ret) {
 		dev_dbg(dev, "%s: desc allocation/submission/check_completion failed ret=%d\n", __func__, ret);
@@ -1935,7 +1955,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 		if (ret) {
 			pr_err("%s: iaa decompress failed: deflate-generic fallback error ret=%d\n",
 			       __func__, ret);
-			goto err;
+			goto out;
 		}
 	} else {
 		req->dlen = idxd_desc->iax_completion->output_size;
@@ -1947,10 +1967,10 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 	*dlen = req->dlen;
 
-err:
+out:
 	if (idxd_desc)
 		idxd_free_desc(wq, idxd_desc);
-out:
+
 	return ret;
 }
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 07/22] crypto: iaa - Refactor hardware descriptor setup into separate procedures.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 06/22] crypto: iaa - Simplify the code flow in iaa_compress() and iaa_decompress() Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 08/22] crypto: iaa - Simplified, efficient job submissions for non-irq mode Kanchana P Sridhar
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch refactors the code that sets up the "struct iax_hw_desc" for
compress/decompress ops, into distinct procedures to make the code more
readable.

Also, get_iaa_device_compression_mode() is deleted and the compression
mode directly accessed from the iaa_device in the calling procedures.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 99 ++++++++++++----------
 1 file changed, 56 insertions(+), 43 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 74d5b451e34b..697e98785335 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -483,12 +483,6 @@ int add_iaa_compression_mode(const char *name,
 }
 EXPORT_SYMBOL_GPL(add_iaa_compression_mode);
 
-static struct iaa_device_compression_mode *
-get_iaa_device_compression_mode(struct iaa_device *iaa_device, int idx)
-{
-	return iaa_device->compression_modes[idx];
-}
-
 static void free_device_compression_mode(struct iaa_device *iaa_device,
 					 struct iaa_device_compression_mode *device_mode)
 {
@@ -1570,7 +1564,6 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 			       dma_addr_t src_addr, unsigned int slen,
 			       dma_addr_t dst_addr, unsigned int dlen)
 {
-	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
@@ -1589,8 +1582,6 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 	pdev = idxd->pdev;
 	dev = &pdev->dev;
 
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
 	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_decomp_desc_timeout)) {
 		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
 		cpu_relax();
@@ -1666,8 +1657,7 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 	pdev = idxd->pdev;
 	dev = &pdev->dev;
 
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device,
-								  compression_ctx->mode);
+	active_compression_mode = iaa_device->compression_modes[compression_ctx->mode];
 	dev_dbg(dev, "%s: compression mode %s,"
 		" ctx->src_addr %llx, ctx->dst_addr %llx\n", __func__,
 		active_compression_mode->name,
@@ -1746,12 +1736,63 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 	percpu_ref_put(&iaa_wq->ref);
 }
 
+static struct iax_hw_desc *
+iaa_setup_compress_hw_desc(struct idxd_desc *idxd_desc,
+			   dma_addr_t src_addr,
+			   unsigned int slen,
+			   dma_addr_t dst_addr,
+			   unsigned int dlen,
+			   enum iaa_mode mode,
+			   struct iaa_device_compression_mode *active_compression_mode)
+{
+	struct iax_hw_desc *desc = idxd_desc->iax_hw;
+
+	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
+	desc->opcode = IAX_OPCODE_COMPRESS;
+	desc->compr_flags = IAA_COMP_FLAGS;
+	desc->priv = 0;
+
+	desc->src1_addr = (u64)src_addr;
+	desc->src1_size = slen;
+	desc->dst_addr = (u64)dst_addr;
+	desc->max_dst_size = dlen;
+	desc->flags |= IDXD_OP_FLAG_RD_SRC2_AECS;
+	desc->src2_addr = active_compression_mode->aecs_comp_table_dma_addr;
+	desc->src2_size = sizeof(struct aecs_comp_table_record);
+	desc->completion_addr = idxd_desc->compl_dma;
+
+	return desc;
+}
+
+static struct iax_hw_desc *
+iaa_setup_decompress_hw_desc(struct idxd_desc *idxd_desc,
+			     dma_addr_t src_addr,
+			     unsigned int slen,
+			     dma_addr_t dst_addr,
+			     unsigned int dlen)
+{
+	struct iax_hw_desc *desc = idxd_desc->iax_hw;
+
+	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
+	desc->opcode = IAX_OPCODE_DECOMPRESS;
+	desc->max_dst_size = PAGE_SIZE;
+	desc->decompr_flags = IAA_DECOMP_FLAGS;
+	desc->priv = 0;
+
+	desc->src1_addr = (u64)src_addr;
+	desc->dst_addr = (u64)dst_addr;
+	desc->max_dst_size = dlen;
+	desc->src1_size = slen;
+	desc->completion_addr = idxd_desc->compl_dma;
+
+	return desc;
+}
+
 static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 			struct idxd_wq *wq,
 			dma_addr_t src_addr, unsigned int slen,
 			dma_addr_t dst_addr, unsigned int *dlen)
 {
-	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
@@ -1770,8 +1811,6 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 	pdev = idxd->pdev;
 	dev = &pdev->dev;
 
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
 	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_comp_desc_timeout)) {
 		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
 		cpu_relax();
@@ -1782,21 +1821,9 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 			PTR_ERR(idxd_desc));
 		return -ENODEV;
 	}
-	desc = idxd_desc->iax_hw;
 
-	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR |
-		IDXD_OP_FLAG_RD_SRC2_AECS | IDXD_OP_FLAG_CC;
-	desc->opcode = IAX_OPCODE_COMPRESS;
-	desc->compr_flags = IAA_COMP_FLAGS;
-	desc->priv = 0;
-
-	desc->src1_addr = (u64)src_addr;
-	desc->src1_size = slen;
-	desc->dst_addr = (u64)dst_addr;
-	desc->max_dst_size = *dlen;
-	desc->src2_addr = active_compression_mode->aecs_comp_table_dma_addr;
-	desc->src2_size = sizeof(struct aecs_comp_table_record);
-	desc->completion_addr = idxd_desc->compl_dma;
+	desc = iaa_setup_compress_hw_desc(idxd_desc, src_addr, slen, dst_addr, *dlen,
+					  ctx->mode, iaa_device->compression_modes[ctx->mode]);
 
 	if (likely(!ctx->use_irq)) {
 		ret = idxd_submit_desc(wq, idxd_desc);
@@ -1858,7 +1885,6 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 			  dma_addr_t src_addr, unsigned int slen,
 			  dma_addr_t dst_addr, unsigned int *dlen)
 {
-	struct iaa_device_compression_mode *active_compression_mode;
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
 	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
@@ -1876,8 +1902,6 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	pdev = idxd->pdev;
 	dev = &pdev->dev;
 
-	active_compression_mode = get_iaa_device_compression_mode(iaa_device, ctx->mode);
-
 	while ((idxd_desc == ERR_PTR(-EAGAIN)) && (alloc_desc_retries++ < ctx->alloc_decomp_desc_timeout)) {
 		idxd_desc = idxd_alloc_desc(wq, IDXD_OP_NONBLOCK);
 		cpu_relax();
@@ -1890,19 +1914,8 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 		idxd_desc = NULL;
 		goto fallback_software_decomp;
 	}
-	desc = idxd_desc->iax_hw;
 
-	desc->flags = IDXD_OP_FLAG_CRAV | IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CC;
-	desc->opcode = IAX_OPCODE_DECOMPRESS;
-	desc->max_dst_size = PAGE_SIZE;
-	desc->decompr_flags = IAA_DECOMP_FLAGS;
-	desc->priv = 0;
-
-	desc->src1_addr = (u64)src_addr;
-	desc->dst_addr = (u64)dst_addr;
-	desc->max_dst_size = *dlen;
-	desc->src1_size = slen;
-	desc->completion_addr = idxd_desc->compl_dma;
+	desc = iaa_setup_decompress_hw_desc(idxd_desc, src_addr, slen, dst_addr, *dlen);
 
 	if (likely(!ctx->use_irq)) {
 		ret = idxd_submit_desc(wq, idxd_desc);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 08/22] crypto: iaa - Simplified, efficient job submissions for non-irq mode.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (6 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 07/22] crypto: iaa - Refactor hardware descriptor setup into separate procedures Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 09/22] crypto: iaa - Deprecate exporting add/remove IAA compression modes Kanchana P Sridhar
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch adds a new procedure, iaa_submit_desc_movdir64b(), that
directly calls movdir64b. The core iaa_crypto routines that submit
compress and decompress jobs now invoke iaa_submit_desc_movdir64b() in
non-irq driver modes, instead of idxd_submit_desc().

idxd_submit_desc() is called only in irq mode.

This improves latency for the most commonly used iaa_crypto usage
(i.e., async non-irq) in zswap by eliminating redundant computes
that would otherwise be incurred in idxd_submit_desc():

For a single-threaded madvise-based workload with the Silesia.tar
dataset, these are the before/after batch compression latencies for a
compress batch of 8 pages:

     ==================================
                   p50 (ns)    p99 (ns)
     ==================================
     before           5,568       6,056
     after            5,472       5,848
     Change             -96        -208
     ==================================

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 30 ++++++++++++++--------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 697e98785335..dfc67109e81e 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1788,6 +1788,24 @@ iaa_setup_decompress_hw_desc(struct idxd_desc *idxd_desc,
 	return desc;
 }
 
+/*
+ * Call this for non-irq, non-enqcmds job submissions.
+ */
+static __always_inline void iaa_submit_desc_movdir64b(struct idxd_wq *wq,
+						     struct idxd_desc *desc)
+{
+	void __iomem *portal = idxd_wq_portal_addr(wq);
+
+	/*
+	 * The wmb() flushes writes to coherent DMA data before
+	 * possibly triggering a DMA read. The wmb() is necessary
+	 * even on UP because the recipient is a device.
+	 */
+	wmb();
+
+	iosubmit_cmds512(portal, desc->hw, 1);
+}
+
 static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 			struct idxd_wq *wq,
 			dma_addr_t src_addr, unsigned int slen,
@@ -1826,11 +1844,7 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 					  ctx->mode, iaa_device->compression_modes[ctx->mode]);
 
 	if (likely(!ctx->use_irq)) {
-		ret = idxd_submit_desc(wq, idxd_desc);
-		if (ret) {
-			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-			goto out;
-		}
+		iaa_submit_desc_movdir64b(wq, idxd_desc);
 
 		/* Update stats */
 		update_total_comp_calls();
@@ -1918,11 +1932,7 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	desc = iaa_setup_decompress_hw_desc(idxd_desc, src_addr, slen, dst_addr, *dlen);
 
 	if (likely(!ctx->use_irq)) {
-		ret = idxd_submit_desc(wq, idxd_desc);
-		if (ret) {
-			dev_dbg(dev, "submit_desc failed ret=%d\n", ret);
-			goto fallback_software_decomp;
-		}
+		iaa_submit_desc_movdir64b(wq, idxd_desc);
 
 		/* Update stats */
 		update_total_decomp_calls();
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 09/22] crypto: iaa - Deprecate exporting add/remove IAA compression modes.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (7 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 08/22] crypto: iaa - Simplified, efficient job submissions for non-irq mode Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 10/22] crypto: iaa - Expect a single scatterlist for a [de]compress request's src/dst Kanchana P Sridhar
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

There is no use case right now for kernel users to dynamically
add/remove IAA compression modes; hence this commit deletes the symbol
exports of add_iaa_compression_mode() and remove_iaa_compression_mode().

The only supported usage model of IAA compression modes is for the code
to be statically linked during the iaa_crypto module build,
e.g. iaa_crypto_comp_fixed.c, and for available modes to be registered
when the first IAA device wq is probed.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index dfc67109e81e..061e3403d365 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -367,10 +367,6 @@ static void free_iaa_compression_mode(struct iaa_compression_mode *mode)
  * These tables are typically generated and captured using statistics
  * collected from running actual compress/decompress workloads.
  *
- * A module or other kernel code can add and remove compression modes
- * with a given name using the exported @add_iaa_compression_mode()
- * and @remove_iaa_compression_mode functions.
- *
  * When a new compression mode is added, the tables are saved in a
  * global compression mode list.  When IAA devices are added, a
  * per-IAA device dma mapping is created for each IAA device, for each
@@ -404,7 +400,6 @@ void remove_iaa_compression_mode(const char *name)
 out:
 	mutex_unlock(&iaa_devices_lock);
 }
-EXPORT_SYMBOL_GPL(remove_iaa_compression_mode);
 
 /**
  * add_iaa_compression_mode - Add an IAA compression mode
@@ -481,7 +476,6 @@ int add_iaa_compression_mode(const char *name,
 	free_iaa_compression_mode(mode);
 	goto out;
 }
-EXPORT_SYMBOL_GPL(add_iaa_compression_mode);
 
 static void free_device_compression_mode(struct iaa_device *iaa_device,
 					 struct iaa_device_compression_mode *device_mode)
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 10/22] crypto: iaa - Expect a single scatterlist for a [de]compress request's src/dst.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (8 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 09/22] crypto: iaa - Deprecate exporting add/remove IAA compression modes Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 11/22] crypto: iaa - Rearchitect iaa_crypto to have clean interfaces with crypto_acomp Kanchana P Sridhar
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

The calls to dma_map_sg() were passing sg_nents() for the @nents
parameter, then error-ing out if more than one @nr_sgs were
returned. Furthermore, there are no use-cases for iaa_crypto that allow
multiple SG lists to be mapped for dma at once.

Moreover, as per Herbert's direction in [1] for the batching API from
higher mm layers to interface with crypto using SG lists, batching
within iaa_crypto will rely on there being exactly one SG list per
"unit" of [de]compression in a batch, where the component SG lists are
obtained by breaking down the @req->src and @req->dst.

Given all of the above, this patch simplifies the design by expecting
only 1 @nents in req->src and req->dst, which aligns with current and
batching use cases that will be developed in subsequent patches.

This alleviates the latency penalty of calling sg_nents() per
[de]compress op submitted to the hardware.

Some unlikely() annotations are added to conditionals in the core
[de]compress routines to further improve latency per op.

[1]: https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99aea2ce3d284e6c5a3253061d97b08c4752a798

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 54 +++++++++++-----------
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 061e3403d365..04602df8d173 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1520,11 +1520,11 @@ static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
 	int ret = 0;
 	int nr_sgs;
 
-	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
-	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+	dma_unmap_sg(dev, req->dst, 1, DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, 1, DMA_TO_DEVICE);
 
-	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
+	nr_sgs = dma_map_sg(dev, req->src, 1, DMA_FROM_DEVICE);
+	if (unlikely(nr_sgs <= 0 || nr_sgs > 1)) {
 		dev_dbg(dev, "verify: couldn't map src sg for iaa device %d,"
 			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
 			iaa_wq->wq->id, ret);
@@ -1536,13 +1536,13 @@ static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
 		" req->slen %d, sg_dma_len(sg) %d\n", *src_addr, nr_sgs,
 		req->src, req->slen, sg_dma_len(req->src));
 
-	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
+	nr_sgs = dma_map_sg(dev, req->dst, 1, DMA_TO_DEVICE);
+	if (unlikely(nr_sgs <= 0 || nr_sgs > 1)) {
 		dev_dbg(dev, "verify: couldn't map dst sg for iaa device %d,"
 			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
 			iaa_wq->wq->id, ret);
 		ret = -EIO;
-		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+		dma_unmap_sg(dev, req->src, 1, DMA_FROM_DEVICE);
 		goto out;
 	}
 	*dst_addr = sg_dma_address(req->dst);
@@ -1710,14 +1710,14 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 			err = -EIO;
 		}
 
-		dma_unmap_sg(dev, ctx->req->dst, sg_nents(ctx->req->dst), DMA_TO_DEVICE);
-		dma_unmap_sg(dev, ctx->req->src, sg_nents(ctx->req->src), DMA_FROM_DEVICE);
+		dma_unmap_sg(dev, ctx->req->dst, 1, DMA_TO_DEVICE);
+		dma_unmap_sg(dev, ctx->req->src, 1, DMA_FROM_DEVICE);
 
 		goto out;
 	}
 err:
-	dma_unmap_sg(dev, ctx->req->dst, sg_nents(ctx->req->dst), DMA_FROM_DEVICE);
-	dma_unmap_sg(dev, ctx->req->src, sg_nents(ctx->req->src), DMA_TO_DEVICE);
+	dma_unmap_sg(dev, ctx->req->dst, 1, DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, ctx->req->src, 1, DMA_TO_DEVICE);
 out:
 	if (ret != 0)
 		dev_dbg(dev, "asynchronous compress failed ret=%d\n", ret);
@@ -2020,8 +2020,8 @@ static int iaa_comp_acompress(struct acomp_req *req)
 
 	dev = &wq->idxd->pdev->dev;
 
-	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
+	nr_sgs = dma_map_sg(dev, req->src, 1, DMA_TO_DEVICE);
+	if (unlikely(nr_sgs <= 0 || nr_sgs > 1)) {
 		dev_dbg(dev, "couldn't map src sg for iaa device %d,"
 			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
 			iaa_wq->wq->id, ret);
@@ -2030,8 +2030,8 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	}
 	src_addr = sg_dma_address(req->src);
 
-	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
+	nr_sgs = dma_map_sg(dev, req->dst, 1, DMA_FROM_DEVICE);
+	if (unlikely(nr_sgs <= 0 || nr_sgs > 1)) {
 		dev_dbg(dev, "couldn't map dst sg for iaa device %d,"
 			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
 			iaa_wq->wq->id, ret);
@@ -2057,18 +2057,18 @@ static int iaa_comp_acompress(struct acomp_req *req)
 		if (ret)
 			dev_dbg(dev, "asynchronous compress verification failed ret=%d\n", ret);
 
-		dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_TO_DEVICE);
-		dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_FROM_DEVICE);
+		dma_unmap_sg(dev, req->dst, 1, DMA_TO_DEVICE);
+		dma_unmap_sg(dev, req->src, 1, DMA_FROM_DEVICE);
 
 		goto out;
 	}
 
-	if (ret)
+	if (unlikely(ret))
 		dev_dbg(dev, "asynchronous compress failed ret=%d\n", ret);
 
-	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->dst, 1, DMA_FROM_DEVICE);
 err_map_dst:
-	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+	dma_unmap_sg(dev, req->src, 1, DMA_TO_DEVICE);
 out:
 	percpu_ref_put(&iaa_wq->ref);
 
@@ -2101,8 +2101,8 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 
 	dev = &wq->idxd->pdev->dev;
 
-	nr_sgs = dma_map_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
+	nr_sgs = dma_map_sg(dev, req->src, 1, DMA_TO_DEVICE);
+	if (unlikely(nr_sgs <= 0 || nr_sgs > 1)) {
 		dev_dbg(dev, "couldn't map src sg for iaa device %d,"
 			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
 			iaa_wq->wq->id, ret);
@@ -2111,8 +2111,8 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 	}
 	src_addr = sg_dma_address(req->src);
 
-	nr_sgs = dma_map_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
-	if (nr_sgs <= 0 || nr_sgs > 1) {
+	nr_sgs = dma_map_sg(dev, req->dst, 1, DMA_FROM_DEVICE);
+	if (unlikely(nr_sgs <= 0 || nr_sgs > 1)) {
 		dev_dbg(dev, "couldn't map dst sg for iaa device %d,"
 			" wq %d: ret=%d\n", iaa_wq->iaa_device->idxd->id,
 			iaa_wq->wq->id, ret);
@@ -2126,12 +2126,12 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 	if (ret == -EINPROGRESS)
 		return ret;
 
-	if (ret != 0)
+	if (unlikely(ret != 0))
 		dev_dbg(dev, "asynchronous decompress failed ret=%d\n", ret);
 
-	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->dst, 1, DMA_FROM_DEVICE);
 err_map_dst:
-	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+	dma_unmap_sg(dev, req->src, 1, DMA_TO_DEVICE);
 out:
 	percpu_ref_put(&iaa_wq->ref);
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 11/22] crypto: iaa - Rearchitect iaa_crypto to have clean interfaces with crypto_acomp
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (9 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 10/22] crypto: iaa - Expect a single scatterlist for a [de]compress request's src/dst Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 12/22] crypto: acomp - Define a unit_size in struct acomp_req to enable batching Kanchana P Sridhar
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies the core functions in the iaa_crypto driver to be
independent of crypto_acomp, by adding a layer between the core driver
functionality and the crypto API. The core driver code is moved under
this layer that relies only on idxd, dma and scatterlist. This leads to
a cleaner interface.

We introduce a new "struct iaa_req" data structure, and light-weight
internal translation routines to/from crypto_acomp, namely,
acomp_to_iaa() and iaa_to_acomp().

The exception is that the driver defines a "static struct crypto_acomp
*deflate_crypto_comp" for the software decompress fall-back
path.

The acomp_alg .compress() and .decompress() interfaces call into
iaa_comp_acompress_main() and iaa_comp_adecompress_main(), which are
wrappers around the core crypto-independent driver functions.

These iaa_crypto interfaces will continue to be available through
crypto_acomp for use in zswap:

       int crypto_acomp_compress(struct acomp_req *req);
       int crypto_acomp_decompress(struct acomp_req *req);

Additionally, this patch resolves a race condition triggered when
IAA wqs and devices are continuously disabled/enabled when workloads are
using IAA for compression/decompression. This commit, in combination
with patches 0002 ("crypto: iaa - New architecture for IAA device WQ
comp/decomp usage & core mapping.) and 0005 (crypto: iaa - iaa_wq uses
percpu_refs for get/put reference counting.) in this series fix the race
condition. This has been verified using bisecting.

One other change made towards a cleaner architecture is the iaa_crypto
symbol namespace is changed from "IDXD" to "CRYPTO_DEV_IAA_CRYPTO".

Fixes: ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto driver core")
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/Makefile          |   2 +-
 drivers/crypto/intel/iaa/iaa_crypto.h      |  24 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 275 +++++++++++++++++----
 3 files changed, 240 insertions(+), 61 deletions(-)

diff --git a/drivers/crypto/intel/iaa/Makefile b/drivers/crypto/intel/iaa/Makefile
index 55bda7770fac..ebfa1a425f80 100644
--- a/drivers/crypto/intel/iaa/Makefile
+++ b/drivers/crypto/intel/iaa/Makefile
@@ -3,7 +3,7 @@
 # Makefile for IAA crypto device drivers
 #
 
-ccflags-y += -I $(srctree)/drivers/dma/idxd -DDEFAULT_SYMBOL_NAMESPACE='"IDXD"'
+ccflags-y += -I $(srctree)/drivers/dma/idxd -DDEFAULT_SYMBOL_NAMESPACE='"CRYPTO_DEV_IAA_CRYPTO"'
 
 obj-$(CONFIG_CRYPTO_DEV_IAA_CRYPTO) := iaa_crypto.o
 
diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 9611f2518f42..4dfb65c88f83 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -7,6 +7,7 @@
 #include <linux/crypto.h>
 #include <linux/idxd.h>
 #include <uapi/linux/idxd.h>
+#include <linux/scatterlist.h>
 
 #define IDXD_SUBDRIVER_NAME		"crypto"
 
@@ -29,8 +30,6 @@
 #define IAA_ERROR_COMP_BUF_OVERFLOW	0x19
 #define IAA_ERROR_WATCHDOG_EXPIRED	0x24
 
-#define IAA_COMP_MODES_MAX		2
-
 #define FIXED_HDR			0x2
 #define FIXED_HDR_SIZE			3
 
@@ -42,6 +41,23 @@
 					 IAA_DECOMP_CHECK_FOR_EOB | \
 					 IAA_DECOMP_STOP_ON_EOB)
 
+#define IAA_COMP_MODES_MAX  IAA_MODE_NONE
+
+enum iaa_mode {
+	IAA_MODE_FIXED = 0,
+	IAA_MODE_NONE = 1,
+};
+
+struct iaa_req {
+	struct scatterlist *src;
+	struct scatterlist *dst;
+	unsigned int slen;
+	unsigned int dlen;
+	u32 flags;
+	u32 compression_crc;
+	void *drv_data; /* for driver internal use */
+};
+
 /* Representation of IAA workqueue */
 struct iaa_wq {
 	struct list_head	list;
@@ -138,10 +154,6 @@ int add_iaa_compression_mode(const char *name,
 
 void remove_iaa_compression_mode(const char *name);
 
-enum iaa_mode {
-	IAA_MODE_FIXED,
-};
-
 struct iaa_compression_ctx {
 	enum iaa_mode	mode;
 	u16		alloc_comp_desc_timeout;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 04602df8d173..75bd455b3b34 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -51,6 +51,10 @@ static struct wq_table_entry **pkg_global_decomp_wqs;
 /* All comp wqs from IAAs on a package. */
 static struct wq_table_entry **pkg_global_comp_wqs;
 
+/* For software deflate fallback compress/decompress. */
+static struct crypto_acomp *deflate_crypto_acomp;
+DEFINE_MUTEX(deflate_crypto_acomp_lock);
+
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
 
@@ -93,9 +97,18 @@ static atomic_t iaa_crypto_enabled = ATOMIC_INIT(0);
 static struct idxd_wq *first_wq_found;
 DEFINE_MUTEX(first_wq_found_lock);
 
-static bool iaa_crypto_registered;
+const char *iaa_compression_mode_names[IAA_COMP_MODES_MAX] = {
+	"fixed",
+};
+
+const char *iaa_compression_alg_names[IAA_COMP_MODES_MAX] = {
+	"deflate-iaa",
+};
 
 static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
+static struct iaa_compression_ctx *iaa_ctx[IAA_COMP_MODES_MAX];
+static bool iaa_mode_registered[IAA_COMP_MODES_MAX];
+static u8 num_iaa_modes_registered;
 
 /* Distribute decompressions across all IAAs on the package. */
 static bool iaa_distribute_decomps;
@@ -353,6 +366,20 @@ static struct iaa_compression_mode *find_iaa_compression_mode(const char *name,
 	return NULL;
 }
 
+static bool iaa_alg_is_registered(const char *name, int *idx)
+{
+	int i;
+
+	for (i = 0; i < IAA_COMP_MODES_MAX; ++i) {
+		if (!strcmp(name, iaa_compression_alg_names[i]) && iaa_mode_registered[i]) {
+			*idx = i;
+			return true;
+		}
+	}
+
+	return false;
+}
+
 static void free_iaa_compression_mode(struct iaa_compression_mode *mode)
 {
 	kfree(mode->name);
@@ -466,6 +493,7 @@ int add_iaa_compression_mode(const char *name,
 		 mode->name, idx);
 
 	iaa_compression_modes[idx] = mode;
+	++num_iaa_modes_registered;
 
 	ret = 0;
 out:
@@ -1440,19 +1468,46 @@ static struct idxd_wq *comp_wq_table_next_wq(int cpu)
  * Core iaa_crypto compress/decompress functions.
  *************************************************/
 
-static int deflate_generic_decompress(struct acomp_req *req)
+static int deflate_generic_decompress(struct iaa_req *req)
 {
-	ACOMP_FBREQ_ON_STACK(fbreq, req);
+	ACOMP_REQUEST_ON_STACK(fbreq, deflate_crypto_acomp);
 	int ret;
 
+	acomp_request_set_callback(fbreq, 0, NULL, NULL);
+	acomp_request_set_params(fbreq, req->src, req->dst, req->slen,
+				 PAGE_SIZE);
+
+	mutex_lock(&deflate_crypto_acomp_lock);
+
 	ret = crypto_acomp_decompress(fbreq);
 	req->dlen = fbreq->dlen;
 
+	mutex_unlock(&deflate_crypto_acomp_lock);
+
 	update_total_sw_decomp_calls();
 
 	return ret;
 }
 
+static __always_inline void acomp_to_iaa(struct acomp_req *areq,
+					 struct iaa_req *req,
+					 struct iaa_compression_ctx *ctx)
+{
+	req->src = areq->src;
+	req->dst = areq->dst;
+	req->slen = areq->slen;
+	req->dlen = areq->dlen;
+	req->flags = areq->base.flags;
+	if (unlikely(ctx->use_irq))
+		req->drv_data = areq;
+}
+
+static __always_inline void iaa_to_acomp(int dlen, struct acomp_req *areq)
+{
+	areq->dst->length = dlen;
+	areq->dlen = dlen;
+}
+
 static inline int check_completion(struct device *dev,
 				   struct iax_completion_record *comp,
 				   bool compress,
@@ -1514,7 +1569,7 @@ static inline int check_completion(struct device *dev,
 }
 
 static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
-				struct acomp_req *req,
+				struct iaa_req *req,
 				dma_addr_t *src_addr, dma_addr_t *dst_addr)
 {
 	int ret = 0;
@@ -1553,13 +1608,11 @@ static int iaa_remap_for_verify(struct device *dev, struct iaa_wq *iaa_wq,
 	return ret;
 }
 
-static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
+static int iaa_compress_verify(struct iaa_compression_ctx *ctx, struct iaa_req *req,
 			       struct idxd_wq *wq,
 			       dma_addr_t src_addr, unsigned int slen,
 			       dma_addr_t dst_addr, unsigned int dlen)
 {
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
-	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
 	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
 	u16 alloc_desc_retries = 0;
@@ -1612,10 +1665,10 @@ static int iaa_compress_verify(struct crypto_tfm *tfm, struct acomp_req *req,
 		goto err;
 	}
 
-	if (*compression_crc != idxd_desc->iax_completion->crc) {
+	if (req->compression_crc != idxd_desc->iax_completion->crc) {
 		ret = -EINVAL;
 		dev_dbg(dev, "(verify) iaa comp/decomp crc mismatch:"
-			" comp=0x%x, decomp=0x%x\n", *compression_crc,
+			" comp=0x%x, decomp=0x%x\n", req->compression_crc,
 			idxd_desc->iax_completion->crc);
 		print_hex_dump(KERN_INFO, "cmp-rec: ", DUMP_PREFIX_OFFSET,
 			       8, 1, idxd_desc->iax_completion, 64, 0);
@@ -1641,6 +1694,7 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 	struct iaa_wq *iaa_wq;
 	struct pci_dev *pdev;
 	struct device *dev;
+	struct iaa_req req;
 	int ret, err = 0;
 
 	compression_ctx = crypto_tfm_ctx(ctx->tfm);
@@ -1666,12 +1720,18 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 			pr_warn("%s: falling back to deflate-generic decompress, "
 				"analytics error code %x\n", __func__,
 				idxd_desc->iax_completion->error_code);
-			ret = deflate_generic_decompress(ctx->req);
+
+			acomp_to_iaa(ctx->req, &req, compression_ctx);
+			ret = deflate_generic_decompress(&req);
+			iaa_to_acomp(req.dlen, ctx->req);
+
 			if (ret) {
 				dev_dbg(dev, "%s: deflate-generic failed ret=%d\n",
 					__func__, ret);
 				err = -EIO;
 				goto err;
+			} else {
+				goto verify;
 			}
 		} else {
 			err = -EIO;
@@ -1690,21 +1750,26 @@ static void iaa_desc_complete(struct idxd_desc *idxd_desc,
 		update_wq_decomp_bytes(iaa_wq->wq, ctx->req->slen);
 	}
 
+verify:
 	if (ctx->compress && compression_ctx->verify_compress) {
-		u32 *compression_crc = acomp_request_ctx(ctx->req);
 		dma_addr_t src_addr, dst_addr;
 
-		*compression_crc = idxd_desc->iax_completion->crc;
+		acomp_to_iaa(ctx->req, &req, compression_ctx);
+		req.compression_crc = idxd_desc->iax_completion->crc;
+
+		ret = iaa_remap_for_verify(dev, iaa_wq, &req, &src_addr, &dst_addr);
+		iaa_to_acomp(req.dlen, ctx->req);
 
-		ret = iaa_remap_for_verify(dev, iaa_wq, ctx->req, &src_addr, &dst_addr);
 		if (ret) {
 			dev_dbg(dev, "%s: compress verify remap failed ret=%d\n", __func__, ret);
 			err = -EIO;
 			goto out;
 		}
 
-		ret = iaa_compress_verify(ctx->tfm, ctx->req, iaa_wq->wq, src_addr,
+		ret = iaa_compress_verify(compression_ctx, &req, iaa_wq->wq, src_addr,
 					  ctx->req->slen, dst_addr, ctx->req->dlen);
+		iaa_to_acomp(req.dlen, ctx->req);
+
 		if (ret) {
 			dev_dbg(dev, "%s: compress verify failed ret=%d\n", __func__, ret);
 			err = -EIO;
@@ -1800,13 +1865,11 @@ static __always_inline void iaa_submit_desc_movdir64b(struct idxd_wq *wq,
 	iosubmit_cmds512(portal, desc->hw, 1);
 }
 
-static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
+static int iaa_compress(struct iaa_compression_ctx *ctx, struct iaa_req *req,
 			struct idxd_wq *wq,
 			dma_addr_t src_addr, unsigned int slen,
 			dma_addr_t dst_addr, unsigned int *dlen)
 {
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
-	u32 *compression_crc = acomp_request_ctx(req);
 	struct iaa_device *iaa_device;
 	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
 	u16 alloc_desc_retries = 0;
@@ -1854,17 +1917,18 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 		}
 
 		*dlen = idxd_desc->iax_completion->output_size;
+		req->compression_crc = idxd_desc->iax_completion->crc;
 
 		/* Update stats */
 		update_total_comp_bytes_out(*dlen);
 		update_wq_comp_bytes(wq, *dlen);
-
-		*compression_crc = idxd_desc->iax_completion->crc;
 	} else {
+		struct acomp_req *areq = req->drv_data;
+
 		desc->flags |= IDXD_OP_FLAG_RCI;
 
-		idxd_desc->crypto.req = req;
-		idxd_desc->crypto.tfm = tfm;
+		idxd_desc->crypto.req = areq;
+		idxd_desc->crypto.tfm = areq->base.tfm;
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = true;
@@ -1888,12 +1952,11 @@ static int iaa_compress(struct crypto_tfm *tfm, struct acomp_req *req,
 	return ret;
 }
 
-static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
+static int iaa_decompress(struct iaa_compression_ctx *ctx, struct iaa_req *req,
 			  struct idxd_wq *wq,
 			  dma_addr_t src_addr, unsigned int slen,
 			  dma_addr_t dst_addr, unsigned int *dlen)
 {
-	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 	struct iaa_device *iaa_device;
 	struct idxd_desc *idxd_desc = ERR_PTR(-EAGAIN);
 	u16 alloc_desc_retries = 0;
@@ -1937,10 +2000,12 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 
 		ret = check_completion(dev, idxd_desc->iax_completion, false, false);
 	} else {
+		struct acomp_req *areq = req->drv_data;
+
 		desc->flags |= IDXD_OP_FLAG_RCI;
 
-		idxd_desc->crypto.req = req;
-		idxd_desc->crypto.tfm = tfm;
+		idxd_desc->crypto.req = areq;
+		idxd_desc->crypto.tfm = areq->base.tfm;
 		idxd_desc->crypto.src_addr = src_addr;
 		idxd_desc->crypto.dst_addr = dst_addr;
 		idxd_desc->crypto.compress = false;
@@ -1991,20 +2056,16 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	return ret;
 }
 
-static int iaa_comp_acompress(struct acomp_req *req)
+static int iaa_comp_acompress(struct iaa_compression_ctx *ctx, struct iaa_req *req)
 {
-	struct iaa_compression_ctx *compression_ctx;
-	struct crypto_tfm *tfm = req->base.tfm;
 	dma_addr_t src_addr, dst_addr;
 	int nr_sgs, cpu, ret = 0;
 	struct iaa_wq *iaa_wq;
 	struct idxd_wq *wq;
 	struct device *dev;
 
-	compression_ctx = crypto_tfm_ctx(tfm);
-
-	if (!req->src || !req->slen) {
-		pr_debug("invalid src, not compressing\n");
+	if (!req->src || !req->slen || !req->dst) {
+		pr_debug("invalid src/dst, not compressing\n");
 		return -EINVAL;
 	}
 
@@ -2040,19 +2101,19 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	}
 	dst_addr = sg_dma_address(req->dst);
 
-	ret = iaa_compress(tfm, req, wq, src_addr, req->slen, dst_addr,
+	ret = iaa_compress(ctx, req, wq, src_addr, req->slen, dst_addr,
 			   &req->dlen);
 	if (ret == -EINPROGRESS)
 		return ret;
 
-	if (!ret && compression_ctx->verify_compress) {
+	if (!ret && ctx->verify_compress) {
 		ret = iaa_remap_for_verify(dev, iaa_wq, req, &src_addr, &dst_addr);
 		if (ret) {
 			dev_dbg(dev, "%s: compress verify remap failed ret=%d\n", __func__, ret);
 			goto out;
 		}
 
-		ret = iaa_compress_verify(tfm, req, wq, src_addr, req->slen,
+		ret = iaa_compress_verify(ctx, req, wq, src_addr, req->slen,
 					  dst_addr, req->dlen);
 		if (ret)
 			dev_dbg(dev, "asynchronous compress verification failed ret=%d\n", ret);
@@ -2075,9 +2136,8 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	return ret;
 }
 
-static int iaa_comp_adecompress(struct acomp_req *req)
+static int iaa_comp_adecompress(struct iaa_compression_ctx *ctx, struct iaa_req *req)
 {
-	struct crypto_tfm *tfm = req->base.tfm;
 	dma_addr_t src_addr, dst_addr;
 	int nr_sgs, cpu, ret = 0;
 	struct iaa_wq *iaa_wq;
@@ -2121,7 +2181,7 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 	}
 	dst_addr = sg_dma_address(req->dst);
 
-	ret = iaa_decompress(tfm, req, wq, src_addr, req->slen,
+	ret = iaa_decompress(ctx, req, wq, src_addr, req->slen,
 			     dst_addr, &req->dlen);
 	if (ret == -EINPROGRESS)
 		return ret;
@@ -2138,8 +2198,9 @@ static int iaa_comp_adecompress(struct acomp_req *req)
 	return ret;
 }
 
-static void compression_ctx_init(struct iaa_compression_ctx *ctx)
+static void compression_ctx_init(struct iaa_compression_ctx *ctx, enum iaa_mode mode)
 {
+	ctx->mode = mode;
 	ctx->alloc_comp_desc_timeout = IAA_ALLOC_DESC_COMP_TIMEOUT;
 	ctx->alloc_decomp_desc_timeout = IAA_ALLOC_DESC_DECOMP_TIMEOUT;
 	ctx->verify_compress = iaa_verify_compress;
@@ -2151,22 +2212,56 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
  * Interfaces to crypto_alg and crypto_acomp.
  *********************************************/
 
+static int iaa_comp_acompress_main(struct acomp_req *areq)
+{
+	struct crypto_tfm *tfm = areq->base.tfm;
+	struct iaa_compression_ctx *ctx;
+	struct iaa_req req;
+	int ret = -ENODEV, idx;
+
+	if (iaa_alg_is_registered(crypto_tfm_alg_driver_name(tfm), &idx)) {
+		ctx = iaa_ctx[idx];
+
+		acomp_to_iaa(areq, &req, ctx);
+		ret = iaa_comp_acompress(ctx, &req);
+		iaa_to_acomp(unlikely(ret) ? ret : req.dlen, areq);
+	}
+
+	return ret;
+}
+
+static int iaa_comp_adecompress_main(struct acomp_req *areq)
+{
+	struct crypto_tfm *tfm = areq->base.tfm;
+	struct iaa_compression_ctx *ctx;
+	struct iaa_req req;
+	int ret = -ENODEV, idx;
+
+	if (iaa_alg_is_registered(crypto_tfm_alg_driver_name(tfm), &idx)) {
+		ctx = iaa_ctx[idx];
+
+		acomp_to_iaa(areq, &req, ctx);
+		ret = iaa_comp_adecompress(ctx, &req);
+		iaa_to_acomp(unlikely(ret) ? ret : req.dlen, areq);
+	}
+
+	return ret;
+}
+
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
 	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
 
-	ctx->mode = IAA_MODE_FIXED;
-
-	compression_ctx_init(ctx);
+	ctx = iaa_ctx[IAA_MODE_FIXED];
 
 	return 0;
 }
 
 static struct acomp_alg iaa_acomp_fixed_deflate = {
 	.init			= iaa_comp_init_fixed,
-	.compress		= iaa_comp_acompress,
-	.decompress		= iaa_comp_adecompress,
+	.compress		= iaa_comp_acompress_main,
+	.decompress		= iaa_comp_adecompress_main,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa",
@@ -2178,29 +2273,76 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	}
 };
 
+/*******************************************
+ * Implement idxd_device_driver interfaces.
+ *******************************************/
+
+static void iaa_unregister_compression_device(void)
+{
+	unsigned int i;
+
+	atomic_set(&iaa_crypto_enabled, 0);
+
+	for (i = 0; i < IAA_COMP_MODES_MAX; ++i) {
+		iaa_mode_registered[i] = false;
+		kfree(iaa_ctx[i]);
+		iaa_ctx[i] = NULL;
+	}
+
+	num_iaa_modes_registered = 0;
+}
+
 static int iaa_register_compression_device(void)
 {
-	int ret;
+	struct iaa_compression_mode *mode;
+	int i, idx;
+
+	for (i = 0; i < IAA_COMP_MODES_MAX; ++i) {
+		iaa_mode_registered[i] = false;
+		mode = find_iaa_compression_mode(iaa_compression_mode_names[i], &idx);
+		if (mode) {
+			iaa_ctx[i] = kmalloc(sizeof(struct iaa_compression_ctx), GFP_KERNEL);
+			if (!iaa_ctx[i])
+				goto err;
+
+			compression_ctx_init(iaa_ctx[i], (enum iaa_mode)i);
+			iaa_mode_registered[i] = true;
+		}
+	}
+
+	if (iaa_mode_registered[IAA_MODE_FIXED])
+		return 0;
+
+	pr_err("%s: IAA_MODE_FIXED is not registered.", __func__);
+
+err:
+	iaa_unregister_compression_device();
+	return -ENODEV;
+}
+
+static int iaa_register_acomp_compression_device(void)
+{
+	int ret = -ENOMEM;
 
 	ret = crypto_register_acomp(&iaa_acomp_fixed_deflate);
 	if (ret) {
 		pr_err("deflate algorithm acomp fixed registration failed (%d)\n", ret);
-		goto out;
+		goto err_fixed;
 	}
 
-	iaa_crypto_registered = true;
-out:
+	return 0;
+
+err_fixed:
+	iaa_unregister_compression_device();
 	return ret;
 }
 
-static int iaa_unregister_compression_device(void)
+static void iaa_unregister_acomp_compression_device(void)
 {
 	atomic_set(&iaa_crypto_enabled, 0);
 
-	if (iaa_crypto_registered)
+	if (iaa_mode_registered[IAA_MODE_FIXED])
 		crypto_unregister_acomp(&iaa_acomp_fixed_deflate);
-
-	return 0;
 }
 
 static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
@@ -2270,6 +2412,12 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 			goto err_register;
 		}
 
+		ret = iaa_register_acomp_compression_device();
+		if (ret != 0) {
+			dev_dbg(dev, "IAA compression device acomp registration failed\n");
+			goto err_register;
+		}
+
 		if (!rebalance_wq_table()) {
 			dev_dbg(dev, "%s: Rerun after registration: IAA rebalancing device wq tables failed\n", __func__);
 			goto err_register;
@@ -2346,6 +2494,8 @@ static void iaa_crypto_remove(struct idxd_dev *idxd_dev)
 		pkg_global_wqs_dealloc();
 		free_wq_tables();
 		WARN_ON(!list_empty(&iaa_devices));
+		iaa_unregister_acomp_compression_device();
+		iaa_unregister_compression_device();
 		INIT_LIST_HEAD(&iaa_devices);
 		module_put(THIS_MODULE);
 
@@ -2387,6 +2537,13 @@ static int __init iaa_crypto_init_module(void)
 	nr_cpus_per_package = topology_num_cores_per_package();
 	nr_packages = topology_max_packages();
 
+	/* Software fallback compressor */
+	deflate_crypto_acomp = crypto_alloc_acomp("deflate", 0, 0);
+	if (IS_ERR_OR_NULL(deflate_crypto_acomp)) {
+		ret = -ENODEV;
+		goto err_deflate_acomp;
+	}
+
 	ret = iaa_aecs_init_fixed();
 	if (ret < 0) {
 		pr_debug("IAA fixed compression mode init failed\n");
@@ -2458,14 +2615,19 @@ static int __init iaa_crypto_init_module(void)
 err_driver_reg:
 	iaa_aecs_cleanup_fixed();
 err_aecs_init:
+	if (!IS_ERR_OR_NULL(deflate_crypto_acomp)) {
+		crypto_free_acomp(deflate_crypto_acomp);
+		deflate_crypto_acomp = NULL;
+	}
+err_deflate_acomp:
 
 	goto out;
 }
 
 static void __exit iaa_crypto_cleanup_module(void)
 {
-	if (iaa_unregister_compression_device())
-		pr_debug("IAA compression device unregister failed\n");
+	iaa_unregister_acomp_compression_device();
+	iaa_unregister_compression_device();
 
 	iaa_crypto_debugfs_cleanup();
 	driver_remove_file(&iaa_crypto_driver.drv,
@@ -2481,6 +2643,11 @@ static void __exit iaa_crypto_cleanup_module(void)
 	idxd_driver_unregister(&iaa_crypto_driver);
 	iaa_aecs_cleanup_fixed();
 
+	if (!IS_ERR_OR_NULL(deflate_crypto_acomp)) {
+		crypto_free_acomp(deflate_crypto_acomp);
+		deflate_crypto_acomp = NULL;
+	}
+
 	pr_debug("cleaned up\n");
 }
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 12/22] crypto: acomp - Define a unit_size in struct acomp_req to enable batching.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (10 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 11/22] crypto: iaa - Rearchitect iaa_crypto to have clean interfaces with crypto_acomp Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 13/22] crypto: iaa - IAA Batching for parallel compressions/decompressions Kanchana P Sridhar
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

mm: zswap: Set the unit size for zswap to PAGE_SIZE.

We add a new @unit_size data member to struct acomp_req along with a
helper function acomp_request_set_unit_size() for kernel modules to set
the unit size to use while breaking down the request's src/dst
scatterlists.

An acomp_alg can implement batching by using the @req->unit_size to
break down the SG lists passed in via @req->dst and/or @req->src, to
submit individual @req->slen/@req->unit_size compress jobs or
@req->dlen/@req->unit_size decompress jobs, for batch compression and
batch decompression respectively.

In case of batch compression, the folio's pages for the batch can be
retrieved from the @req->src scatterlist by using a struct sg_page_iter
after determining the number of pages as @req->slen/@req->unit_size.

As per Herbert's suggestion:

 1) acomp_request_set_callback() sets the @req->unit_size to 0.
 2) In zswap_cpu_comp_prepare(), after the call to
    acomp_request_set_callback(), we call:

      acomp_request_set_unit_size(acomp_ctx->req, PAGE_SIZE);

    to set the unit size for zswap to PAGE_SIZE.

Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/crypto/acompress.h | 36 ++++++++++++++++++++++++++++++++++++
 mm/zswap.c                 |  3 +++
 2 files changed, 39 insertions(+)

diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 9eacb9fa375d..0f1334168f1b 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -79,6 +79,7 @@ struct acomp_req_chain {
  * @dvirt:	Destination virtual address
  * @slen:	Size of the input buffer
  * @dlen:	Size of the output buffer and number of bytes produced
+ * @unit_size:  Unit size for the request for use in batching
  * @chain:	Private API code data, do not use
  * @__ctx:	Start of private context data
  */
@@ -94,6 +95,7 @@ struct acomp_req {
 	};
 	unsigned int slen;
 	unsigned int dlen;
+	unsigned int unit_size;
 
 	struct acomp_req_chain chain;
 
@@ -328,9 +330,43 @@ static inline void acomp_request_set_callback(struct acomp_req *req,
 {
 	flgs &= ~CRYPTO_ACOMP_REQ_PRIVATE;
 	flgs |= req->base.flags & CRYPTO_ACOMP_REQ_PRIVATE;
+	req->unit_size = 0;
 	crypto_request_set_callback(&req->base, flgs, cmpl, data);
 }
 
+/**
+ * acomp_request_set_unit_size() -- Sets the unit size for the request.
+ *
+ * As suggested by Herbert Xu, this is a new helper function that enables
+ * batching for zswap, IPComp, etc.
+ *
+ * Example usage model:
+ *
+ * A module like zswap that wants to use batch compression of @nr_pages with
+ * crypto_acomp must create an output SG table for the batch, initialized to
+ * contain @nr_pages SG lists. Each scatterlist is mapped to the nth
+ * destination buffer for the batch.
+ *
+ * An acomp_alg can implement batching by using the @req->unit_size to
+ * break down the SG lists passed in via @req->dst and/or @req->src, to
+ * submit individual @req->slen/@req->unit_size compress jobs or
+ * @req->dlen/@req->unit_size decompress jobs, for batch compression and
+ * batch decompression respectively.
+ *
+ * This API must be called after acomp_request_set_callback(),
+ * which sets @req->unit_size to 0.
+ *
+ * @du would be PAGE_SIZE for zswap, it could be the MTU for IPsec.
+ *
+ * @req:	asynchronous compress request
+ * @du:		data unit size of the input buffer scatterlist.
+ */
+static inline void acomp_request_set_unit_size(struct acomp_req *req,
+					       unsigned int du)
+{
+	req->unit_size = du;
+}
+
 /**
  * acomp_request_set_params() -- Sets request parameters
  *
diff --git a/mm/zswap.c b/mm/zswap.c
index 5d0f8b13a958..4897ed689b9f 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -783,6 +783,9 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	acomp_ctx->acomp = acomp;
 	acomp_ctx->is_sleepable = acomp_is_async(acomp);
 	acomp_ctx->req = req;
+
+	acomp_request_set_unit_size(acomp_ctx->req, PAGE_SIZE);
+
 	mutex_unlock(&acomp_ctx->mutex);
 	return 0;
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 13/22] crypto: iaa - IAA Batching for parallel compressions/decompressions.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (11 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 12/22] crypto: acomp - Define a unit_size in struct acomp_req to enable batching Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-14  9:59   ` Herbert Xu
  2025-11-04  9:12 ` [PATCH v13 14/22] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch adds capabilities in the IAA driver for kernel users to avail
of the benefits of compressing/decompressing multiple jobs in parallel
using IAA hardware acceleration, without the use of interrupts. Instead,
this is accomplished using an async "submit-poll" mechanism.

To achieve this, we break down a compress/decompress job into two
separate activities if the driver is configured for non-irq async mode:

1) Submit a descriptor after caching the "idxd_desc" descriptor in the
   req->drv_data, and return -EINPROGRESS.
2) Poll: Given a request, retrieve the descriptor and poll its completion
   status for success/error.

This is enabled by the following additions in the driver:

1) The idxd_desc is cached in the "drv_data" member of "struct iaa_req".

2) IAA_REQ_POLL_FLAG: if set in the iaa_req's flags, this tells
   the driver that it should submit the descriptor and return
   -EINPROGRESS. If not set, the driver will proceed to call
   check_completion() in fully synchronous mode, until the hardware
   returns a completion status.

3) iaa_comp_poll() procedure: This routine is intended to be called
   after submission returns -EINPROGRESS. It will check the completion
   status once, and return -EAGAIN if the job has not completed. If the
   job has completed, it will return the completion status.

The purpose of this commit is to allow kernel users of iaa_crypto, such
as zswap, to be able to invoke the crypto_acomp_compress() API in fully
synchronous mode for sequential/non-batching use cases (i.e. today's
status-quo), wherein zswap calls:

  crypto_wait_req(crypto_acomp_compress(req), wait);

and to seamlessly invoke fully asynchronous batch
compress/decompress functionality. Both use cases need to reuse same
code paths in the driver to interface with hardware: the
IAA_REQ_POLL_FLAG allows this shared code to determine whether we need
to process an iaa_req synchronously/asynchronously. The idea is to
simplify iaa_crypto's sequential/batching interfaces for use by swap
modules.

Thus, regardless of the iaa_crypto driver's 'sync_mode' setting, it
can still be forced to use synchronous mode by *not setting* the
IAA_REQ_POLL_FLAG in iaa_req->flags: this is the default to support
sequential use cases in zswap today. In other words, both these
conditions need to be met for a request to be processed in fully async
submit-poll mode:

 1) use_irq should be "false"
 2) iaa_req->flags & IAA_REQ_POLL_FLAG should be "true"

The IAA batching functionality introduced in this patch will set
the IAA_REQ_POLL_FLAG for the requests in a batch. We will submit the
descriptors for each request in the batch in iaa_[de]compress(), and
return -EINPROGRESS. The hardware will begin processing each request as
soon as it is submitted; essentially all compress/decompress jobs will
be parallelized. The polling function, "iaa_comp_poll()", will retrieve
the descriptor from each iaa_req->drv_data to check its completion
status. This enables the iaa_crypto driver to implement true async
"submit-polling" for parallel compressions and decompressions in the IAA
hardware accelerator.

This patch introduces batch compressions/decompressions in
iaa_crypto, that zswap can invoke using the same crypto API mentioned
earlier.

IAA Batching allows the kernel swap modules to compress/decompress
multiple pages/buffers in parallel in hardware, significantly improving
swapout/swapin latency and throughput.

The patch defines an iaa_crypto constant, IAA_CRYPTO_MAX_BATCH_SIZE
(set to 8U currently). This is the maximum batch-size for IAA, and
represents the maximum number of pages/buffers that can be
compressed/decompressed in parallel, respectively.

In order to support IAA batching, the iaa_crypto driver allocates
IAA_CRYPTO_MAX_BATCH_SIZE "struct iaa_req *reqs[]" per-CPU, upon
initialization. Notably, the task of allocating multiple requests to
submit to the hardware for parallel [de]compressions is taken over by
iaa_crypto, so that zswap doesn't need to allocate the reqs.

Compress batching is expected to be called by kernel modules such as
zswap by passing the folio pages as the "source" SG list of the
acomp_req, and by constructing an SG table of SG lists for the output
buffers and setting the acomp_req's "dst" to the head of this list of
scatterlists. Thanks to Herbert Xu for suggesting this batching
architecture.

Within the iaa_crypto driver's compress batching function:

1) The per-CPU iaa_reqs are populated from the acomp_req's src/dst SG
   lists.

2) All iaa_reqs are submitted to the hardware in async mode, using
   movdir64b. This enables hardware parallelism, because we don't wait
   for one compress/decompress job to finish before submitting the next
   one.

3) The iaa_reqs submitted are polled for completion statuses in a
   non-blocking manner in a while loop: each request that is still
   pending is polled once, and this repeats, until all requests have
   completed.

The core IAA batching functions are:

        static int iaa_comp_acompress_batch(
                struct iaa_compression_ctx *ctx,
                struct iaa_req *parent_req,
                unsigned int unit_size);

        static int iaa_comp_adecompress_batch(
                struct iaa_compression_ctx *ctx,
                struct iaa_req *parent_req,
                unsigned int unit_size);

The parameter @unit_size represents the unit size in bytes, for
dis-assembling the source or destination @parent_req->slen or
@parent_req->dlen and SG lists passed in through
@parent_req->src and @parent_req->dst.

The zswap interface to these batching API will be done by setting up the
acomp_req through these crypto API:

 acomp_request_set_src_folio()
 acomp_request_set_dst_sg()
 acomp_request_set_unit_size()

before proceeding to invoke batch compression/decompression using the
existing crypto_acomp_compress()/crypto_acomp_decompress() interfaces.

Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto.h      |  30 ++
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 399 ++++++++++++++++++++-
 2 files changed, 420 insertions(+), 9 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index 4dfb65c88f83..a99cd421f918 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -41,6 +41,35 @@
 					 IAA_DECOMP_CHECK_FOR_EOB | \
 					 IAA_DECOMP_STOP_ON_EOB)
 
+/*
+ * If set, the driver must have a way to submit the req, then
+ * poll its completion status for success/error.
+ */
+#define IAA_REQ_POLL_FLAG		0x00000002
+
+/*
+ * The maximum compress/decompress batch size for IAA's batch compression
+ * and batch decompression functionality.
+ */
+#define IAA_CRYPTO_MAX_BATCH_SIZE 8U
+
+/*
+ * Used to create per-CPU structure comprising of IAA_CRYPTO_MAX_BATCH_SIZE
+ * reqs for batch [de]compressions.
+ *
+ * @reqs:  Used to submit up to IAA_CRYPTO_MAX_BATCH_SIZE parallel
+ *         compress/decompress jobs to the accelerator.
+ * @mutex: Used to protect the per-CPU batch compression/decompression context
+ *         from preemption/process migration; and to allow upper layers in the
+ *         kernel to use synchronous/asynchronous compress/decompress calls to
+ *         IAA. In other words, don't make any assumptions, and protect
+ *         compression/decompression data.
+ */
+struct iaa_batch_ctx {
+	struct iaa_req **reqs;
+	struct mutex mutex;
+};
+
 #define IAA_COMP_MODES_MAX  IAA_MODE_NONE
 
 enum iaa_mode {
@@ -51,6 +80,7 @@ enum iaa_mode {
 struct iaa_req {
 	struct scatterlist *src;
 	struct scatterlist *dst;
+	struct scatterlist sg_src;
 	unsigned int slen;
 	unsigned int dlen;
 	u32 flags;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 75bd455b3b34..910598405c5c 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -55,6 +55,9 @@ static struct wq_table_entry **pkg_global_comp_wqs;
 static struct crypto_acomp *deflate_crypto_acomp;
 DEFINE_MUTEX(deflate_crypto_acomp_lock);
 
+/* Per-cpu iaa_reqs for batching. */
+static struct iaa_batch_ctx __percpu *iaa_batch_ctx;
+
 LIST_HEAD(iaa_devices);
 DEFINE_MUTEX(iaa_devices_lock);
 
@@ -1901,13 +1904,14 @@ static int iaa_compress(struct iaa_compression_ctx *ctx, struct iaa_req *req,
 					  ctx->mode, iaa_device->compression_modes[ctx->mode]);
 
 	if (likely(!ctx->use_irq)) {
+		req->drv_data = idxd_desc;
 		iaa_submit_desc_movdir64b(wq, idxd_desc);
 
 		/* Update stats */
 		update_total_comp_calls();
 		update_wq_comp_calls(wq);
 
-		if (ctx->async_mode)
+		if (req->flags & IAA_REQ_POLL_FLAG)
 			return -EINPROGRESS;
 
 		ret = check_completion(dev, idxd_desc->iax_completion, true, false);
@@ -1989,13 +1993,14 @@ static int iaa_decompress(struct iaa_compression_ctx *ctx, struct iaa_req *req,
 	desc = iaa_setup_decompress_hw_desc(idxd_desc, src_addr, slen, dst_addr, *dlen);
 
 	if (likely(!ctx->use_irq)) {
+		req->drv_data = idxd_desc;
 		iaa_submit_desc_movdir64b(wq, idxd_desc);
 
 		/* Update stats */
 		update_total_decomp_calls();
 		update_wq_decomp_calls(wq);
 
-		if (ctx->async_mode)
+		if (req->flags & IAA_REQ_POLL_FLAG)
 			return -EINPROGRESS;
 
 		ret = check_completion(dev, idxd_desc->iax_completion, false, false);
@@ -2198,6 +2203,301 @@ static int iaa_comp_adecompress(struct iaa_compression_ctx *ctx, struct iaa_req
 	return ret;
 }
 
+static int iaa_comp_poll(struct iaa_compression_ctx *ctx, struct iaa_req *req)
+{
+	struct idxd_desc *idxd_desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	struct idxd_wq *wq;
+	bool compress_op;
+	int ret;
+
+	idxd_desc = req->drv_data;
+	if (!idxd_desc)
+		return -EAGAIN;
+
+	compress_op = (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS);
+	wq = idxd_desc->wq;
+	iaa_wq = idxd_wq_get_private(wq);
+	idxd = iaa_wq->iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	ret = check_completion(dev, idxd_desc->iax_completion, compress_op, true);
+	if (ret == -EAGAIN)
+		return ret;
+	if (ret)
+		goto out;
+
+	req->dlen = idxd_desc->iax_completion->output_size;
+
+	/* Update stats */
+	if (compress_op) {
+		update_total_comp_bytes_out(req->dlen);
+		update_wq_comp_bytes(wq, req->dlen);
+	} else {
+		update_total_decomp_bytes_in(req->slen);
+		update_wq_decomp_bytes(wq, req->slen);
+	}
+
+	if (compress_op && ctx->verify_compress) {
+		dma_addr_t src_addr, dst_addr;
+
+		req->compression_crc = idxd_desc->iax_completion->crc;
+
+		dma_sync_sg_for_device(dev, req->dst, 1, DMA_FROM_DEVICE);
+		dma_sync_sg_for_device(dev, req->src, 1, DMA_TO_DEVICE);
+
+		src_addr = sg_dma_address(req->src);
+		dst_addr = sg_dma_address(req->dst);
+
+		ret = iaa_compress_verify(ctx, req, wq, src_addr, req->slen,
+					  dst_addr, req->dlen);
+	}
+
+out:
+	/* caller doesn't call crypto_wait_req, so no acomp_request_complete() */
+	dma_unmap_sg(dev, req->dst, 1, DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, 1, DMA_TO_DEVICE);
+
+	idxd_free_desc(idxd_desc->wq, idxd_desc);
+	percpu_ref_put(&iaa_wq->ref);
+
+	return ret;
+}
+
+static __always_inline void iaa_set_req_poll(
+	struct iaa_req *reqs[],
+	int nr_reqs,
+	bool set_flag)
+{
+	int i;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		set_flag ? (reqs[i]->flags |= IAA_REQ_POLL_FLAG) :
+			   (reqs[i]->flags &= ~IAA_REQ_POLL_FLAG);
+	}
+}
+
+/**
+ * This API provides IAA compress batching functionality for use by swap
+ * modules.
+ *
+ * @ctx:  compression ctx for the requested IAA mode (fixed/dynamic).
+ * @parent_req: The "parent" iaa_req that contains SG lists for the batch's
+ *              inputs and outputs.
+ * @unit_size: The unit size to apply to @parent_req->slen to get the number of
+ *             scatterlists it contains.
+ *
+ * The caller should check the individual sg->lengths in the @parent_req for
+ * errors, including incompressible page errors.
+ *
+ * Returns 0 if all compress requests in the batch complete successfully,
+ * -EINVAL otherwise.
+ */
+static int iaa_comp_acompress_batch(
+	struct iaa_compression_ctx *ctx,
+	struct iaa_req *parent_req,
+	unsigned int unit_size)
+{
+	struct iaa_batch_ctx *cpu_ctx = raw_cpu_ptr(iaa_batch_ctx);
+	int nr_reqs = parent_req->slen / unit_size;
+	int errors[IAA_CRYPTO_MAX_BATCH_SIZE];
+	int *dlens[IAA_CRYPTO_MAX_BATCH_SIZE];
+	bool compressions_done = false;
+	struct sg_page_iter sgiter;
+	struct scatterlist *sg;
+	struct iaa_req **reqs;
+	int i, err = 0;
+
+	mutex_lock(&cpu_ctx->mutex);
+
+	reqs = cpu_ctx->reqs;
+
+	__sg_page_iter_start(&sgiter, parent_req->src, nr_reqs,
+			     parent_req->src->offset/unit_size);
+
+	for (i = 0; i < nr_reqs; ++i, ++sgiter.sg_pgoffset) {
+		sg_set_page(reqs[i]->src, sg_page_iter_page(&sgiter), PAGE_SIZE, 0);
+		reqs[i]->slen = PAGE_SIZE;
+	}
+
+	for_each_sg(parent_req->dst, sg, nr_reqs, i) {
+		sg->length = PAGE_SIZE;
+		dlens[i] = &sg->length;
+		reqs[i]->dst = sg;
+		reqs[i]->dlen = PAGE_SIZE;
+	}
+
+	iaa_set_req_poll(reqs, nr_reqs, true);
+
+	/*
+	 * Prepare and submit the batch of iaa_reqs to IAA. IAA will process
+	 * these compress jobs in parallel.
+	 */
+	for (i = 0; i < nr_reqs; ++i) {
+		errors[i] = iaa_comp_acompress(ctx, reqs[i]);
+
+		if (likely(errors[i] == -EINPROGRESS)) {
+			errors[i] = -EAGAIN;
+		} else if (unlikely(errors[i])) {
+			*dlens[i] = errors[i];
+			err = -EINVAL;
+		} else {
+			*dlens[i] = reqs[i]->dlen;
+		}
+	}
+
+	/*
+	 * Asynchronously poll for and process IAA compress job completions.
+	 */
+	while (!compressions_done) {
+		compressions_done = true;
+
+		for (i = 0; i < nr_reqs; ++i) {
+			/*
+			 * Skip, if the compression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(ctx, reqs[i]);
+
+			if (errors[i]) {
+				if (likely(errors[i] == -EAGAIN)) {
+					compressions_done = false;
+				} else {
+					*dlens[i] = errors[i];
+					err = -EINVAL;
+				}
+			} else {
+				*dlens[i] = reqs[i]->dlen;
+			}
+		}
+	}
+
+	/*
+	 * For the same 'reqs[]' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_adecompress(),
+	 * clear the IAA_REQ_POLL_FLAG bit on all iaa_reqs.
+	 */
+	iaa_set_req_poll(reqs, nr_reqs, false);
+
+	mutex_unlock(&cpu_ctx->mutex);
+	return err;
+}
+
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ *
+ * @ctx:  compression ctx for the requested IAA mode (fixed/dynamic).
+ * @parent_req: The "parent" iaa_req that contains SG lists for the batch's
+ *              inputs and outputs.
+ * @unit_size: The unit size to apply to @parent_req->dlen to get the number of
+ *             scatterlists it contains.
+ *
+ * The caller should check @parent_req->dst scatterlist's component SG lists'
+ * @length for errors and handle @length != PAGE_SIZE.
+ *
+ * Returns 0 if all decompress requests complete successfully,
+ * -EINVAL otherwise.
+ */
+static int iaa_comp_adecompress_batch(
+	struct iaa_compression_ctx *ctx,
+	struct iaa_req *parent_req,
+	unsigned int unit_size)
+{
+	struct iaa_batch_ctx *cpu_ctx = raw_cpu_ptr(iaa_batch_ctx);
+	int nr_reqs = parent_req->dlen / unit_size;
+	int errors[IAA_CRYPTO_MAX_BATCH_SIZE];
+	int *dlens[IAA_CRYPTO_MAX_BATCH_SIZE];
+	bool decompressions_done = false;
+	struct scatterlist *sg;
+	struct iaa_req **reqs;
+	int i, err = 0;
+
+	mutex_lock(&cpu_ctx->mutex);
+
+	reqs = cpu_ctx->reqs;
+
+	for_each_sg(parent_req->src, sg, nr_reqs, i) {
+		reqs[i]->src = sg;
+		reqs[i]->slen = sg->length;
+	}
+
+	for_each_sg(parent_req->dst, sg, nr_reqs, i) {
+		dlens[i] = &sg->length;
+		reqs[i]->dst = sg;
+		reqs[i]->dlen = PAGE_SIZE;
+	}
+
+	iaa_set_req_poll(reqs, nr_reqs, true);
+
+	/*
+	 * Prepare and submit the batch of iaa_reqs to IAA. IAA will process
+	 * these decompress jobs in parallel.
+	 */
+	for (i = 0; i < nr_reqs; ++i) {
+		errors[i] = iaa_comp_adecompress(ctx, reqs[i]);
+
+		/*
+		 * If it failed desc allocation/submission, errors[i] can
+		 * be 0 or error value from software decompress.
+		 */
+		if (likely(errors[i] == -EINPROGRESS)) {
+			errors[i] = -EAGAIN;
+		} else if (unlikely(errors[i])) {
+			*dlens[i] = errors[i];
+			err = -EINVAL;
+		} else {
+			*dlens[i] = reqs[i]->dlen;
+		}
+	}
+
+	/*
+	 * Asynchronously poll for and process IAA decompress job completions.
+	 */
+	while (!decompressions_done) {
+		decompressions_done = true;
+
+		for (i = 0; i < nr_reqs; ++i) {
+			/*
+			 * Skip, if the decompression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = iaa_comp_poll(ctx, reqs[i]);
+
+			if (errors[i]) {
+				if (likely(errors[i] == -EAGAIN)) {
+					decompressions_done = false;
+				} else {
+					*dlens[i] = errors[i];
+					err = -EINVAL;
+				}
+			} else {
+				*dlens[i] = reqs[i]->dlen;
+			}
+		}
+	}
+
+	/*
+	 * For the same 'reqs[]' to be usable by
+	 * iaa_comp_acompress()/iaa_comp_adecompress(),
+	 * clear the IAA_REQ_POLL_FLAG bit on all iaa_reqs.
+	 */
+	iaa_set_req_poll(reqs, nr_reqs, false);
+
+	mutex_unlock(&cpu_ctx->mutex);
+	return err;
+}
+
 static void compression_ctx_init(struct iaa_compression_ctx *ctx, enum iaa_mode mode)
 {
 	ctx->mode = mode;
@@ -2222,9 +2522,19 @@ static int iaa_comp_acompress_main(struct acomp_req *areq)
 	if (iaa_alg_is_registered(crypto_tfm_alg_driver_name(tfm), &idx)) {
 		ctx = iaa_ctx[idx];
 
-		acomp_to_iaa(areq, &req, ctx);
-		ret = iaa_comp_acompress(ctx, &req);
-		iaa_to_acomp(unlikely(ret) ? ret : req.dlen, areq);
+		if (likely(areq->slen == areq->unit_size) || !areq->unit_size) {
+			acomp_to_iaa(areq, &req, ctx);
+			ret = iaa_comp_acompress(ctx, &req);
+			iaa_to_acomp(unlikely(ret) ? ret : req.dlen, areq);
+		} else {
+			acomp_to_iaa(areq, &req, ctx);
+			ret = iaa_comp_acompress_batch(ctx, &req, areq->unit_size);
+			/* 
+			 * Set the acomp_req's dlen to be the first SG list's
+			 * compressed length/error value.
+			 */
+			areq->dlen = req.dst->length;
+		}
 	}
 
 	return ret;
@@ -2240,9 +2550,19 @@ static int iaa_comp_adecompress_main(struct acomp_req *areq)
 	if (iaa_alg_is_registered(crypto_tfm_alg_driver_name(tfm), &idx)) {
 		ctx = iaa_ctx[idx];
 
-		acomp_to_iaa(areq, &req, ctx);
-		ret = iaa_comp_adecompress(ctx, &req);
-		iaa_to_acomp(unlikely(ret) ? ret : req.dlen, areq);
+		if (likely(areq->dlen == areq->unit_size) || !areq->unit_size) {
+			acomp_to_iaa(areq, &req, ctx);
+			ret = iaa_comp_adecompress(ctx, &req);
+			iaa_to_acomp(unlikely(ret) ? ret : req.dlen, areq);
+		} else {
+			acomp_to_iaa(areq, &req, ctx);
+			ret = iaa_comp_adecompress_batch(ctx, &req, areq->unit_size);
+			/* 
+			 * Set the acomp_req's dlen to be the first SG list's
+			 * decompressed length/error value.
+			 */
+			areq->dlen = req.dst->length;
+		}
 	}
 
 	return ret;
@@ -2527,9 +2847,31 @@ static struct idxd_device_driver iaa_crypto_driver = {
  * Module init/exit.
  ********************/
 
+static void iaa_batch_ctx_dealloc(void)
+{
+	int cpu;
+	u8 i;
+
+	if (!iaa_batch_ctx)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct iaa_batch_ctx *cpu_ctx = per_cpu_ptr(iaa_batch_ctx, cpu);
+
+		if (cpu_ctx && cpu_ctx->reqs) {
+			for (i = 0; i < IAA_CRYPTO_MAX_BATCH_SIZE; ++i)
+				kfree(cpu_ctx->reqs[i]);
+			kfree(cpu_ctx->reqs);
+		}
+	}
+
+	free_percpu(iaa_batch_ctx);
+}
+
 static int __init iaa_crypto_init_module(void)
 {
-	int ret = 0;
+	int cpu, ret = 0;
+	u8 i;
 
 	INIT_LIST_HEAD(&iaa_devices);
 
@@ -2591,6 +2933,39 @@ static int __init iaa_crypto_init_module(void)
 		goto err_sync_attr_create;
 	}
 
+	/* Allocate batching resources for iaa_crypto. */
+	iaa_batch_ctx = alloc_percpu_gfp(struct iaa_batch_ctx, GFP_KERNEL | __GFP_ZERO);
+	if (!iaa_batch_ctx) {
+		pr_debug("Failed to allocate per-cpu iaa_batch_ctx\n");
+		goto batch_ctx_fail;
+	}
+
+	for_each_possible_cpu(cpu) {
+		struct iaa_batch_ctx *cpu_ctx = per_cpu_ptr(iaa_batch_ctx, cpu);
+		int nid = cpu_to_node(cpu);
+
+		cpu_ctx->reqs = kcalloc_node(IAA_CRYPTO_MAX_BATCH_SIZE,
+					     sizeof(struct iaa_req *),
+					     GFP_KERNEL, nid);
+
+		if (!cpu_ctx->reqs)
+			goto reqs_fail;
+
+		for (i = 0; i < IAA_CRYPTO_MAX_BATCH_SIZE; ++i) {
+			cpu_ctx->reqs[i] = kzalloc_node(sizeof(struct iaa_req),
+							GFP_KERNEL, nid);
+			if (!cpu_ctx->reqs[i]) {
+				pr_debug("Could not alloc iaa_req reqs[%d]\n", i);
+				goto reqs_fail;
+			}
+
+			sg_init_table(&cpu_ctx->reqs[i]->sg_src, 1);
+			cpu_ctx->reqs[i]->src = &cpu_ctx->reqs[i]->sg_src;
+		}
+
+		mutex_init(&cpu_ctx->mutex);
+	}
+
 	if (iaa_crypto_debugfs_init())
 		pr_warn("debugfs init failed, stats not available\n");
 
@@ -2598,6 +2973,11 @@ static int __init iaa_crypto_init_module(void)
 out:
 	return ret;
 
+reqs_fail:
+	iaa_batch_ctx_dealloc();
+batch_ctx_fail:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_sync_mode);
 err_sync_attr_create:
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
@@ -2629,6 +3009,7 @@ static void __exit iaa_crypto_cleanup_module(void)
 	iaa_unregister_acomp_compression_device();
 	iaa_unregister_compression_device();
 
+	iaa_batch_ctx_dealloc();
 	iaa_crypto_debugfs_cleanup();
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_sync_mode);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 13/22] crypto: iaa - IAA Batching for parallel compressions/decompressions.
  2025-11-04  9:12 ` [PATCH v13 13/22] crypto: iaa - IAA Batching for parallel compressions/decompressions Kanchana P Sridhar
@ 2025-11-14  9:59   ` Herbert Xu
  2025-11-16 18:53     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2025-11-14  9:59 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, davem, clabbe, ardb,
	ebiggers, surenb, kristen.c.accardi, vinicius.gomes,
	wajdi.k.feghali, vinodh.gopal

On Tue, Nov 04, 2025 at 01:12:26AM -0800, Kanchana P Sridhar wrote:
>
> +/**
> + * This API provides IAA compress batching functionality for use by swap
> + * modules.
> + *
> + * @ctx:  compression ctx for the requested IAA mode (fixed/dynamic).
> + * @parent_req: The "parent" iaa_req that contains SG lists for the batch's
> + *              inputs and outputs.
> + * @unit_size: The unit size to apply to @parent_req->slen to get the number of
> + *             scatterlists it contains.
> + *
> + * The caller should check the individual sg->lengths in the @parent_req for
> + * errors, including incompressible page errors.
> + *
> + * Returns 0 if all compress requests in the batch complete successfully,
> + * -EINVAL otherwise.
> + */
> +static int iaa_comp_acompress_batch(
> +	struct iaa_compression_ctx *ctx,
> +	struct iaa_req *parent_req,
> +	unsigned int unit_size)
> +{
> +	struct iaa_batch_ctx *cpu_ctx = raw_cpu_ptr(iaa_batch_ctx);
> +	int nr_reqs = parent_req->slen / unit_size;
> +	int errors[IAA_CRYPTO_MAX_BATCH_SIZE];
> +	int *dlens[IAA_CRYPTO_MAX_BATCH_SIZE];
> +	bool compressions_done = false;
> +	struct sg_page_iter sgiter;
> +	struct scatterlist *sg;
> +	struct iaa_req **reqs;
> +	int i, err = 0;
> +
> +	mutex_lock(&cpu_ctx->mutex);
> +
> +	reqs = cpu_ctx->reqs;
> +
> +	__sg_page_iter_start(&sgiter, parent_req->src, nr_reqs,
> +			     parent_req->src->offset/unit_size);
> +
> +	for (i = 0; i < nr_reqs; ++i, ++sgiter.sg_pgoffset) {
> +		sg_set_page(reqs[i]->src, sg_page_iter_page(&sgiter), PAGE_SIZE, 0);
> +		reqs[i]->slen = PAGE_SIZE;
> +	}
> +
> +	for_each_sg(parent_req->dst, sg, nr_reqs, i) {
> +		sg->length = PAGE_SIZE;
> +		dlens[i] = &sg->length;
> +		reqs[i]->dst = sg;
> +		reqs[i]->dlen = PAGE_SIZE;
> +	}
> +
> +	iaa_set_req_poll(reqs, nr_reqs, true);
> +
> +	/*
> +	 * Prepare and submit the batch of iaa_reqs to IAA. IAA will process
> +	 * these compress jobs in parallel.
> +	 */
> +	for (i = 0; i < nr_reqs; ++i) {
> +		errors[i] = iaa_comp_acompress(ctx, reqs[i]);
> +
> +		if (likely(errors[i] == -EINPROGRESS)) {
> +			errors[i] = -EAGAIN;
> +		} else if (unlikely(errors[i])) {
> +			*dlens[i] = errors[i];
> +			err = -EINVAL;
> +		} else {
> +			*dlens[i] = reqs[i]->dlen;
> +		}
> +	}
> +
> +	/*
> +	 * Asynchronously poll for and process IAA compress job completions.
> +	 */
> +	while (!compressions_done) {
> +		compressions_done = true;
> +
> +		for (i = 0; i < nr_reqs; ++i) {
> +			/*
> +			 * Skip, if the compression has already completed
> +			 * successfully or with an error.
> +			 */
> +			if (errors[i] != -EAGAIN)
> +				continue;
> +
> +			errors[i] = iaa_comp_poll(ctx, reqs[i]);
> +
> +			if (errors[i]) {
> +				if (likely(errors[i] == -EAGAIN)) {
> +					compressions_done = false;
> +				} else {
> +					*dlens[i] = errors[i];
> +					err = -EINVAL;
> +				}
> +			} else {
> +				*dlens[i] = reqs[i]->dlen;
> +			}
> +		}
> +	}

Why is this polling necessary?

The crypto_acomp interface is async, even if the only user that
you're proposing is synchronous.

IOW the driver shouldn't care about synchronous polling at all.
Just invoke the callback once all the sub-requests are complete
and the wait call in zswap will take care of the rest.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 13/22] crypto: iaa - IAA Batching for parallel compressions/decompressions.
  2025-11-14  9:59   ` Herbert Xu
@ 2025-11-16 18:53     ` Sridhar, Kanchana P
  2025-11-17  3:12       ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-11-16 18:53 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, davem, clabbe, ardb,
	ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius, Feghali,
	Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Friday, November 14, 2025 1:59 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 13/22] crypto: iaa - IAA Batching for parallel
> compressions/decompressions.
> 
> On Tue, Nov 04, 2025 at 01:12:26AM -0800, Kanchana P Sridhar wrote:
> >
> > +/**
> > + * This API provides IAA compress batching functionality for use by swap
> > + * modules.
> > + *
> > + * @ctx:  compression ctx for the requested IAA mode (fixed/dynamic).
> > + * @parent_req: The "parent" iaa_req that contains SG lists for the batch's
> > + *              inputs and outputs.
> > + * @unit_size: The unit size to apply to @parent_req->slen to get the
> number of
> > + *             scatterlists it contains.
> > + *
> > + * The caller should check the individual sg->lengths in the @parent_req
> for
> > + * errors, including incompressible page errors.
> > + *
> > + * Returns 0 if all compress requests in the batch complete successfully,
> > + * -EINVAL otherwise.
> > + */
> > +static int iaa_comp_acompress_batch(
> > +	struct iaa_compression_ctx *ctx,
> > +	struct iaa_req *parent_req,
> > +	unsigned int unit_size)
> > +{
> > +	struct iaa_batch_ctx *cpu_ctx = raw_cpu_ptr(iaa_batch_ctx);
> > +	int nr_reqs = parent_req->slen / unit_size;
> > +	int errors[IAA_CRYPTO_MAX_BATCH_SIZE];
> > +	int *dlens[IAA_CRYPTO_MAX_BATCH_SIZE];
> > +	bool compressions_done = false;
> > +	struct sg_page_iter sgiter;
> > +	struct scatterlist *sg;
> > +	struct iaa_req **reqs;
> > +	int i, err = 0;
> > +
> > +	mutex_lock(&cpu_ctx->mutex);
> > +
> > +	reqs = cpu_ctx->reqs;
> > +
> > +	__sg_page_iter_start(&sgiter, parent_req->src, nr_reqs,
> > +			     parent_req->src->offset/unit_size);
> > +
> > +	for (i = 0; i < nr_reqs; ++i, ++sgiter.sg_pgoffset) {
> > +		sg_set_page(reqs[i]->src, sg_page_iter_page(&sgiter),
> PAGE_SIZE, 0);
> > +		reqs[i]->slen = PAGE_SIZE;
> > +	}
> > +
> > +	for_each_sg(parent_req->dst, sg, nr_reqs, i) {
> > +		sg->length = PAGE_SIZE;
> > +		dlens[i] = &sg->length;
> > +		reqs[i]->dst = sg;
> > +		reqs[i]->dlen = PAGE_SIZE;
> > +	}
> > +
> > +	iaa_set_req_poll(reqs, nr_reqs, true);
> > +
> > +	/*
> > +	 * Prepare and submit the batch of iaa_reqs to IAA. IAA will process
> > +	 * these compress jobs in parallel.
> > +	 */
> > +	for (i = 0; i < nr_reqs; ++i) {
> > +		errors[i] = iaa_comp_acompress(ctx, reqs[i]);
> > +
> > +		if (likely(errors[i] == -EINPROGRESS)) {
> > +			errors[i] = -EAGAIN;
> > +		} else if (unlikely(errors[i])) {
> > +			*dlens[i] = errors[i];
> > +			err = -EINVAL;
> > +		} else {
> > +			*dlens[i] = reqs[i]->dlen;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Asynchronously poll for and process IAA compress job completions.
> > +	 */
> > +	while (!compressions_done) {
> > +		compressions_done = true;
> > +
> > +		for (i = 0; i < nr_reqs; ++i) {
> > +			/*
> > +			 * Skip, if the compression has already completed
> > +			 * successfully or with an error.
> > +			 */
> > +			if (errors[i] != -EAGAIN)
> > +				continue;
> > +
> > +			errors[i] = iaa_comp_poll(ctx, reqs[i]);
> > +
> > +			if (errors[i]) {
> > +				if (likely(errors[i] == -EAGAIN)) {
> > +					compressions_done = false;
> > +				} else {
> > +					*dlens[i] = errors[i];
> > +					err = -EINVAL;
> > +				}
> > +			} else {
> > +				*dlens[i] = reqs[i]->dlen;
> > +			}
> > +		}
> > +	}
> 
> Why is this polling necessary?
> 
> The crypto_acomp interface is async, even if the only user that
> you're proposing is synchronous.
> 
> IOW the driver shouldn't care about synchronous polling at all.
> Just invoke the callback once all the sub-requests are complete
> and the wait call in zswap will take care of the rest.

Hi Herbert,

This is a simple/low-overhead implementation that tries to avail of
hardware parallelism by launching multiple compress/decompress jobs
to the accelerator. Each job runs independently of the other from a
driver perspective. For e.g., no assumptions are made in the driver
about submission order vis-à-vis completion order. Completions can
occur asynchronously.

The polling is intended for exactly the purpose you mention, namely,
to know when all the sub-requests are complete and to set the sg->length
as each sub-request completes. Please let me know if I understood your
question correctly.

Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 13/22] crypto: iaa - IAA Batching for parallel compressions/decompressions.
  2025-11-16 18:53     ` Sridhar, Kanchana P
@ 2025-11-17  3:12       ` Herbert Xu
  2025-11-17  5:47         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2025-11-17  3:12 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, davem, clabbe, ardb,
	ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius, Feghali,
	Wajdi K, Gopal, Vinodh

On Sun, Nov 16, 2025 at 06:53:08PM +0000, Sridhar, Kanchana P wrote:
>
> This is a simple/low-overhead implementation that tries to avail of
> hardware parallelism by launching multiple compress/decompress jobs
> to the accelerator. Each job runs independently of the other from a
> driver perspective. For e.g., no assumptions are made in the driver
> about submission order vis-à-vis completion order. Completions can
> occur asynchronously.
> 
> The polling is intended for exactly the purpose you mention, namely,
> to know when all the sub-requests are complete and to set the sg->length
> as each sub-request completes. Please let me know if I understood your
> question correctly.

The issue here is that this code is being plugged into the acomp
API, but it isn't implementing the acomp API correctly.  The acomp
API is supposed to be asynchronous and you should return immediately
here and then invoke the callback when every sub-request is complete.

I know that the ultimate user is synchronous, but still the driver
needs to implement the acomp API correctly.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 13/22] crypto: iaa - IAA Batching for parallel compressions/decompressions.
  2025-11-17  3:12       ` Herbert Xu
@ 2025-11-17  5:47         ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-11-17  5:47 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, davem, clabbe, ardb,
	ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius, Feghali,
	Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Sunday, November 16, 2025 7:13 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 13/22] crypto: iaa - IAA Batching for parallel
> compressions/decompressions.
> 
> On Sun, Nov 16, 2025 at 06:53:08PM +0000, Sridhar, Kanchana P wrote:
> >
> > This is a simple/low-overhead implementation that tries to avail of
> > hardware parallelism by launching multiple compress/decompress jobs
> > to the accelerator. Each job runs independently of the other from a
> > driver perspective. For e.g., no assumptions are made in the driver
> > about submission order vis-à-vis completion order. Completions can
> > occur asynchronously.
> >
> > The polling is intended for exactly the purpose you mention, namely,
> > to know when all the sub-requests are complete and to set the sg->length
> > as each sub-request completes. Please let me know if I understood your
> > question correctly.
> 
> The issue here is that this code is being plugged into the acomp
> API, but it isn't implementing the acomp API correctly.  The acomp
> API is supposed to be asynchronous and you should return immediately
> here and then invoke the callback when every sub-request is complete.
> 
> I know that the ultimate user is synchronous, but still the driver
> needs to implement the acomp API correctly.

Thanks Herbert, for this explanation. I think the main problem to solve
is how to signal the callback with the "err" of all sub-requests, noting
which of them are still -EINPROGRESS, vs. having completed with
error/success, set the sg->lengths, etc. I am wondering how this can
be done since I have already returned after submitting the requests..

It seems there needs to be a polling mechanism after returning from
crypto_acomp_compress() once I have submitted the sub-requests.
This polling would ultimately call acomp_request_complete() after all
sub-requests have completed. Do you have any suggestions on how
this can be accomplished?

Thanks,
Kanchana


> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 14/22] crypto: iaa - Enable async mode and make it the default.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (12 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 13/22] crypto: iaa - IAA Batching for parallel compressions/decompressions Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 15/22] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch enables the 'async' sync_mode in the driver. Further, it sets
the default sync_mode to 'async', which makes it easier for IAA hardware
acceleration in the iaa_crypto driver to be loaded by default in the most
efficient/recommended 'async' mode for parallel
compressions/decompressions, namely, asynchronous submission of
descriptors, followed by polling for job completions. Earlier, the
"sync" mode used to be the default.

The iaa_crypto driver documentation has been updated with these
changes.

This way, anyone who wants to use IAA for zswap/zram can do so after
building the kernel, and without having to go through these steps to use
async mode:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo async > /sys/bus/dsa/drivers/crypto/sync_mode
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 Documentation/driver-api/crypto/iaa/iaa-crypto.rst | 11 ++---------
 drivers/crypto/intel/iaa/iaa_crypto_main.c         |  4 ++--
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 0ff4ec603b43..d5e610ef4612 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -272,7 +272,7 @@ The available attributes are:
       echo async_irq > /sys/bus/dsa/drivers/crypto/sync_mode
 
     Async mode without interrupts (caller must poll) can be enabled by
-    writing 'async' to it (please see Caveat)::
+    writing 'async' to it::
 
       echo async > /sys/bus/dsa/drivers/crypto/sync_mode
 
@@ -281,14 +281,7 @@ The available attributes are:
 
       echo sync > /sys/bus/dsa/drivers/crypto/sync_mode
 
-    The default mode is 'sync'.
-
-    Caveat: since the only mechanism that iaa_crypto currently implements
-    for async polling without interrupts is via the 'sync' mode as
-    described earlier, writing 'async' to
-    '/sys/bus/dsa/drivers/crypto/sync_mode' will internally enable the
-    'sync' mode. This is to ensure correct iaa_crypto behavior until true
-    async polling without interrupts is enabled in iaa_crypto.
+    The default mode is 'async'.
 
   - g_comp_wqs_per_iaa
 
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 910598405c5c..8f477577dbd1 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -152,7 +152,7 @@ static bool iaa_verify_compress = true;
  */
 
 /* Use async mode */
-static bool async_mode;
+static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
@@ -206,7 +206,7 @@ static int set_iaa_sync_mode(const char *name)
 		async_mode = false;
 		use_irq = false;
 	} else if (sysfs_streq(name, "async")) {
-		async_mode = false;
+		async_mode = true;
 		use_irq = false;
 	} else if (sysfs_streq(name, "async_irq")) {
 		async_mode = true;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 15/22] crypto: iaa - Disable iaa_verify_compress by default.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (13 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 14/22] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 16/22] crypto: iaa - Submit the two largest source buffers first in decompress batching Kanchana P Sridhar
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default with "iaa_verify_compress" disabled, to
facilitate performance comparisons with software compressors (which also
do not run compress verification by default). Earlier, iaa_crypto compress
verification used to be enabled by default.

The iaa_crypto driver documentation has been updated with this change.

With this patch, if users want to enable compress verification, they can do
so with these steps:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo 1 > /sys/bus/dsa/drivers/crypto/verify_compress
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 Documentation/driver-api/crypto/iaa/iaa-crypto.rst | 2 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c         | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index d5e610ef4612..81a7dbd15f8b 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -239,7 +239,7 @@ The available attributes are:
 
       echo 0 > /sys/bus/dsa/drivers/crypto/verify_compress
 
-    The default setting is '1' - verify all compresses.
+    The default setting is '0' - to not verify compresses.
 
   - sync_mode
 
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 8f477577dbd1..349fea0af454 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -120,7 +120,7 @@ static bool iaa_distribute_decomps;
 static bool iaa_distribute_comps = true;
 
 /* Verify results of IAA compress or not */
-static bool iaa_verify_compress = true;
+static bool iaa_verify_compress;
 
 /*
  * The iaa crypto driver supports three 'sync' methods determining how
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 16/22] crypto: iaa - Submit the two largest source buffers first in decompress batching.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (14 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 15/22] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 17/22] crypto: iaa - Add deflate-iaa-dynamic compression mode Kanchana P Sridhar
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch finds the two largest source buffers in a given decompression
batch, and submits them first to the IAA decompress engines.

This improves decompress batching latency because the hardware has a
head start on decompressing the highest latency source buffers in the
batch. Workload performance is also significantly improved as a result
of this optimization.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 61 +++++++++++++++++++++-
 1 file changed, 59 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 349fea0af454..cc0d82154ff6 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -2390,6 +2390,36 @@ static int iaa_comp_acompress_batch(
 	return err;
 }
 
+/*
+ * Find the two largest source buffers in @slens for a decompress batch,
+ * and pass their indices back in @idx_max and @idx_next_max.
+ *
+ * Returns true if there is no second largest source buffer, only a max buffer.
+ */
+static bool decomp_batch_get_max_slens_idx(
+	struct iaa_req *reqs[],
+	int nr_pages,
+	int *idx_max,
+	int *idx_next_max)
+{
+	int i, max_i = 0, next_max_i = 0;
+
+	for (i = 0; i < nr_pages; ++i) {
+		if (reqs[i]->slen >= reqs[max_i]->slen) {
+			next_max_i = max_i;
+			max_i = i;
+		} else if ((next_max_i == max_i) ||
+			   (reqs[i]->slen > reqs[next_max_i]->slen)) {
+			next_max_i = i;
+		}
+	}
+
+	*idx_max = max_i;
+	*idx_next_max = next_max_i;
+
+	return (next_max_i == max_i);
+}
+
 /**
  * This API provides IAA decompress batching functionality for use by swap
  * modules.
@@ -2412,13 +2442,14 @@ static int iaa_comp_adecompress_batch(
 	unsigned int unit_size)
 {
 	struct iaa_batch_ctx *cpu_ctx = raw_cpu_ptr(iaa_batch_ctx);
+	bool max_processed = false, next_max_processed = false;
 	int nr_reqs = parent_req->dlen / unit_size;
 	int errors[IAA_CRYPTO_MAX_BATCH_SIZE];
 	int *dlens[IAA_CRYPTO_MAX_BATCH_SIZE];
+	int i = 0, max_i, next_max_i, err = 0;
 	bool decompressions_done = false;
 	struct scatterlist *sg;
 	struct iaa_req **reqs;
-	int i, err = 0;
 
 	mutex_lock(&cpu_ctx->mutex);
 
@@ -2437,11 +2468,28 @@ static int iaa_comp_adecompress_batch(
 
 	iaa_set_req_poll(reqs, nr_reqs, true);
 
+	/*
+	 * Get the indices of the two largest decomp buffers in the batch.
+	 * Submit them first. This improves latency of the batch.
+	 */
+	next_max_processed = decomp_batch_get_max_slens_idx(reqs, nr_reqs,
+							    &max_i, &next_max_i);
+
+	i = max_i;
+
 	/*
 	 * Prepare and submit the batch of iaa_reqs to IAA. IAA will process
 	 * these decompress jobs in parallel.
 	 */
-	for (i = 0; i < nr_reqs; ++i) {
+	for (; i < nr_reqs; ++i) {
+		if ((i == max_i) && max_processed)
+			continue;
+		if ((i == next_max_i) && max_processed && next_max_processed)
+			continue;
+
+		if (max_processed && !next_max_processed)
+			i = next_max_i;
+
 		errors[i] = iaa_comp_adecompress(ctx, reqs[i]);
 
 		/*
@@ -2456,6 +2504,15 @@ static int iaa_comp_adecompress_batch(
 		} else {
 			*dlens[i] = reqs[i]->dlen;
 		}
+
+		if (i == max_i) {
+			max_processed = true;
+			i = -1;
+		}
+		if (i == next_max_i) {
+			next_max_processed = true;
+			i = -1;
+		}
 	}
 
 	/*
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 17/22] crypto: iaa - Add deflate-iaa-dynamic compression mode.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (15 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 16/22] crypto: iaa - Submit the two largest source buffers first in decompress batching Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 18/22] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size Kanchana P Sridhar
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Some versions of Intel IAA such as Granite Rapids, support dynamic
compression where the hardware dynamically computes the Huffman tables
and generates a Deflate header if the input size is no larger than
4KB. This patch will use IAA for dynamic compression if an appropriate
IAA is present and the input size is not greater than 4KB. If an IAA is
not present, the algorithm will not be available. Otherwise, if the size
of the input is greater than PAGE_SIZE, zlib is used to do the
compression. If the algorithm is selected, IAA will be used for
decompression. If the compressed stream contains a reference whose
distance is greater than 4KB, hardware decompression will fail, and the
decompression will be done with zlib.

Intel IAA dynamic compression results in a compression ratio that is
better than or equal to the currently supported "fixed" compression mode
on the same data set. Compressing a data set of 4300 4KB pages sampled
from SPEC CPU17 workloads produces a compression ratio of 3.14 for IAA
dynamic compression and 2.69 for IAA fixed compression.

If an appropriate IAA exists, dynamic mode can be chosen as the IAA
compression mode by selecting the corresponding algorithm.

For example, to use IAA dynamic mode in zswap:

      echo deflate-iaa-dynamic > /sys/module/zswap/parameters/compressor

This patch also adds a deflate_generic_compress() fallback when dynamic
mode is selected and the input size is over 4KB; along with stats
support that will count these software fallback calls as
"total_sw_comp_calls" in the driver's global_stats.

Furthermore, we define IAA_DYN_ALLOC_DESC_COMP_TIMEOUT as 2000 for
dynamic mode compression on Granite Rapids.

Signed-off-by: Andre Glover <andre.glover@linux.intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 .../driver-api/crypto/iaa/iaa-crypto.rst      |  21 ++++
 crypto/testmgr.c                              |  10 ++
 crypto/testmgr.h                              |  74 +++++++++++++
 drivers/crypto/intel/iaa/Makefile             |   2 +-
 drivers/crypto/intel/iaa/iaa_crypto.h         |   8 +-
 .../intel/iaa/iaa_crypto_comp_dynamic.c       |  22 ++++
 drivers/crypto/intel/iaa/iaa_crypto_main.c    | 102 ++++++++++++++++--
 drivers/crypto/intel/iaa/iaa_crypto_stats.c   |   8 ++
 drivers/crypto/intel/iaa/iaa_crypto_stats.h   |   2 +
 9 files changed, 239 insertions(+), 10 deletions(-)
 create mode 100644 drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c

diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
index 81a7dbd15f8b..e841a33564db 100644
--- a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
+++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst
@@ -33,6 +33,8 @@ compresses and decompresses.
 Currently, there is only one compression modes available, 'fixed'
 mode.
 
+'dynamic' mode is available on certain generations of IAA hardware.
+
 The 'fixed' compression mode implements the compression scheme
 specified by RFC 1951 and is given the crypto algorithm name
 'deflate-iaa'.  (Because the IAA hardware has a 4k history-window
@@ -43,6 +45,25 @@ the IAA fixed mode deflate algorithm is given its own algorithm name
 rather than simply 'deflate').
 
 
+The 'dynamic' compression mode implements a compression scheme where
+the IAA hardware will internally do one pass through the data, compute the
+Huffman tables and generate a Deflate header, then automatically do a
+second pass through the data, generating the final compressed output. IAA
+dynamic compression can be used if an appropriate IAA is present and the
+input size is not too big.  If an appropriate IAA is not present, the
+algorithm will not be available. Otherwise, if the size of the input is too
+big, zlib is used to do the compression. If the algorithm is selected,
+IAA will be used for decompression. If the compressed stream contains a
+reference whose distance is greater than 4KB, hardware decompression will
+fail, and the decompression will be done with zlib. If an appropriate IAA
+exists, 'dynamic' compression, it is implemented by the
+'deflate-iaa-dynamic' crypto algorithm.
+
+A zswap device can select the IAA 'dynamic' mode represented by
+selecting the 'deflate-iaa-dynamic' crypto compression algorithm::
+
+  # echo deflate-iaa-dynamic> /sys/module/zswap/parameters/compressor
+
 Config options and other setup
 ==============================
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 6a490aaa71b9..7c58a9163429 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -4664,6 +4664,16 @@ static const struct alg_test_desc alg_test_descs[] = {
 				.decomp = __VECS(deflate_decomp_tv_template)
 			}
 		}
+	}, {
+		.alg = "deflate-iaa-dynamic",
+		.test = alg_test_comp,
+		.fips_allowed = 1,
+		.suite = {
+			.comp = {
+				.comp = __VECS(deflate_iaa_dynamic_comp_tv_template),
+				.decomp = __VECS(deflate_iaa_dynamic_decomp_tv_template)
+			}
+		}
 	}, {
 		.alg = "dh",
 		.test = alg_test_kpp,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 268231227282..f2d75008c408 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -33350,6 +33350,80 @@ static const struct comp_testvec deflate_decomp_tv_template[] = {
 	},
 };
 
+static const struct comp_testvec deflate_iaa_dynamic_comp_tv_template[] = {
+	{
+		.inlen	= 70,
+		.outlen	= 46,
+		.input	= "Join us now and share the software "
+			"Join us now and share the software ",
+		.output = "\x85\xca\xc1\x09\x00\x20\x08\x05"
+			  "\xd0\x55\xfe\x3c\x6e\x21\x64\xd8"
+			  "\x45\x21\x0d\xd7\xb7\x26\xe8\xf8"
+			  "\xe0\x91\x2f\xc3\x09\x98\x17\xd8"
+			  "\x06\x42\x79\x0b\x52\x05\xe1\x33"
+			  "\xeb\x81\x3e\xe5\xa2\x01",
+	}, {
+		.inlen	= 191,
+		.outlen	= 121,
+		.input	= "This document describes a compression method based on the DEFLATE "
+			"compression algorithm.  This document defines the application of "
+			"the DEFLATE algorithm to the IP Payload Compression Protocol.",
+		.output = "\x5d\x8d\xc1\x0d\xc2\x30\x10\x04"
+			  "\x5b\xd9\x0a\xd2\x03\x82\x20\x21"
+			  "\xf1\xf0\x23\x0d\x5c\xec\x0b\xb6"
+			  "\x64\xfb\x2c\xdf\xf1\xa0\x7b\x12"
+			  "\x3e\x58\x79\xae\x76\x67\x76\x89"
+			  "\x49\x11\xc4\xbf\x0b\x57\x43\x60"
+			  "\xf5\x3d\xad\xac\x20\x78\x29\xad"
+			  "\xb3\x6a\x92\x8a\xc2\x16\x25\x60"
+			  "\x25\xe5\x80\x3d\x5b\x64\xdc\xe6"
+			  "\xfb\xf3\xb2\xcc\xe3\x8c\xf2\x4b"
+			  "\x7a\xb2\x58\x26\xe0\x2c\xde\x52"
+			  "\xdd\xb5\x07\x48\xad\xe5\xe4\xc9"
+			  "\x0e\x42\xb6\xd1\xf5\x17\xc0\xe4"
+			  "\x57\x3c\x1c\x1c\x7d\xb2\x50\xc0"
+			  "\x75\x38\x72\x5d\x4c\xbc\xe4\xe9"
+			  "\x0b",
+	},
+};
+
+static const struct comp_testvec deflate_iaa_dynamic_decomp_tv_template[] = {
+	{
+		.inlen	= 121,
+		.outlen	= 191,
+		.input	= "\x5d\x8d\xc1\x0d\xc2\x30\x10\x04"
+			  "\x5b\xd9\x0a\xd2\x03\x82\x20\x21"
+			  "\xf1\xf0\x23\x0d\x5c\xec\x0b\xb6"
+			  "\x64\xfb\x2c\xdf\xf1\xa0\x7b\x12"
+			  "\x3e\x58\x79\xae\x76\x67\x76\x89"
+			  "\x49\x11\xc4\xbf\x0b\x57\x43\x60"
+			  "\xf5\x3d\xad\xac\x20\x78\x29\xad"
+			  "\xb3\x6a\x92\x8a\xc2\x16\x25\x60"
+			  "\x25\xe5\x80\x3d\x5b\x64\xdc\xe6"
+			  "\xfb\xf3\xb2\xcc\xe3\x8c\xf2\x4b"
+			  "\x7a\xb2\x58\x26\xe0\x2c\xde\x52"
+			  "\xdd\xb5\x07\x48\xad\xe5\xe4\xc9"
+			  "\x0e\x42\xb6\xd1\xf5\x17\xc0\xe4"
+			  "\x57\x3c\x1c\x1c\x7d\xb2\x50\xc0"
+			  "\x75\x38\x72\x5d\x4c\xbc\xe4\xe9"
+			  "\x0b",
+		.output	= "This document describes a compression method based on the DEFLATE "
+			"compression algorithm.  This document defines the application of "
+			"the DEFLATE algorithm to the IP Payload Compression Protocol.",
+	}, {
+		.inlen	= 46,
+		.outlen	= 70,
+		.input	= "\x85\xca\xc1\x09\x00\x20\x08\x05"
+			  "\xd0\x55\xfe\x3c\x6e\x21\x64\xd8"
+			  "\x45\x21\x0d\xd7\xb7\x26\xe8\xf8"
+			  "\xe0\x91\x2f\xc3\x09\x98\x17\xd8"
+			  "\x06\x42\x79\x0b\x52\x05\xe1\x33"
+			  "\xeb\x81\x3e\xe5\xa2\x01",
+		.output	= "Join us now and share the software "
+			"Join us now and share the software ",
+	},
+};
+
 /*
  * LZO test vectors (null-terminated strings).
  */
diff --git a/drivers/crypto/intel/iaa/Makefile b/drivers/crypto/intel/iaa/Makefile
index ebfa1a425f80..96f22cd39924 100644
--- a/drivers/crypto/intel/iaa/Makefile
+++ b/drivers/crypto/intel/iaa/Makefile
@@ -7,6 +7,6 @@ ccflags-y += -I $(srctree)/drivers/dma/idxd -DDEFAULT_SYMBOL_NAMESPACE='"CRYPTO_
 
 obj-$(CONFIG_CRYPTO_DEV_IAA_CRYPTO) := iaa_crypto.o
 
-iaa_crypto-y := iaa_crypto_main.o iaa_crypto_comp_fixed.o
+iaa_crypto-y := iaa_crypto_main.o iaa_crypto_comp_fixed.o iaa_crypto_comp_dynamic.o
 
 iaa_crypto-$(CONFIG_CRYPTO_DEV_IAA_CRYPTO_STATS) += iaa_crypto_stats.o
diff --git a/drivers/crypto/intel/iaa/iaa_crypto.h b/drivers/crypto/intel/iaa/iaa_crypto.h
index a99cd421f918..0e8dadd84a92 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto.h
@@ -19,12 +19,15 @@
 
 #define IAA_COMP_FLUSH_OUTPUT		BIT(1)
 #define IAA_COMP_APPEND_EOB		BIT(2)
+#define IAA_COMP_GEN_HDR_1_PASS		(BIT(12) | BIT(13))
 
 #define IAA_COMPLETION_TIMEOUT		1000000
 
 #define IAA_ALLOC_DESC_COMP_TIMEOUT	   1000
 #define IAA_ALLOC_DESC_DECOMP_TIMEOUT	    500
 
+#define IAA_DYN_ALLOC_DESC_COMP_TIMEOUT	   2000
+
 #define IAA_ANALYTICS_ERROR		0x0a
 #define IAA_ERROR_DECOMP_BUF_OVERFLOW	0x0b
 #define IAA_ERROR_COMP_BUF_OVERFLOW	0x19
@@ -74,7 +77,8 @@ struct iaa_batch_ctx {
 
 enum iaa_mode {
 	IAA_MODE_FIXED = 0,
-	IAA_MODE_NONE = 1,
+	IAA_MODE_DYNAMIC = 1,
+	IAA_MODE_NONE = 2,
 };
 
 struct iaa_req {
@@ -160,6 +164,8 @@ struct aecs_comp_table_record {
 
 int iaa_aecs_init_fixed(void);
 void iaa_aecs_cleanup_fixed(void);
+int iaa_aecs_init_dynamic(void);
+void iaa_aecs_cleanup_dynamic(void);
 
 typedef int (*iaa_dev_comp_init_fn_t) (struct iaa_device_compression_mode *mode);
 typedef int (*iaa_dev_comp_free_fn_t) (struct iaa_device_compression_mode *mode);
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c b/drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c
new file mode 100644
index 000000000000..3a93d7913443
--- /dev/null
+++ b/drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2023 Intel Corporation. All rights rsvd. */
+
+#include "idxd.h"
+#include "iaa_crypto.h"
+
+int iaa_aecs_init_dynamic(void)
+{
+	int ret;
+
+	ret = add_iaa_compression_mode("dynamic", NULL, 0, NULL, 0, NULL, NULL);
+
+	if (!ret)
+		pr_debug("IAA dynamic compression mode initialized\n");
+
+	return ret;
+}
+
+void iaa_aecs_cleanup_dynamic(void)
+{
+	remove_iaa_compression_mode("dynamic");
+}
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index cc0d82154ff6..37e1cc720e5d 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -102,10 +102,12 @@ DEFINE_MUTEX(first_wq_found_lock);
 
 const char *iaa_compression_mode_names[IAA_COMP_MODES_MAX] = {
 	"fixed",
+	"dynamic",
 };
 
 const char *iaa_compression_alg_names[IAA_COMP_MODES_MAX] = {
 	"deflate-iaa",
+	"deflate-iaa-dynamic",
 };
 
 static struct iaa_compression_mode *iaa_compression_modes[IAA_COMP_MODES_MAX];
@@ -1492,6 +1494,27 @@ static int deflate_generic_decompress(struct iaa_req *req)
 	return ret;
 }
 
+static int deflate_generic_compress(struct iaa_req *req)
+{
+	ACOMP_REQUEST_ON_STACK(fbreq, deflate_crypto_acomp);
+	int ret;
+
+	acomp_request_set_callback(fbreq, 0, NULL, NULL);
+	acomp_request_set_params(fbreq, req->src, req->dst, req->slen,
+				 PAGE_SIZE);
+
+	mutex_lock(&deflate_crypto_acomp_lock);
+
+	ret = crypto_acomp_compress(fbreq);
+	req->dlen = fbreq->dlen;
+
+	mutex_unlock(&deflate_crypto_acomp_lock);
+
+	update_total_sw_comp_calls();
+
+	return ret;
+}
+
 static __always_inline void acomp_to_iaa(struct acomp_req *areq,
 					 struct iaa_req *req,
 					 struct iaa_compression_ctx *ctx)
@@ -1818,9 +1841,13 @@ iaa_setup_compress_hw_desc(struct idxd_desc *idxd_desc,
 	desc->src1_size = slen;
 	desc->dst_addr = (u64)dst_addr;
 	desc->max_dst_size = dlen;
-	desc->flags |= IDXD_OP_FLAG_RD_SRC2_AECS;
-	desc->src2_addr = active_compression_mode->aecs_comp_table_dma_addr;
-	desc->src2_size = sizeof(struct aecs_comp_table_record);
+	if (mode == IAA_MODE_DYNAMIC) {
+		desc->compr_flags |= IAA_COMP_GEN_HDR_1_PASS;
+	} else {
+		desc->flags |= IDXD_OP_FLAG_RD_SRC2_AECS;
+		desc->src2_addr = active_compression_mode->aecs_comp_table_dma_addr;
+		desc->src2_size = sizeof(struct aecs_comp_table_record);
+	}
 	desc->completion_addr = idxd_desc->compl_dma;
 
 	return desc;
@@ -2074,6 +2101,9 @@ static int iaa_comp_acompress(struct iaa_compression_ctx *ctx, struct iaa_req *r
 		return -EINVAL;
 	}
 
+	if (ctx->mode == IAA_MODE_DYNAMIC && req->slen > PAGE_SIZE)
+		return deflate_generic_compress(req);
+
 	cpu = get_cpu();
 	wq = comp_wq_table_next_wq(cpu);
 	put_cpu();
@@ -2558,7 +2588,9 @@ static int iaa_comp_adecompress_batch(
 static void compression_ctx_init(struct iaa_compression_ctx *ctx, enum iaa_mode mode)
 {
 	ctx->mode = mode;
-	ctx->alloc_comp_desc_timeout = IAA_ALLOC_DESC_COMP_TIMEOUT;
+	ctx->alloc_comp_desc_timeout = (mode == IAA_MODE_DYNAMIC ?
+					IAA_DYN_ALLOC_DESC_COMP_TIMEOUT :
+					IAA_ALLOC_DESC_COMP_TIMEOUT);
 	ctx->alloc_decomp_desc_timeout = IAA_ALLOC_DESC_DECOMP_TIMEOUT;
 	ctx->verify_compress = iaa_verify_compress;
 	ctx->async_mode = async_mode;
@@ -2650,6 +2682,30 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	}
 };
 
+static int iaa_comp_init_dynamic(struct crypto_acomp *acomp_tfm)
+{
+	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
+	struct iaa_compression_ctx *ctx = crypto_tfm_ctx(tfm);
+
+	ctx = iaa_ctx[IAA_MODE_DYNAMIC];
+
+	return 0;
+}
+
+static struct acomp_alg iaa_acomp_dynamic_deflate = {
+	.init			= iaa_comp_init_dynamic,
+	.compress		= iaa_comp_acompress_main,
+	.decompress		= iaa_comp_adecompress_main,
+	.base			= {
+		.cra_name		= "deflate",
+		.cra_driver_name	= "deflate-iaa-dynamic",
+		.cra_flags		= CRYPTO_ALG_ASYNC,
+		.cra_ctxsize		= sizeof(struct iaa_compression_ctx),
+		.cra_module		= THIS_MODULE,
+		.cra_priority		= IAA_ALG_PRIORITY + 1,
+	}
+};
+
 /*******************************************
  * Implement idxd_device_driver interfaces.
  *******************************************/
@@ -2669,7 +2725,7 @@ static void iaa_unregister_compression_device(void)
 	num_iaa_modes_registered = 0;
 }
 
-static int iaa_register_compression_device(void)
+static int iaa_register_compression_device(struct idxd_device *idxd)
 {
 	struct iaa_compression_mode *mode;
 	int i, idx;
@@ -2678,6 +2734,13 @@ static int iaa_register_compression_device(void)
 		iaa_mode_registered[i] = false;
 		mode = find_iaa_compression_mode(iaa_compression_mode_names[i], &idx);
 		if (mode) {
+			/* Header Generation Capability is required for the dynamic algorithm. */
+			if ((!strcmp(mode->name, "dynamic")) && !idxd->hw.iaa_cap.header_gen) {
+				if (num_iaa_modes_registered > 0)
+					--num_iaa_modes_registered;
+				continue;
+			}
+
 			iaa_ctx[i] = kmalloc(sizeof(struct iaa_compression_ctx), GFP_KERNEL);
 			if (!iaa_ctx[i])
 				goto err;
@@ -2697,7 +2760,7 @@ static int iaa_register_compression_device(void)
 	return -ENODEV;
 }
 
-static int iaa_register_acomp_compression_device(void)
+static int iaa_register_acomp_compression_device(struct idxd_device *idxd)
 {
 	int ret = -ENOMEM;
 
@@ -2707,8 +2770,19 @@ static int iaa_register_acomp_compression_device(void)
 		goto err_fixed;
 	}
 
+	if (iaa_mode_registered[IAA_MODE_DYNAMIC]) {
+		ret = crypto_register_acomp(&iaa_acomp_dynamic_deflate);
+		if (ret) {
+			pr_err("deflate algorithm acomp dynamic registration failed (%d)\n", ret);
+			goto err_dynamic;
+		}
+	}
+
 	return 0;
 
+err_dynamic:
+	crypto_unregister_acomp(&iaa_acomp_fixed_deflate);
+
 err_fixed:
 	iaa_unregister_compression_device();
 	return ret;
@@ -2720,6 +2794,9 @@ static void iaa_unregister_acomp_compression_device(void)
 
 	if (iaa_mode_registered[IAA_MODE_FIXED])
 		crypto_unregister_acomp(&iaa_acomp_fixed_deflate);
+
+	if (iaa_mode_registered[IAA_MODE_DYNAMIC])
+		crypto_unregister_acomp(&iaa_acomp_dynamic_deflate);
 }
 
 static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
@@ -2783,13 +2860,13 @@ static int iaa_crypto_probe(struct idxd_dev *idxd_dev)
 	atomic_set(&iaa_crypto_enabled, 1);
 
 	if (first_wq) {
-		ret = iaa_register_compression_device();
+		ret = iaa_register_compression_device(idxd);
 		if (ret != 0) {
 			dev_dbg(dev, "IAA compression device registration failed\n");
 			goto err_register;
 		}
 
-		ret = iaa_register_acomp_compression_device();
+		ret = iaa_register_acomp_compression_device(idxd);
 		if (ret != 0) {
 			dev_dbg(dev, "IAA compression device acomp registration failed\n");
 			goto err_register;
@@ -2949,6 +3026,12 @@ static int __init iaa_crypto_init_module(void)
 		goto err_aecs_init;
 	}
 
+	ret = iaa_aecs_init_dynamic();
+	if (ret < 0) {
+		pr_debug("IAA dynamic compression mode init failed\n");
+		goto err_dynamic;
+	}
+
 	ret = idxd_driver_register(&iaa_crypto_driver);
 	if (ret) {
 		pr_debug("IAA wq sub-driver registration failed\n");
@@ -3050,6 +3133,8 @@ static int __init iaa_crypto_init_module(void)
 err_g_comp_wqs_per_iaa_attr_create:
 	idxd_driver_unregister(&iaa_crypto_driver);
 err_driver_reg:
+	iaa_aecs_cleanup_dynamic();
+err_dynamic:
 	iaa_aecs_cleanup_fixed();
 err_aecs_init:
 	if (!IS_ERR_OR_NULL(deflate_crypto_acomp)) {
@@ -3079,6 +3164,7 @@ static void __exit iaa_crypto_cleanup_module(void)
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_g_comp_wqs_per_iaa);
 	idxd_driver_unregister(&iaa_crypto_driver);
+	iaa_aecs_cleanup_dynamic();
 	iaa_aecs_cleanup_fixed();
 
 	if (!IS_ERR_OR_NULL(deflate_crypto_acomp)) {
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_stats.c b/drivers/crypto/intel/iaa/iaa_crypto_stats.c
index f5cc3d29ca19..42aae8a738ac 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_stats.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_stats.c
@@ -19,6 +19,7 @@
 
 static atomic64_t total_comp_calls;
 static atomic64_t total_decomp_calls;
+static atomic64_t total_sw_comp_calls;
 static atomic64_t total_sw_decomp_calls;
 static atomic64_t total_comp_bytes_out;
 static atomic64_t total_decomp_bytes_in;
@@ -43,6 +44,11 @@ void update_total_decomp_calls(void)
 	atomic64_inc(&total_decomp_calls);
 }
 
+void update_total_sw_comp_calls(void)
+{
+	atomic64_inc(&total_sw_comp_calls);
+}
+
 void update_total_sw_decomp_calls(void)
 {
 	atomic64_inc(&total_sw_decomp_calls);
@@ -174,6 +180,8 @@ static int global_stats_show(struct seq_file *m, void *v)
 		   atomic64_read(&total_comp_calls));
 	seq_printf(m, "  total_decomp_calls: %llu\n",
 		   atomic64_read(&total_decomp_calls));
+	seq_printf(m, "  total_sw_comp_calls: %llu\n",
+		   atomic64_read(&total_sw_comp_calls));
 	seq_printf(m, "  total_sw_decomp_calls: %llu\n",
 		   atomic64_read(&total_sw_decomp_calls));
 	seq_printf(m, "  total_comp_bytes_out: %llu\n",
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_stats.h b/drivers/crypto/intel/iaa/iaa_crypto_stats.h
index 3787a5f507eb..6e0c6f9939bf 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_stats.h
+++ b/drivers/crypto/intel/iaa/iaa_crypto_stats.h
@@ -11,6 +11,7 @@ void	iaa_crypto_debugfs_cleanup(void);
 void	update_total_comp_calls(void);
 void	update_total_comp_bytes_out(int n);
 void	update_total_decomp_calls(void);
+void	update_total_sw_comp_calls(void);
 void	update_total_sw_decomp_calls(void);
 void	update_total_decomp_bytes_in(int n);
 void	update_completion_einval_errs(void);
@@ -29,6 +30,7 @@ static inline void	iaa_crypto_debugfs_cleanup(void) {}
 static inline void	update_total_comp_calls(void) {}
 static inline void	update_total_comp_bytes_out(int n) {}
 static inline void	update_total_decomp_calls(void) {}
+static inline void	update_total_sw_comp_calls(void) {}
 static inline void	update_total_sw_decomp_calls(void) {}
 static inline void	update_total_decomp_bytes_in(int n) {}
 static inline void	update_completion_einval_errs(void) {}
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 18/22] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (16 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 17/22] crypto: iaa - Add deflate-iaa-dynamic compression mode Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-04  9:12 ` [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion Kanchana P Sridhar
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This commit adds a @batch_size data member to struct acomp_alg.

An acomp_alg compression algorithm that supports batching of
compressions and decompressions must provide a @batch_size greater than
one, representing the maximum batch-size that the compressor supports,
so that kernel users of crypto_acomp, such as zswap, can allocate
resources for submitting multiple compress/decompress jobs that can be
batched, and invoke batching of [de]compressions.

The new crypto_acomp_batch_size() API queries the crypto_acomp's
acomp_alg for the batch-size. If the acomp_alg has registered a
@batch_size greater than 1, this is returned. If not, a default of "1"
is returned.

zswap can invoke crypto_acomp_batch_size() to query the maximum number
of requests that can be batch [de]compressed. Based on this, zswap
can use the minimum of any zswap-specific upper limits for batch-size
and the compressor's max @batch_size, to allocate batching resources.

The IAA acomp_algs Fixed ("deflate-iaa") and Dynamic
("deflate-iaa-dynamic") register @batch_size as
IAA_CRYPTO_MAX_BATCH_SIZE.

This enables zswap to compress/decompress pages in parallel in the IAA
hardware accelerator to improve swapout/swapin performance and memory
savings.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c                         | 14 ++++++++++++++
 drivers/crypto/intel/iaa/iaa_crypto_main.c |  2 ++
 include/crypto/acompress.h                 | 12 ++++++++++++
 include/crypto/internal/acompress.h        |  3 +++
 4 files changed, 31 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index be28cbfd22e3..61ad81b06f49 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -305,6 +305,20 @@ int crypto_acomp_decompress(struct acomp_req *req)
 }
 EXPORT_SYMBOL_GPL(crypto_acomp_decompress);
 
+unsigned int crypto_acomp_batch_size(struct crypto_acomp *tfm)
+{
+	if (acomp_is_async(tfm) &&
+		(crypto_comp_alg_common(tfm)->base.cra_flags & CRYPTO_ALG_TYPE_ACOMPRESS)) {
+		struct acomp_alg *alg = crypto_acomp_alg(tfm);
+
+		if (alg && alg->batch_size > 1)
+			return alg->batch_size;
+	}
+
+	return 1;
+}
+EXPORT_SYMBOL_GPL(crypto_acomp_batch_size);
+
 void comp_prepare_alg(struct comp_alg_common *alg)
 {
 	struct crypto_alg *base = &alg->base;
diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 37e1cc720e5d..2db2ddd4cb49 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -2671,6 +2671,7 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	.init			= iaa_comp_init_fixed,
 	.compress		= iaa_comp_acompress_main,
 	.decompress		= iaa_comp_adecompress_main,
+	.batch_size		= IAA_CRYPTO_MAX_BATCH_SIZE,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa",
@@ -2696,6 +2697,7 @@ static struct acomp_alg iaa_acomp_dynamic_deflate = {
 	.init			= iaa_comp_init_dynamic,
 	.compress		= iaa_comp_acompress_main,
 	.decompress		= iaa_comp_adecompress_main,
+	.batch_size		= IAA_CRYPTO_MAX_BATCH_SIZE,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa-dynamic",
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 0f1334168f1b..6385f9b78a0d 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -578,6 +578,18 @@ int crypto_acomp_compress(struct acomp_req *req);
  */
 int crypto_acomp_decompress(struct acomp_req *req);
 
+/**
+ * crypto_acomp_batch_size() -- Get the algorithm's batch size
+ *
+ * Function returns the algorithm's batch size for batching operations
+ *
+ * @tfm:	ACOMPRESS tfm handle allocated with crypto_alloc_acomp()
+ *
+ * Return:	@tfm's acomp_alg's @batch_size, if it has defined a
+ *		@batch_size greater than 1; else return 1.
+ */
+unsigned int crypto_acomp_batch_size(struct crypto_acomp *tfm);
+
 static inline struct acomp_req *acomp_request_on_stack_init(
 	char *buf, struct crypto_acomp *tfm)
 {
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 2d97440028ff..e451e0ae3b9b 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -28,6 +28,8 @@
  *
  * @compress:	Function performs a compress operation
  * @decompress:	Function performs a de-compress operation
+ * @batch_size:	Maximum batch-size for batching compress/decompress
+ *		operations.
  * @init:	Initialize the cryptographic transformation object.
  *		This function is used to initialize the cryptographic
  *		transformation object. This function is called only once at
@@ -46,6 +48,7 @@
 struct acomp_alg {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	unsigned int batch_size;
 	int (*init)(struct crypto_acomp *tfm);
 	void (*exit)(struct crypto_acomp *tfm);
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (17 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 18/22] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-13 20:24   ` Yosry Ahmed
  2025-11-04  9:12 ` [PATCH v13 20/22] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources Kanchana P Sridhar
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
management. Similar to the per-CPU acomp_ctx itself, the per-CPU
acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
pool creation to pool deletion. These resources will persist through CPU
hotplug operations instead of being destroyed/recreated. The
zswap_cpu_comp_dead() teardown callback has been deleted from the call
to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a result, CPU
offline hotplug operations will be no-ops as far as the acomp_ctx
resources are concerned.

This commit refactors the code from zswap_cpu_comp_dead() into a
new function acomp_ctx_dealloc() that is called to clean up acomp_ctx
resources from:

1) zswap_cpu_comp_prepare() when an error is encountered,
2) zswap_pool_create() when an error is encountered, and
3) from zswap_pool_destroy().

The main benefit of using the CPU hotplug multi state instance startup
callback to allocate the acomp_ctx resources is that it prevents the
cores from being offlined until the multi state instance addition call
returns.

  From Documentation/core-api/cpu_hotplug.rst:

    "The node list add/remove operations and the callback invocations are
     serialized against CPU hotplug operations."

Furthermore, zswap_[de]compress() cannot contend with
zswap_cpu_comp_prepare() because:

  - During pool creation/deletion, the pool is not in the zswap_pools
    list.

  - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
    out. zswap_cpu_comp_prepare() will be run on a control CPU,
    since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum
    cpuhp_state". Thanks Yosry for sharing this observation!

  In both these cases, any recursions into zswap reclaim from
  zswap_cpu_comp_prepare() will be handled by the old pool.

The above two observations enable the following simplifications:

 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot use
    the pool. Considerations for mutex init/locking and handling
    subsequent CPU hotplug online-offline-online:

    Should we lock the mutex of current CPU's acomp_ctx from start to
    end? It doesn't seem like this is required. The CPU hotplug
    operations acquire a "cpuhp_state_mutex" before proceeding, hence
    they are serialized against CPU hotplug operations.

    If the process gets migrated while zswap_cpu_comp_prepare() is
    running, it will complete on the new CPU. In case of failures, we
    pass the acomp_ctx pointer obtained at the start of
    zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
    only undergo migration. There appear to be no contention scenarios
    that might cause inconsistent values of acomp_ctx's members. Hence,
    it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
    zswap_cpu_comp_prepare().

    Since the pool is not yet on zswap_pools list, we don't need to
    initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
    has been restored to occur in zswap_cpu_comp_prepare().

    zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
    valid. If so, it returns success. This should handle any CPU
    hotplug online-offline transitions after pool creation is done.

 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
    migrated to another CPU before the current CPU is dysfunctional. If
    zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined
    CPU, that mutex will be released once it completes on the new
    CPU. Since there is no teardown callback, there is no possibility of
    UAF.

 3) Pool creation/deletion and process migration to another CPU:

    - During pool creation/deletion, the pool is not in the zswap_pools
      list. Hence it cannot contend with zswap ops on that CPU. However,
      the process can get migrated.

      Pool creation --> zswap_cpu_comp_prepare()
                                --> process migrated:
                                    * CPU offline: no-op.
                                    * zswap_cpu_comp_prepare() continues
                                      to run on the new CPU to finish
                                      allocating acomp_ctx resources for
                                      the offlined CPU.

      Pool deletion --> acomp_ctx_dealloc()
                                --> process migrated:
                                    * CPU offline: no-op.
                                    * acomp_ctx_dealloc() continues
                                      to run on the new CPU to finish
                                      de-allocating acomp_ctx resources
                                      for the offlined CPU.

 4) Pool deletion vis-a-vis CPU onlining:
    The call to cpuhp_state_remove_instance() cannot race with
    zswap_cpu_comp_prepare() because of hotplug synchronization.

This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock().
Instead, zswap_[de]compress() directly call
mutex_[un]lock(&acomp_ctx->mutex).

The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU
offlining, and only deleting them when the pool is destroyed, is as
follows, on x86_64:

    IAA with 8 dst buffers for batching:    64.34 KB
    Software compressors with 1 dst buffer:  8.28 KB

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 164 +++++++++++++++++++++--------------------------------
 1 file changed, 64 insertions(+), 100 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 4897ed689b9f..87d50786f61f 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -242,6 +242,20 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 **********************************/
 static void __zswap_pool_empty(struct percpu_ref *ref);
 
+static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
+{
+	if (IS_ERR_OR_NULL(acomp_ctx))
+		return;
+
+	if (!IS_ERR_OR_NULL(acomp_ctx->req))
+		acomp_request_free(acomp_ctx->req);
+
+	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+		crypto_free_acomp(acomp_ctx->acomp);
+
+	kfree(acomp_ctx->buffer);
+}
+
 static struct zswap_pool *zswap_pool_create(char *compressor)
 {
 	struct zswap_pool *pool;
@@ -263,19 +277,26 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
 
 	strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
 
-	pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
+	/* Many things rely on the zero-initialization. */
+	pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx,
+					   GFP_KERNEL | __GFP_ZERO);
 	if (!pool->acomp_ctx) {
 		pr_err("percpu alloc failed\n");
 		goto error;
 	}
 
-	for_each_possible_cpu(cpu)
-		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
-
+	/*
+	 * This is serialized against CPU hotplug operations. Hence, cores
+	 * cannot be offlined until this finishes.
+	 * In case of errors, we need to goto "ref_fail" instead of "error"
+	 * because there is no teardown callback registered anymore, for
+	 * cpuhp_state_add_instance() to de-allocate resources as it rolls back
+	 * state on cores before the CPU on which error was encountered.
+	 */
 	ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
 				       &pool->node);
 	if (ret)
-		goto error;
+		goto ref_fail;
 
 	/* being the current pool takes 1 ref; this func expects the
 	 * caller to always add the new pool as the current pool
@@ -292,6 +313,9 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
 
 ref_fail:
 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+
+	for_each_possible_cpu(cpu)
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
 error:
 	if (pool->acomp_ctx)
 		free_percpu(pool->acomp_ctx);
@@ -322,9 +346,15 @@ static struct zswap_pool *__zswap_pool_create_fallback(void)
 
 static void zswap_pool_destroy(struct zswap_pool *pool)
 {
+	int cpu;
+
 	zswap_pool_debug("destroying", pool);
 
 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
+
+	for_each_possible_cpu(cpu)
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
+
 	free_percpu(pool->acomp_ctx);
 
 	zs_destroy_pool(pool->zs_pool);
@@ -736,39 +766,35 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 {
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
-	struct crypto_acomp *acomp = NULL;
-	struct acomp_req *req = NULL;
-	u8 *buffer = NULL;
-	int ret;
+	int ret = -ENOMEM;
 
-	buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
-	if (!buffer) {
-		ret = -ENOMEM;
-		goto fail;
-	}
+	/*
+	 * To handle cases where the CPU goes through online-offline-online
+	 * transitions, we return if the acomp_ctx has already been initialized.
+	 */
+	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
+		return 0;
 
-	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
-	if (IS_ERR(acomp)) {
+	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffer)
+		return ret;
+
+	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+	if (IS_ERR(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
-				pool->tfm_name, PTR_ERR(acomp));
-		ret = PTR_ERR(acomp);
+				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
+		ret = PTR_ERR(acomp_ctx->acomp);
 		goto fail;
 	}
+	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
 
-	req = acomp_request_alloc(acomp);
-	if (!req) {
+	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
+	if (!acomp_ctx->req) {
 		pr_err("could not alloc crypto acomp_request %s\n",
 		       pool->tfm_name);
-		ret = -ENOMEM;
 		goto fail;
 	}
 
-	/*
-	 * Only hold the mutex after completing allocations, otherwise we may
-	 * recurse into zswap through reclaim and attempt to hold the mutex
-	 * again resulting in a deadlock.
-	 */
-	mutex_lock(&acomp_ctx->mutex);
 	crypto_init_wait(&acomp_ctx->wait);
 
 	/*
@@ -776,84 +802,19 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
-	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
-	acomp_ctx->buffer = buffer;
-	acomp_ctx->acomp = acomp;
-	acomp_ctx->is_sleepable = acomp_is_async(acomp);
-	acomp_ctx->req = req;
-
 	acomp_request_set_unit_size(acomp_ctx->req, PAGE_SIZE);
 
-	mutex_unlock(&acomp_ctx->mutex);
+	mutex_init(&acomp_ctx->mutex);
 	return 0;
 
 fail:
-	if (acomp)
-		crypto_free_acomp(acomp);
-	kfree(buffer);
+	acomp_ctx_dealloc(acomp_ctx);
 	return ret;
 }
 
-static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
-{
-	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
-	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
-	struct acomp_req *req;
-	struct crypto_acomp *acomp;
-	u8 *buffer;
-
-	if (IS_ERR_OR_NULL(acomp_ctx))
-		return 0;
-
-	mutex_lock(&acomp_ctx->mutex);
-	req = acomp_ctx->req;
-	acomp = acomp_ctx->acomp;
-	buffer = acomp_ctx->buffer;
-	acomp_ctx->req = NULL;
-	acomp_ctx->acomp = NULL;
-	acomp_ctx->buffer = NULL;
-	mutex_unlock(&acomp_ctx->mutex);
-
-	/*
-	 * Do the actual freeing after releasing the mutex to avoid subtle
-	 * locking dependencies causing deadlocks.
-	 */
-	if (!IS_ERR_OR_NULL(req))
-		acomp_request_free(req);
-	if (!IS_ERR_OR_NULL(acomp))
-		crypto_free_acomp(acomp);
-	kfree(buffer);
-
-	return 0;
-}
-
-static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct zswap_pool *pool)
-{
-	struct crypto_acomp_ctx *acomp_ctx;
-
-	for (;;) {
-		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
-		mutex_lock(&acomp_ctx->mutex);
-		if (likely(acomp_ctx->req))
-			return acomp_ctx;
-		/*
-		 * It is possible that we were migrated to a different CPU after
-		 * getting the per-CPU ctx but before the mutex was acquired. If
-		 * the old CPU got offlined, zswap_cpu_comp_dead() could have
-		 * already freed ctx->req (among other things) and set it to
-		 * NULL. Just try again on the new CPU that we ended up on.
-		 */
-		mutex_unlock(&acomp_ctx->mutex);
-	}
-}
-
-static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
-{
-	mutex_unlock(&acomp_ctx->mutex);
-}
-
 static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 			   struct zswap_pool *pool)
 {
@@ -866,7 +827,9 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	u8 *dst;
 	bool mapped = false;
 
-	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+	mutex_lock(&acomp_ctx->mutex);
+
 	dst = acomp_ctx->buffer;
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
@@ -929,7 +892,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	else if (alloc_ret)
 		zswap_reject_alloc_fail++;
 
-	acomp_ctx_put_unlock(acomp_ctx);
+	mutex_unlock(&acomp_ctx->mutex);
 	return comp_ret == 0 && alloc_ret == 0;
 }
 
@@ -941,7 +904,8 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	int decomp_ret = 0, dlen = PAGE_SIZE;
 	u8 *src, *obj;
 
-	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
+	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+	mutex_lock(&acomp_ctx->mutex);
 	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
 
 	/* zswap entries of length PAGE_SIZE are not compressed. */
@@ -972,7 +936,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 
 read_done:
 	zs_obj_read_end(pool->zs_pool, entry->handle, obj);
-	acomp_ctx_put_unlock(acomp_ctx);
+	mutex_unlock(&acomp_ctx->mutex);
 
 	if (!decomp_ret && dlen == PAGE_SIZE)
 		return true;
@@ -1798,7 +1762,7 @@ static int zswap_setup(void)
 	ret = cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE,
 				      "mm/zswap_pool:prepare",
 				      zswap_cpu_comp_prepare,
-				      zswap_cpu_comp_dead);
+				      NULL);
 	if (ret)
 		goto hp_fail;
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-11-04  9:12 ` [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion Kanchana P Sridhar
@ 2025-11-13 20:24   ` Yosry Ahmed
  2025-12-12  0:55     ` Sridhar, Kanchana P
  2025-12-12 18:17     ` Sridhar, Kanchana P
  0 siblings, 2 replies; 79+ messages in thread
From: Yosry Ahmed @ 2025-11-13 20:24 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, kristen.c.accardi, vinicius.gomes,
	wajdi.k.feghali, vinodh.gopal

On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote:

The subject can be shortened to:

"mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool"

> This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
> management. Similar to the per-CPU acomp_ctx itself, the per-CPU
> acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
> pool creation to pool deletion. These resources will persist through CPU
> hotplug operations instead of being destroyed/recreated. The
> zswap_cpu_comp_dead() teardown callback has been deleted from the call
> to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a result, CPU
> offline hotplug operations will be no-ops as far as the acomp_ctx
> resources are concerned.

Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU
hotplug, and destroyed on pool destruction or CPU hotunplug. This
complicates the lifetime management to save memory while a CPU is
offlined, which is not very common.

Simplify lifetime management by allocating per-CPU acomp_ctx once on
pool creation (or CPU hotplug for CPUs onlined later), and keeping them
allocated until the pool is destroyed.

> 
> This commit refactors the code from zswap_cpu_comp_dead() into a
> new function acomp_ctx_dealloc() that is called to clean up acomp_ctx
> resources from:
> 
> 1) zswap_cpu_comp_prepare() when an error is encountered,
> 2) zswap_pool_create() when an error is encountered, and
> 3) from zswap_pool_destroy().


Refactor cleanup code from zswap_cpu_comp_dead() into
acomp_ctx_dealloc() to be used elsewhere.

> 
> The main benefit of using the CPU hotplug multi state instance startup
> callback to allocate the acomp_ctx resources is that it prevents the
> cores from being offlined until the multi state instance addition call
> returns.
> 
>   From Documentation/core-api/cpu_hotplug.rst:
> 
>     "The node list add/remove operations and the callback invocations are
>      serialized against CPU hotplug operations."
> 
> Furthermore, zswap_[de]compress() cannot contend with
> zswap_cpu_comp_prepare() because:
> 
>   - During pool creation/deletion, the pool is not in the zswap_pools
>     list.
> 
>   - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
>     out. zswap_cpu_comp_prepare() will be run on a control CPU,
>     since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of "enum
>     cpuhp_state". Thanks Yosry for sharing this observation!
> 
>   In both these cases, any recursions into zswap reclaim from
>   zswap_cpu_comp_prepare() will be handled by the old pool.
> 
> The above two observations enable the following simplifications:
> 
>  1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot use
>     the pool. Considerations for mutex init/locking and handling
>     subsequent CPU hotplug online-offline-online:
> 
>     Should we lock the mutex of current CPU's acomp_ctx from start to
>     end? It doesn't seem like this is required. The CPU hotplug
>     operations acquire a "cpuhp_state_mutex" before proceeding, hence
>     they are serialized against CPU hotplug operations.
> 
>     If the process gets migrated while zswap_cpu_comp_prepare() is
>     running, it will complete on the new CPU. In case of failures, we
>     pass the acomp_ctx pointer obtained at the start of
>     zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
>     only undergo migration. There appear to be no contention scenarios
>     that might cause inconsistent values of acomp_ctx's members. Hence,
>     it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
>     zswap_cpu_comp_prepare().
> 
>     Since the pool is not yet on zswap_pools list, we don't need to
>     initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
>     has been restored to occur in zswap_cpu_comp_prepare().
> 
>     zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
>     valid. If so, it returns success. This should handle any CPU
>     hotplug online-offline transitions after pool creation is done.
> 
>  2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
>     migrated to another CPU before the current CPU is dysfunctional. If
>     zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined
>     CPU, that mutex will be released once it completes on the new
>     CPU. Since there is no teardown callback, there is no possibility of
>     UAF.
> 
>  3) Pool creation/deletion and process migration to another CPU:
> 
>     - During pool creation/deletion, the pool is not in the zswap_pools
>       list. Hence it cannot contend with zswap ops on that CPU. However,
>       the process can get migrated.
> 
>       Pool creation --> zswap_cpu_comp_prepare()
>                                 --> process migrated:
>                                     * CPU offline: no-op.
>                                     * zswap_cpu_comp_prepare() continues
>                                       to run on the new CPU to finish
>                                       allocating acomp_ctx resources for
>                                       the offlined CPU.
> 
>       Pool deletion --> acomp_ctx_dealloc()
>                                 --> process migrated:
>                                     * CPU offline: no-op.
>                                     * acomp_ctx_dealloc() continues
>                                       to run on the new CPU to finish
>                                       de-allocating acomp_ctx resources
>                                       for the offlined CPU.
> 
>  4) Pool deletion vis-a-vis CPU onlining:
>     The call to cpuhp_state_remove_instance() cannot race with
>     zswap_cpu_comp_prepare() because of hotplug synchronization.
> 
> This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock().
> Instead, zswap_[de]compress() directly call
> mutex_[un]lock(&acomp_ctx->mutex).

I am not sure why all of this is needed. We should just describe why
it's safe to drop holding the mutex while initializing per-CPU
acomp_ctx:

It is no longer possible for CPU hotplug to race against allocation or
usage of per-CPU acomp_ctx, as they are only allocated once before the
pool can be used, and remain allocated as long as the pool is used.
Hence, stop holding the lock during acomp_ctx initialization, and drop
acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock().

> 
> The per-CPU memory cost of not deleting the acomp_ctx resources upon CPU
> offlining, and only deleting them when the pool is destroyed, is as
> follows, on x86_64:
> 
>     IAA with 8 dst buffers for batching:    64.34 KB
>     Software compressors with 1 dst buffer:  8.28 KB

This cost is only paid when a CPU is offlined, until it is onlined
again.

> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 164 +++++++++++++++++++++--------------------------------
>  1 file changed, 64 insertions(+), 100 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 4897ed689b9f..87d50786f61f 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -242,6 +242,20 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>  **********************************/
>  static void __zswap_pool_empty(struct percpu_ref *ref);
>  
> +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> +{
> +	if (IS_ERR_OR_NULL(acomp_ctx))
> +		return;
> +
> +	if (!IS_ERR_OR_NULL(acomp_ctx->req))
> +		acomp_request_free(acomp_ctx->req);
> +
> +	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> +		crypto_free_acomp(acomp_ctx->acomp);
> +
> +	kfree(acomp_ctx->buffer);
> +}
> +
>  static struct zswap_pool *zswap_pool_create(char *compressor)
>  {
>  	struct zswap_pool *pool;
> @@ -263,19 +277,26 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
>  
>  	strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
>  
> -	pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
> +	/* Many things rely on the zero-initialization. */
> +	pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx,
> +					   GFP_KERNEL | __GFP_ZERO);
>  	if (!pool->acomp_ctx) {
>  		pr_err("percpu alloc failed\n");
>  		goto error;
>  	}
>  
> -	for_each_possible_cpu(cpu)
> -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> -
> +	/*
> +	 * This is serialized against CPU hotplug operations. Hence, cores
> +	 * cannot be offlined until this finishes.
> +	 * In case of errors, we need to goto "ref_fail" instead of "error"
> +	 * because there is no teardown callback registered anymore, for
> +	 * cpuhp_state_add_instance() to de-allocate resources as it rolls back
> +	 * state on cores before the CPU on which error was encountered.
> +	 */

Do we need to manually call acomp_ctx_dealloc() on each CPU on failure
because cpuhp_state_add_instance() relies on the hotunplug callback for
cleanup, and we don't have any?

If that's the case:

	/*
	 * cpuhp_state_add_instance() will not cleanup on failure since
	 * we don't register a hotunplug callback.
	 */

Describing what the code does is not helpful, and things like "anymore"
do not make sense once the code is merged.

>  	ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
>  				       &pool->node);
>  	if (ret)
> -		goto error;
> +		goto ref_fail;

IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
probably should add a new label.

>  
>  	/* being the current pool takes 1 ref; this func expects the
>  	 * caller to always add the new pool as the current pool
> @@ -292,6 +313,9 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
>  
>  ref_fail:
>  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
> +
> +	for_each_possible_cpu(cpu)
> +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
>  error:
>  	if (pool->acomp_ctx)
>  		free_percpu(pool->acomp_ctx);
> @@ -322,9 +346,15 @@ static struct zswap_pool *__zswap_pool_create_fallback(void)
>  
>  static void zswap_pool_destroy(struct zswap_pool *pool)
>  {
> +	int cpu;
> +
>  	zswap_pool_debug("destroying", pool);
>  
>  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
> +
> +	for_each_possible_cpu(cpu)
> +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> +
>  	free_percpu(pool->acomp_ctx);
>  
>  	zs_destroy_pool(pool->zs_pool);
> @@ -736,39 +766,35 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  {
>  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> -	struct crypto_acomp *acomp = NULL;
> -	struct acomp_req *req = NULL;
> -	u8 *buffer = NULL;
> -	int ret;
> +	int ret = -ENOMEM;
>  
> -	buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> -	if (!buffer) {
> -		ret = -ENOMEM;
> -		goto fail;
> -	}
> +	/*
> +	 * To handle cases where the CPU goes through online-offline-online
> +	 * transitions, we return if the acomp_ctx has already been initialized.
> +	 */
> +	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> +		return 0;

Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
that case?

>  
> -	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
> -	if (IS_ERR(acomp)) {
> +	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> +	if (!acomp_ctx->buffer)
> +		return ret;
> +
> +	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
> +	if (IS_ERR(acomp_ctx->acomp)) {
>  		pr_err("could not alloc crypto acomp %s : %ld\n",
> -				pool->tfm_name, PTR_ERR(acomp));
> -		ret = PTR_ERR(acomp);
> +				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
> +		ret = PTR_ERR(acomp_ctx->acomp);
>  		goto fail;
>  	}
> +	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
>  
> -	req = acomp_request_alloc(acomp);
> -	if (!req) {
> +	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> +	if (!acomp_ctx->req) {
>  		pr_err("could not alloc crypto acomp_request %s\n",
>  		       pool->tfm_name);
> -		ret = -ENOMEM;
>  		goto fail;
>  	}
>  
> -	/*
> -	 * Only hold the mutex after completing allocations, otherwise we may
> -	 * recurse into zswap through reclaim and attempt to hold the mutex
> -	 * again resulting in a deadlock.
> -	 */
> -	mutex_lock(&acomp_ctx->mutex);
>  	crypto_init_wait(&acomp_ctx->wait);
>  
>  	/*
[..]


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-11-13 20:24   ` Yosry Ahmed
@ 2025-12-12  0:55     ` Sridhar, Kanchana P
  2025-12-12  1:06       ` Yosry Ahmed
  2025-12-12 18:17     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-12  0:55 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 12:24 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> exist from pool creation to deletion.
> 
> On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote:
> 
> The subject can be shortened to:
> 
> "mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool"
> 
> > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
> > management. Similar to the per-CPU acomp_ctx itself, the per-CPU
> > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
> > pool creation to pool deletion. These resources will persist through CPU
> > hotplug operations instead of being destroyed/recreated. The
> > zswap_cpu_comp_dead() teardown callback has been deleted from the call
> > to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a
> result, CPU
> > offline hotplug operations will be no-ops as far as the acomp_ctx
> > resources are concerned.
> 
> Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU
> hotplug, and destroyed on pool destruction or CPU hotunplug. This
> complicates the lifetime management to save memory while a CPU is
> offlined, which is not very common.
> 
> Simplify lifetime management by allocating per-CPU acomp_ctx once on
> pool creation (or CPU hotplug for CPUs onlined later), and keeping them
> allocated until the pool is destroyed.
> 
> >
> > This commit refactors the code from zswap_cpu_comp_dead() into a
> > new function acomp_ctx_dealloc() that is called to clean up acomp_ctx
> > resources from:
> >
> > 1) zswap_cpu_comp_prepare() when an error is encountered,
> > 2) zswap_pool_create() when an error is encountered, and
> > 3) from zswap_pool_destroy().
> 
> 
> Refactor cleanup code from zswap_cpu_comp_dead() into
> acomp_ctx_dealloc() to be used elsewhere.
> 
> >
> > The main benefit of using the CPU hotplug multi state instance startup
> > callback to allocate the acomp_ctx resources is that it prevents the
> > cores from being offlined until the multi state instance addition call
> > returns.
> >
> >   From Documentation/core-api/cpu_hotplug.rst:
> >
> >     "The node list add/remove operations and the callback invocations are
> >      serialized against CPU hotplug operations."
> >
> > Furthermore, zswap_[de]compress() cannot contend with
> > zswap_cpu_comp_prepare() because:
> >
> >   - During pool creation/deletion, the pool is not in the zswap_pools
> >     list.
> >
> >   - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
> >     out. zswap_cpu_comp_prepare() will be run on a control CPU,
> >     since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of
> "enum
> >     cpuhp_state". Thanks Yosry for sharing this observation!
> >
> >   In both these cases, any recursions into zswap reclaim from
> >   zswap_cpu_comp_prepare() will be handled by the old pool.
> >
> > The above two observations enable the following simplifications:
> >
> >  1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot
> use
> >     the pool. Considerations for mutex init/locking and handling
> >     subsequent CPU hotplug online-offline-online:
> >
> >     Should we lock the mutex of current CPU's acomp_ctx from start to
> >     end? It doesn't seem like this is required. The CPU hotplug
> >     operations acquire a "cpuhp_state_mutex" before proceeding, hence
> >     they are serialized against CPU hotplug operations.
> >
> >     If the process gets migrated while zswap_cpu_comp_prepare() is
> >     running, it will complete on the new CPU. In case of failures, we
> >     pass the acomp_ctx pointer obtained at the start of
> >     zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
> >     only undergo migration. There appear to be no contention scenarios
> >     that might cause inconsistent values of acomp_ctx's members. Hence,
> >     it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
> >     zswap_cpu_comp_prepare().
> >
> >     Since the pool is not yet on zswap_pools list, we don't need to
> >     initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
> >     has been restored to occur in zswap_cpu_comp_prepare().
> >
> >     zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
> >     valid. If so, it returns success. This should handle any CPU
> >     hotplug online-offline transitions after pool creation is done.
> >
> >  2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
> >     migrated to another CPU before the current CPU is dysfunctional. If
> >     zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined
> >     CPU, that mutex will be released once it completes on the new
> >     CPU. Since there is no teardown callback, there is no possibility of
> >     UAF.
> >
> >  3) Pool creation/deletion and process migration to another CPU:
> >
> >     - During pool creation/deletion, the pool is not in the zswap_pools
> >       list. Hence it cannot contend with zswap ops on that CPU. However,
> >       the process can get migrated.
> >
> >       Pool creation --> zswap_cpu_comp_prepare()
> >                                 --> process migrated:
> >                                     * CPU offline: no-op.
> >                                     * zswap_cpu_comp_prepare() continues
> >                                       to run on the new CPU to finish
> >                                       allocating acomp_ctx resources for
> >                                       the offlined CPU.
> >
> >       Pool deletion --> acomp_ctx_dealloc()
> >                                 --> process migrated:
> >                                     * CPU offline: no-op.
> >                                     * acomp_ctx_dealloc() continues
> >                                       to run on the new CPU to finish
> >                                       de-allocating acomp_ctx resources
> >                                       for the offlined CPU.
> >
> >  4) Pool deletion vis-a-vis CPU onlining:
> >     The call to cpuhp_state_remove_instance() cannot race with
> >     zswap_cpu_comp_prepare() because of hotplug synchronization.
> >
> > This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock().
> > Instead, zswap_[de]compress() directly call
> > mutex_[un]lock(&acomp_ctx->mutex).
> 
> I am not sure why all of this is needed. We should just describe why
> it's safe to drop holding the mutex while initializing per-CPU
> acomp_ctx:
> 
> It is no longer possible for CPU hotplug to race against allocation or
> usage of per-CPU acomp_ctx, as they are only allocated once before the
> pool can be used, and remain allocated as long as the pool is used.
> Hence, stop holding the lock during acomp_ctx initialization, and drop
> acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock().

Hi Yosry,

Thanks for these comments. IIRC, there was quite a bit of technical
discussion analyzing various what-ifs, that we were able to answer
adequately. The above is a nice summary of the outcome, however,
I think it would help the next time this topic is re-visited to have a log
of the "why" and how races/UAF scenarios are being considered and
addressed by the solution. Does this sound Ok?

Thanks,
Kanchana


 

> 
> >
> > The per-CPU memory cost of not deleting the acomp_ctx resources upon
> CPU
> > offlining, and only deleting them when the pool is destroyed, is as
> > follows, on x86_64:
> >
> >     IAA with 8 dst buffers for batching:    64.34 KB
> >     Software compressors with 1 dst buffer:  8.28 KB
> 
> This cost is only paid when a CPU is offlined, until it is onlined
> again.
> 
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 164 +++++++++++++++++++++--------------------------------
> >  1 file changed, 64 insertions(+), 100 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 4897ed689b9f..87d50786f61f 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -242,6 +242,20 @@ static inline struct xarray
> *swap_zswap_tree(swp_entry_t swp)
> >  **********************************/
> >  static void __zswap_pool_empty(struct percpu_ref *ref);
> >
> > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> > +{
> > +	if (IS_ERR_OR_NULL(acomp_ctx))
> > +		return;
> > +
> > +	if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > +		acomp_request_free(acomp_ctx->req);
> > +
> > +	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > +		crypto_free_acomp(acomp_ctx->acomp);
> > +
> > +	kfree(acomp_ctx->buffer);
> > +}
> > +
> >  static struct zswap_pool *zswap_pool_create(char *compressor)
> >  {
> >  	struct zswap_pool *pool;
> > @@ -263,19 +277,26 @@ static struct zswap_pool
> *zswap_pool_create(char *compressor)
> >
> >  	strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
> >
> > -	pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
> > +	/* Many things rely on the zero-initialization. */
> > +	pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx,
> > +					   GFP_KERNEL | __GFP_ZERO);
> >  	if (!pool->acomp_ctx) {
> >  		pr_err("percpu alloc failed\n");
> >  		goto error;
> >  	}
> >
> > -	for_each_possible_cpu(cpu)
> > -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > -
> > +	/*
> > +	 * This is serialized against CPU hotplug operations. Hence, cores
> > +	 * cannot be offlined until this finishes.
> > +	 * In case of errors, we need to goto "ref_fail" instead of "error"
> > +	 * because there is no teardown callback registered anymore, for
> > +	 * cpuhp_state_add_instance() to de-allocate resources as it rolls
> back
> > +	 * state on cores before the CPU on which error was encountered.
> > +	 */
> 
> Do we need to manually call acomp_ctx_dealloc() on each CPU on failure
> because cpuhp_state_add_instance() relies on the hotunplug callback for
> cleanup, and we don't have any?
> 
> If that's the case:
> 
> 	/*
> 	 * cpuhp_state_add_instance() will not cleanup on failure since
> 	 * we don't register a hotunplug callback.
> 	 */
> 
> Describing what the code does is not helpful, and things like "anymore"
> do not make sense once the code is merged.
> 
> >  	ret =
> cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> >  				       &pool->node);
> >  	if (ret)
> > -		goto error;
> > +		goto ref_fail;
> 
> IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> probably should add a new label.
> 
> >
> >  	/* being the current pool takes 1 ref; this func expects the
> >  	 * caller to always add the new pool as the current pool
> > @@ -292,6 +313,9 @@ static struct zswap_pool *zswap_pool_create(char
> *compressor)
> >
> >  ref_fail:
> >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> > +
> > +	for_each_possible_cpu(cpu)
> > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> >  error:
> >  	if (pool->acomp_ctx)
> >  		free_percpu(pool->acomp_ctx);
> > @@ -322,9 +346,15 @@ static struct zswap_pool
> *__zswap_pool_create_fallback(void)
> >
> >  static void zswap_pool_destroy(struct zswap_pool *pool)
> >  {
> > +	int cpu;
> > +
> >  	zswap_pool_debug("destroying", pool);
> >
> >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> > +
> > +	for_each_possible_cpu(cpu)
> > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > +
> >  	free_percpu(pool->acomp_ctx);
> >
> >  	zs_destroy_pool(pool->zs_pool);
> > @@ -736,39 +766,35 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  {
> >  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > -	struct crypto_acomp *acomp = NULL;
> > -	struct acomp_req *req = NULL;
> > -	u8 *buffer = NULL;
> > -	int ret;
> > +	int ret = -ENOMEM;
> >
> > -	buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> > -	if (!buffer) {
> > -		ret = -ENOMEM;
> > -		goto fail;
> > -	}
> > +	/*
> > +	 * To handle cases where the CPU goes through online-offline-online
> > +	 * transitions, we return if the acomp_ctx has already been initialized.
> > +	 */
> > +	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > +		return 0;
> 
> Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
> that case?
> 
> >
> > -	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
> > -	if (IS_ERR(acomp)) {
> > +	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL,
> cpu_to_node(cpu));
> > +	if (!acomp_ctx->buffer)
> > +		return ret;
> > +
> > +	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0,
> 0, cpu_to_node(cpu));
> > +	if (IS_ERR(acomp_ctx->acomp)) {
> >  		pr_err("could not alloc crypto acomp %s : %ld\n",
> > -				pool->tfm_name, PTR_ERR(acomp));
> > -		ret = PTR_ERR(acomp);
> > +				pool->tfm_name, PTR_ERR(acomp_ctx-
> >acomp));
> > +		ret = PTR_ERR(acomp_ctx->acomp);
> >  		goto fail;
> >  	}
> > +	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
> >
> > -	req = acomp_request_alloc(acomp);
> > -	if (!req) {
> > +	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> > +	if (!acomp_ctx->req) {
> >  		pr_err("could not alloc crypto acomp_request %s\n",
> >  		       pool->tfm_name);
> > -		ret = -ENOMEM;
> >  		goto fail;
> >  	}
> >
> > -	/*
> > -	 * Only hold the mutex after completing allocations, otherwise we
> may
> > -	 * recurse into zswap through reclaim and attempt to hold the mutex
> > -	 * again resulting in a deadlock.
> > -	 */
> > -	mutex_lock(&acomp_ctx->mutex);
> >  	crypto_init_wait(&acomp_ctx->wait);
> >
> >  	/*
> [..]


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-12-12  0:55     ` Sridhar, Kanchana P
@ 2025-12-12  1:06       ` Yosry Ahmed
  2025-12-12  1:58         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-12-12  1:06 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh

On Fri, Dec 12, 2025 at 12:55:10AM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Thursday, November 13, 2025 12:24 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> > exist from pool creation to deletion.
> > 
> > On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote:
> > 
> > The subject can be shortened to:
> > 
> > "mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool"
> > 
> > > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
> > > management. Similar to the per-CPU acomp_ctx itself, the per-CPU
> > > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
> > > pool creation to pool deletion. These resources will persist through CPU
> > > hotplug operations instead of being destroyed/recreated. The
> > > zswap_cpu_comp_dead() teardown callback has been deleted from the call
> > > to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a
> > result, CPU
> > > offline hotplug operations will be no-ops as far as the acomp_ctx
> > > resources are concerned.
> > 
> > Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU
> > hotplug, and destroyed on pool destruction or CPU hotunplug. This
> > complicates the lifetime management to save memory while a CPU is
> > offlined, which is not very common.
> > 
> > Simplify lifetime management by allocating per-CPU acomp_ctx once on
> > pool creation (or CPU hotplug for CPUs onlined later), and keeping them
> > allocated until the pool is destroyed.
> > 
> > >
> > > This commit refactors the code from zswap_cpu_comp_dead() into a
> > > new function acomp_ctx_dealloc() that is called to clean up acomp_ctx
> > > resources from:
> > >
> > > 1) zswap_cpu_comp_prepare() when an error is encountered,
> > > 2) zswap_pool_create() when an error is encountered, and
> > > 3) from zswap_pool_destroy().
> > 
> > 
> > Refactor cleanup code from zswap_cpu_comp_dead() into
> > acomp_ctx_dealloc() to be used elsewhere.
> > 
> > >
> > > The main benefit of using the CPU hotplug multi state instance startup
> > > callback to allocate the acomp_ctx resources is that it prevents the
> > > cores from being offlined until the multi state instance addition call
> > > returns.
> > >
> > >   From Documentation/core-api/cpu_hotplug.rst:
> > >
> > >     "The node list add/remove operations and the callback invocations are
> > >      serialized against CPU hotplug operations."
> > >
> > > Furthermore, zswap_[de]compress() cannot contend with
> > > zswap_cpu_comp_prepare() because:
> > >
> > >   - During pool creation/deletion, the pool is not in the zswap_pools
> > >     list.
> > >
> > >   - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
> > >     out. zswap_cpu_comp_prepare() will be run on a control CPU,
> > >     since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section of
> > "enum
> > >     cpuhp_state". Thanks Yosry for sharing this observation!
> > >
> > >   In both these cases, any recursions into zswap reclaim from
> > >   zswap_cpu_comp_prepare() will be handled by the old pool.
> > >
> > > The above two observations enable the following simplifications:
> > >
> > >  1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot
> > use
> > >     the pool. Considerations for mutex init/locking and handling
> > >     subsequent CPU hotplug online-offline-online:
> > >
> > >     Should we lock the mutex of current CPU's acomp_ctx from start to
> > >     end? It doesn't seem like this is required. The CPU hotplug
> > >     operations acquire a "cpuhp_state_mutex" before proceeding, hence
> > >     they are serialized against CPU hotplug operations.
> > >
> > >     If the process gets migrated while zswap_cpu_comp_prepare() is
> > >     running, it will complete on the new CPU. In case of failures, we
> > >     pass the acomp_ctx pointer obtained at the start of
> > >     zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
> > >     only undergo migration. There appear to be no contention scenarios
> > >     that might cause inconsistent values of acomp_ctx's members. Hence,
> > >     it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
> > >     zswap_cpu_comp_prepare().
> > >
> > >     Since the pool is not yet on zswap_pools list, we don't need to
> > >     initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
> > >     has been restored to occur in zswap_cpu_comp_prepare().
> > >
> > >     zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
> > >     valid. If so, it returns success. This should handle any CPU
> > >     hotplug online-offline transitions after pool creation is done.
> > >
> > >  2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
> > >     migrated to another CPU before the current CPU is dysfunctional. If
> > >     zswap_[de]compress() holds the acomp_ctx->mutex lock of the offlined
> > >     CPU, that mutex will be released once it completes on the new
> > >     CPU. Since there is no teardown callback, there is no possibility of
> > >     UAF.
> > >
> > >  3) Pool creation/deletion and process migration to another CPU:
> > >
> > >     - During pool creation/deletion, the pool is not in the zswap_pools
> > >       list. Hence it cannot contend with zswap ops on that CPU. However,
> > >       the process can get migrated.
> > >
> > >       Pool creation --> zswap_cpu_comp_prepare()
> > >                                 --> process migrated:
> > >                                     * CPU offline: no-op.
> > >                                     * zswap_cpu_comp_prepare() continues
> > >                                       to run on the new CPU to finish
> > >                                       allocating acomp_ctx resources for
> > >                                       the offlined CPU.
> > >
> > >       Pool deletion --> acomp_ctx_dealloc()
> > >                                 --> process migrated:
> > >                                     * CPU offline: no-op.
> > >                                     * acomp_ctx_dealloc() continues
> > >                                       to run on the new CPU to finish
> > >                                       de-allocating acomp_ctx resources
> > >                                       for the offlined CPU.
> > >
> > >  4) Pool deletion vis-a-vis CPU onlining:
> > >     The call to cpuhp_state_remove_instance() cannot race with
> > >     zswap_cpu_comp_prepare() because of hotplug synchronization.
> > >
> > > This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock().
> > > Instead, zswap_[de]compress() directly call
> > > mutex_[un]lock(&acomp_ctx->mutex).
> > 
> > I am not sure why all of this is needed. We should just describe why
> > it's safe to drop holding the mutex while initializing per-CPU
> > acomp_ctx:
> > 
> > It is no longer possible for CPU hotplug to race against allocation or
> > usage of per-CPU acomp_ctx, as they are only allocated once before the
> > pool can be used, and remain allocated as long as the pool is used.
> > Hence, stop holding the lock during acomp_ctx initialization, and drop
> > acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock().
> 
> Hi Yosry,
> 
> Thanks for these comments. IIRC, there was quite a bit of technical
> discussion analyzing various what-ifs, that we were able to answer
> adequately. The above is a nice summary of the outcome, however,
> I think it would help the next time this topic is re-visited to have a log
> of the "why" and how races/UAF scenarios are being considered and
> addressed by the solution. Does this sound Ok?

How about using the summarized version in the commit log and linking to
the thread with the discussion?


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-12-12  1:06       ` Yosry Ahmed
@ 2025-12-12  1:58         ` Sridhar, Kanchana P
  2025-12-12  2:47           ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-12  1:58 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, December 11, 2025 5:06 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> exist from pool creation to deletion.
> 
> On Fri, Dec 12, 2025 at 12:55:10AM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Thursday, November 13, 2025 12:24 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx
> resources
> > > exist from pool creation to deletion.
> > >
> > > On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote:
> > >
> > > The subject can be shortened to:
> > >
> > > "mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool"
> > >
> > > > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
> > > > management. Similar to the per-CPU acomp_ctx itself, the per-CPU
> > > > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
> > > > pool creation to pool deletion. These resources will persist through CPU
> > > > hotplug operations instead of being destroyed/recreated. The
> > > > zswap_cpu_comp_dead() teardown callback has been deleted from the
> call
> > > > to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a
> > > result, CPU
> > > > offline hotplug operations will be no-ops as far as the acomp_ctx
> > > > resources are concerned.
> > >
> > > Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU
> > > hotplug, and destroyed on pool destruction or CPU hotunplug. This
> > > complicates the lifetime management to save memory while a CPU is
> > > offlined, which is not very common.
> > >
> > > Simplify lifetime management by allocating per-CPU acomp_ctx once on
> > > pool creation (or CPU hotplug for CPUs onlined later), and keeping them
> > > allocated until the pool is destroyed.
> > >
> > > >
> > > > This commit refactors the code from zswap_cpu_comp_dead() into a
> > > > new function acomp_ctx_dealloc() that is called to clean up acomp_ctx
> > > > resources from:
> > > >
> > > > 1) zswap_cpu_comp_prepare() when an error is encountered,
> > > > 2) zswap_pool_create() when an error is encountered, and
> > > > 3) from zswap_pool_destroy().
> > >
> > >
> > > Refactor cleanup code from zswap_cpu_comp_dead() into
> > > acomp_ctx_dealloc() to be used elsewhere.
> > >
> > > >
> > > > The main benefit of using the CPU hotplug multi state instance startup
> > > > callback to allocate the acomp_ctx resources is that it prevents the
> > > > cores from being offlined until the multi state instance addition call
> > > > returns.
> > > >
> > > >   From Documentation/core-api/cpu_hotplug.rst:
> > > >
> > > >     "The node list add/remove operations and the callback invocations are
> > > >      serialized against CPU hotplug operations."
> > > >
> > > > Furthermore, zswap_[de]compress() cannot contend with
> > > > zswap_cpu_comp_prepare() because:
> > > >
> > > >   - During pool creation/deletion, the pool is not in the zswap_pools
> > > >     list.
> > > >
> > > >   - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
> > > >     out. zswap_cpu_comp_prepare() will be run on a control CPU,
> > > >     since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section
> of
> > > "enum
> > > >     cpuhp_state". Thanks Yosry for sharing this observation!
> > > >
> > > >   In both these cases, any recursions into zswap reclaim from
> > > >   zswap_cpu_comp_prepare() will be handled by the old pool.
> > > >
> > > > The above two observations enable the following simplifications:
> > > >
> > > >  1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot
> > > use
> > > >     the pool. Considerations for mutex init/locking and handling
> > > >     subsequent CPU hotplug online-offline-online:
> > > >
> > > >     Should we lock the mutex of current CPU's acomp_ctx from start to
> > > >     end? It doesn't seem like this is required. The CPU hotplug
> > > >     operations acquire a "cpuhp_state_mutex" before proceeding, hence
> > > >     they are serialized against CPU hotplug operations.
> > > >
> > > >     If the process gets migrated while zswap_cpu_comp_prepare() is
> > > >     running, it will complete on the new CPU. In case of failures, we
> > > >     pass the acomp_ctx pointer obtained at the start of
> > > >     zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
> > > >     only undergo migration. There appear to be no contention scenarios
> > > >     that might cause inconsistent values of acomp_ctx's members. Hence,
> > > >     it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
> > > >     zswap_cpu_comp_prepare().
> > > >
> > > >     Since the pool is not yet on zswap_pools list, we don't need to
> > > >     initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
> > > >     has been restored to occur in zswap_cpu_comp_prepare().
> > > >
> > > >     zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
> > > >     valid. If so, it returns success. This should handle any CPU
> > > >     hotplug online-offline transitions after pool creation is done.
> > > >
> > > >  2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
> > > >     migrated to another CPU before the current CPU is dysfunctional. If
> > > >     zswap_[de]compress() holds the acomp_ctx->mutex lock of the
> offlined
> > > >     CPU, that mutex will be released once it completes on the new
> > > >     CPU. Since there is no teardown callback, there is no possibility of
> > > >     UAF.
> > > >
> > > >  3) Pool creation/deletion and process migration to another CPU:
> > > >
> > > >     - During pool creation/deletion, the pool is not in the zswap_pools
> > > >       list. Hence it cannot contend with zswap ops on that CPU. However,
> > > >       the process can get migrated.
> > > >
> > > >       Pool creation --> zswap_cpu_comp_prepare()
> > > >                                 --> process migrated:
> > > >                                     * CPU offline: no-op.
> > > >                                     * zswap_cpu_comp_prepare() continues
> > > >                                       to run on the new CPU to finish
> > > >                                       allocating acomp_ctx resources for
> > > >                                       the offlined CPU.
> > > >
> > > >       Pool deletion --> acomp_ctx_dealloc()
> > > >                                 --> process migrated:
> > > >                                     * CPU offline: no-op.
> > > >                                     * acomp_ctx_dealloc() continues
> > > >                                       to run on the new CPU to finish
> > > >                                       de-allocating acomp_ctx resources
> > > >                                       for the offlined CPU.
> > > >
> > > >  4) Pool deletion vis-a-vis CPU onlining:
> > > >     The call to cpuhp_state_remove_instance() cannot race with
> > > >     zswap_cpu_comp_prepare() because of hotplug synchronization.
> > > >
> > > > This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock().
> > > > Instead, zswap_[de]compress() directly call
> > > > mutex_[un]lock(&acomp_ctx->mutex).
> > >
> > > I am not sure why all of this is needed. We should just describe why
> > > it's safe to drop holding the mutex while initializing per-CPU
> > > acomp_ctx:
> > >
> > > It is no longer possible for CPU hotplug to race against allocation or
> > > usage of per-CPU acomp_ctx, as they are only allocated once before the
> > > pool can be used, and remain allocated as long as the pool is used.
> > > Hence, stop holding the lock during acomp_ctx initialization, and drop
> > > acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock().
> >
> > Hi Yosry,
> >
> > Thanks for these comments. IIRC, there was quite a bit of technical
> > discussion analyzing various what-ifs, that we were able to answer
> > adequately. The above is a nice summary of the outcome, however,
> > I think it would help the next time this topic is re-visited to have a log
> > of the "why" and how races/UAF scenarios are being considered and
> > addressed by the solution. Does this sound Ok?
> 
> How about using the summarized version in the commit log and linking to
> the thread with the discussion?

Seems like capturing just enough detail of the threads involving the
discussions, in this commit log would be valuable. As against reading long
email threads with indentations, as the sole resource to provide context?




^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-12-12  1:58         ` Sridhar, Kanchana P
@ 2025-12-12  2:47           ` Yosry Ahmed
  2025-12-12  4:32             ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-12-12  2:47 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

December 11, 2025 at 5:58 PM, "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com mailto:kanchana.p.sridhar@intel.com?to=%22Sridhar%2C%20Kanchana%20P%22%20%3Ckanchana.p.sridhar%40intel.com%3E > wrote:


> 
> > 
> > -----Original Message-----
> >  From: Yosry Ahmed <>
> >  Sent: Thursday, December 11, 2025 5:06 PM
> >  To: Sridhar, Kanchana P <>
> >  Cc:;;
> >  hannes@cmpxchg.org; mailto:hannes@cmpxchg.org; ;;
> >  usamaarif642@gmail.com; mailto:usamaarif642@gmail.com; ;;
> >  ying.huang@linux.alibaba.com; mailto:ying.huang@linux.alibaba.com; ;
> >  senozhatsky@chromium.org; mailto:senozhatsky@chromium.org; ;; linux-
> >  crypto@vger.kernel.org; mailto:crypto@vger.kernel.org; ;
> >  davem@davemloft.net; mailto:davem@davemloft.net; ;;
> >  ebiggers@google.com; mailto:ebiggers@google.com; ; Accardi, Kristen C
> >  <>; Gomes, Vinicius <>;
> >  Feghali, Wajdi K <>; Gopal, Vinodh
> >  <>
> >  Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> >  exist from pool creation to deletion.
> >  
> >  On Fri, Dec 12, 2025 at 12:55:10AM +0000, Sridhar, Kanchana P wrote:
> > 
> >  > -----Original Message-----
> >  > From: Yosry Ahmed <>
> >  > Sent: Thursday, November 13, 2025 12:24 PM
> >  > To: Sridhar, Kanchana P <>
> >  > Cc:;;
> >  >;;
> >  chengming.zhou@linux.dev; mailto:chengming.zhou@linux.dev; 
> >  >;;;
> >  >;;
> >  >;;; linux-
> >  >;;
> >  >;;;
> >  >;; Accardi, Kristen C
> >  > <>; Gomes, Vinicius
> >  <>;
> >  > Feghali, Wajdi K <>; Gopal, Vinodh
> >  > <>
> >  > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx
> >  resources
> >  > exist from pool creation to deletion.
> >  >
> >  > On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote:
> >  >
> >  > The subject can be shortened to:
> >  >
> >  > "mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool"
> >  >
> >  > > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
> >  > > management. Similar to the per-CPU acomp_ctx itself, the per-CPU
> >  > > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
> >  > > pool creation to pool deletion. These resources will persist through CPU
> >  > > hotplug operations instead of being destroyed/recreated. The
> >  > > zswap_cpu_comp_dead() teardown callback has been deleted from the
> >  call
> >  > > to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As a
> >  > result, CPU
> >  > > offline hotplug operations will be no-ops as far as the acomp_ctx
> >  > > resources are concerned.
> >  >
> >  > Currently, per-CPU acomp_ctx are allocated on pool creation and/or CPU
> >  > hotplug, and destroyed on pool destruction or CPU hotunplug. This
> >  > complicates the lifetime management to save memory while a CPU is
> >  > offlined, which is not very common.
> >  >
> >  > Simplify lifetime management by allocating per-CPU acomp_ctx once on
> >  > pool creation (or CPU hotplug for CPUs onlined later), and keeping them
> >  > allocated until the pool is destroyed.
> >  >
> >  > >
> >  > > This commit refactors the code from zswap_cpu_comp_dead() into a
> >  > > new function acomp_ctx_dealloc() that is called to clean up acomp_ctx
> >  > > resources from:
> >  > >
> >  > > 1) zswap_cpu_comp_prepare() when an error is encountered,
> >  > > 2) zswap_pool_create() when an error is encountered, and
> >  > > 3) from zswap_pool_destroy().
> >  >
> >  >
> >  > Refactor cleanup code from zswap_cpu_comp_dead() into
> >  > acomp_ctx_dealloc() to be used elsewhere.
> >  >
> >  > >
> >  > > The main benefit of using the CPU hotplug multi state instance startup
> >  > > callback to allocate the acomp_ctx resources is that it prevents the
> >  > > cores from being offlined until the multi state instance addition call
> >  > > returns.
> >  > >
> >  > > From Documentation/core-api/cpu_hotplug.rst:
> >  > >
> >  > > "The node list add/remove operations and the callback invocations are
> >  > > serialized against CPU hotplug operations."
> >  > >
> >  > > Furthermore, zswap_[de]compress() cannot contend with
> >  > > zswap_cpu_comp_prepare() because:
> >  > >
> >  > > - During pool creation/deletion, the pool is not in the zswap_pools
> >  > > list.
> >  > >
> >  > > - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
> >  > > out. zswap_cpu_comp_prepare() will be run on a control CPU,
> >  > > since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section
> >  of
> >  > "enum
> >  > > cpuhp_state". Thanks Yosry for sharing this observation!
> >  > >
> >  > > In both these cases, any recursions into zswap reclaim from
> >  > > zswap_cpu_comp_prepare() will be handled by the old pool.
> >  > >
> >  > > The above two observations enable the following simplifications:
> >  > >
> >  > > 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim cannot
> >  > use
> >  > > the pool. Considerations for mutex init/locking and handling
> >  > > subsequent CPU hotplug online-offline-online:
> >  > >
> >  > > Should we lock the mutex of current CPU's acomp_ctx from start to
> >  > > end? It doesn't seem like this is required. The CPU hotplug
> >  > > operations acquire a "cpuhp_state_mutex" before proceeding, hence
> >  > > they are serialized against CPU hotplug operations.
> >  > >
> >  > > If the process gets migrated while zswap_cpu_comp_prepare() is
> >  > > running, it will complete on the new CPU. In case of failures, we
> >  > > pass the acomp_ctx pointer obtained at the start of
> >  > > zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
> >  > > only undergo migration. There appear to be no contention scenarios
> >  > > that might cause inconsistent values of acomp_ctx's members. Hence,
> >  > > it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
> >  > > zswap_cpu_comp_prepare().
> >  > >
> >  > > Since the pool is not yet on zswap_pools list, we don't need to
> >  > > initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
> >  > > has been restored to occur in zswap_cpu_comp_prepare().
> >  > >
> >  > > zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
> >  > > valid. If so, it returns success. This should handle any CPU
> >  > > hotplug online-offline transitions after pool creation is done.
> >  > >
> >  > > 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
> >  > > migrated to another CPU before the current CPU is dysfunctional. If
> >  > > zswap_[de]compress() holds the acomp_ctx->mutex lock of the
> >  offlined
> >  > > CPU, that mutex will be released once it completes on the new
> >  > > CPU. Since there is no teardown callback, there is no possibility of
> >  > > UAF.
> >  > >
> >  > > 3) Pool creation/deletion and process migration to another CPU:
> >  > >
> >  > > - During pool creation/deletion, the pool is not in the zswap_pools
> >  > > list. Hence it cannot contend with zswap ops on that CPU. However,
> >  > > the process can get migrated.
> >  > >
> >  > > Pool creation --> zswap_cpu_comp_prepare()
> >  > > --> process migrated:
> >  > > * CPU offline: no-op.
> >  > > * zswap_cpu_comp_prepare() continues
> >  > > to run on the new CPU to finish
> >  > > allocating acomp_ctx resources for
> >  > > the offlined CPU.
> >  > >
> >  > > Pool deletion --> acomp_ctx_dealloc()
> >  > > --> process migrated:
> >  > > * CPU offline: no-op.
> >  > > * acomp_ctx_dealloc() continues
> >  > > to run on the new CPU to finish
> >  > > de-allocating acomp_ctx resources
> >  > > for the offlined CPU.
> >  > >
> >  > > 4) Pool deletion vis-a-vis CPU onlining:
> >  > > The call to cpuhp_state_remove_instance() cannot race with
> >  > > zswap_cpu_comp_prepare() because of hotplug synchronization.
> >  > >
> >  > > This patch deletes acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock().
> >  > > Instead, zswap_[de]compress() directly call
> >  > > mutex_[un]lock(&acomp_ctx->mutex).
> >  >
> >  > I am not sure why all of this is needed. We should just describe why
> >  > it's safe to drop holding the mutex while initializing per-CPU
> >  > acomp_ctx:
> >  >
> >  > It is no longer possible for CPU hotplug to race against allocation or
> >  > usage of per-CPU acomp_ctx, as they are only allocated once before the
> >  > pool can be used, and remain allocated as long as the pool is used.
> >  > Hence, stop holding the lock during acomp_ctx initialization, and drop
> >  > acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock().
> > 
> >  Hi Yosry,
> > 
> >  Thanks for these comments. IIRC, there was quite a bit of technical
> >  discussion analyzing various what-ifs, that we were able to answer
> >  adequately. The above is a nice summary of the outcome, however,
> >  I think it would help the next time this topic is re-visited to have a log
> >  of the "why" and how races/UAF scenarios are being considered and
> >  addressed by the solution. Does this sound Ok?
> >  
> >  How about using the summarized version in the commit log and linking to
> >  the thread with the discussion?
> > 
> Seems like capturing just enough detail of the threads involving the
> discussions, in this commit log would be valuable. As against reading long
> email threads with indentations, as the sole resource to provide context?
>

If you feel strongly about it then sure, but try to keep it as concise as possible, thanks.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-12-12  2:47           ` Yosry Ahmed
@ 2025-12-12  4:32             ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-12  4:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, December 11, 2025 6:47 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> exist from pool creation to deletion.
> 
> December 11, 2025 at 5:58 PM, "Sridhar, Kanchana P"
> <kanchana.p.sridhar@intel.com
> mailto:kanchana.p.sridhar@intel.com?to=%22Sridhar%2C%20Kanchana%20
> P%22%20%3Ckanchana.p.sridhar%40intel.com%3E > wrote:
> 
> 
> >
> > >
> > > -----Original Message-----
> > >  From: Yosry Ahmed <>
> > >  Sent: Thursday, December 11, 2025 5:06 PM
> > >  To: Sridhar, Kanchana P <>
> > >  Cc:;;
> > >  hannes@cmpxchg.org; mailto:hannes@cmpxchg.org; ;;
> > >  usamaarif642@gmail.com; mailto:usamaarif642@gmail.com; ;;
> > >  ying.huang@linux.alibaba.com; mailto:ying.huang@linux.alibaba.com; ;
> > >  senozhatsky@chromium.org; mailto:senozhatsky@chromium.org; ;;
> linux-
> > >  crypto@vger.kernel.org; mailto:crypto@vger.kernel.org; ;
> > >  davem@davemloft.net; mailto:davem@davemloft.net; ;;
> > >  ebiggers@google.com; mailto:ebiggers@google.com; ; Accardi, Kristen C
> > >  <>; Gomes, Vinicius <>;
> > >  Feghali, Wajdi K <>; Gopal, Vinodh
> > >  <>
> > >  Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx
> resources
> > >  exist from pool creation to deletion.
> > >
> > >  On Fri, Dec 12, 2025 at 12:55:10AM +0000, Sridhar, Kanchana P wrote:
> > >
> > >  > -----Original Message-----
> > >  > From: Yosry Ahmed <>
> > >  > Sent: Thursday, November 13, 2025 12:24 PM
> > >  > To: Sridhar, Kanchana P <>
> > >  > Cc:;;
> > >  >;;
> > >  chengming.zhou@linux.dev; mailto:chengming.zhou@linux.dev;
> > >  >;;;
> > >  >;;
> > >  >;;; linux-
> > >  >;;
> > >  >;;;
> > >  >;; Accardi, Kristen C
> > >  > <>; Gomes, Vinicius
> > >  <>;
> > >  > Feghali, Wajdi K <>; Gopal, Vinodh
> > >  > <>
> > >  > Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx
> > >  resources
> > >  > exist from pool creation to deletion.
> > >  >
> > >  > On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote:
> > >  >
> > >  > The subject can be shortened to:
> > >  >
> > >  > "mm: zswap: Tie per-CPU acomp_ctx lifetime to the pool"
> > >  >
> > >  > > This patch simplifies the zswap_pool's per-CPU acomp_ctx resource
> > >  > > management. Similar to the per-CPU acomp_ctx itself, the per-CPU
> > >  > > acomp_ctx's resources' (acomp, req, buffer) lifetime will also be from
> > >  > > pool creation to pool deletion. These resources will persist through
> CPU
> > >  > > hotplug operations instead of being destroyed/recreated. The
> > >  > > zswap_cpu_comp_dead() teardown callback has been deleted from
> the
> > >  call
> > >  > > to cpuhp_setup_state_multi(CPUHP_MM_ZSWP_POOL_PREPARE). As
> a
> > >  > result, CPU
> > >  > > offline hotplug operations will be no-ops as far as the acomp_ctx
> > >  > > resources are concerned.
> > >  >
> > >  > Currently, per-CPU acomp_ctx are allocated on pool creation and/or
> CPU
> > >  > hotplug, and destroyed on pool destruction or CPU hotunplug. This
> > >  > complicates the lifetime management to save memory while a CPU is
> > >  > offlined, which is not very common.
> > >  >
> > >  > Simplify lifetime management by allocating per-CPU acomp_ctx once
> on
> > >  > pool creation (or CPU hotplug for CPUs onlined later), and keeping them
> > >  > allocated until the pool is destroyed.
> > >  >
> > >  > >
> > >  > > This commit refactors the code from zswap_cpu_comp_dead() into a
> > >  > > new function acomp_ctx_dealloc() that is called to clean up
> acomp_ctx
> > >  > > resources from:
> > >  > >
> > >  > > 1) zswap_cpu_comp_prepare() when an error is encountered,
> > >  > > 2) zswap_pool_create() when an error is encountered, and
> > >  > > 3) from zswap_pool_destroy().
> > >  >
> > >  >
> > >  > Refactor cleanup code from zswap_cpu_comp_dead() into
> > >  > acomp_ctx_dealloc() to be used elsewhere.
> > >  >
> > >  > >
> > >  > > The main benefit of using the CPU hotplug multi state instance
> startup
> > >  > > callback to allocate the acomp_ctx resources is that it prevents the
> > >  > > cores from being offlined until the multi state instance addition call
> > >  > > returns.
> > >  > >
> > >  > > From Documentation/core-api/cpu_hotplug.rst:
> > >  > >
> > >  > > "The node list add/remove operations and the callback invocations
> are
> > >  > > serialized against CPU hotplug operations."
> > >  > >
> > >  > > Furthermore, zswap_[de]compress() cannot contend with
> > >  > > zswap_cpu_comp_prepare() because:
> > >  > >
> > >  > > - During pool creation/deletion, the pool is not in the zswap_pools
> > >  > > list.
> > >  > >
> > >  > > - During CPU hot[un]plug, the CPU is not yet online, as Yosry pointed
> > >  > > out. zswap_cpu_comp_prepare() will be run on a control CPU,
> > >  > > since CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section
> > >  of
> > >  > "enum
> > >  > > cpuhp_state". Thanks Yosry for sharing this observation!
> > >  > >
> > >  > > In both these cases, any recursions into zswap reclaim from
> > >  > > zswap_cpu_comp_prepare() will be handled by the old pool.
> > >  > >
> > >  > > The above two observations enable the following simplifications:
> > >  > >
> > >  > > 1) zswap_cpu_comp_prepare(): CPU cannot be offlined. Reclaim
> cannot
> > >  > use
> > >  > > the pool. Considerations for mutex init/locking and handling
> > >  > > subsequent CPU hotplug online-offline-online:
> > >  > >
> > >  > > Should we lock the mutex of current CPU's acomp_ctx from start to
> > >  > > end? It doesn't seem like this is required. The CPU hotplug
> > >  > > operations acquire a "cpuhp_state_mutex" before proceeding, hence
> > >  > > they are serialized against CPU hotplug operations.
> > >  > >
> > >  > > If the process gets migrated while zswap_cpu_comp_prepare() is
> > >  > > running, it will complete on the new CPU. In case of failures, we
> > >  > > pass the acomp_ctx pointer obtained at the start of
> > >  > > zswap_cpu_comp_prepare() to acomp_ctx_dealloc(), which again, can
> > >  > > only undergo migration. There appear to be no contention scenarios
> > >  > > that might cause inconsistent values of acomp_ctx's members. Hence,
> > >  > > it seems there is no need for mutex_lock(&acomp_ctx->mutex) in
> > >  > > zswap_cpu_comp_prepare().
> > >  > >
> > >  > > Since the pool is not yet on zswap_pools list, we don't need to
> > >  > > initialize the per-CPU acomp_ctx mutex in zswap_pool_create(). This
> > >  > > has been restored to occur in zswap_cpu_comp_prepare().
> > >  > >
> > >  > > zswap_cpu_comp_prepare() checks upfront if acomp_ctx->acomp is
> > >  > > valid. If so, it returns success. This should handle any CPU
> > >  > > hotplug online-offline transitions after pool creation is done.
> > >  > >
> > >  > > 2) CPU offline vis-a-vis zswap ops: Let's suppose the process is
> > >  > > migrated to another CPU before the current CPU is dysfunctional. If
> > >  > > zswap_[de]compress() holds the acomp_ctx->mutex lock of the
> > >  offlined
> > >  > > CPU, that mutex will be released once it completes on the new
> > >  > > CPU. Since there is no teardown callback, there is no possibility of
> > >  > > UAF.
> > >  > >
> > >  > > 3) Pool creation/deletion and process migration to another CPU:
> > >  > >
> > >  > > - During pool creation/deletion, the pool is not in the zswap_pools
> > >  > > list. Hence it cannot contend with zswap ops on that CPU. However,
> > >  > > the process can get migrated.
> > >  > >
> > >  > > Pool creation --> zswap_cpu_comp_prepare()
> > >  > > --> process migrated:
> > >  > > * CPU offline: no-op.
> > >  > > * zswap_cpu_comp_prepare() continues
> > >  > > to run on the new CPU to finish
> > >  > > allocating acomp_ctx resources for
> > >  > > the offlined CPU.
> > >  > >
> > >  > > Pool deletion --> acomp_ctx_dealloc()
> > >  > > --> process migrated:
> > >  > > * CPU offline: no-op.
> > >  > > * acomp_ctx_dealloc() continues
> > >  > > to run on the new CPU to finish
> > >  > > de-allocating acomp_ctx resources
> > >  > > for the offlined CPU.
> > >  > >
> > >  > > 4) Pool deletion vis-a-vis CPU onlining:
> > >  > > The call to cpuhp_state_remove_instance() cannot race with
> > >  > > zswap_cpu_comp_prepare() because of hotplug synchronization.
> > >  > >
> > >  > > This patch deletes
> acomp_ctx_get_cpu_lock()/acomp_ctx_put_unlock().
> > >  > > Instead, zswap_[de]compress() directly call
> > >  > > mutex_[un]lock(&acomp_ctx->mutex).
> > >  >
> > >  > I am not sure why all of this is needed. We should just describe why
> > >  > it's safe to drop holding the mutex while initializing per-CPU
> > >  > acomp_ctx:
> > >  >
> > >  > It is no longer possible for CPU hotplug to race against allocation or
> > >  > usage of per-CPU acomp_ctx, as they are only allocated once before the
> > >  > pool can be used, and remain allocated as long as the pool is used.
> > >  > Hence, stop holding the lock during acomp_ctx initialization, and drop
> > >  > acomp_ctx_get_cpu_lock()//acomp_ctx_put_unlock().
> > >
> > >  Hi Yosry,
> > >
> > >  Thanks for these comments. IIRC, there was quite a bit of technical
> > >  discussion analyzing various what-ifs, that we were able to answer
> > >  adequately. The above is a nice summary of the outcome, however,
> > >  I think it would help the next time this topic is re-visited to have a log
> > >  of the "why" and how races/UAF scenarios are being considered and
> > >  addressed by the solution. Does this sound Ok?
> > >
> > >  How about using the summarized version in the commit log and linking to
> > >  the thread with the discussion?
> > >
> > Seems like capturing just enough detail of the threads involving the
> > discussions, in this commit log would be valuable. As against reading long
> > email threads with indentations, as the sole resource to provide context?
> >
> 
> If you feel strongly about it then sure, but try to keep it as concise as possible,
> thanks.

Sure, will do, thanks!


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-11-13 20:24   ` Yosry Ahmed
  2025-12-12  0:55     ` Sridhar, Kanchana P
@ 2025-12-12 18:17     ` Sridhar, Kanchana P
  2025-12-12 18:43       ` Yosry Ahmed
  1 sibling, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-12 18:17 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 12:24 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> exist from pool creation to deletion.
> 
> On Tue, Nov 04, 2025 at 01:12:32AM -0800, Kanchana P Sridhar wrote:
>
[...]
> >  mm/zswap.c | 164 +++++++++++++++++++++--------------------------------
> >  1 file changed, 64 insertions(+), 100 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 4897ed689b9f..87d50786f61f 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -242,6 +242,20 @@ static inline struct xarray
> *swap_zswap_tree(swp_entry_t swp)
> >  **********************************/
> >  static void __zswap_pool_empty(struct percpu_ref *ref);
> >
> > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> > +{
> > +	if (IS_ERR_OR_NULL(acomp_ctx))
> > +		return;
> > +
> > +	if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > +		acomp_request_free(acomp_ctx->req);
> > +
> > +	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > +		crypto_free_acomp(acomp_ctx->acomp);
> > +
> > +	kfree(acomp_ctx->buffer);
> > +}
> > +
> >  static struct zswap_pool *zswap_pool_create(char *compressor)
> >  {
> >  	struct zswap_pool *pool;
> > @@ -263,19 +277,26 @@ static struct zswap_pool
> *zswap_pool_create(char *compressor)
> >
> >  	strscpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
> >
> > -	pool->acomp_ctx = alloc_percpu(*pool->acomp_ctx);
> > +	/* Many things rely on the zero-initialization. */
> > +	pool->acomp_ctx = alloc_percpu_gfp(*pool->acomp_ctx,
> > +					   GFP_KERNEL | __GFP_ZERO);
> >  	if (!pool->acomp_ctx) {
> >  		pr_err("percpu alloc failed\n");
> >  		goto error;
> >  	}
> >
> > -	for_each_possible_cpu(cpu)
> > -		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> > -
> > +	/*
> > +	 * This is serialized against CPU hotplug operations. Hence, cores
> > +	 * cannot be offlined until this finishes.
> > +	 * In case of errors, we need to goto "ref_fail" instead of "error"
> > +	 * because there is no teardown callback registered anymore, for
> > +	 * cpuhp_state_add_instance() to de-allocate resources as it rolls
> back
> > +	 * state on cores before the CPU on which error was encountered.
> > +	 */
> 
> Do we need to manually call acomp_ctx_dealloc() on each CPU on failure
> because cpuhp_state_add_instance() relies on the hotunplug callback for
> cleanup, and we don't have any?

That's correct.

> 
> If that's the case:
> 
> 	/*
> 	 * cpuhp_state_add_instance() will not cleanup on failure since
> 	 * we don't register a hotunplug callback.
> 	 */
> 
> Describing what the code does is not helpful, and things like "anymore"
> do not make sense once the code is merged.

Ok.

> 
> >  	ret =
> cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> >  				       &pool->node);
> >  	if (ret)
> > -		goto error;
> > +		goto ref_fail;
> 
> IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> probably should add a new label.

In this case we should because it is part of the pool creation failure
handling flow, at the end of which, the pool will be deleted.

> 
> >
> >  	/* being the current pool takes 1 ref; this func expects the
> >  	 * caller to always add the new pool as the current pool
> > @@ -292,6 +313,9 @@ static struct zswap_pool *zswap_pool_create(char
> *compressor)
> >
> >  ref_fail:
> >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> > +
> > +	for_each_possible_cpu(cpu)
> > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> >  error:
> >  	if (pool->acomp_ctx)
> >  		free_percpu(pool->acomp_ctx);
> > @@ -322,9 +346,15 @@ static struct zswap_pool
> *__zswap_pool_create_fallback(void)
> >
> >  static void zswap_pool_destroy(struct zswap_pool *pool)
> >  {
> > +	int cpu;
> > +
> >  	zswap_pool_debug("destroying", pool);
> >
> >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> > +
> > +	for_each_possible_cpu(cpu)
> > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > +
> >  	free_percpu(pool->acomp_ctx);
> >
> >  	zs_destroy_pool(pool->zs_pool);
> > @@ -736,39 +766,35 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  {
> >  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > -	struct crypto_acomp *acomp = NULL;
> > -	struct acomp_req *req = NULL;
> > -	u8 *buffer = NULL;
> > -	int ret;
> > +	int ret = -ENOMEM;
> >
> > -	buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> > -	if (!buffer) {
> > -		ret = -ENOMEM;
> > -		goto fail;
> > -	}
> > +	/*
> > +	 * To handle cases where the CPU goes through online-offline-online
> > +	 * transitions, we return if the acomp_ctx has already been initialized.
> > +	 */
> > +	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > +		return 0;
> 
> Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
> that case?

This is checking for a valid acomp_ctx->acomp using the same criteria
being uniformly used in acomp_ctx_dealloc(). This check is necessary to
handle the case where the CPU goes through online-offline-online state
transitions.

Thanks,
Kanchana

> 
> >
> > -	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
> > -	if (IS_ERR(acomp)) {
> > +	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL,
> cpu_to_node(cpu));
> > +	if (!acomp_ctx->buffer)
> > +		return ret;
> > +
> > +	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0,
> 0, cpu_to_node(cpu));
> > +	if (IS_ERR(acomp_ctx->acomp)) {
> >  		pr_err("could not alloc crypto acomp %s : %ld\n",
> > -				pool->tfm_name, PTR_ERR(acomp));
> > -		ret = PTR_ERR(acomp);
> > +				pool->tfm_name, PTR_ERR(acomp_ctx-
> >acomp));
> > +		ret = PTR_ERR(acomp_ctx->acomp);
> >  		goto fail;
> >  	}
> > +	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
> >
> > -	req = acomp_request_alloc(acomp);
> > -	if (!req) {
> > +	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> > +	if (!acomp_ctx->req) {
> >  		pr_err("could not alloc crypto acomp_request %s\n",
> >  		       pool->tfm_name);
> > -		ret = -ENOMEM;
> >  		goto fail;
> >  	}
> >
> > -	/*
> > -	 * Only hold the mutex after completing allocations, otherwise we
> may
> > -	 * recurse into zswap through reclaim and attempt to hold the mutex
> > -	 * again resulting in a deadlock.
> > -	 */
> > -	mutex_lock(&acomp_ctx->mutex);
> >  	crypto_init_wait(&acomp_ctx->wait);
> >
> >  	/*
> [..]


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-12-12 18:17     ` Sridhar, Kanchana P
@ 2025-12-12 18:43       ` Yosry Ahmed
  2025-12-12 20:53         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-12-12 18:43 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh

On Fri, Dec 12, 2025 at 06:17:07PM +0000, Sridhar, Kanchana P wrote:
> > 
> > >  	ret =
> > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > >  				       &pool->node);
> > >  	if (ret)
> > > -		goto error;
> > > +		goto ref_fail;
> > 
> > IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> > probably should add a new label.
> 
> In this case we should because it is part of the pool creation failure
> handling flow, at the end of which, the pool will be deleted.

What I mean is, when cpuhp_state_add_instance() fails we goto ref_fail
which will call cpuhp_state_remove_instance(). But the current code does
not call cpuhp_state_remove_instance() if cpuhp_state_add_instance()
fails.

> 
> > 
> > >
> > >  	/* being the current pool takes 1 ref; this func expects the
> > >  	 * caller to always add the new pool as the current pool
> > > @@ -292,6 +313,9 @@ static struct zswap_pool *zswap_pool_create(char
> > *compressor)
> > >
> > >  ref_fail:
> > >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > &pool->node);
> > > +
> > > +	for_each_possible_cpu(cpu)
> > > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > >  error:
> > >  	if (pool->acomp_ctx)
> > >  		free_percpu(pool->acomp_ctx);
> > > @@ -322,9 +346,15 @@ static struct zswap_pool
> > *__zswap_pool_create_fallback(void)
> > >
> > >  static void zswap_pool_destroy(struct zswap_pool *pool)
> > >  {
> > > +	int cpu;
> > > +
> > >  	zswap_pool_debug("destroying", pool);
> > >
> > >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > &pool->node);
> > > +
> > > +	for_each_possible_cpu(cpu)
> > > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > > +
> > >  	free_percpu(pool->acomp_ctx);
> > >
> > >  	zs_destroy_pool(pool->zs_pool);
> > > @@ -736,39 +766,35 @@ static int zswap_cpu_comp_prepare(unsigned int
> > cpu, struct hlist_node *node)
> > >  {
> > >  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > node);
> > >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > >acomp_ctx, cpu);
> > > -	struct crypto_acomp *acomp = NULL;
> > > -	struct acomp_req *req = NULL;
> > > -	u8 *buffer = NULL;
> > > -	int ret;
> > > +	int ret = -ENOMEM;
> > >
> > > -	buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> > > -	if (!buffer) {
> > > -		ret = -ENOMEM;
> > > -		goto fail;
> > > -	}
> > > +	/*
> > > +	 * To handle cases where the CPU goes through online-offline-online
> > > +	 * transitions, we return if the acomp_ctx has already been initialized.
> > > +	 */
> > > +	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > > +		return 0;
> > 
> > Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> > then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
> > that case?
> 
> This is checking for a valid acomp_ctx->acomp using the same criteria
> being uniformly used in acomp_ctx_dealloc(). This check is necessary to
> handle the case where the CPU goes through online-offline-online state
> transitions.

I think I am confused. I thought now we don't free this on CPU offline,
so either it's NULL because this is the first time we initialize it on
this CPU, or it is allocated. If it is an ERR value, then the pool
creation should have failed and we wouldn't be calling this again on CPU
online.

In other words, what scenario do we expect to legitimately see an ERR
value here?


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-12-12 18:43       ` Yosry Ahmed
@ 2025-12-12 20:53         ` Sridhar, Kanchana P
  2025-12-12 22:25           ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-12 20:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Friday, December 12, 2025 10:44 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> exist from pool creation to deletion.
> 
> On Fri, Dec 12, 2025 at 06:17:07PM +0000, Sridhar, Kanchana P wrote:
> > >
> > > >  	ret =
> > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > >  				       &pool->node);
> > > >  	if (ret)
> > > > -		goto error;
> > > > +		goto ref_fail;
> > >
> > > IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> > > probably should add a new label.
> >
> > In this case we should because it is part of the pool creation failure
> > handling flow, at the end of which, the pool will be deleted.
> 
> What I mean is, when cpuhp_state_add_instance() fails we goto ref_fail
> which will call cpuhp_state_remove_instance(). But the current code does
> not call cpuhp_state_remove_instance() if cpuhp_state_add_instance()
> fails.

I see what you mean. The current mainline code does not call
cpuhp_state_remove_instance() if cpuhp_state_add_instance() fails, because
the cpuhotplug code will call the teardown callback in this case.

In this patch, I do need to call cpuhp_state_remove_instance() and
acomp_ctx_dealloc() in this case because there is no teardown callback
being registered.

> 
> >
> > >
> > > >
> > > >  	/* being the current pool takes 1 ref; this func expects the
> > > >  	 * caller to always add the new pool as the current pool
> > > > @@ -292,6 +313,9 @@ static struct zswap_pool
> *zswap_pool_create(char
> > > *compressor)
> > > >
> > > >  ref_fail:
> > > >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > &pool->node);
> > > > +
> > > > +	for_each_possible_cpu(cpu)
> > > > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > > >  error:
> > > >  	if (pool->acomp_ctx)
> > > >  		free_percpu(pool->acomp_ctx);
> > > > @@ -322,9 +346,15 @@ static struct zswap_pool
> > > *__zswap_pool_create_fallback(void)
> > > >
> > > >  static void zswap_pool_destroy(struct zswap_pool *pool)
> > > >  {
> > > > +	int cpu;
> > > > +
> > > >  	zswap_pool_debug("destroying", pool);
> > > >
> > > >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > &pool->node);
> > > > +
> > > > +	for_each_possible_cpu(cpu)
> > > > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > > > +
> > > >  	free_percpu(pool->acomp_ctx);
> > > >
> > > >  	zs_destroy_pool(pool->zs_pool);
> > > > @@ -736,39 +766,35 @@ static int
> zswap_cpu_comp_prepare(unsigned int
> > > cpu, struct hlist_node *node)
> > > >  {
> > > >  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > > node);
> > > >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > >acomp_ctx, cpu);
> > > > -	struct crypto_acomp *acomp = NULL;
> > > > -	struct acomp_req *req = NULL;
> > > > -	u8 *buffer = NULL;
> > > > -	int ret;
> > > > +	int ret = -ENOMEM;
> > > >
> > > > -	buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> > > > -	if (!buffer) {
> > > > -		ret = -ENOMEM;
> > > > -		goto fail;
> > > > -	}
> > > > +	/*
> > > > +	 * To handle cases where the CPU goes through online-offline-online
> > > > +	 * transitions, we return if the acomp_ctx has already been initialized.
> > > > +	 */
> > > > +	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > > > +		return 0;
> > >
> > > Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> > > then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
> > > that case?
> >
> > This is checking for a valid acomp_ctx->acomp using the same criteria
> > being uniformly used in acomp_ctx_dealloc(). This check is necessary to
> > handle the case where the CPU goes through online-offline-online state
> > transitions.
> 
> I think I am confused. I thought now we don't free this on CPU offline,
> so either it's NULL because this is the first time we initialize it on
> this CPU, or it is allocated.

Yes, this is correct.

> If it is an ERR value, then the pool
> creation should have failed and we wouldn't be calling this again on CPU
> online.
> 
> In other words, what scenario do we expect to legitimately see an ERR
> value here?

I am using "(!IS_ERR_OR_NULL(acomp_ctx->acomp)" as a check for the
acomp being allocated already. I could instead have used "if (acomp_ctx->acomp)",
but use the former to be consistent with patch 20/22.

I cannot think of a scenario where we can expect an ERR value here.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-12-12 20:53         ` Sridhar, Kanchana P
@ 2025-12-12 22:25           ` Yosry Ahmed
  2025-12-13 19:53             ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-12-12 22:25 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh

On Fri, Dec 12, 2025 at 08:53:13PM +0000, Sridhar, Kanchana P wrote:
[..]
> > On Fri, Dec 12, 2025 at 06:17:07PM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > > >  	ret =
> > > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > > >  				       &pool->node);
> > > > >  	if (ret)
> > > > > -		goto error;
> > > > > +		goto ref_fail;
> > > >
> > > > IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> > > > probably should add a new label.
> > >
> > > In this case we should because it is part of the pool creation failure
> > > handling flow, at the end of which, the pool will be deleted.
> > 
> > What I mean is, when cpuhp_state_add_instance() fails we goto ref_fail
> > which will call cpuhp_state_remove_instance(). But the current code does
> > not call cpuhp_state_remove_instance() if cpuhp_state_add_instance()
> > fails.
> 
> I see what you mean. The current mainline code does not call
> cpuhp_state_remove_instance() if cpuhp_state_add_instance() fails, because
> the cpuhotplug code will call the teardown callback in this case.
> 
> In this patch, I do need to call cpuhp_state_remove_instance() and
> acomp_ctx_dealloc() in this case because there is no teardown callback
> being registered.

Hmm looking at cpuhp_state_add_instance(), it seems like it doesn't add
the node to the list on failure. cpuhp_state_remove_instance() only
removes the node from the list when there's no teardown cb, so it will
be a nop in this case.

What we need to do is manual cleanup, since there is no teardown cb,
which is already being done by acomp_ctx_dealloc() IIUC.

So I think calling cpuhp_state_remove_instance() when
cpuhp_state_add_instance() fails is not needed, and I don't see other
callers doing it.

[..]
> > > > > @@ -322,9 +346,15 @@ static struct zswap_pool
> > > > *__zswap_pool_create_fallback(void)
> > > > >
> > > > >  static void zswap_pool_destroy(struct zswap_pool *pool)
> > > > >  {
> > > > > +	int cpu;
> > > > > +
> > > > >  	zswap_pool_debug("destroying", pool);
> > > > >
> > > > >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > > &pool->node);
> > > > > +
> > > > > +	for_each_possible_cpu(cpu)
> > > > > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > > > > +
> > > > >  	free_percpu(pool->acomp_ctx);
> > > > >
> > > > >  	zs_destroy_pool(pool->zs_pool);
> > > > > @@ -736,39 +766,35 @@ static int
> > zswap_cpu_comp_prepare(unsigned int
> > > > cpu, struct hlist_node *node)
> > > > >  {
> > > > >  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> > > > node);
> > > > >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > > >acomp_ctx, cpu);
> > > > > -	struct crypto_acomp *acomp = NULL;
> > > > > -	struct acomp_req *req = NULL;
> > > > > -	u8 *buffer = NULL;
> > > > > -	int ret;
> > > > > +	int ret = -ENOMEM;
> > > > >
> > > > > -	buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> > > > > -	if (!buffer) {
> > > > > -		ret = -ENOMEM;
> > > > > -		goto fail;
> > > > > -	}
> > > > > +	/*
> > > > > +	 * To handle cases where the CPU goes through online-offline-online
> > > > > +	 * transitions, we return if the acomp_ctx has already been initialized.
> > > > > +	 */
> > > > > +	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > > > > +		return 0;
> > > >
> > > > Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> > > > then zswap initialization should have failed. Maybe WARN_ON_ONCE() for
> > > > that case?
> > >
> > > This is checking for a valid acomp_ctx->acomp using the same criteria
> > > being uniformly used in acomp_ctx_dealloc(). This check is necessary to
> > > handle the case where the CPU goes through online-offline-online state
> > > transitions.
> > 
> > I think I am confused. I thought now we don't free this on CPU offline,
> > so either it's NULL because this is the first time we initialize it on
> > this CPU, or it is allocated.
> 
> Yes, this is correct.
> 
> > If it is an ERR value, then the pool
> > creation should have failed and we wouldn't be calling this again on CPU
> > online.
> > 
> > In other words, what scenario do we expect to legitimately see an ERR
> > value here?
> 
> I am using "(!IS_ERR_OR_NULL(acomp_ctx->acomp)" as a check for the
> acomp being allocated already. I could instead have used "if (acomp_ctx->acomp)",
> but use the former to be consistent with patch 20/22.
> 
> I cannot think of a scenario where we can expect an ERR value here.

Yeah maybe do if (acomp_ctx->acomp) and
WARN_ON_ONCE(IS_ERR(acomp_ctx->acomp))?


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion.
  2025-12-12 22:25           ` Yosry Ahmed
@ 2025-12-13 19:53             ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-13 19:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Friday, December 12, 2025 2:25 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources
> exist from pool creation to deletion.
> 
> On Fri, Dec 12, 2025 at 08:53:13PM +0000, Sridhar, Kanchana P wrote:
> [..]
> > > On Fri, Dec 12, 2025 at 06:17:07PM +0000, Sridhar, Kanchana P wrote:
> > > > >
> > > > > >  	ret =
> > > > > cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > > > >  				       &pool->node);
> > > > > >  	if (ret)
> > > > > > -		goto error;
> > > > > > +		goto ref_fail;
> > > > >
> > > > > IIUC we shouldn't call cpuhp_state_remove_instance() on failure, we
> > > > > probably should add a new label.
> > > >
> > > > In this case we should because it is part of the pool creation failure
> > > > handling flow, at the end of which, the pool will be deleted.
> > >
> > > What I mean is, when cpuhp_state_add_instance() fails we goto ref_fail
> > > which will call cpuhp_state_remove_instance(). But the current code does
> > > not call cpuhp_state_remove_instance() if cpuhp_state_add_instance()
> > > fails.
> >
> > I see what you mean. The current mainline code does not call
> > cpuhp_state_remove_instance() if cpuhp_state_add_instance() fails,
> because
> > the cpuhotplug code will call the teardown callback in this case.
> >
> > In this patch, I do need to call cpuhp_state_remove_instance() and
> > acomp_ctx_dealloc() in this case because there is no teardown callback
> > being registered.
> 
> Hmm looking at cpuhp_state_add_instance(), it seems like it doesn't add
> the node to the list on failure. cpuhp_state_remove_instance() only
> removes the node from the list when there's no teardown cb, so it will
> be a nop in this case.
> 
> What we need to do is manual cleanup, since there is no teardown cb,
> which is already being done by acomp_ctx_dealloc() IIUC.
> 
> So I think calling cpuhp_state_remove_instance() when
> cpuhp_state_add_instance() fails is not needed, and I don't see other
> callers doing it.

You are right. I too have verified this. I will create a label for the call
to acomp_ctx_dealloc() and fix this.

> 
> [..]
> > > > > > @@ -322,9 +346,15 @@ static struct zswap_pool
> > > > > *__zswap_pool_create_fallback(void)
> > > > > >
> > > > > >  static void zswap_pool_destroy(struct zswap_pool *pool)
> > > > > >  {
> > > > > > +	int cpu;
> > > > > > +
> > > > > >  	zswap_pool_debug("destroying", pool);
> > > > > >
> > > > > >
> 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> > > > > &pool->node);
> > > > > > +
> > > > > > +	for_each_possible_cpu(cpu)
> > > > > > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx,
> cpu));
> > > > > > +
> > > > > >  	free_percpu(pool->acomp_ctx);
> > > > > >
> > > > > >  	zs_destroy_pool(pool->zs_pool);
> > > > > > @@ -736,39 +766,35 @@ static int
> > > zswap_cpu_comp_prepare(unsigned int
> > > > > cpu, struct hlist_node *node)
> > > > > >  {
> > > > > >  	struct zswap_pool *pool = hlist_entry(node, struct
> zswap_pool,
> > > > > node);
> > > > > >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> > > > > >acomp_ctx, cpu);
> > > > > > -	struct crypto_acomp *acomp = NULL;
> > > > > > -	struct acomp_req *req = NULL;
> > > > > > -	u8 *buffer = NULL;
> > > > > > -	int ret;
> > > > > > +	int ret = -ENOMEM;
> > > > > >
> > > > > > -	buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL,
> cpu_to_node(cpu));
> > > > > > -	if (!buffer) {
> > > > > > -		ret = -ENOMEM;
> > > > > > -		goto fail;
> > > > > > -	}
> > > > > > +	/*
> > > > > > +	 * To handle cases where the CPU goes through online-
> offline-online
> > > > > > +	 * transitions, we return if the acomp_ctx has already been
> initialized.
> > > > > > +	 */
> > > > > > +	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> > > > > > +		return 0;
> > > > >
> > > > > Is it possible for acomp_ctx->acomp to be an ERR value here? If it is,
> > > > > then zswap initialization should have failed. Maybe
> WARN_ON_ONCE() for
> > > > > that case?
> > > >
> > > > This is checking for a valid acomp_ctx->acomp using the same criteria
> > > > being uniformly used in acomp_ctx_dealloc(). This check is necessary to
> > > > handle the case where the CPU goes through online-offline-online state
> > > > transitions.
> > >
> > > I think I am confused. I thought now we don't free this on CPU offline,
> > > so either it's NULL because this is the first time we initialize it on
> > > this CPU, or it is allocated.
> >
> > Yes, this is correct.
> >
> > > If it is an ERR value, then the pool
> > > creation should have failed and we wouldn't be calling this again on CPU
> > > online.
> > >
> > > In other words, what scenario do we expect to legitimately see an ERR
> > > value here?
> >
> > I am using "(!IS_ERR_OR_NULL(acomp_ctx->acomp)" as a check for the
> > acomp being allocated already. I could instead have used "if (acomp_ctx-
> >acomp)",
> > but use the former to be consistent with patch 20/22.
> >
> > I cannot think of a scenario where we can expect an ERR value here.
> 
> Yeah maybe do if (acomp_ctx->acomp) and
> WARN_ON_ONCE(IS_ERR(acomp_ctx->acomp))?

Sure.

Thanks,
Kanchana



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 20/22] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (18 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-13 20:25   ` Yosry Ahmed
  2025-11-04  9:12 ` [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches Kanchana P Sridhar
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check
for valid acomp/req, thereby making it consistent with
acomp_ctx_dealloc().

This is based on this earlier comment [1] from Yosry, when reviewing v8.

[1] https://patchwork.kernel.org/comment/26282128/

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 87d50786f61f..cb384eb7c815 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -780,7 +780,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 		return ret;
 
 	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
-	if (IS_ERR(acomp_ctx->acomp)) {
+	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
 		ret = PTR_ERR(acomp_ctx->acomp);
@@ -789,7 +789,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
 
 	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
-	if (!acomp_ctx->req) {
+	if (IS_ERR_OR_NULL(acomp_ctx->req)) {
 		pr_err("could not alloc crypto acomp_request %s\n",
 		       pool->tfm_name);
 		goto fail;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 20/22] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources.
  2025-11-04  9:12 ` [PATCH v13 20/22] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources Kanchana P Sridhar
@ 2025-11-13 20:25   ` Yosry Ahmed
  2025-12-12  1:07     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-11-13 20:25 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, kristen.c.accardi, vinicius.gomes,
	wajdi.k.feghali, vinodh.gopal

On Tue, Nov 04, 2025 at 01:12:33AM -0800, Kanchana P Sridhar wrote:
> This patch uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check
> for valid acomp/req, thereby making it consistent with
> acomp_ctx_dealloc().

Instead of "This patch..":

Use IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check for valid
acomp/req, making it consistent with acomp_ctx_dealloc().

> 
> This is based on this earlier comment [1] from Yosry, when reviewing v8.

Drop this statement, it loses its meaning after the code is merged.

With those changes:
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev>

> 
> [1] https://patchwork.kernel.org/comment/26282128/
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 87d50786f61f..cb384eb7c815 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -780,7 +780,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  		return ret;
>  
>  	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
> -	if (IS_ERR(acomp_ctx->acomp)) {
> +	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
>  		pr_err("could not alloc crypto acomp %s : %ld\n",
>  				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
>  		ret = PTR_ERR(acomp_ctx->acomp);
> @@ -789,7 +789,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
>  
>  	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> -	if (!acomp_ctx->req) {
> +	if (IS_ERR_OR_NULL(acomp_ctx->req)) {
>  		pr_err("could not alloc crypto acomp_request %s\n",
>  		       pool->tfm_name);
>  		goto fail;
> -- 
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 20/22] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources.
  2025-11-13 20:25   ` Yosry Ahmed
@ 2025-12-12  1:07     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-12  1:07 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 12:26 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 20/22] mm: zswap: Consistently use
> IS_ERR_OR_NULL() to check acomp_ctx resources.
> 
> On Tue, Nov 04, 2025 at 01:12:33AM -0800, Kanchana P Sridhar wrote:
> > This patch uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check
> > for valid acomp/req, thereby making it consistent with
> > acomp_ctx_dealloc().
> 
> Instead of "This patch..":
> 
> Use IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check for valid
> acomp/req, making it consistent with acomp_ctx_dealloc().
> 
> >
> > This is based on this earlier comment [1] from Yosry, when reviewing v8.
> 
> Drop this statement, it loses its meaning after the code is merged.
> 
> With those changes:
> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev>

Sounds great, and thanks for the Ack!

Thanks,
Kanchana

> 
> >
> > [1] https://patchwork.kernel.org/comment/26282128/
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 87d50786f61f..cb384eb7c815 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -780,7 +780,7 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  		return ret;
> >
> >  	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0,
> 0, cpu_to_node(cpu));
> > -	if (IS_ERR(acomp_ctx->acomp)) {
> > +	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
> >  		pr_err("could not alloc crypto acomp %s : %ld\n",
> >  				pool->tfm_name, PTR_ERR(acomp_ctx-
> >acomp));
> >  		ret = PTR_ERR(acomp_ctx->acomp);
> > @@ -789,7 +789,7 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
> >
> >  	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> > -	if (!acomp_ctx->req) {
> > +	if (IS_ERR_OR_NULL(acomp_ctx->req)) {
> >  		pr_err("could not alloc crypto acomp_request %s\n",
> >  		       pool->tfm_name);
> >  		goto fail;
> > --
> > 2.27.0
> >


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (19 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 20/22] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-06 17:45   ` Nhat Pham
  2025-11-13 20:51   ` Yosry Ahmed
  2025-11-04  9:12 ` [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
  2025-11-13 18:14 ` [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Sridhar, Kanchana P
  22 siblings, 2 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes two major changes:

First, we allocate pool batching resources if the compressor supports
batching:

  This patch sets up zswap for allocating per-CPU resources optimally
  for non-batching and batching compressors.

  A new ZSWAP_MAX_BATCH_SIZE constant is defined as 8U, to set an upper
  limit on the number of pages in large folios that will be batch
  compressed.

  It is up to the compressor to manage multiple requests, as needed, to
  accomplish batch parallelism. zswap only needs to allocate the per-CPU
  dst buffers according to the batch size supported by the compressor.

  A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
  Yosry's suggestion. pool->compr_batch_size is set as the minimum of
  the compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE. Accordingly,
  pool->compr_batch_size compression dst buffers are allocated in the
  per-CPU acomp_ctx.

  zswap does not use more than one dst buffer yet. Follow-up patches
  will actually utilize the multiple acomp_ctx buffers for batch
  compression/decompression of multiple pages.

  Thus, ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used for
  batching. There is a small extra memory overhead of allocating
  the acomp_ctx->buffers array for compressors that do not support
  batching: On x86_64, the overhead is 1 pointer per-CPU (i.e. 8 bytes).

Next, we store the folio in batches:

  This patch modifies zswap_store() to store a batch of pages in large
  folios at a time, instead of storing one page at a time. It does this by
  calling a new procedure zswap_store_pages() with a range of indices in
  the folio: for batching compressors, this range contains up to
  pool->compr_batch_size pages. For non-batching compressors, we send up
  to ZSWAP_MAX_BATCH_SIZE pages to be sequentially compressed and stored
  in zswap_store_pages().

  zswap_store_pages() implements all the computes done earlier in
  zswap_store_page() for a single-page, for multiple pages in a folio,
  namely the "batch":

  1) It starts by allocating all zswap entries required to store the
     batch. New procedures, zswap_entries_cache_alloc_batch() and
     zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
     to optimize the performance of this step.

  2) The entry doesn't have to be allocated on the same node as the page
     being stored in zswap: we let the slab allocator decide this in
     kmem_cache_alloc_bulk(). However, to make sure the current zswap
     LRU list/shrinker behavior is preserved, we store the folio's nid as
     a new @nid member in the entry to enable adding it to the correct
     LRU list (and deleting it from the right LRU list). This ensures
     that when the folio's allocating NUMA node is under memory
     pressure, the entries corresponding to its pages are written back.

     The memory footprint of struct zswap_entry remains unchanged at
     56 bytes with the addition of the "int nid" member by condensing
     "length" and "referenced" into 4 bytes using bit fields and using
     the 4 bytes available after "referenced" for the "int nid". Thanks
     to Nhat and Yosry for these suggestions!

  3) Next, the entries fields are written, computes that need to be happen
     anyway, without modifying the zswap xarray/LRU publishing order. This
     avoids bringing the entries into the cache for writing in different
     code blocks within this procedure, hence improves latency.

  4) Next, it calls zswap_compress() to sequentially compress each page in
     the batch.

  5) Finally, it adds the batch's zswap entries to the xarray and LRU,
     charges zswap memory and increments zswap stats.

  6) The error handling and cleanup required for all failure scenarios
     that can occur while storing a batch in zswap are consolidated to a
     single "store_pages_failed" label in zswap_store_pages(). Here again,
     we optimize performance by calling kmem_cache_free_bulk().

This commit also makes a minor optimization in zswap_compress(), that
takes a "bool wb_enabled" argument; computed once in zswap_store()
rather than for each page in the folio.

Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 336 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 232 insertions(+), 104 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index cb384eb7c815..257567edc587 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -82,6 +82,9 @@ static bool zswap_pool_reached_full;
 
 #define ZSWAP_PARAM_UNSET ""
 
+/* Limit the batch size to limit per-CPU memory usage for dst buffers. */
+#define ZSWAP_MAX_BATCH_SIZE 8U
+
 static int zswap_setup(void);
 
 /* Enable/disable zswap */
@@ -139,7 +142,7 @@ struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
 	struct acomp_req *req;
 	struct crypto_wait wait;
-	u8 *buffer;
+	u8 **buffers;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -149,6 +152,9 @@ struct crypto_acomp_ctx {
  * The only case where lru_lock is not acquired while holding tree.lock is
  * when a zswap_entry is taken off the lru for writeback, in that case it
  * needs to be verified that it's still valid in the tree.
+ *
+ * @compr_batch_size: The max batch size of the compression algorithm,
+ *                    bounded by ZSWAP_MAX_BATCH_SIZE.
  */
 struct zswap_pool {
 	struct zs_pool *zs_pool;
@@ -158,6 +164,7 @@ struct zswap_pool {
 	struct work_struct release_work;
 	struct hlist_node node;
 	char tfm_name[CRYPTO_MAX_ALG_NAME];
+	u8 compr_batch_size;
 };
 
 /* Global LRU lists shared by all zswap pools. */
@@ -182,6 +189,7 @@ static struct shrinker *zswap_shrinker;
  *              writeback logic. The entry is only reclaimed by the writeback
  *              logic if referenced is unset. See comments in the shrinker
  *              section for context.
+ * nid - NUMA node id of the page for which this is the zswap entry.
  * pool - the zswap_pool the entry's data is in
  * handle - zsmalloc allocation handle that stores the compressed page data
  * objcg - the obj_cgroup that the compressed memory is charged to
@@ -189,8 +197,11 @@ static struct shrinker *zswap_shrinker;
  */
 struct zswap_entry {
 	swp_entry_t swpentry;
-	unsigned int length;
-	bool referenced;
+	struct {
+		unsigned int length:31;
+		bool referenced:1;
+	};
+	int nid;
 	struct zswap_pool *pool;
 	unsigned long handle;
 	struct obj_cgroup *objcg;
@@ -242,8 +253,10 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 **********************************/
 static void __zswap_pool_empty(struct percpu_ref *ref);
 
-static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
+static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
 {
+	u8 i;
+
 	if (IS_ERR_OR_NULL(acomp_ctx))
 		return;
 
@@ -253,7 +266,11 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
 	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 		crypto_free_acomp(acomp_ctx->acomp);
 
-	kfree(acomp_ctx->buffer);
+	if (acomp_ctx->buffers) {
+		for (i = 0; i < nr_buffers; ++i)
+			kfree(acomp_ctx->buffers[i]);
+		kfree(acomp_ctx->buffers);
+	}
 }
 
 static struct zswap_pool *zswap_pool_create(char *compressor)
@@ -265,6 +282,7 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
 	if (!zswap_has_pool && !strcmp(compressor, ZSWAP_PARAM_UNSET))
 		return NULL;
 
+	/* Many things rely on the zero-initialization. */
 	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
 	if (!pool)
 		return NULL;
@@ -315,7 +333,9 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
 
 	for_each_possible_cpu(cpu)
-		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
+				  pool->compr_batch_size);
+
 error:
 	if (pool->acomp_ctx)
 		free_percpu(pool->acomp_ctx);
@@ -353,7 +373,8 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
 
 	for_each_possible_cpu(cpu)
-		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
+		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
+				  pool->compr_batch_size);
 
 	free_percpu(pool->acomp_ctx);
 
@@ -644,14 +665,8 @@ static inline struct mem_cgroup *mem_cgroup_from_entry(struct zswap_entry *entry
 }
 #endif
 
-static inline int entry_to_nid(struct zswap_entry *entry)
-{
-	return page_to_nid(virt_to_page(entry));
-}
-
 static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
 {
-	int nid = entry_to_nid(entry);
 	struct mem_cgroup *memcg;
 
 	/*
@@ -668,19 +683,18 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
 	rcu_read_lock();
 	memcg = mem_cgroup_from_entry(entry);
 	/* will always succeed */
-	list_lru_add(list_lru, &entry->lru, nid, memcg);
+	list_lru_add(list_lru, &entry->lru, entry->nid, memcg);
 	rcu_read_unlock();
 }
 
 static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
 {
-	int nid = entry_to_nid(entry);
 	struct mem_cgroup *memcg;
 
 	rcu_read_lock();
 	memcg = mem_cgroup_from_entry(entry);
 	/* will always succeed */
-	list_lru_del(list_lru, &entry->lru, nid, memcg);
+	list_lru_del(list_lru, &entry->lru, entry->nid, memcg);
 	rcu_read_unlock();
 }
 
@@ -740,6 +754,29 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
 	kmem_cache_free(zswap_entry_cache, entry);
 }
 
+/*
+ * Returns 0 if kmem_cache_alloc_bulk() failed and a positive number otherwise.
+ * The code for __kmem_cache_alloc_bulk() indicates that this positive number
+ * will be the @size requested, i.e., @nr_entries.
+ */
+static __always_inline int zswap_entries_cache_alloc_batch(void **entries,
+							   unsigned int nr_entries,
+							   gfp_t gfp)
+{
+	int nr_alloc = kmem_cache_alloc_bulk(zswap_entry_cache, gfp,
+					     nr_entries, entries);
+
+	WARN_ON(!nr_alloc || (nr_alloc != nr_entries));
+
+	return nr_alloc;
+}
+
+static __always_inline void zswap_entries_cache_free_batch(void **entries,
+							   unsigned int nr_entries)
+{
+	kmem_cache_free_bulk(zswap_entry_cache, nr_entries, entries);
+}
+
 /*
  * Carries out the common pattern of freeing an entry's zsmalloc allocation,
  * freeing the entry itself, and decrementing the number of stored pages.
@@ -766,7 +803,9 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 {
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
+	int nid = cpu_to_node(cpu);
 	int ret = -ENOMEM;
+	u8 i;
 
 	/*
 	 * To handle cases where the CPU goes through online-offline-online
@@ -775,11 +814,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 		return 0;
 
-	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
-	if (!acomp_ctx->buffer)
-		return ret;
-
-	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
+	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, nid);
 	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
@@ -788,20 +823,39 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	}
 	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
 
+	/*
+	 * Allocate up to ZSWAP_MAX_BATCH_SIZE dst buffers if the
+	 * compressor supports batching.
+	 */
+	pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
+				     crypto_acomp_batch_size(acomp_ctx->acomp));
+
 	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
+
 	if (IS_ERR_OR_NULL(acomp_ctx->req)) {
 		pr_err("could not alloc crypto acomp_request %s\n",
 		       pool->tfm_name);
 		goto fail;
 	}
 
-	crypto_init_wait(&acomp_ctx->wait);
+	acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8 *),
+					  GFP_KERNEL, nid);
+	if (!acomp_ctx->buffers)
+		goto fail;
+
+	for (i = 0; i < pool->compr_batch_size; ++i) {
+		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE, GFP_KERNEL, nid);
+		if (!acomp_ctx->buffers[i])
+			goto fail;
+	}
 
 	/*
 	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
+	crypto_init_wait(&acomp_ctx->wait);
+
 	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
@@ -811,12 +865,12 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	return 0;
 
 fail:
-	acomp_ctx_dealloc(acomp_ctx);
+	acomp_ctx_dealloc(acomp_ctx, pool->compr_batch_size);
 	return ret;
 }
 
 static bool zswap_compress(struct page *page, struct zswap_entry *entry,
-			   struct zswap_pool *pool)
+			   struct zswap_pool *pool, bool wb_enabled)
 {
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct scatterlist input, output;
@@ -830,7 +884,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
 	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->buffer;
+	dst = acomp_ctx->buffers[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
@@ -860,8 +914,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * to the active LRU list in the case.
 	 */
 	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
-		if (!mem_cgroup_zswap_writeback_enabled(
-					folio_memcg(page_folio(page)))) {
+		if (!wb_enabled) {
 			comp_ret = comp_ret ? comp_ret : -EINVAL;
 			goto unlock;
 		}
@@ -906,7 +959,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 
 	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
 	mutex_lock(&acomp_ctx->mutex);
-	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
+	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffers[0]);
 
 	/* zswap entries of length PAGE_SIZE are not compressed. */
 	if (entry->length == PAGE_SIZE) {
@@ -916,15 +969,15 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 
 	/*
 	 * zs_obj_read_begin() might return a kmap address of highmem when
-	 * acomp_ctx->buffer is not used.  However, sg_init_one() does not
-	 * handle highmem addresses, so copy the object to acomp_ctx->buffer.
+	 * acomp_ctx->buffers[0] is not used.  However, sg_init_one() does not
+	 * handle highmem addresses, so copy the object to acomp_ctx->buffers[0].
 	 */
 	if (virt_addr_valid(obj)) {
 		src = obj;
 	} else {
-		WARN_ON_ONCE(obj == acomp_ctx->buffer);
-		memcpy(acomp_ctx->buffer, obj, entry->length);
-		src = acomp_ctx->buffer;
+		WARN_ON_ONCE(obj == acomp_ctx->buffers[0]);
+		memcpy(acomp_ctx->buffers[0], obj, entry->length);
+		src = acomp_ctx->buffers[0];
 	}
 
 	sg_init_one(&input, src, entry->length);
@@ -1378,95 +1431,156 @@ static void shrink_worker(struct work_struct *w)
 * main API
 **********************************/
 
-static bool zswap_store_page(struct page *page,
-			     struct obj_cgroup *objcg,
-			     struct zswap_pool *pool)
+/*
+ * Store multiple pages in @folio, starting from the page at index @start up to
+ * the page at index @end-1.
+ */
+static bool zswap_store_pages(struct folio *folio,
+			      long start,
+			      long end,
+			      struct obj_cgroup *objcg,
+			      struct zswap_pool *pool,
+			      int nid,
+			      bool wb_enabled)
 {
-	swp_entry_t page_swpentry = page_swap_entry(page);
-	struct zswap_entry *entry, *old;
-
-	/* allocate entry */
-	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
-	if (!entry) {
-		zswap_reject_kmemcache_fail++;
-		return false;
+	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
+	u8 i, store_fail_idx = 0, nr_pages = end - start;
+
+	VM_WARN_ON_ONCE(nr_pages > ZSWAP_MAX_BATCH_SIZE);
+
+	if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
+						      nr_pages, GFP_KERNEL))) {
+		for (i = 0; i < nr_pages; ++i) {
+			entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, nid);
+
+			if (unlikely(!entries[i])) {
+				zswap_reject_kmemcache_fail++;
+				/*
+				 * While handling this error, we only need to
+				 * call zswap_entries_cache_free_batch() for
+				 * entries[0 .. @i-1].
+				 */
+				nr_pages = i;
+				goto store_pages_failed;
+			}
+		}
 	}
 
-	if (!zswap_compress(page, entry, pool))
-		goto compress_failed;
+	/*
+	 * We colocate entry initialization as much as possible here to
+	 * minimize potential cache misses.
+	 *
+	 * With kmem_cache_alloc_bulk(), the batch's entries will be created
+	 * on the NUMA node of the CPU on which zswap_store() is called, which
+	 * might not be the same as @nid, the NUMA node on which @folio was
+	 * allocated. In order for the @folio's entries to be written back when
+	 * @nid experiences memory pressure, we store @nid in @entry->nid.
+	 * This ensures that the entry is added to and deleted from the LRU
+	 * list of the correct node, namely @nid.
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
+		entries[i]->pool = pool;
+		entries[i]->swpentry = page_swap_entry(folio_page(folio, start + i));
+		entries[i]->objcg = objcg;
+		entries[i]->referenced = true;
+		entries[i]->nid = nid;
+		INIT_LIST_HEAD(&entries[i]->lru);
+	}
 
-	old = xa_store(swap_zswap_tree(page_swpentry),
-		       swp_offset(page_swpentry),
-		       entry, GFP_KERNEL);
-	if (xa_is_err(old)) {
-		int err = xa_err(old);
+	for (i = 0; i < nr_pages; ++i) {
+		struct page *page = folio_page(folio, start + i);
 
-		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-		zswap_reject_alloc_fail++;
-		goto store_failed;
+		if (!zswap_compress(page, entries[i], pool, wb_enabled))
+			goto store_pages_failed;
 	}
 
-	/*
-	 * We may have had an existing entry that became stale when
-	 * the folio was redirtied and now the new version is being
-	 * swapped out. Get rid of the old.
-	 */
-	if (old)
-		zswap_entry_free(old);
+	for (i = 0; i < nr_pages; ++i) {
+		struct zswap_entry *old, *entry = entries[i];
 
-	/*
-	 * The entry is successfully compressed and stored in the tree, there is
-	 * no further possibility of failure. Grab refs to the pool and objcg,
-	 * charge zswap memory, and increment zswap_stored_pages.
-	 * The opposite actions will be performed by zswap_entry_free()
-	 * when the entry is removed from the tree.
-	 */
-	zswap_pool_get(pool);
-	if (objcg) {
-		obj_cgroup_get(objcg);
-		obj_cgroup_charge_zswap(objcg, entry->length);
-	}
-	atomic_long_inc(&zswap_stored_pages);
-	if (entry->length == PAGE_SIZE)
-		atomic_long_inc(&zswap_stored_incompressible_pages);
+		old = xa_store(swap_zswap_tree(entry->swpentry),
+			       swp_offset(entry->swpentry),
+			       entry, GFP_KERNEL);
+		if (unlikely(xa_is_err(old))) {
+			int err = xa_err(old);
 
-	/*
-	 * We finish initializing the entry while it's already in xarray.
-	 * This is safe because:
-	 *
-	 * 1. Concurrent stores and invalidations are excluded by folio lock.
-	 *
-	 * 2. Writeback is excluded by the entry not being on the LRU yet.
-	 *    The publishing order matters to prevent writeback from seeing
-	 *    an incoherent entry.
-	 */
-	entry->pool = pool;
-	entry->swpentry = page_swpentry;
-	entry->objcg = objcg;
-	entry->referenced = true;
-	if (entry->length) {
-		INIT_LIST_HEAD(&entry->lru);
-		zswap_lru_add(&zswap_list_lru, entry);
+			WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+			zswap_reject_alloc_fail++;
+			/*
+			 * Entries up to this point have been stored in the
+			 * xarray. zswap_store() will erase them from the xarray
+			 * and call zswap_entry_free(). Local cleanup in
+			 * 'store_pages_failed' only needs to happen for
+			 * entries from [@i to @nr_pages).
+			 */
+			store_fail_idx = i;
+			goto store_pages_failed;
+		}
+
+		/*
+		 * We may have had an existing entry that became stale when
+		 * the folio was redirtied and now the new version is being
+		 * swapped out. Get rid of the old.
+		 */
+		if (unlikely(old))
+			zswap_entry_free(old);
+
+		/*
+		 * The entry is successfully compressed and stored in the tree,
+		 * and further failures will be cleaned up in zswap_store().
+		 * Grab refs to the pool and objcg, charge zswap memory, and
+		 * increment zswap_stored_pages. The opposite actions will be
+		 * performed by zswap_entry_free() when the entry is removed
+		 * from the tree.
+		 */
+		zswap_pool_get(pool);
+		if (objcg) {
+			obj_cgroup_get(objcg);
+			obj_cgroup_charge_zswap(objcg, entry->length);
+		}
+		atomic_long_inc(&zswap_stored_pages);
+		if (entry->length == PAGE_SIZE)
+			atomic_long_inc(&zswap_stored_incompressible_pages);
+
+		/*
+		 * We finish by adding the entry to the LRU while it's already
+		 * in xarray. This is safe because:
+		 *
+		 * 1. Concurrent stores and invalidations are excluded by folio lock.
+		 *
+		 * 2. Writeback is excluded by the entry not being on the LRU yet.
+		 *    The publishing order matters to prevent writeback from seeing
+		 *    an incoherent entry.
+		 */
+		if (likely(entry->length))
+			zswap_lru_add(&zswap_list_lru, entry);
 	}
 
 	return true;
 
-store_failed:
-	zs_free(pool->zs_pool, entry->handle);
-compress_failed:
-	zswap_entry_cache_free(entry);
+store_pages_failed:
+	for (i = store_fail_idx; i < nr_pages; ++i) {
+		if (!IS_ERR_VALUE(entries[i]->handle))
+			zs_free(pool->zs_pool, entries[i]->handle);
+	}
+	zswap_entries_cache_free_batch((void **)&entries[store_fail_idx],
+				       nr_pages - store_fail_idx);
+
 	return false;
 }
 
 bool zswap_store(struct folio *folio)
 {
+	bool wb_enabled = mem_cgroup_zswap_writeback_enabled(folio_memcg(folio));
 	long nr_pages = folio_nr_pages(folio);
 	swp_entry_t swp = folio->swap;
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
+	int nid = folio_nid(folio);
 	struct zswap_pool *pool;
+	u8 store_batch_size;
 	bool ret = false;
-	long index;
+	long start, end;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1500,10 +1614,24 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
-	for (index = 0; index < nr_pages; ++index) {
-		struct page *page = folio_page(folio, index);
+	/*
+	 * For batching compressors, store the folio in batches of the
+	 * compressor's batch_size.
+	 *
+	 * For non-batching compressors, store the folio in batches
+	 * of ZSWAP_MAX_BATCH_SIZE, where each page in the batch is
+	 * compressed sequentially. This gives better performance than
+	 * invoking zswap_store_pages() per-page, due to cache locality
+	 * of working set structures.
+	 */
+	store_batch_size = (pool->compr_batch_size > 1) ?
+				pool->compr_batch_size : ZSWAP_MAX_BATCH_SIZE;
+
+	for (start = 0; start < nr_pages; start += store_batch_size) {
+		end = min(start + store_batch_size, nr_pages);
 
-		if (!zswap_store_page(page, objcg, pool))
+		if (!zswap_store_pages(folio, start, end, objcg, pool,
+				       nid, wb_enabled))
 			goto put_pool;
 	}
 
@@ -1533,9 +1661,9 @@ bool zswap_store(struct folio *folio)
 		struct zswap_entry *entry;
 		struct xarray *tree;
 
-		for (index = 0; index < nr_pages; ++index) {
-			tree = swap_zswap_tree(swp_entry(type, offset + index));
-			entry = xa_erase(tree, offset + index);
+		for (start = 0; start < nr_pages; ++start) {
+			tree = swap_zswap_tree(swp_entry(type, offset + start));
+			entry = xa_erase(tree, offset + start);
 			if (entry)
 				zswap_entry_free(entry);
 		}
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches.
  2025-11-04  9:12 ` [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches Kanchana P Sridhar
@ 2025-11-06 17:45   ` Nhat Pham
  2025-11-07  2:28     ` Sridhar, Kanchana P
  2025-11-13 20:51   ` Yosry Ahmed
  1 sibling, 1 reply; 79+ messages in thread
From: Nhat Pham @ 2025-11-06 17:45 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, kristen.c.accardi, vinicius.gomes,
	wajdi.k.feghali, vinodh.gopal

On Tue, Nov 4, 2025 at 1:12 AM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch makes two major changes:
>
> First, we allocate pool batching resources if the compressor supports
> batching:
>
>   This patch sets up zswap for allocating per-CPU resources optimally
>   for non-batching and batching compressors.
>
>   A new ZSWAP_MAX_BATCH_SIZE constant is defined as 8U, to set an upper
>   limit on the number of pages in large folios that will be batch
>   compressed.
>
>   It is up to the compressor to manage multiple requests, as needed, to
>   accomplish batch parallelism. zswap only needs to allocate the per-CPU
>   dst buffers according to the batch size supported by the compressor.
>
>   A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
>   Yosry's suggestion. pool->compr_batch_size is set as the minimum of
>   the compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE. Accordingly,
>   pool->compr_batch_size compression dst buffers are allocated in the
>   per-CPU acomp_ctx.
>
>   zswap does not use more than one dst buffer yet. Follow-up patches
>   will actually utilize the multiple acomp_ctx buffers for batch
>   compression/decompression of multiple pages.
>
>   Thus, ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used for
>   batching. There is a small extra memory overhead of allocating
>   the acomp_ctx->buffers array for compressors that do not support
>   batching: On x86_64, the overhead is 1 pointer per-CPU (i.e. 8 bytes).
>
> Next, we store the folio in batches:
>
>   This patch modifies zswap_store() to store a batch of pages in large
>   folios at a time, instead of storing one page at a time. It does this by
>   calling a new procedure zswap_store_pages() with a range of indices in
>   the folio: for batching compressors, this range contains up to
>   pool->compr_batch_size pages. For non-batching compressors, we send up
>   to ZSWAP_MAX_BATCH_SIZE pages to be sequentially compressed and stored
>   in zswap_store_pages().
>
>   zswap_store_pages() implements all the computes done earlier in
>   zswap_store_page() for a single-page, for multiple pages in a folio,
>   namely the "batch":
>
>   1) It starts by allocating all zswap entries required to store the
>      batch. New procedures, zswap_entries_cache_alloc_batch() and
>      zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
>      to optimize the performance of this step.
>
>   2) The entry doesn't have to be allocated on the same node as the page
>      being stored in zswap: we let the slab allocator decide this in
>      kmem_cache_alloc_bulk(). However, to make sure the current zswap
>      LRU list/shrinker behavior is preserved, we store the folio's nid as
>      a new @nid member in the entry to enable adding it to the correct
>      LRU list (and deleting it from the right LRU list). This ensures
>      that when the folio's allocating NUMA node is under memory
>      pressure, the entries corresponding to its pages are written back.
>
>      The memory footprint of struct zswap_entry remains unchanged at
>      56 bytes with the addition of the "int nid" member by condensing
>      "length" and "referenced" into 4 bytes using bit fields and using
>      the 4 bytes available after "referenced" for the "int nid". Thanks
>      to Nhat and Yosry for these suggestions!
>
>   3) Next, the entries fields are written, computes that need to be happen
>      anyway, without modifying the zswap xarray/LRU publishing order. This
>      avoids bringing the entries into the cache for writing in different
>      code blocks within this procedure, hence improves latency.
>
>   4) Next, it calls zswap_compress() to sequentially compress each page in
>      the batch.
>
>   5) Finally, it adds the batch's zswap entries to the xarray and LRU,
>      charges zswap memory and increments zswap stats.
>
>   6) The error handling and cleanup required for all failure scenarios
>      that can occur while storing a batch in zswap are consolidated to a
>      single "store_pages_failed" label in zswap_store_pages(). Here again,
>      we optimize performance by calling kmem_cache_free_bulk().
>
> This commit also makes a minor optimization in zswap_compress(), that
> takes a "bool wb_enabled" argument; computed once in zswap_store()
> rather than for each page in the folio.
>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 336 ++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 232 insertions(+), 104 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index cb384eb7c815..257567edc587 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -82,6 +82,9 @@ static bool zswap_pool_reached_full;
>
>  #define ZSWAP_PARAM_UNSET ""
>
> +/* Limit the batch size to limit per-CPU memory usage for dst buffers. */
> +#define ZSWAP_MAX_BATCH_SIZE 8U
> +
>  static int zswap_setup(void);
>
>  /* Enable/disable zswap */
> @@ -139,7 +142,7 @@ struct crypto_acomp_ctx {
>         struct crypto_acomp *acomp;
>         struct acomp_req *req;
>         struct crypto_wait wait;
> -       u8 *buffer;
> +       u8 **buffers;
>         struct mutex mutex;
>         bool is_sleepable;
>  };
> @@ -149,6 +152,9 @@ struct crypto_acomp_ctx {
>   * The only case where lru_lock is not acquired while holding tree.lock is
>   * when a zswap_entry is taken off the lru for writeback, in that case it
>   * needs to be verified that it's still valid in the tree.
> + *
> + * @compr_batch_size: The max batch size of the compression algorithm,
> + *                    bounded by ZSWAP_MAX_BATCH_SIZE.
>   */
>  struct zswap_pool {
>         struct zs_pool *zs_pool;
> @@ -158,6 +164,7 @@ struct zswap_pool {
>         struct work_struct release_work;
>         struct hlist_node node;
>         char tfm_name[CRYPTO_MAX_ALG_NAME];
> +       u8 compr_batch_size;
>  };
>
>  /* Global LRU lists shared by all zswap pools. */
> @@ -182,6 +189,7 @@ static struct shrinker *zswap_shrinker;
>   *              writeback logic. The entry is only reclaimed by the writeback
>   *              logic if referenced is unset. See comments in the shrinker
>   *              section for context.
> + * nid - NUMA node id of the page for which this is the zswap entry.
>   * pool - the zswap_pool the entry's data is in
>   * handle - zsmalloc allocation handle that stores the compressed page data
>   * objcg - the obj_cgroup that the compressed memory is charged to
> @@ -189,8 +197,11 @@ static struct shrinker *zswap_shrinker;
>   */
>  struct zswap_entry {
>         swp_entry_t swpentry;
> -       unsigned int length;
> -       bool referenced;
> +       struct {
> +               unsigned int length:31;
> +               bool referenced:1;
> +       };

Maybe make these macro-defined constants?

Code mostly LGTM otherwise.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches.
  2025-11-06 17:45   ` Nhat Pham
@ 2025-11-07  2:28     ` Sridhar, Kanchana P
  2025-11-13 20:52       ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-11-07  2:28 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel, linux-mm, hannes, yosry.ahmed, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Thursday, November 6, 2025 9:46 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 21/22] mm: zswap: zswap_store() will process a
> large folio in batches.
> 
> On Tue, Nov 4, 2025 at 1:12 AM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch makes two major changes:
> >
> > First, we allocate pool batching resources if the compressor supports
> > batching:
> >
> >   This patch sets up zswap for allocating per-CPU resources optimally
> >   for non-batching and batching compressors.
> >
> >   A new ZSWAP_MAX_BATCH_SIZE constant is defined as 8U, to set an upper
> >   limit on the number of pages in large folios that will be batch
> >   compressed.
> >
> >   It is up to the compressor to manage multiple requests, as needed, to
> >   accomplish batch parallelism. zswap only needs to allocate the per-CPU
> >   dst buffers according to the batch size supported by the compressor.
> >
> >   A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
> >   Yosry's suggestion. pool->compr_batch_size is set as the minimum of
> >   the compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE.
> Accordingly,
> >   pool->compr_batch_size compression dst buffers are allocated in the
> >   per-CPU acomp_ctx.
> >
> >   zswap does not use more than one dst buffer yet. Follow-up patches
> >   will actually utilize the multiple acomp_ctx buffers for batch
> >   compression/decompression of multiple pages.
> >
> >   Thus, ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used
> for
> >   batching. There is a small extra memory overhead of allocating
> >   the acomp_ctx->buffers array for compressors that do not support
> >   batching: On x86_64, the overhead is 1 pointer per-CPU (i.e. 8 bytes).
> >
> > Next, we store the folio in batches:
> >
> >   This patch modifies zswap_store() to store a batch of pages in large
> >   folios at a time, instead of storing one page at a time. It does this by
> >   calling a new procedure zswap_store_pages() with a range of indices in
> >   the folio: for batching compressors, this range contains up to
> >   pool->compr_batch_size pages. For non-batching compressors, we send up
> >   to ZSWAP_MAX_BATCH_SIZE pages to be sequentially compressed and
> stored
> >   in zswap_store_pages().
> >
> >   zswap_store_pages() implements all the computes done earlier in
> >   zswap_store_page() for a single-page, for multiple pages in a folio,
> >   namely the "batch":
> >
> >   1) It starts by allocating all zswap entries required to store the
> >      batch. New procedures, zswap_entries_cache_alloc_batch() and
> >      zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
> >      to optimize the performance of this step.
> >
> >   2) The entry doesn't have to be allocated on the same node as the page
> >      being stored in zswap: we let the slab allocator decide this in
> >      kmem_cache_alloc_bulk(). However, to make sure the current zswap
> >      LRU list/shrinker behavior is preserved, we store the folio's nid as
> >      a new @nid member in the entry to enable adding it to the correct
> >      LRU list (and deleting it from the right LRU list). This ensures
> >      that when the folio's allocating NUMA node is under memory
> >      pressure, the entries corresponding to its pages are written back.
> >
> >      The memory footprint of struct zswap_entry remains unchanged at
> >      56 bytes with the addition of the "int nid" member by condensing
> >      "length" and "referenced" into 4 bytes using bit fields and using
> >      the 4 bytes available after "referenced" for the "int nid". Thanks
> >      to Nhat and Yosry for these suggestions!
> >
> >   3) Next, the entries fields are written, computes that need to be happen
> >      anyway, without modifying the zswap xarray/LRU publishing order. This
> >      avoids bringing the entries into the cache for writing in different
> >      code blocks within this procedure, hence improves latency.
> >
> >   4) Next, it calls zswap_compress() to sequentially compress each page in
> >      the batch.
> >
> >   5) Finally, it adds the batch's zswap entries to the xarray and LRU,
> >      charges zswap memory and increments zswap stats.
> >
> >   6) The error handling and cleanup required for all failure scenarios
> >      that can occur while storing a batch in zswap are consolidated to a
> >      single "store_pages_failed" label in zswap_store_pages(). Here again,
> >      we optimize performance by calling kmem_cache_free_bulk().
> >
> > This commit also makes a minor optimization in zswap_compress(), that
> > takes a "bool wb_enabled" argument; computed once in zswap_store()
> > rather than for each page in the folio.
> >
> > Suggested-by: Nhat Pham <nphamcs@gmail.com>
> > Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 336 ++++++++++++++++++++++++++++++++++++-------------
> ----
> >  1 file changed, 232 insertions(+), 104 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index cb384eb7c815..257567edc587 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -82,6 +82,9 @@ static bool zswap_pool_reached_full;
> >
> >  #define ZSWAP_PARAM_UNSET ""
> >
> > +/* Limit the batch size to limit per-CPU memory usage for dst buffers. */
> > +#define ZSWAP_MAX_BATCH_SIZE 8U
> > +
> >  static int zswap_setup(void);
> >
> >  /* Enable/disable zswap */
> > @@ -139,7 +142,7 @@ struct crypto_acomp_ctx {
> >         struct crypto_acomp *acomp;
> >         struct acomp_req *req;
> >         struct crypto_wait wait;
> > -       u8 *buffer;
> > +       u8 **buffers;
> >         struct mutex mutex;
> >         bool is_sleepable;
> >  };
> > @@ -149,6 +152,9 @@ struct crypto_acomp_ctx {
> >   * The only case where lru_lock is not acquired while holding tree.lock is
> >   * when a zswap_entry is taken off the lru for writeback, in that case it
> >   * needs to be verified that it's still valid in the tree.
> > + *
> > + * @compr_batch_size: The max batch size of the compression algorithm,
> > + *                    bounded by ZSWAP_MAX_BATCH_SIZE.
> >   */
> >  struct zswap_pool {
> >         struct zs_pool *zs_pool;
> > @@ -158,6 +164,7 @@ struct zswap_pool {
> >         struct work_struct release_work;
> >         struct hlist_node node;
> >         char tfm_name[CRYPTO_MAX_ALG_NAME];
> > +       u8 compr_batch_size;
> >  };
> >
> >  /* Global LRU lists shared by all zswap pools. */
> > @@ -182,6 +189,7 @@ static struct shrinker *zswap_shrinker;
> >   *              writeback logic. The entry is only reclaimed by the writeback
> >   *              logic if referenced is unset. See comments in the shrinker
> >   *              section for context.
> > + * nid - NUMA node id of the page for which this is the zswap entry.
> >   * pool - the zswap_pool the entry's data is in
> >   * handle - zsmalloc allocation handle that stores the compressed page data
> >   * objcg - the obj_cgroup that the compressed memory is charged to
> > @@ -189,8 +197,11 @@ static struct shrinker *zswap_shrinker;
> >   */
> >  struct zswap_entry {
> >         swp_entry_t swpentry;
> > -       unsigned int length;
> > -       bool referenced;
> > +       struct {
> > +               unsigned int length:31;
> > +               bool referenced:1;
> > +       };
> 
> Maybe make these macro-defined constants?
> 
> Code mostly LGTM otherwise.

Thanks, Nhat! With respect to the suggestion to make the bit-fields
as macro-defined constants, I was browsing through kernel headers
that use bit-fields, and it appears the convention is to use integers
rather than constants.

I then started browsing mm code and saw the struct zspage { .. } definition
in zsmalloc.c that uses macro-defined constants. But that seems to be
the exception. vmscan.c uses integer valued bit-fields in
struct scan_control { .. }. 

If you would still like the bit-fields to be constants, I am happy to make
the change but just wanted to share these observations.

Thanks,
Kanchana



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches.
  2025-11-07  2:28     ` Sridhar, Kanchana P
@ 2025-11-13 20:52       ` Yosry Ahmed
  0 siblings, 0 replies; 79+ messages in thread
From: Yosry Ahmed @ 2025-11-13 20:52 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Nhat Pham, linux-kernel, linux-mm, hannes, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh

On Fri, Nov 07, 2025 at 02:28:23AM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Thursday, November 6, 2025 9:46 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosry.ahmed@linux.dev; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 21/22] mm: zswap: zswap_store() will process a
> > large folio in batches.
> > 
> > On Tue, Nov 4, 2025 at 1:12 AM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > This patch makes two major changes:
> > >
> > > First, we allocate pool batching resources if the compressor supports
> > > batching:
> > >
> > >   This patch sets up zswap for allocating per-CPU resources optimally
> > >   for non-batching and batching compressors.
> > >
> > >   A new ZSWAP_MAX_BATCH_SIZE constant is defined as 8U, to set an upper
> > >   limit on the number of pages in large folios that will be batch
> > >   compressed.
> > >
> > >   It is up to the compressor to manage multiple requests, as needed, to
> > >   accomplish batch parallelism. zswap only needs to allocate the per-CPU
> > >   dst buffers according to the batch size supported by the compressor.
> > >
> > >   A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
> > >   Yosry's suggestion. pool->compr_batch_size is set as the minimum of
> > >   the compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE.
> > Accordingly,
> > >   pool->compr_batch_size compression dst buffers are allocated in the
> > >   per-CPU acomp_ctx.
> > >
> > >   zswap does not use more than one dst buffer yet. Follow-up patches
> > >   will actually utilize the multiple acomp_ctx buffers for batch
> > >   compression/decompression of multiple pages.
> > >
> > >   Thus, ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used
> > for
> > >   batching. There is a small extra memory overhead of allocating
> > >   the acomp_ctx->buffers array for compressors that do not support
> > >   batching: On x86_64, the overhead is 1 pointer per-CPU (i.e. 8 bytes).
> > >
> > > Next, we store the folio in batches:
> > >
> > >   This patch modifies zswap_store() to store a batch of pages in large
> > >   folios at a time, instead of storing one page at a time. It does this by
> > >   calling a new procedure zswap_store_pages() with a range of indices in
> > >   the folio: for batching compressors, this range contains up to
> > >   pool->compr_batch_size pages. For non-batching compressors, we send up
> > >   to ZSWAP_MAX_BATCH_SIZE pages to be sequentially compressed and
> > stored
> > >   in zswap_store_pages().
> > >
> > >   zswap_store_pages() implements all the computes done earlier in
> > >   zswap_store_page() for a single-page, for multiple pages in a folio,
> > >   namely the "batch":
> > >
> > >   1) It starts by allocating all zswap entries required to store the
> > >      batch. New procedures, zswap_entries_cache_alloc_batch() and
> > >      zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
> > >      to optimize the performance of this step.
> > >
> > >   2) The entry doesn't have to be allocated on the same node as the page
> > >      being stored in zswap: we let the slab allocator decide this in
> > >      kmem_cache_alloc_bulk(). However, to make sure the current zswap
> > >      LRU list/shrinker behavior is preserved, we store the folio's nid as
> > >      a new @nid member in the entry to enable adding it to the correct
> > >      LRU list (and deleting it from the right LRU list). This ensures
> > >      that when the folio's allocating NUMA node is under memory
> > >      pressure, the entries corresponding to its pages are written back.
> > >
> > >      The memory footprint of struct zswap_entry remains unchanged at
> > >      56 bytes with the addition of the "int nid" member by condensing
> > >      "length" and "referenced" into 4 bytes using bit fields and using
> > >      the 4 bytes available after "referenced" for the "int nid". Thanks
> > >      to Nhat and Yosry for these suggestions!
> > >
> > >   3) Next, the entries fields are written, computes that need to be happen
> > >      anyway, without modifying the zswap xarray/LRU publishing order. This
> > >      avoids bringing the entries into the cache for writing in different
> > >      code blocks within this procedure, hence improves latency.
> > >
> > >   4) Next, it calls zswap_compress() to sequentially compress each page in
> > >      the batch.
> > >
> > >   5) Finally, it adds the batch's zswap entries to the xarray and LRU,
> > >      charges zswap memory and increments zswap stats.
> > >
> > >   6) The error handling and cleanup required for all failure scenarios
> > >      that can occur while storing a batch in zswap are consolidated to a
> > >      single "store_pages_failed" label in zswap_store_pages(). Here again,
> > >      we optimize performance by calling kmem_cache_free_bulk().
> > >
> > > This commit also makes a minor optimization in zswap_compress(), that
> > > takes a "bool wb_enabled" argument; computed once in zswap_store()
> > > rather than for each page in the folio.
> > >
> > > Suggested-by: Nhat Pham <nphamcs@gmail.com>
> > > Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > ---
> > >  mm/zswap.c | 336 ++++++++++++++++++++++++++++++++++++-------------
> > ----
> > >  1 file changed, 232 insertions(+), 104 deletions(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index cb384eb7c815..257567edc587 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -82,6 +82,9 @@ static bool zswap_pool_reached_full;
> > >
> > >  #define ZSWAP_PARAM_UNSET ""
> > >
> > > +/* Limit the batch size to limit per-CPU memory usage for dst buffers. */
> > > +#define ZSWAP_MAX_BATCH_SIZE 8U
> > > +
> > >  static int zswap_setup(void);
> > >
> > >  /* Enable/disable zswap */
> > > @@ -139,7 +142,7 @@ struct crypto_acomp_ctx {
> > >         struct crypto_acomp *acomp;
> > >         struct acomp_req *req;
> > >         struct crypto_wait wait;
> > > -       u8 *buffer;
> > > +       u8 **buffers;
> > >         struct mutex mutex;
> > >         bool is_sleepable;
> > >  };
> > > @@ -149,6 +152,9 @@ struct crypto_acomp_ctx {
> > >   * The only case where lru_lock is not acquired while holding tree.lock is
> > >   * when a zswap_entry is taken off the lru for writeback, in that case it
> > >   * needs to be verified that it's still valid in the tree.
> > > + *
> > > + * @compr_batch_size: The max batch size of the compression algorithm,
> > > + *                    bounded by ZSWAP_MAX_BATCH_SIZE.
> > >   */
> > >  struct zswap_pool {
> > >         struct zs_pool *zs_pool;
> > > @@ -158,6 +164,7 @@ struct zswap_pool {
> > >         struct work_struct release_work;
> > >         struct hlist_node node;
> > >         char tfm_name[CRYPTO_MAX_ALG_NAME];
> > > +       u8 compr_batch_size;
> > >  };
> > >
> > >  /* Global LRU lists shared by all zswap pools. */
> > > @@ -182,6 +189,7 @@ static struct shrinker *zswap_shrinker;
> > >   *              writeback logic. The entry is only reclaimed by the writeback
> > >   *              logic if referenced is unset. See comments in the shrinker
> > >   *              section for context.
> > > + * nid - NUMA node id of the page for which this is the zswap entry.
> > >   * pool - the zswap_pool the entry's data is in
> > >   * handle - zsmalloc allocation handle that stores the compressed page data
> > >   * objcg - the obj_cgroup that the compressed memory is charged to
> > > @@ -189,8 +197,11 @@ static struct shrinker *zswap_shrinker;
> > >   */
> > >  struct zswap_entry {
> > >         swp_entry_t swpentry;
> > > -       unsigned int length;
> > > -       bool referenced;
> > > +       struct {
> > > +               unsigned int length:31;
> > > +               bool referenced:1;
> > > +       };
> > 
> > Maybe make these macro-defined constants?
> > 
> > Code mostly LGTM otherwise.
> 
> Thanks, Nhat! With respect to the suggestion to make the bit-fields
> as macro-defined constants, I was browsing through kernel headers
> that use bit-fields, and it appears the convention is to use integers
> rather than constants.

Yeah I think that's the common case, let's keep the numbers as-is.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches.
  2025-11-04  9:12 ` [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches Kanchana P Sridhar
  2025-11-06 17:45   ` Nhat Pham
@ 2025-11-13 20:51   ` Yosry Ahmed
  2025-12-12  1:43     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-11-13 20:51 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, kristen.c.accardi, vinicius.gomes,
	wajdi.k.feghali, vinodh.gopal

On Tue, Nov 04, 2025 at 01:12:34AM -0800, Kanchana P Sridhar wrote:

Subject:

"mm: zswap: Store large folios in batches"

> This patch makes two major changes:
> 
> First, we allocate pool batching resources if the compressor supports
> batching:
> 
>   This patch sets up zswap for allocating per-CPU resources optimally
>   for non-batching and batching compressors.
> 
>   A new ZSWAP_MAX_BATCH_SIZE constant is defined as 8U, to set an upper
>   limit on the number of pages in large folios that will be batch
>   compressed.
> 
>   It is up to the compressor to manage multiple requests, as needed, to
>   accomplish batch parallelism. zswap only needs to allocate the per-CPU
>   dst buffers according to the batch size supported by the compressor.
> 
>   A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
>   Yosry's suggestion. pool->compr_batch_size is set as the minimum of
>   the compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE. Accordingly,
>   pool->compr_batch_size compression dst buffers are allocated in the
>   per-CPU acomp_ctx.
> 
>   zswap does not use more than one dst buffer yet. Follow-up patches
>   will actually utilize the multiple acomp_ctx buffers for batch
>   compression/decompression of multiple pages.
> 
>   Thus, ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used for
>   batching. There is a small extra memory overhead of allocating
>   the acomp_ctx->buffers array for compressors that do not support
>   batching: On x86_64, the overhead is 1 pointer per-CPU (i.e. 8 bytes).

Support batching when storing large folios in zswap. If the underlying
compressor supports batching (e.g. HW parallel compression), allocate
multiple compression buffers, otherwise allocate one. The number of
buffers is bounded by a new constant, ZSWAP_MAX_BATCH_SIZE, to limit the
memory overhead. For existing software compressors, the only extra
overhead is the extra 'buffers' pointer, so 8 bytes per-CPU on x86_64.

Only the first buffer is currently used, but subsequent changes will use
the remaining buffers for HW compression batching.

> 
> Next, we store the folio in batches:
> 
>   This patch modifies zswap_store() to store a batch of pages in large
>   folios at a time, instead of storing one page at a time. It does this by
>   calling a new procedure zswap_store_pages() with a range of indices in
>   the folio: for batching compressors, this range contains up to
>   pool->compr_batch_size pages. For non-batching compressors, we send up
>   to ZSWAP_MAX_BATCH_SIZE pages to be sequentially compressed and stored
>   in zswap_store_pages().
> 
>   zswap_store_pages() implements all the computes done earlier in
>   zswap_store_page() for a single-page, for multiple pages in a folio,
>   namely the "batch":
> 
>   1) It starts by allocating all zswap entries required to store the
>      batch. New procedures, zswap_entries_cache_alloc_batch() and
>      zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
>      to optimize the performance of this step.
> 
>   2) The entry doesn't have to be allocated on the same node as the page
>      being stored in zswap: we let the slab allocator decide this in
>      kmem_cache_alloc_bulk(). However, to make sure the current zswap
>      LRU list/shrinker behavior is preserved, we store the folio's nid as
>      a new @nid member in the entry to enable adding it to the correct
>      LRU list (and deleting it from the right LRU list). This ensures
>      that when the folio's allocating NUMA node is under memory
>      pressure, the entries corresponding to its pages are written back.
> 
>      The memory footprint of struct zswap_entry remains unchanged at
>      56 bytes with the addition of the "int nid" member by condensing
>      "length" and "referenced" into 4 bytes using bit fields and using
>      the 4 bytes available after "referenced" for the "int nid". Thanks
>      to Nhat and Yosry for these suggestions!
> 
>   3) Next, the entries fields are written, computes that need to be happen
>      anyway, without modifying the zswap xarray/LRU publishing order. This
>      avoids bringing the entries into the cache for writing in different
>      code blocks within this procedure, hence improves latency.
> 
>   4) Next, it calls zswap_compress() to sequentially compress each page in
>      the batch.
> 
>   5) Finally, it adds the batch's zswap entries to the xarray and LRU,
>      charges zswap memory and increments zswap stats.
> 
>   6) The error handling and cleanup required for all failure scenarios
>      that can occur while storing a batch in zswap are consolidated to a
>      single "store_pages_failed" label in zswap_store_pages(). Here again,
>      we optimize performance by calling kmem_cache_free_bulk().

Regardless of compression batching, always process large folios in
batches. For HW compressors, the batch size is the compressor batch
size, otherwise ZSWAP_MAX_BATCH_SIZE is used.

zswap_store_page() is replaced with zswap_store_pages(), which processes
a batch of pages and allows for batching optimizations. For now, only
optimize allocating entries by using batch allocations from the slab
cache.

Since batch allocations do not support specifying a node id, store the
node id in the zswap entry instead of relying on the zswap_entry being
allocated on the same node. The size of the zswap_entry remains
unchanged as 'referenced' is lumped in with the length (as it doesn't
need a full unsigned int anyway).

Avoid repeatedly calling mem_cgroup_zswap_writeback_enabled() for every
page and only call it once for the folio, since the entire folio is
charged to a single memcg.

> 
> This commit also makes a minor optimization in zswap_compress(), that
> takes a "bool wb_enabled" argument; computed once in zswap_store()
> rather than for each page in the folio.
> 
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 336 ++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 232 insertions(+), 104 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index cb384eb7c815..257567edc587 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -82,6 +82,9 @@ static bool zswap_pool_reached_full;
>  
>  #define ZSWAP_PARAM_UNSET ""
>  
> +/* Limit the batch size to limit per-CPU memory usage for dst buffers. */
> +#define ZSWAP_MAX_BATCH_SIZE 8U
> +
>  static int zswap_setup(void);
>  
>  /* Enable/disable zswap */
> @@ -139,7 +142,7 @@ struct crypto_acomp_ctx {
>  	struct crypto_acomp *acomp;
>  	struct acomp_req *req;
>  	struct crypto_wait wait;
> -	u8 *buffer;
> +	u8 **buffers;
>  	struct mutex mutex;
>  	bool is_sleepable;
>  };
> @@ -149,6 +152,9 @@ struct crypto_acomp_ctx {
>   * The only case where lru_lock is not acquired while holding tree.lock is
>   * when a zswap_entry is taken off the lru for writeback, in that case it
>   * needs to be verified that it's still valid in the tree.
> + *
> + * @compr_batch_size: The max batch size of the compression algorithm,
> + *                    bounded by ZSWAP_MAX_BATCH_SIZE.
>   */
>  struct zswap_pool {
>  	struct zs_pool *zs_pool;
> @@ -158,6 +164,7 @@ struct zswap_pool {
>  	struct work_struct release_work;
>  	struct hlist_node node;
>  	char tfm_name[CRYPTO_MAX_ALG_NAME];
> +	u8 compr_batch_size;
>  };
>  
>  /* Global LRU lists shared by all zswap pools. */
> @@ -182,6 +189,7 @@ static struct shrinker *zswap_shrinker;
>   *              writeback logic. The entry is only reclaimed by the writeback
>   *              logic if referenced is unset. See comments in the shrinker
>   *              section for context.
> + * nid - NUMA node id of the page for which this is the zswap entry.
>   * pool - the zswap_pool the entry's data is in
>   * handle - zsmalloc allocation handle that stores the compressed page data
>   * objcg - the obj_cgroup that the compressed memory is charged to
> @@ -189,8 +197,11 @@ static struct shrinker *zswap_shrinker;
>   */
>  struct zswap_entry {
>  	swp_entry_t swpentry;
> -	unsigned int length;
> -	bool referenced;
> +	struct {
> +		unsigned int length:31;
> +		bool referenced:1;
> +	};
> +	int nid;
>  	struct zswap_pool *pool;
>  	unsigned long handle;
>  	struct obj_cgroup *objcg;
> @@ -242,8 +253,10 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>  **********************************/
>  static void __zswap_pool_empty(struct percpu_ref *ref);
>  
> -static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
>  {
> +	u8 i;
> +
>  	if (IS_ERR_OR_NULL(acomp_ctx))
>  		return;
>  
> @@ -253,7 +266,11 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
>  	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>  		crypto_free_acomp(acomp_ctx->acomp);
>  
> -	kfree(acomp_ctx->buffer);
> +	if (acomp_ctx->buffers) {
> +		for (i = 0; i < nr_buffers; ++i)
> +			kfree(acomp_ctx->buffers[i]);
> +		kfree(acomp_ctx->buffers);
> +	}
>  }
>  
>  static struct zswap_pool *zswap_pool_create(char *compressor)
> @@ -265,6 +282,7 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
>  	if (!zswap_has_pool && !strcmp(compressor, ZSWAP_PARAM_UNSET))
>  		return NULL;
>  
> +	/* Many things rely on the zero-initialization. */
>  	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
>  	if (!pool)
>  		return NULL;
> @@ -315,7 +333,9 @@ static struct zswap_pool *zswap_pool_create(char *compressor)
>  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
>  
>  	for_each_possible_cpu(cpu)
> -		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
> +				  pool->compr_batch_size);
> +
>  error:
>  	if (pool->acomp_ctx)
>  		free_percpu(pool->acomp_ctx);
> @@ -353,7 +373,8 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
>  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
>  
>  	for_each_possible_cpu(cpu)
> -		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
> +				  pool->compr_batch_size);
>  
>  	free_percpu(pool->acomp_ctx);
>  
> @@ -644,14 +665,8 @@ static inline struct mem_cgroup *mem_cgroup_from_entry(struct zswap_entry *entry
>  }
>  #endif
>  
> -static inline int entry_to_nid(struct zswap_entry *entry)
> -{
> -	return page_to_nid(virt_to_page(entry));
> -}
> -
>  static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
>  {
> -	int nid = entry_to_nid(entry);
>  	struct mem_cgroup *memcg;
>  
>  	/*
> @@ -668,19 +683,18 @@ static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
>  	rcu_read_lock();
>  	memcg = mem_cgroup_from_entry(entry);
>  	/* will always succeed */
> -	list_lru_add(list_lru, &entry->lru, nid, memcg);
> +	list_lru_add(list_lru, &entry->lru, entry->nid, memcg);
>  	rcu_read_unlock();
>  }
>  
>  static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>  {
> -	int nid = entry_to_nid(entry);
>  	struct mem_cgroup *memcg;
>  
>  	rcu_read_lock();
>  	memcg = mem_cgroup_from_entry(entry);
>  	/* will always succeed */
> -	list_lru_del(list_lru, &entry->lru, nid, memcg);
> +	list_lru_del(list_lru, &entry->lru, entry->nid, memcg);
>  	rcu_read_unlock();
>  }
>  
> @@ -740,6 +754,29 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
>  	kmem_cache_free(zswap_entry_cache, entry);
>  }
>  

Instead of this:

> +/*
> + * Returns 0 if kmem_cache_alloc_bulk() failed and a positive number otherwise.
> + * The code for __kmem_cache_alloc_bulk() indicates that this positive number
> + * will be the @size requested, i.e., @nr_entries.
> + */
> +static __always_inline int zswap_entries_cache_alloc_batch(void **entries,
> +							   unsigned int nr_entries,
> +							   gfp_t gfp)
> +{
> +	int nr_alloc = kmem_cache_alloc_bulk(zswap_entry_cache, gfp,
> +					     nr_entries, entries);
> +

Add this here:
	/*
	 * kmem_cache_alloc_bulk() should return nr_entries on success
	 * and 0 on failure.
	 */

> +	WARN_ON(!nr_alloc || (nr_alloc != nr_entries));

WARN_ON_ONCE() is sufficient, and why do we WARN if
kmem_cache_alloc_bulk() fails? I thought that was expected in some
cases.

> +
> +	return nr_alloc;
> +}
> +

Please document that it's okay use this to free entries allocated
separately by zswap_entry_cache_alloc().

> +static __always_inline void zswap_entries_cache_free_batch(void **entries,
> +							   unsigned int nr_entries)
> +{
> +	kmem_cache_free_bulk(zswap_entry_cache, nr_entries, entries);
> +}
> +
>  /*
>   * Carries out the common pattern of freeing an entry's zsmalloc allocation,
>   * freeing the entry itself, and decrementing the number of stored pages.
> @@ -766,7 +803,9 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  {
>  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
> +	int nid = cpu_to_node(cpu);
>  	int ret = -ENOMEM;
> +	u8 i;
>  
>  	/*
>  	 * To handle cases where the CPU goes through online-offline-online
> @@ -775,11 +814,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>  		return 0;
>  
> -	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL, cpu_to_node(cpu));
> -	if (!acomp_ctx->buffer)
> -		return ret;
> -
> -	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
> +	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, nid);
>  	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
>  		pr_err("could not alloc crypto acomp %s : %ld\n",
>  				pool->tfm_name, PTR_ERR(acomp_ctx->acomp));
> @@ -788,20 +823,39 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	}
>  	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
>  
> +	/*
> +	 * Allocate up to ZSWAP_MAX_BATCH_SIZE dst buffers if the
> +	 * compressor supports batching.
> +	 */
> +	pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
> +				     crypto_acomp_batch_size(acomp_ctx->acomp));
> +
>  	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> +
>  	if (IS_ERR_OR_NULL(acomp_ctx->req)) {
>  		pr_err("could not alloc crypto acomp_request %s\n",
>  		       pool->tfm_name);
>  		goto fail;
>  	}
>  
> -	crypto_init_wait(&acomp_ctx->wait);
> +	acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size, sizeof(u8 *),
> +					  GFP_KERNEL, nid);
> +	if (!acomp_ctx->buffers)
> +		goto fail;
> +
> +	for (i = 0; i < pool->compr_batch_size; ++i) {
> +		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE, GFP_KERNEL, nid);
> +		if (!acomp_ctx->buffers[i])
> +			goto fail;
> +	}
>  
>  	/*
>  	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
>  	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
>  	 * won't be called, crypto_wait_req() will return without blocking.
>  	 */
> +	crypto_init_wait(&acomp_ctx->wait);
> +
>  	acomp_request_set_callback(acomp_ctx->req, CRYPTO_TFM_REQ_MAY_BACKLOG,
>  				   crypto_req_done, &acomp_ctx->wait);
>  
> @@ -811,12 +865,12 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	return 0;
>  
>  fail:
> -	acomp_ctx_dealloc(acomp_ctx);
> +	acomp_ctx_dealloc(acomp_ctx, pool->compr_batch_size);
>  	return ret;
>  }
>  
>  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> -			   struct zswap_pool *pool)
> +			   struct zswap_pool *pool, bool wb_enabled)
>  {
>  	struct crypto_acomp_ctx *acomp_ctx;
>  	struct scatterlist input, output;
> @@ -830,7 +884,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
>  	mutex_lock(&acomp_ctx->mutex);
>  
> -	dst = acomp_ctx->buffer;
> +	dst = acomp_ctx->buffers[0];
>  	sg_init_table(&input, 1);
>  	sg_set_page(&input, page, PAGE_SIZE, 0);
>  
> @@ -860,8 +914,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	 * to the active LRU list in the case.
>  	 */
>  	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
> -		if (!mem_cgroup_zswap_writeback_enabled(
> -					folio_memcg(page_folio(page)))) {
> +		if (!wb_enabled) {
>  			comp_ret = comp_ret ? comp_ret : -EINVAL;
>  			goto unlock;
>  		}
> @@ -906,7 +959,7 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  
>  	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
>  	mutex_lock(&acomp_ctx->mutex);
> -	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffer);
> +	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx->buffers[0]);
>  
>  	/* zswap entries of length PAGE_SIZE are not compressed. */
>  	if (entry->length == PAGE_SIZE) {
> @@ -916,15 +969,15 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  
>  	/*
>  	 * zs_obj_read_begin() might return a kmap address of highmem when
> -	 * acomp_ctx->buffer is not used.  However, sg_init_one() does not
> -	 * handle highmem addresses, so copy the object to acomp_ctx->buffer.
> +	 * acomp_ctx->buffers[0] is not used.  However, sg_init_one() does not
> +	 * handle highmem addresses, so copy the object to acomp_ctx->buffers[0].
>  	 */
>  	if (virt_addr_valid(obj)) {
>  		src = obj;
>  	} else {
> -		WARN_ON_ONCE(obj == acomp_ctx->buffer);
> -		memcpy(acomp_ctx->buffer, obj, entry->length);
> -		src = acomp_ctx->buffer;
> +		WARN_ON_ONCE(obj == acomp_ctx->buffers[0]);
> +		memcpy(acomp_ctx->buffers[0], obj, entry->length);
> +		src = acomp_ctx->buffers[0];
>  	}
>  
>  	sg_init_one(&input, src, entry->length);
> @@ -1378,95 +1431,156 @@ static void shrink_worker(struct work_struct *w)
>  * main API
>  **********************************/
>  
> -static bool zswap_store_page(struct page *page,
> -			     struct obj_cgroup *objcg,
> -			     struct zswap_pool *pool)
> +/*
> + * Store multiple pages in @folio, starting from the page at index @start up to
> + * the page at index @end-1.
> + */
> +static bool zswap_store_pages(struct folio *folio,
> +			      long start,
> +			      long end,
> +			      struct obj_cgroup *objcg,
> +			      struct zswap_pool *pool,
> +			      int nid,
> +			      bool wb_enabled)
>  {
> -	swp_entry_t page_swpentry = page_swap_entry(page);
> -	struct zswap_entry *entry, *old;
> -
> -	/* allocate entry */
> -	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> -	if (!entry) {
> -		zswap_reject_kmemcache_fail++;
> -		return false;
> +	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> +	u8 i, store_fail_idx = 0, nr_pages = end - start;
> +
> +	VM_WARN_ON_ONCE(nr_pages > ZSWAP_MAX_BATCH_SIZE);
> +
> +	if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],

Is this equivalent to just passing in 'entries'?

> +						      nr_pages, GFP_KERNEL))) {
> +		for (i = 0; i < nr_pages; ++i) {
> +			entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, nid);
> +
> +			if (unlikely(!entries[i])) {
> +				zswap_reject_kmemcache_fail++;
> +				/*
> +				 * While handling this error, we only need to
> +				 * call zswap_entries_cache_free_batch() for
> +				 * entries[0 .. @i-1].
> +				 */
> +				nr_pages = i;
> +				goto store_pages_failed;
> +			}
> +		}


Maybe move the fallback loop into zswap_entries_cache_alloc_batch()?

>  	}
>  
> -	if (!zswap_compress(page, entry, pool))
> -		goto compress_failed;
> +	/*
> +	 * We colocate entry initialization as much as possible here to
> +	 * minimize potential cache misses.

s/colocate/co-locate

Please only keep the portion above and drop the rest of the comment.

> +	 *
> +	 * With kmem_cache_alloc_bulk(), the batch's entries will be created
> +	 * on the NUMA node of the CPU on which zswap_store() is called, which
> +	 * might not be the same as @nid, the NUMA node on which @folio was
> +	 * allocated. In order for the @folio's entries to be written back when
> +	 * @nid experiences memory pressure, we store @nid in @entry->nid.
> +	 * This ensures that the entry is added to and deleted from the LRU
> +	 * list of the correct node, namely @nid.
> +	 */
> +	for (i = 0; i < nr_pages; ++i) {
> +		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> +		entries[i]->pool = pool;
> +		entries[i]->swpentry = page_swap_entry(folio_page(folio, start + i));
> +		entries[i]->objcg = objcg;
> +		entries[i]->referenced = true;
> +		entries[i]->nid = nid;
> +		INIT_LIST_HEAD(&entries[i]->lru);
> +	}
>  
> -	old = xa_store(swap_zswap_tree(page_swpentry),
> -		       swp_offset(page_swpentry),
> -		       entry, GFP_KERNEL);
> -	if (xa_is_err(old)) {
> -		int err = xa_err(old);
> +	for (i = 0; i < nr_pages; ++i) {
> +		struct page *page = folio_page(folio, start + i);
>  
> -		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> -		zswap_reject_alloc_fail++;
> -		goto store_failed;
> +		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> +			goto store_pages_failed;
>  	}
[..]


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches.
  2025-11-13 20:51   ` Yosry Ahmed
@ 2025-12-12  1:43     ` Sridhar, Kanchana P
  2025-12-12  4:40       ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-12  1:43 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 12:51 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 21/22] mm: zswap: zswap_store() will process a
> large folio in batches.
> 
> On Tue, Nov 04, 2025 at 01:12:34AM -0800, Kanchana P Sridhar wrote:
> 
> Subject:
> 
> "mm: zswap: Store large folios in batches"
> 
> > This patch makes two major changes:
> >
> > First, we allocate pool batching resources if the compressor supports
> > batching:
> >
> >   This patch sets up zswap for allocating per-CPU resources optimally
> >   for non-batching and batching compressors.
> >
> >   A new ZSWAP_MAX_BATCH_SIZE constant is defined as 8U, to set an upper
> >   limit on the number of pages in large folios that will be batch
> >   compressed.
> >
> >   It is up to the compressor to manage multiple requests, as needed, to
> >   accomplish batch parallelism. zswap only needs to allocate the per-CPU
> >   dst buffers according to the batch size supported by the compressor.
> >
> >   A "u8 compr_batch_size" member is added to "struct zswap_pool", as per
> >   Yosry's suggestion. pool->compr_batch_size is set as the minimum of
> >   the compressor's max batch-size and ZSWAP_MAX_BATCH_SIZE.
> Accordingly,
> >   pool->compr_batch_size compression dst buffers are allocated in the
> >   per-CPU acomp_ctx.
> >
> >   zswap does not use more than one dst buffer yet. Follow-up patches
> >   will actually utilize the multiple acomp_ctx buffers for batch
> >   compression/decompression of multiple pages.
> >
> >   Thus, ZSWAP_MAX_BATCH_SIZE limits the amount of extra memory used
> for
> >   batching. There is a small extra memory overhead of allocating
> >   the acomp_ctx->buffers array for compressors that do not support
> >   batching: On x86_64, the overhead is 1 pointer per-CPU (i.e. 8 bytes).
> 
> Support batching when storing large folios in zswap. If the underlying
> compressor supports batching (e.g. HW parallel compression), allocate
> multiple compression buffers, otherwise allocate one. The number of
> buffers is bounded by a new constant, ZSWAP_MAX_BATCH_SIZE, to limit the
> memory overhead. For existing software compressors, the only extra
> overhead is the extra 'buffers' pointer, so 8 bytes per-CPU on x86_64.
> 
> Only the first buffer is currently used, but subsequent changes will use
> the remaining buffers for HW compression batching.
> 
> >
> > Next, we store the folio in batches:
> >
> >   This patch modifies zswap_store() to store a batch of pages in large
> >   folios at a time, instead of storing one page at a time. It does this by
> >   calling a new procedure zswap_store_pages() with a range of indices in
> >   the folio: for batching compressors, this range contains up to
> >   pool->compr_batch_size pages. For non-batching compressors, we send up
> >   to ZSWAP_MAX_BATCH_SIZE pages to be sequentially compressed and
> stored
> >   in zswap_store_pages().
> >
> >   zswap_store_pages() implements all the computes done earlier in
> >   zswap_store_page() for a single-page, for multiple pages in a folio,
> >   namely the "batch":
> >
> >   1) It starts by allocating all zswap entries required to store the
> >      batch. New procedures, zswap_entries_cache_alloc_batch() and
> >      zswap_entries_cache_free_batch() call kmem_cache_[free]alloc_bulk()
> >      to optimize the performance of this step.
> >
> >   2) The entry doesn't have to be allocated on the same node as the page
> >      being stored in zswap: we let the slab allocator decide this in
> >      kmem_cache_alloc_bulk(). However, to make sure the current zswap
> >      LRU list/shrinker behavior is preserved, we store the folio's nid as
> >      a new @nid member in the entry to enable adding it to the correct
> >      LRU list (and deleting it from the right LRU list). This ensures
> >      that when the folio's allocating NUMA node is under memory
> >      pressure, the entries corresponding to its pages are written back.
> >
> >      The memory footprint of struct zswap_entry remains unchanged at
> >      56 bytes with the addition of the "int nid" member by condensing
> >      "length" and "referenced" into 4 bytes using bit fields and using
> >      the 4 bytes available after "referenced" for the "int nid". Thanks
> >      to Nhat and Yosry for these suggestions!
> >
> >   3) Next, the entries fields are written, computes that need to be happen
> >      anyway, without modifying the zswap xarray/LRU publishing order. This
> >      avoids bringing the entries into the cache for writing in different
> >      code blocks within this procedure, hence improves latency.
> >
> >   4) Next, it calls zswap_compress() to sequentially compress each page in
> >      the batch.
> >
> >   5) Finally, it adds the batch's zswap entries to the xarray and LRU,
> >      charges zswap memory and increments zswap stats.
> >
> >   6) The error handling and cleanup required for all failure scenarios
> >      that can occur while storing a batch in zswap are consolidated to a
> >      single "store_pages_failed" label in zswap_store_pages(). Here again,
> >      we optimize performance by calling kmem_cache_free_bulk().
> 
> Regardless of compression batching, always process large folios in
> batches. For HW compressors, the batch size is the compressor batch
> size, otherwise ZSWAP_MAX_BATCH_SIZE is used.
> 
> zswap_store_page() is replaced with zswap_store_pages(), which processes
> a batch of pages and allows for batching optimizations. For now, only
> optimize allocating entries by using batch allocations from the slab
> cache.
> 
> Since batch allocations do not support specifying a node id, store the
> node id in the zswap entry instead of relying on the zswap_entry being
> allocated on the same node. The size of the zswap_entry remains
> unchanged as 'referenced' is lumped in with the length (as it doesn't
> need a full unsigned int anyway).
> 
> Avoid repeatedly calling mem_cgroup_zswap_writeback_enabled() for every
> page and only call it once for the folio, since the entire folio is
> charged to a single memcg.

Ok, will change this accordingly, thanks.

> 
> >
> > This commit also makes a minor optimization in zswap_compress(), that
> > takes a "bool wb_enabled" argument; computed once in zswap_store()
> > rather than for each page in the folio.
> >
> > Suggested-by: Nhat Pham <nphamcs@gmail.com>
> > Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 336 ++++++++++++++++++++++++++++++++++++-------------
> ----
> >  1 file changed, 232 insertions(+), 104 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index cb384eb7c815..257567edc587 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -82,6 +82,9 @@ static bool zswap_pool_reached_full;
> >
> >  #define ZSWAP_PARAM_UNSET ""
> >
> > +/* Limit the batch size to limit per-CPU memory usage for dst buffers. */
> > +#define ZSWAP_MAX_BATCH_SIZE 8U
> > +
> >  static int zswap_setup(void);
> >
> >  /* Enable/disable zswap */
> > @@ -139,7 +142,7 @@ struct crypto_acomp_ctx {
> >  	struct crypto_acomp *acomp;
> >  	struct acomp_req *req;
> >  	struct crypto_wait wait;
> > -	u8 *buffer;
> > +	u8 **buffers;
> >  	struct mutex mutex;
> >  	bool is_sleepable;
> >  };
> > @@ -149,6 +152,9 @@ struct crypto_acomp_ctx {
> >   * The only case where lru_lock is not acquired while holding tree.lock is
> >   * when a zswap_entry is taken off the lru for writeback, in that case it
> >   * needs to be verified that it's still valid in the tree.
> > + *
> > + * @compr_batch_size: The max batch size of the compression algorithm,
> > + *                    bounded by ZSWAP_MAX_BATCH_SIZE.
> >   */
> >  struct zswap_pool {
> >  	struct zs_pool *zs_pool;
> > @@ -158,6 +164,7 @@ struct zswap_pool {
> >  	struct work_struct release_work;
> >  	struct hlist_node node;
> >  	char tfm_name[CRYPTO_MAX_ALG_NAME];
> > +	u8 compr_batch_size;
> >  };
> >
> >  /* Global LRU lists shared by all zswap pools. */
> > @@ -182,6 +189,7 @@ static struct shrinker *zswap_shrinker;
> >   *              writeback logic. The entry is only reclaimed by the writeback
> >   *              logic if referenced is unset. See comments in the shrinker
> >   *              section for context.
> > + * nid - NUMA node id of the page for which this is the zswap entry.
> >   * pool - the zswap_pool the entry's data is in
> >   * handle - zsmalloc allocation handle that stores the compressed page data
> >   * objcg - the obj_cgroup that the compressed memory is charged to
> > @@ -189,8 +197,11 @@ static struct shrinker *zswap_shrinker;
> >   */
> >  struct zswap_entry {
> >  	swp_entry_t swpentry;
> > -	unsigned int length;
> > -	bool referenced;
> > +	struct {
> > +		unsigned int length:31;
> > +		bool referenced:1;
> > +	};
> > +	int nid;
> >  	struct zswap_pool *pool;
> >  	unsigned long handle;
> >  	struct obj_cgroup *objcg;
> > @@ -242,8 +253,10 @@ static inline struct xarray
> *swap_zswap_tree(swp_entry_t swp)
> >  **********************************/
> >  static void __zswap_pool_empty(struct percpu_ref *ref);
> >
> > -static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx)
> > +static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8
> nr_buffers)
> >  {
> > +	u8 i;
> > +
> >  	if (IS_ERR_OR_NULL(acomp_ctx))
> >  		return;
> >
> > @@ -253,7 +266,11 @@ static void acomp_ctx_dealloc(struct
> crypto_acomp_ctx *acomp_ctx)
> >  	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> >  		crypto_free_acomp(acomp_ctx->acomp);
> >
> > -	kfree(acomp_ctx->buffer);
> > +	if (acomp_ctx->buffers) {
> > +		for (i = 0; i < nr_buffers; ++i)
> > +			kfree(acomp_ctx->buffers[i]);
> > +		kfree(acomp_ctx->buffers);
> > +	}
> >  }
> >
> >  static struct zswap_pool *zswap_pool_create(char *compressor)
> > @@ -265,6 +282,7 @@ static struct zswap_pool *zswap_pool_create(char
> *compressor)
> >  	if (!zswap_has_pool && !strcmp(compressor,
> ZSWAP_PARAM_UNSET))
> >  		return NULL;
> >
> > +	/* Many things rely on the zero-initialization. */
> >  	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
> >  	if (!pool)
> >  		return NULL;
> > @@ -315,7 +333,9 @@ static struct zswap_pool *zswap_pool_create(char
> *compressor)
> >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> >
> >  	for_each_possible_cpu(cpu)
> > -		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
> > +				  pool->compr_batch_size);
> > +
> >  error:
> >  	if (pool->acomp_ctx)
> >  		free_percpu(pool->acomp_ctx);
> > @@ -353,7 +373,8 @@ static void zswap_pool_destroy(struct zswap_pool
> *pool)
> >  	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
> &pool->node);
> >
> >  	for_each_possible_cpu(cpu)
> > -		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu));
> > +		acomp_ctx_dealloc(per_cpu_ptr(pool->acomp_ctx, cpu),
> > +				  pool->compr_batch_size);
> >
> >  	free_percpu(pool->acomp_ctx);
> >
> > @@ -644,14 +665,8 @@ static inline struct mem_cgroup
> *mem_cgroup_from_entry(struct zswap_entry *entry
> >  }
> >  #endif
> >
> > -static inline int entry_to_nid(struct zswap_entry *entry)
> > -{
> > -	return page_to_nid(virt_to_page(entry));
> > -}
> > -
> >  static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry
> *entry)
> >  {
> > -	int nid = entry_to_nid(entry);
> >  	struct mem_cgroup *memcg;
> >
> >  	/*
> > @@ -668,19 +683,18 @@ static void zswap_lru_add(struct list_lru *list_lru,
> struct zswap_entry *entry)
> >  	rcu_read_lock();
> >  	memcg = mem_cgroup_from_entry(entry);
> >  	/* will always succeed */
> > -	list_lru_add(list_lru, &entry->lru, nid, memcg);
> > +	list_lru_add(list_lru, &entry->lru, entry->nid, memcg);
> >  	rcu_read_unlock();
> >  }
> >
> >  static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry
> *entry)
> >  {
> > -	int nid = entry_to_nid(entry);
> >  	struct mem_cgroup *memcg;
> >
> >  	rcu_read_lock();
> >  	memcg = mem_cgroup_from_entry(entry);
> >  	/* will always succeed */
> > -	list_lru_del(list_lru, &entry->lru, nid, memcg);
> > +	list_lru_del(list_lru, &entry->lru, entry->nid, memcg);
> >  	rcu_read_unlock();
> >  }
> >
> > @@ -740,6 +754,29 @@ static void zswap_entry_cache_free(struct
> zswap_entry *entry)
> >  	kmem_cache_free(zswap_entry_cache, entry);
> >  }
> >
> 
> Instead of this:
> 
> > +/*
> > + * Returns 0 if kmem_cache_alloc_bulk() failed and a positive number
> otherwise.
> > + * The code for __kmem_cache_alloc_bulk() indicates that this positive
> number
> > + * will be the @size requested, i.e., @nr_entries.
> > + */
> > +static __always_inline int zswap_entries_cache_alloc_batch(void
> **entries,
> > +							   unsigned int
> nr_entries,
> > +							   gfp_t gfp)
> > +{
> > +	int nr_alloc = kmem_cache_alloc_bulk(zswap_entry_cache, gfp,
> > +					     nr_entries, entries);
> > +
> 
> Add this here:
> 	/*
> 	 * kmem_cache_alloc_bulk() should return nr_entries on success
> 	 * and 0 on failure.
> 	 */
> 

Sure.

> > +	WARN_ON(!nr_alloc || (nr_alloc != nr_entries));
> 
> WARN_ON_ONCE() is sufficient, and why do we WARN if
> kmem_cache_alloc_bulk() fails? I thought that was expected in some
> cases.

I can change this to a WARN_ON_ONCE(). The code for kmem_cache_alloc_bulk()
makes sure that either all entries are allocated, or none are allocated
(partial allocations are freed and 0 returned in case of the latter). It can be expected
to fail based on this.

I believe there was an earlier comment for which I added the WARN_ON? I can
either change this to WARN_ON_ONCE() or drop the WARN_ON_ONCE(), since
we anyway have a fallback mechanism.

> 
> > +
> > +	return nr_alloc;
> > +}
> > +
> 
> Please document that it's okay use this to free entries allocated
> separately by zswap_entry_cache_alloc().

Sure.

> 
> > +static __always_inline void zswap_entries_cache_free_batch(void
> **entries,
> > +							   unsigned int
> nr_entries)
> > +{
> > +	kmem_cache_free_bulk(zswap_entry_cache, nr_entries, entries);
> > +}
> > +
> >  /*
> >   * Carries out the common pattern of freeing an entry's zsmalloc allocation,
> >   * freeing the entry itself, and decrementing the number of stored pages.
> > @@ -766,7 +803,9 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  {
> >  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> > +	int nid = cpu_to_node(cpu);
> >  	int ret = -ENOMEM;
> > +	u8 i;
> >
> >  	/*
> >  	 * To handle cases where the CPU goes through online-offline-online
> > @@ -775,11 +814,7 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  	if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> >  		return 0;
> >
> > -	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE, GFP_KERNEL,
> cpu_to_node(cpu));
> > -	if (!acomp_ctx->buffer)
> > -		return ret;
> > -
> > -	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0,
> 0, cpu_to_node(cpu));
> > +	acomp_ctx->acomp = crypto_alloc_acomp_node(pool->tfm_name, 0,
> 0, nid);
> >  	if (IS_ERR_OR_NULL(acomp_ctx->acomp)) {
> >  		pr_err("could not alloc crypto acomp %s : %ld\n",
> >  				pool->tfm_name, PTR_ERR(acomp_ctx-
> >acomp));
> > @@ -788,20 +823,39 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  	}
> >  	acomp_ctx->is_sleepable = acomp_is_async(acomp_ctx->acomp);
> >
> > +	/*
> > +	 * Allocate up to ZSWAP_MAX_BATCH_SIZE dst buffers if the
> > +	 * compressor supports batching.
> > +	 */
> > +	pool->compr_batch_size = min(ZSWAP_MAX_BATCH_SIZE,
> > +				     crypto_acomp_batch_size(acomp_ctx-
> >acomp));
> > +
> >  	acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
> > +
> >  	if (IS_ERR_OR_NULL(acomp_ctx->req)) {
> >  		pr_err("could not alloc crypto acomp_request %s\n",
> >  		       pool->tfm_name);
> >  		goto fail;
> >  	}
> >
> > -	crypto_init_wait(&acomp_ctx->wait);
> > +	acomp_ctx->buffers = kcalloc_node(pool->compr_batch_size,
> sizeof(u8 *),
> > +					  GFP_KERNEL, nid);
> > +	if (!acomp_ctx->buffers)
> > +		goto fail;
> > +
> > +	for (i = 0; i < pool->compr_batch_size; ++i) {
> > +		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE,
> GFP_KERNEL, nid);
> > +		if (!acomp_ctx->buffers[i])
> > +			goto fail;
> > +	}
> >
> >  	/*
> >  	 * if the backend of acomp is async zip, crypto_req_done() will
> wakeup
> >  	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
> >  	 * won't be called, crypto_wait_req() will return without blocking.
> >  	 */
> > +	crypto_init_wait(&acomp_ctx->wait);
> > +
> >  	acomp_request_set_callback(acomp_ctx->req,
> CRYPTO_TFM_REQ_MAY_BACKLOG,
> >  				   crypto_req_done, &acomp_ctx->wait);
> >
> > @@ -811,12 +865,12 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  	return 0;
> >
> >  fail:
> > -	acomp_ctx_dealloc(acomp_ctx);
> > +	acomp_ctx_dealloc(acomp_ctx, pool->compr_batch_size);
> >  	return ret;
> >  }
> >
> >  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > -			   struct zswap_pool *pool)
> > +			   struct zswap_pool *pool, bool wb_enabled)
> >  {
> >  	struct crypto_acomp_ctx *acomp_ctx;
> >  	struct scatterlist input, output;
> > @@ -830,7 +884,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >  	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> >  	mutex_lock(&acomp_ctx->mutex);
> >
> > -	dst = acomp_ctx->buffer;
> > +	dst = acomp_ctx->buffers[0];
> >  	sg_init_table(&input, 1);
> >  	sg_set_page(&input, page, PAGE_SIZE, 0);
> >
> > @@ -860,8 +914,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >  	 * to the active LRU list in the case.
> >  	 */
> >  	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
> > -		if (!mem_cgroup_zswap_writeback_enabled(
> > -					folio_memcg(page_folio(page)))) {
> > +		if (!wb_enabled) {
> >  			comp_ret = comp_ret ? comp_ret : -EINVAL;
> >  			goto unlock;
> >  		}
> > @@ -906,7 +959,7 @@ static bool zswap_decompress(struct zswap_entry
> *entry, struct folio *folio)
> >
> >  	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> >  	mutex_lock(&acomp_ctx->mutex);
> > -	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx-
> >buffer);
> > +	obj = zs_obj_read_begin(pool->zs_pool, entry->handle, acomp_ctx-
> >buffers[0]);
> >
> >  	/* zswap entries of length PAGE_SIZE are not compressed. */
> >  	if (entry->length == PAGE_SIZE) {
> > @@ -916,15 +969,15 @@ static bool zswap_decompress(struct
> zswap_entry *entry, struct folio *folio)
> >
> >  	/*
> >  	 * zs_obj_read_begin() might return a kmap address of highmem
> when
> > -	 * acomp_ctx->buffer is not used.  However, sg_init_one() does not
> > -	 * handle highmem addresses, so copy the object to acomp_ctx-
> >buffer.
> > +	 * acomp_ctx->buffers[0] is not used.  However, sg_init_one() does
> not
> > +	 * handle highmem addresses, so copy the object to acomp_ctx-
> >buffers[0].
> >  	 */
> >  	if (virt_addr_valid(obj)) {
> >  		src = obj;
> >  	} else {
> > -		WARN_ON_ONCE(obj == acomp_ctx->buffer);
> > -		memcpy(acomp_ctx->buffer, obj, entry->length);
> > -		src = acomp_ctx->buffer;
> > +		WARN_ON_ONCE(obj == acomp_ctx->buffers[0]);
> > +		memcpy(acomp_ctx->buffers[0], obj, entry->length);
> > +		src = acomp_ctx->buffers[0];
> >  	}
> >
> >  	sg_init_one(&input, src, entry->length);
> > @@ -1378,95 +1431,156 @@ static void shrink_worker(struct work_struct
> *w)
> >  * main API
> >  **********************************/
> >
> > -static bool zswap_store_page(struct page *page,
> > -			     struct obj_cgroup *objcg,
> > -			     struct zswap_pool *pool)
> > +/*
> > + * Store multiple pages in @folio, starting from the page at index @start up
> to
> > + * the page at index @end-1.
> > + */
> > +static bool zswap_store_pages(struct folio *folio,
> > +			      long start,
> > +			      long end,
> > +			      struct obj_cgroup *objcg,
> > +			      struct zswap_pool *pool,
> > +			      int nid,
> > +			      bool wb_enabled)
> >  {
> > -	swp_entry_t page_swpentry = page_swap_entry(page);
> > -	struct zswap_entry *entry, *old;
> > -
> > -	/* allocate entry */
> > -	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > -	if (!entry) {
> > -		zswap_reject_kmemcache_fail++;
> > -		return false;
> > +	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> > +	u8 i, store_fail_idx = 0, nr_pages = end - start;
> > +
> > +	VM_WARN_ON_ONCE(nr_pages > ZSWAP_MAX_BATCH_SIZE);
> > +
> > +	if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
> 
> Is this equivalent to just passing in 'entries'?

It is, however, I wanted to keep this equivalent to the failure case call to
zswap_entries_cache_free_batch(), that passes in the address of the
batch index that failed xarray store.

> 
> > +						      nr_pages, GFP_KERNEL)))
> {
> > +		for (i = 0; i < nr_pages; ++i) {
> > +			entries[i] = zswap_entry_cache_alloc(GFP_KERNEL,
> nid);
> > +
> > +			if (unlikely(!entries[i])) {
> > +				zswap_reject_kmemcache_fail++;
> > +				/*
> > +				 * While handling this error, we only need to
> > +				 * call zswap_entries_cache_free_batch() for
> > +				 * entries[0 .. @i-1].
> > +				 */
> > +				nr_pages = i;
> > +				goto store_pages_failed;
> > +			}
> > +		}
> 
> 
> Maybe move the fallback loop into zswap_entries_cache_alloc_batch()?

I could, however, I would need to modify the API to return the error index "i",
so that the "goto store_pages_failed" works. Imo, inlining this makes the error
handling more apparent, but let me know.

> 
> >  	}
> >
> > -	if (!zswap_compress(page, entry, pool))
> > -		goto compress_failed;
> > +	/*
> > +	 * We colocate entry initialization as much as possible here to
> > +	 * minimize potential cache misses.
> 
> s/colocate/co-locate
> 
> Please only keep the portion above and drop the rest of the comment.

Ok.

> 
> > +	 *
> > +	 * With kmem_cache_alloc_bulk(), the batch's entries will be created
> > +	 * on the NUMA node of the CPU on which zswap_store() is called,
> which
> > +	 * might not be the same as @nid, the NUMA node on which @folio
> was
> > +	 * allocated. In order for the @folio's entries to be written back when
> > +	 * @nid experiences memory pressure, we store @nid in @entry-
> >nid.
> > +	 * This ensures that the entry is added to and deleted from the LRU
> > +	 * list of the correct node, namely @nid.
> > +	 */
> > +	for (i = 0; i < nr_pages; ++i) {
> > +		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> > +		entries[i]->pool = pool;
> > +		entries[i]->swpentry = page_swap_entry(folio_page(folio,
> start + i));
> > +		entries[i]->objcg = objcg;
> > +		entries[i]->referenced = true;
> > +		entries[i]->nid = nid;
> > +		INIT_LIST_HEAD(&entries[i]->lru);
> > +	}
> >
> > -	old = xa_store(swap_zswap_tree(page_swpentry),
> > -		       swp_offset(page_swpentry),
> > -		       entry, GFP_KERNEL);
> > -	if (xa_is_err(old)) {
> > -		int err = xa_err(old);
> > +	for (i = 0; i < nr_pages; ++i) {
> > +		struct page *page = folio_page(folio, start + i);
> >
> > -		WARN_ONCE(err != -ENOMEM, "unexpected xarray error:
> %d\n", err);
> > -		zswap_reject_alloc_fail++;
> > -		goto store_failed;
> > +		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> > +			goto store_pages_failed;
> >  	}
> [..]


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches.
  2025-12-12  1:43     ` Sridhar, Kanchana P
@ 2025-12-12  4:40       ` Yosry Ahmed
  2025-12-12 18:03         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-12-12  4:40 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh

On Fri, Dec 12, 2025 at 01:43:54AM +0000, Sridhar, Kanchana P wrote:
[..]
> > 
> > Instead of this:
> > 
> > > +/*
> > > + * Returns 0 if kmem_cache_alloc_bulk() failed and a positive number
> > otherwise.
> > > + * The code for __kmem_cache_alloc_bulk() indicates that this positive
> > number
> > > + * will be the @size requested, i.e., @nr_entries.
> > > + */
> > > +static __always_inline int zswap_entries_cache_alloc_batch(void
> > **entries,
> > > +							   unsigned int
> > nr_entries,
> > > +							   gfp_t gfp)
> > > +{
> > > +	int nr_alloc = kmem_cache_alloc_bulk(zswap_entry_cache, gfp,
> > > +					     nr_entries, entries);
> > > +
> > 
> > Add this here:
> > 	/*
> > 	 * kmem_cache_alloc_bulk() should return nr_entries on success
> > 	 * and 0 on failure.
> > 	 */
> > 
> 
> Sure.
> 
> > > +	WARN_ON(!nr_alloc || (nr_alloc != nr_entries));
> > 
> > WARN_ON_ONCE() is sufficient, and why do we WARN if
> > kmem_cache_alloc_bulk() fails? I thought that was expected in some
> > cases.
> 
> I can change this to a WARN_ON_ONCE(). The code for kmem_cache_alloc_bulk()
> makes sure that either all entries are allocated, or none are allocated
> (partial allocations are freed and 0 returned in case of the latter). It can be expected
> to fail based on this.

Right, I mean specifically the !nr_alloc case. This should be expected,
so we should not WARN in this case, right? IIUC, we should do:

	WARN_ON_ONCE(nr_alloc && nr_alloc != nr_entries)

> 
> I believe there was an earlier comment for which I added the WARN_ON? I can
> either change this to WARN_ON_ONCE() or drop the WARN_ON_ONCE(), since
> we anyway have a fallback mechanism.
> 
> > 
> > > +
> > > +	return nr_alloc;
> > > +}
> > > +
> > 
[..]
> > > +static bool zswap_store_pages(struct folio *folio,
> > > +			      long start,
> > > +			      long end,
> > > +			      struct obj_cgroup *objcg,
> > > +			      struct zswap_pool *pool,
> > > +			      int nid,
> > > +			      bool wb_enabled)
> > >  {
> > > -	swp_entry_t page_swpentry = page_swap_entry(page);
> > > -	struct zswap_entry *entry, *old;
> > > -
> > > -	/* allocate entry */
> > > -	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > > -	if (!entry) {
> > > -		zswap_reject_kmemcache_fail++;
> > > -		return false;
> > > +	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> > > +	u8 i, store_fail_idx = 0, nr_pages = end - start;
> > > +
> > > +	VM_WARN_ON_ONCE(nr_pages > ZSWAP_MAX_BATCH_SIZE);
> > > +
> > > +	if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
> > 
> > Is this equivalent to just passing in 'entries'?
> 
> It is, however, I wanted to keep this equivalent to the failure case call to
> zswap_entries_cache_free_batch(), that passes in the address of the
> batch index that failed xarray store.

I understand, but I think it's clearer to pass 'entries'. Also, can we
make zswap_entries_cache_alloc_batch() take in the proper type and avoid
the cast at the callsites?

> 
> > 
> > > +						      nr_pages, GFP_KERNEL)))
> > {
> > > +		for (i = 0; i < nr_pages; ++i) {
> > > +			entries[i] = zswap_entry_cache_alloc(GFP_KERNEL,
> > nid);
> > > +
> > > +			if (unlikely(!entries[i])) {
> > > +				zswap_reject_kmemcache_fail++;
> > > +				/*
> > > +				 * While handling this error, we only need to
> > > +				 * call zswap_entries_cache_free_batch() for
> > > +				 * entries[0 .. @i-1].
> > > +				 */
> > > +				nr_pages = i;
> > > +				goto store_pages_failed;
> > > +			}
> > > +		}
> > 
> > 
> > Maybe move the fallback loop into zswap_entries_cache_alloc_batch()?
> 
> I could, however, I would need to modify the API to return the error index "i",
> so that the "goto store_pages_failed" works. Imo, inlining this makes the error
> handling more apparent, but let me know.

Hmm yeah. Maybe make zswap_entries_cache_alloc_batch() free the already
allocated entries on failure? Then if zswap_entries_cache_alloc_batch()
fails we exit without goto store_pages_failed. This is the first failure
mode so we don't need any further cleanup anyway.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches.
  2025-12-12  4:40       ` Yosry Ahmed
@ 2025-12-12 18:03         ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-12 18:03 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, December 11, 2025 8:41 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 21/22] mm: zswap: zswap_store() will process a
> large folio in batches.
> 
> On Fri, Dec 12, 2025 at 01:43:54AM +0000, Sridhar, Kanchana P wrote:
> [..]
> > >
> > > Instead of this:
> > >
> > > > +/*
> > > > + * Returns 0 if kmem_cache_alloc_bulk() failed and a positive number
> > > otherwise.
> > > > + * The code for __kmem_cache_alloc_bulk() indicates that this positive
> > > number
> > > > + * will be the @size requested, i.e., @nr_entries.
> > > > + */
> > > > +static __always_inline int zswap_entries_cache_alloc_batch(void
> > > **entries,
> > > > +							   unsigned int
> > > nr_entries,
> > > > +							   gfp_t gfp)
> > > > +{
> > > > +	int nr_alloc = kmem_cache_alloc_bulk(zswap_entry_cache, gfp,
> > > > +					     nr_entries, entries);
> > > > +
> > >
> > > Add this here:
> > > 	/*
> > > 	 * kmem_cache_alloc_bulk() should return nr_entries on success
> > > 	 * and 0 on failure.
> > > 	 */
> > >
> >
> > Sure.
> >
> > > > +	WARN_ON(!nr_alloc || (nr_alloc != nr_entries));
> > >
> > > WARN_ON_ONCE() is sufficient, and why do we WARN if
> > > kmem_cache_alloc_bulk() fails? I thought that was expected in some
> > > cases.
> >
> > I can change this to a WARN_ON_ONCE(). The code for
> kmem_cache_alloc_bulk()
> > makes sure that either all entries are allocated, or none are allocated
> > (partial allocations are freed and 0 returned in case of the latter). It can be
> expected
> > to fail based on this.
> 
> Right, I mean specifically the !nr_alloc case. This should be expected,
> so we should not WARN in this case, right? IIUC, we should do:
> 
> 	WARN_ON_ONCE(nr_alloc && nr_alloc != nr_entries)

Sure, I could add this, but it would be for potential future changes in
kmem_cache_alloc_bulk() since at present, if a non-0 nr_alloc is returned,
it is the nr_entries.

> 
> >
> > I believe there was an earlier comment for which I added the WARN_ON? I
> can
> > either change this to WARN_ON_ONCE() or drop the WARN_ON_ONCE(),
> since
> > we anyway have a fallback mechanism.
> >
> > >
> > > > +
> > > > +	return nr_alloc;
> > > > +}
> > > > +
> > >
> [..]
> > > > +static bool zswap_store_pages(struct folio *folio,
> > > > +			      long start,
> > > > +			      long end,
> > > > +			      struct obj_cgroup *objcg,
> > > > +			      struct zswap_pool *pool,
> > > > +			      int nid,
> > > > +			      bool wb_enabled)
> > > >  {
> > > > -	swp_entry_t page_swpentry = page_swap_entry(page);
> > > > -	struct zswap_entry *entry, *old;
> > > > -
> > > > -	/* allocate entry */
> > > > -	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > > > -	if (!entry) {
> > > > -		zswap_reject_kmemcache_fail++;
> > > > -		return false;
> > > > +	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> > > > +	u8 i, store_fail_idx = 0, nr_pages = end - start;
> > > > +
> > > > +	VM_WARN_ON_ONCE(nr_pages > ZSWAP_MAX_BATCH_SIZE);
> > > > +
> > > > +	if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
> > >
> > > Is this equivalent to just passing in 'entries'?
> >
> > It is, however, I wanted to keep this equivalent to the failure case call to
> > zswap_entries_cache_free_batch(), that passes in the address of the
> > batch index that failed xarray store.
> 
> I understand, but I think it's clearer to pass 'entries'. Also, can we
> make zswap_entries_cache_alloc_batch() take in the proper type and avoid
> the cast at the callsites?

Sure.

> 
> >
> > >
> > > > +						      nr_pages, GFP_KERNEL)))
> > > {
> > > > +		for (i = 0; i < nr_pages; ++i) {
> > > > +			entries[i] = zswap_entry_cache_alloc(GFP_KERNEL,
> > > nid);
> > > > +
> > > > +			if (unlikely(!entries[i])) {
> > > > +				zswap_reject_kmemcache_fail++;
> > > > +				/*
> > > > +				 * While handling this error, we only need to
> > > > +				 * call zswap_entries_cache_free_batch() for
> > > > +				 * entries[0 .. @i-1].
> > > > +				 */
> > > > +				nr_pages = i;
> > > > +				goto store_pages_failed;
> > > > +			}
> > > > +		}
> > >
> > >
> > > Maybe move the fallback loop into zswap_entries_cache_alloc_batch()?
> >
> > I could, however, I would need to modify the API to return the error index
> "i",
> > so that the "goto store_pages_failed" works. Imo, inlining this makes the
> error
> > handling more apparent, but let me know.
> 
> Hmm yeah. Maybe make zswap_entries_cache_alloc_batch() free the already
> allocated entries on failure? Then if zswap_entries_cache_alloc_batch()
> fails we exit without goto store_pages_failed. This is the first failure
> mode so we don't need any further cleanup anyway.

Sure, this can be done.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (20 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches Kanchana P Sridhar
@ 2025-11-04  9:12 ` Kanchana P Sridhar
  2025-11-13 21:34   ` Yosry Ahmed
  2025-11-13 18:14 ` [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Sridhar, Kanchana P
  22 siblings, 1 reply; 79+ messages in thread
From: Kanchana P Sridhar @ 2025-11-04  9:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, kristen.c.accardi,
	vinicius.gomes
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch introduces a new unified implementation of zswap_compress()
for compressors that do and do not support batching. This eliminates
code duplication and facilitates code maintainability with the
introduction of compress batching.

The vectorized implementation of calling the earlier zswap_compress()
sequentially, one page at a time in zswap_store_pages(), is replaced
with this new version of zswap_compress() that accepts multiple pages to
compress as a batch.

If the compressor does not support batching, each page in the batch is
compressed and stored sequentially. If the compressor supports batching,
for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
is compressed in parallel in hardware. If the batch is compressed
without errors, the compressed buffers are then stored in zsmalloc. In
case of compression errors, the current behavior is preserved for the
batching zswap_compress(): if the folio's memcg is writeback enabled,
pages with compression errors are store uncompressed in zsmalloc; if
not, we return an error for the folio in zswap_store().

As per Herbert's suggestion in [1] for batching to be based on SG lists
to interface with the crypto API, a "struct sg_table *sg_outputs" is
added to the per-CPU acomp_ctx. In zswap_cpu_comp_prepare(), memory is
allocated for @pool->compr_batch_size scatterlists in
@acomp_ctx->sg_outputs. The per-CPU @acomp_ctx->buffers' addresses are
statically mapped to the respective SG lists. The existing non-NUMA
sg_alloc_table() was found to give better performance than a NUMA-aware
allocation function, hence is used in this patch.

Batching compressors should initialize the output SG lengths to
PAGE_SIZE as part of the internal compress batching setup, to avoid
having to do multiple traversals over the @acomp_ctx->sg_outputs->sgl.
This is exactly how batching is implemented in the iaa_crypto driver's
compress batching procedure, iaa_comp_acompress_batch().

The batched zswap_compress() implementation is generalized as much as
possible for non-batching and batching compressors, so that the
subsequent incompressible page handling, zs_pool writes, and error
handling code is seamless for both, without the use of conditionals to
switch to specialized code for either.

The new batching implementation of zswap_compress() is called with a
batch of @nr_pages sent from zswap_store() to zswap_store_pages().
zswap_compress() steps through the batch in increments of the
compressor's batch-size, sets up the acomp_ctx->req's src/dst SG lists
to contain the folio pages and output buffers, before calling
crypto_acomp_compress().

Some important requirements of this batching architecture for batching
compressors:

  1) The output SG lengths for each sg in the acomp_req->dst should be
     intialized to PAGE_SIZE as part of other batch setup in the batch
     compression function. zswap will not take care of this in the
     interest of avoiding repetitive traversals of the
     @acomp_ctx->sg_outputs->sgl so as to not lose the benefits of
     batching.

  2) In case of a compression error for any page in the batch, the
     batching compressor should set the corresponding @sg->length to a
     negative error number, as suggested by Herbert. Otherwise, the
     @sg->length will contain the compressed output length.

  3) Batching compressors should set acomp_req->dlen to
     acomp_req->dst->length, i.e., the sg->length of the first SG in
     acomp_req->dst.

Another important change this patch makes is with the acomp_ctx mutex
locking in zswap_compress(). Earlier, the mutex was held per page's
compression. With the new code, [un]locking the mutex per page caused
regressions for software compressors when testing with 30 usemem
processes, and also kernel compilation with 'allmod' config. The
regressions were more eggregious when PMD folios were stored. The
implementation in this commit locks/unlocks the mutex once per batch,
that resolves the regression.

Architectural considerations for the zswap batching framework:
==============================================================
We have designed the zswap batching framework to be
hardware-agnostic. It has no dependencies on Intel-specific features and
can be leveraged by any hardware accelerator or software-based
compressor. In other words, the framework is open and inclusive by
design.

Other ongoing work that can use batching:
=========================================
This patch-series demonstrates the performance benefits of compress
batching when used in zswap_store() of large folios. shrink_folio_list()
"reclaim batching" of any-order folios is the major next work that uses
the zswap compress batching framework: our testing of kernel_compilation
with writeback and the zswap shrinker indicates 10X fewer pages get
written back when we reclaim 32 folios as a batch, as compared to one
folio at a time: this is with deflate-iaa and with zstd. We expect to
submit a patch-series with this data and the resulting performance
improvements shortly. Reclaim batching relieves memory pressure faster
than reclaiming one folio at a time, hence alleviates the need to scan
slab memory for writeback.

Nhat has given ideas on using batching with the ongoing kcompressd work,
as well as beneficially using decompression batching & block IO batching
to improve zswap writeback efficiency.

Experiments that combine zswap compress batching, reclaim batching,
swapin_readahead() decompression batching of prefetched pages, and
writeback batching show that 0 pages are written back with deflate-iaa
and zstd. For comparison, the baselines for these compressors see
200K-800K pages written to disk (kernel compilation 'allmod' config).

To summarize, these are future clients of the batching framework:

   - shrink_folio_list() reclaim batching of multiple folios:
       Implemented, will submit patch-series.
   - zswap writeback with decompress batching:
       Implemented, will submit patch-series.
   - zram:
       Implemented, will submit patch-series.
   - kcompressd:
       Not yet implemented.
   - file systems:
       Not yet implemented.
   - swapin_readahead() decompression batching of prefetched pages:
       Implemented, will submit patch-series.

Additionally, any place we have folios that need to be compressed, can
potentially be parallelized.

Performance data:
=================

As suggested by Barry, this is the performance data gathered on Intel
Sapphire Rapids with usemem 30 processes running at 50% memory pressure
and kernel_compilation/allmod config run with 2G limit using 32
threads. To keep comparisons simple, all testing was done without the
zswap shrinker.

  usemem30 with 64K folios:
  =========================

     zswap shrinker_enabled = N.

     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
                                                                 vs.
                                                             IAA Sequential
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,118,675       9,901,216       62%
     Average throughput (KB/s)     203,955         330,040       62%
     elapsed time (sec)              98.94           70.90      -28%
     sys time (sec)               2,379.29        1,686.18      -29%
     -----------------------------------------------------------------------

     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd   v13 zstd
                                                             improvement
     -----------------------------------------------------------------------
     Total throughput (KB/s)     5,983,561       6,003,851      0.3%
     Average throughput (KB/s)     199,452         200,128      0.3%
     elapsed time (sec)             100.93           96.62     -4.3%
     sys time (sec)               2,532.49        2,395.83       -5%
     -----------------------------------------------------------------------

  usemem30 with 2M folios:
  ========================

     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
                                                                 vs.
                                                             IAA Sequential
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,309,635      10,558,225       67%
     Average throughput (KB/s)     210,321         351,940       67%
     elapsed time (sec)              88.70           67.84      -24%
     sys time (sec)               2,059.83        1,581.07      -23%
     -----------------------------------------------------------------------

     -----------------------------------------------------------------------
                     mm-unstable-10-24-2025             v13
     -----------------------------------------------------------------------
     zswap compressor                 zstd            zstd   v13 zstd
                                                             improvement
     -----------------------------------------------------------------------
     Total throughput (KB/s)     6,562,687       6,567,946      0.1%
     Average throughput (KB/s)     218,756         218,931      0.1%
     elapsed time (sec)              94.69           88.79       -6%
     sys time (sec)               2,253.97        2,083.43       -8%
     -----------------------------------------------------------------------

    The main takeaway from usemem, a workload that is mostly compression
    dominated (very few swapins) is that the higher the number of batches,
    such as with larger folios, the more the benefit of batching cost
    amortization, as shown by the PMD usemem data. This aligns well
    with the future direction for batching.

kernel_compilation/allmodconfig, 64K folios:
============================================

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
                                                             vs.
                                                        IAA Sequential
     --------------------------------------------------------------------------
     real_sec                 836.64          806.94      -3.5%
     sys_sec                3,897.57        3,661.83        -6%
     --------------------------------------------------------------------------

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor           zstd            zstd    Improvement
     --------------------------------------------------------------------------
     real_sec                 880.62          850.41      -3.4%
     sys_sec                5,171.90        5,076.51      -1.8%
     --------------------------------------------------------------------------

kernel_compilation/allmodconfig, PMD folios:
============================================

     --------------------------------------------------------------------------
               mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
                                                             vs.
                                                        IAA Sequential
     --------------------------------------------------------------------------
     real_sec                 818.48          779.67      -4.7%
     sys_sec                4,226.52        4,245.18       0.4%
     --------------------------------------------------------------------------

     --------------------------------------------------------------------------
              mm-unstable-10-24-2025             v13
     --------------------------------------------------------------------------
     zswap compressor          zstd             zstd    Improvement
     --------------------------------------------------------------------------
     real_sec                888.45           849.54      -4.4%
     sys_sec               5,866.72         5,847.17      -0.3%
     --------------------------------------------------------------------------

[1]: https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99aea2ce3d284e6c5a3253061d97b08c4752a798

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 249 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 181 insertions(+), 68 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 257567edc587..c5487dd69ec6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -143,6 +143,7 @@ struct crypto_acomp_ctx {
 	struct acomp_req *req;
 	struct crypto_wait wait;
 	u8 **buffers;
+	struct sg_table *sg_outputs;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -271,6 +272,11 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
 			kfree(acomp_ctx->buffers[i]);
 		kfree(acomp_ctx->buffers);
 	}
+
+	if (acomp_ctx->sg_outputs) {
+		sg_free_table(acomp_ctx->sg_outputs);
+		kfree(acomp_ctx->sg_outputs);
+	}
 }
 
 static struct zswap_pool *zswap_pool_create(char *compressor)
@@ -804,6 +810,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	int nid = cpu_to_node(cpu);
+	struct scatterlist *sg;
 	int ret = -ENOMEM;
 	u8 i;
 
@@ -849,6 +856,22 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 			goto fail;
 	}
 
+	acomp_ctx->sg_outputs = kmalloc(sizeof(*acomp_ctx->sg_outputs),
+					GFP_KERNEL);
+	if (!acomp_ctx->sg_outputs)
+		goto fail;
+
+	if (sg_alloc_table(acomp_ctx->sg_outputs, pool->compr_batch_size,
+			   GFP_KERNEL))
+		goto fail;
+
+	/*
+	 * Statically map the per-CPU destination buffers to the per-CPU
+	 * SG lists.
+	 */
+	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, pool->compr_batch_size, i)
+		sg_set_buf(sg, acomp_ctx->buffers[i], PAGE_SIZE);
+
 	/*
 	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
@@ -869,84 +892,177 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	return ret;
 }
 
-static bool zswap_compress(struct page *page, struct zswap_entry *entry,
-			   struct zswap_pool *pool, bool wb_enabled)
+/*
+ * Unified code path for compressors that do and do not support batching. This
+ * procedure will compress multiple @nr_pages in @folio starting from the
+ * @start index.
+ *
+ * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE. zswap_store() makes
+ * sure of this by design and zswap_store_pages() warns if this is not
+ * true.
+ *
+ * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the compressor does not
+ * support batching.
+ *
+ * If @pool->compr_batch_size is 1, each page is processed sequentially.
+ *
+ * If @pool->compr_batch_size is > 1, compression batching is invoked within
+ * the algorithm's driver, except if @nr_pages is 1: if so, the driver can
+ * choose to call the sequential/non-batching compress API.
+ *
+ * In both cases, if all compressions are successful, the compressed buffers
+ * are stored in zsmalloc.
+ *
+ * Traversing multiple SG lists when @nr_comps is > 1 is expensive, and impacts
+ * batching performance if we were to repeat this operation multiple times,
+ * such as:
+ *   - to map destination buffers to each SG list in the @acomp_ctx->sg_outputs
+ *     sg_table.
+ *   - to initialize each output SG list's @sg->length to PAGE_SIZE.
+ *   - to get the compressed output length in each @sg->length.
+ *
+ * These are some design choices made to optimize batching with SG lists:
+ *
+ * 1) The source folio pages in the batch are directly submitted to
+ *    crypto_acomp via acomp_request_set_src_folio().
+ *
+ * 2) The per-CPU @acomp_ctx->sg_outputs scatterlists are used to set up
+ *    destination buffers for interfacing with crypto_acomp.
+ *
+ * 3) To optimize performance, we map the per-CPU @acomp_ctx->buffers to the
+ *    @acomp_ctx->sg_outputs->sgl SG lists at pool creation time. The only task
+ *    remaining to be done for the output SG lists in zswap_compress() is to
+ *    set each @sg->length to PAGE_SIZE. This is done in zswap_compress()
+ *    for non-batching compressors. This needs to be done within the compress
+ *    batching driver procedure as part of iterating through the SG lists for
+ *    batch setup, so as to minimize expensive traversals through the SG lists.
+ *
+ * 4) Important requirements for batching compressors:
+ *    - Each @sg->length in @acomp_ctx->req->sg_outputs->sgl should reflect the
+ *      compression outcome for that specific page, and be set to:
+ *      - the page's compressed length, or
+ *      - the compression error value for that page.
+ *    - The @acomp_ctx->req->dlen should be set to the first page's
+ *      @sg->length. This enables code generalization in zswap_compress()
+ *      for non-batching and batching compressors.
+ *
+ * acomp_ctx mutex locking:
+ *    Earlier, the mutex was held per page compression. With the new code,
+ *    [un]locking the mutex per page caused regressions for software
+ *    compressors. We now lock the mutex once per batch, which resolves the
+ *    regression.
+ */
+static bool zswap_compress(struct folio *folio, long start, unsigned int nr_pages,
+			   struct zswap_entry *entries[], struct zswap_pool *pool,
+			   int nid, bool wb_enabled)
 {
+	gfp_t gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
+	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
+	unsigned int slen = nr_comps * PAGE_SIZE;
 	struct crypto_acomp_ctx *acomp_ctx;
-	struct scatterlist input, output;
-	int comp_ret = 0, alloc_ret = 0;
-	unsigned int dlen = PAGE_SIZE;
+	int err = 0, err_sg = 0;
+	struct scatterlist *sg;
+	unsigned int i, j, k;
 	unsigned long handle;
-	gfp_t gfp;
-	u8 *dst;
-	bool mapped = false;
+	int *errp, dlen;
+	void *dst;
 
 	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
 	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->buffers[0];
-	sg_init_table(&input, 1);
-	sg_set_page(&input, page, PAGE_SIZE, 0);
-
-	sg_init_one(&output, dst, PAGE_SIZE);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	errp = (pool->compr_batch_size == 1) ? &err : &err_sg;
 
 	/*
-	 * it maybe looks a little bit silly that we send an asynchronous request,
-	 * then wait for its completion synchronously. This makes the process look
-	 * synchronous in fact.
-	 * Theoretically, acomp supports users send multiple acomp requests in one
-	 * acomp instance, then get those requests done simultaneously. but in this
-	 * case, zswap actually does store and load page by page, there is no
-	 * existing method to send the second page before the first page is done
-	 * in one thread doing zswap.
-	 * but in different threads running on different cpu, we have different
-	 * acomp instance, so multiple threads can do (de)compression in parallel.
+	 * [i] refers to the incoming batch space and is used to
+	 *     index into the folio pages.
+	 *
+	 * [j] refers to the incoming batch space and is used to
+	 *     index into the @entries for the folio's pages in this
+	 *     batch, per compress call while iterating over the output SG
+	 *     lists. Also used to index into the folio's pages from @start,
+	 *     in case of compress errors.
+	 *
+	 * [k] refers to the @acomp_ctx space, as determined by
+	 *     @pool->compr_batch_size, and is used to index into
+	 *     @acomp_ctx->sg_outputs->sgl and @acomp_ctx->buffers.
 	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
+	for (i = 0; i < nr_pages; i += nr_comps) {
+		acomp_request_set_src_folio(acomp_ctx->req, folio,
+					    (start + i) * PAGE_SIZE,
+					    slen);
 
-	/*
-	 * If a page cannot be compressed into a size smaller than PAGE_SIZE,
-	 * save the content as is without a compression, to keep the LRU order
-	 * of writebacks.  If writeback is disabled, reject the page since it
-	 * only adds metadata overhead.  swap_writeout() will put the page back
-	 * to the active LRU list in the case.
-	 */
-	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
-		if (!wb_enabled) {
-			comp_ret = comp_ret ? comp_ret : -EINVAL;
-			goto unlock;
-		}
-		comp_ret = 0;
-		dlen = PAGE_SIZE;
-		dst = kmap_local_page(page);
-		mapped = true;
-	}
+		acomp_ctx->sg_outputs->sgl->length = slen;
 
-	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
-	handle = zs_malloc(pool->zs_pool, dlen, gfp, page_to_nid(page));
-	if (IS_ERR_VALUE(handle)) {
-		alloc_ret = PTR_ERR((void *)handle);
-		goto unlock;
-	}
+		acomp_request_set_dst_sg(acomp_ctx->req,
+					 acomp_ctx->sg_outputs->sgl,
+					 slen);
+
+		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
+				      &acomp_ctx->wait);
+
+		acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
+
+		/*
+		 * If a page cannot be compressed into a size smaller than
+		 * PAGE_SIZE, save the content as is without a compression, to
+		 * keep the LRU order of writebacks.  If writeback is disabled,
+		 * reject the page since it only adds metadata overhead.
+		 * swap_writeout() will put the page back to the active LRU list
+		 * in the case.
+		 *
+		 * It is assumed that any compressor that sets the output length
+		 * to 0 or a value >= PAGE_SIZE will also return a negative
+		 * error status in @err; i.e, will not return a successful
+		 * compression status in @err in this case.
+		 */
+		if (err && !wb_enabled)
+			goto compress_error;
+
+		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
+			j = k + i;
+			dst = acomp_ctx->buffers[k];
+			dlen = sg->length | *errp;
+
+			if (dlen < 0) {
+				dlen = PAGE_SIZE;
+				dst = kmap_local_page(folio_page(folio, start + j));
+			}
+
+			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
 
-	zs_obj_write(pool->zs_pool, handle, dst, dlen);
-	entry->handle = handle;
-	entry->length = dlen;
+			if (IS_ERR_VALUE(handle)) {
+				if (PTR_ERR((void *)handle) == -ENOSPC)
+					zswap_reject_compress_poor++;
+				else
+					zswap_reject_alloc_fail++;
 
-unlock:
-	if (mapped)
-		kunmap_local(dst);
-	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
-		zswap_reject_compress_poor++;
-	else if (comp_ret)
-		zswap_reject_compress_fail++;
-	else if (alloc_ret)
-		zswap_reject_alloc_fail++;
+				goto err_unlock;
+			}
+
+			zs_obj_write(pool->zs_pool, handle, dst, dlen);
+			entries[j]->handle = handle;
+			entries[j]->length = dlen;
+			if (dst != acomp_ctx->buffers[k])
+				kunmap_local(dst);
+		}
+	} /* finished compress and store nr_pages. */
+
+	mutex_unlock(&acomp_ctx->mutex);
+	return true;
+
+compress_error:
+	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
+		if ((int)sg->length < 0) {
+			if ((int)sg->length == -ENOSPC)
+				zswap_reject_compress_poor++;
+			else
+				zswap_reject_compress_fail++;
+		}
+	}
 
+err_unlock:
 	mutex_unlock(&acomp_ctx->mutex);
-	return comp_ret == 0 && alloc_ret == 0;
+	return false;
 }
 
 static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
@@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio *folio,
 		INIT_LIST_HEAD(&entries[i]->lru);
 	}
 
-	for (i = 0; i < nr_pages; ++i) {
-		struct page *page = folio_page(folio, start + i);
-
-		if (!zswap_compress(page, entries[i], pool, wb_enabled))
-			goto store_pages_failed;
-	}
+	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
+				     nid, wb_enabled)))
+		goto store_pages_failed;
 
 	for (i = 0; i < nr_pages; ++i) {
 		struct zswap_entry *old, *entry = entries[i];
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-04  9:12 ` [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
@ 2025-11-13 21:34   ` Yosry Ahmed
  2025-11-13 23:55     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-11-13 21:34 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, kristen.c.accardi, vinicius.gomes,
	wajdi.k.feghali, vinodh.gopal

On Tue, Nov 04, 2025 at 01:12:35AM -0800, Kanchana P Sridhar wrote:
> This patch introduces a new unified implementation of zswap_compress()
> for compressors that do and do not support batching. This eliminates
> code duplication and facilitates code maintainability with the
> introduction of compress batching.
> 
> The vectorized implementation of calling the earlier zswap_compress()
> sequentially, one page at a time in zswap_store_pages(), is replaced
> with this new version of zswap_compress() that accepts multiple pages to
> compress as a batch.
> 
> If the compressor does not support batching, each page in the batch is
> compressed and stored sequentially. If the compressor supports batching,
> for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
> is compressed in parallel in hardware. If the batch is compressed
> without errors, the compressed buffers are then stored in zsmalloc. In
> case of compression errors, the current behavior is preserved for the
> batching zswap_compress(): if the folio's memcg is writeback enabled,
> pages with compression errors are store uncompressed in zsmalloc; if
> not, we return an error for the folio in zswap_store().
> 
> As per Herbert's suggestion in [1] for batching to be based on SG lists
> to interface with the crypto API, a "struct sg_table *sg_outputs" is
> added to the per-CPU acomp_ctx. In zswap_cpu_comp_prepare(), memory is
> allocated for @pool->compr_batch_size scatterlists in
> @acomp_ctx->sg_outputs. The per-CPU @acomp_ctx->buffers' addresses are
> statically mapped to the respective SG lists. The existing non-NUMA
> sg_alloc_table() was found to give better performance than a NUMA-aware
> allocation function, hence is used in this patch.
> 
> Batching compressors should initialize the output SG lengths to
> PAGE_SIZE as part of the internal compress batching setup, to avoid
> having to do multiple traversals over the @acomp_ctx->sg_outputs->sgl.
> This is exactly how batching is implemented in the iaa_crypto driver's
> compress batching procedure, iaa_comp_acompress_batch().
> 
> The batched zswap_compress() implementation is generalized as much as
> possible for non-batching and batching compressors, so that the
> subsequent incompressible page handling, zs_pool writes, and error
> handling code is seamless for both, without the use of conditionals to
> switch to specialized code for either.
> 
> The new batching implementation of zswap_compress() is called with a
> batch of @nr_pages sent from zswap_store() to zswap_store_pages().
> zswap_compress() steps through the batch in increments of the
> compressor's batch-size, sets up the acomp_ctx->req's src/dst SG lists
> to contain the folio pages and output buffers, before calling
> crypto_acomp_compress().
> 
> Some important requirements of this batching architecture for batching
> compressors:
> 
>   1) The output SG lengths for each sg in the acomp_req->dst should be
>      intialized to PAGE_SIZE as part of other batch setup in the batch
>      compression function. zswap will not take care of this in the
>      interest of avoiding repetitive traversals of the
>      @acomp_ctx->sg_outputs->sgl so as to not lose the benefits of
>      batching.
> 
>   2) In case of a compression error for any page in the batch, the
>      batching compressor should set the corresponding @sg->length to a
>      negative error number, as suggested by Herbert. Otherwise, the
>      @sg->length will contain the compressed output length.
> 
>   3) Batching compressors should set acomp_req->dlen to
>      acomp_req->dst->length, i.e., the sg->length of the first SG in
>      acomp_req->dst.
> 
> Another important change this patch makes is with the acomp_ctx mutex
> locking in zswap_compress(). Earlier, the mutex was held per page's
> compression. With the new code, [un]locking the mutex per page caused
> regressions for software compressors when testing with 30 usemem
> processes, and also kernel compilation with 'allmod' config. The
> regressions were more eggregious when PMD folios were stored. The
> implementation in this commit locks/unlocks the mutex once per batch,
> that resolves the regression.
> 
> Architectural considerations for the zswap batching framework:
> ==============================================================
> We have designed the zswap batching framework to be
> hardware-agnostic. It has no dependencies on Intel-specific features and
> can be leveraged by any hardware accelerator or software-based
> compressor. In other words, the framework is open and inclusive by
> design.
> 
> Other ongoing work that can use batching:
> =========================================
> This patch-series demonstrates the performance benefits of compress
> batching when used in zswap_store() of large folios. shrink_folio_list()
> "reclaim batching" of any-order folios is the major next work that uses
> the zswap compress batching framework: our testing of kernel_compilation
> with writeback and the zswap shrinker indicates 10X fewer pages get
> written back when we reclaim 32 folios as a batch, as compared to one
> folio at a time: this is with deflate-iaa and with zstd. We expect to
> submit a patch-series with this data and the resulting performance
> improvements shortly. Reclaim batching relieves memory pressure faster
> than reclaiming one folio at a time, hence alleviates the need to scan
> slab memory for writeback.
> 
> Nhat has given ideas on using batching with the ongoing kcompressd work,
> as well as beneficially using decompression batching & block IO batching
> to improve zswap writeback efficiency.
> 
> Experiments that combine zswap compress batching, reclaim batching,
> swapin_readahead() decompression batching of prefetched pages, and
> writeback batching show that 0 pages are written back with deflate-iaa
> and zstd. For comparison, the baselines for these compressors see
> 200K-800K pages written to disk (kernel compilation 'allmod' config).
> 
> To summarize, these are future clients of the batching framework:
> 
>    - shrink_folio_list() reclaim batching of multiple folios:
>        Implemented, will submit patch-series.
>    - zswap writeback with decompress batching:
>        Implemented, will submit patch-series.
>    - zram:
>        Implemented, will submit patch-series.
>    - kcompressd:
>        Not yet implemented.
>    - file systems:
>        Not yet implemented.
>    - swapin_readahead() decompression batching of prefetched pages:
>        Implemented, will submit patch-series.
> 
> Additionally, any place we have folios that need to be compressed, can
> potentially be parallelized.
> 
> Performance data:
> =================
> 
> As suggested by Barry, this is the performance data gathered on Intel
> Sapphire Rapids with usemem 30 processes running at 50% memory pressure
> and kernel_compilation/allmod config run with 2G limit using 32
> threads. To keep comparisons simple, all testing was done without the
> zswap shrinker.
> 
>   usemem30 with 64K folios:
>   =========================
> 
>      zswap shrinker_enabled = N.
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     6,118,675       9,901,216       62%
>      Average throughput (KB/s)     203,955         330,040       62%
>      elapsed time (sec)              98.94           70.90      -28%
>      sys time (sec)               2,379.29        1,686.18      -29%
>      -----------------------------------------------------------------------
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor                 zstd            zstd   v13 zstd
>                                                              improvement
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     5,983,561       6,003,851      0.3%
>      Average throughput (KB/s)     199,452         200,128      0.3%
>      elapsed time (sec)             100.93           96.62     -4.3%
>      sys time (sec)               2,532.49        2,395.83       -5%
>      -----------------------------------------------------------------------
> 
>   usemem30 with 2M folios:
>   ========================
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     6,309,635      10,558,225       67%
>      Average throughput (KB/s)     210,321         351,940       67%
>      elapsed time (sec)              88.70           67.84      -24%
>      sys time (sec)               2,059.83        1,581.07      -23%
>      -----------------------------------------------------------------------
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor                 zstd            zstd   v13 zstd
>                                                              improvement
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     6,562,687       6,567,946      0.1%
>      Average throughput (KB/s)     218,756         218,931      0.1%
>      elapsed time (sec)              94.69           88.79       -6%
>      sys time (sec)               2,253.97        2,083.43       -8%
>      -----------------------------------------------------------------------
> 
>     The main takeaway from usemem, a workload that is mostly compression
>     dominated (very few swapins) is that the higher the number of batches,
>     such as with larger folios, the more the benefit of batching cost
>     amortization, as shown by the PMD usemem data. This aligns well
>     with the future direction for batching.
> 
> kernel_compilation/allmodconfig, 64K folios:
> ============================================
> 
>      --------------------------------------------------------------------------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
>                                                              vs.
>                                                         IAA Sequential
>      --------------------------------------------------------------------------
>      real_sec                 836.64          806.94      -3.5%
>      sys_sec                3,897.57        3,661.83        -6%
>      --------------------------------------------------------------------------
> 
>      --------------------------------------------------------------------------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor           zstd            zstd    Improvement
>      --------------------------------------------------------------------------
>      real_sec                 880.62          850.41      -3.4%
>      sys_sec                5,171.90        5,076.51      -1.8%
>      --------------------------------------------------------------------------
> 
> kernel_compilation/allmodconfig, PMD folios:
> ============================================
> 
>      --------------------------------------------------------------------------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
>                                                              vs.
>                                                         IAA Sequential
>      --------------------------------------------------------------------------
>      real_sec                 818.48          779.67      -4.7%
>      sys_sec                4,226.52        4,245.18       0.4%
>      --------------------------------------------------------------------------
> 
>      --------------------------------------------------------------------------
>               mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor          zstd             zstd    Improvement
>      --------------------------------------------------------------------------
>      real_sec                888.45           849.54      -4.4%
>      sys_sec               5,866.72         5,847.17      -0.3%
>      --------------------------------------------------------------------------
> 
> [1]: https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99aea2ce3d284e6c5a3253061d97b08c4752a798
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

I won't go through the commit log and rewrite for this one too, but
please do so similar to how I did for the previous patches. Do not
describe the code, give a high-level overview of what is happening and
why it's happeneing, as well as very concise performance results.

Do not include things that only make sense in the context of a patch and
won't make sense as part of git histroy.

That being said, I'd like Herbert to review this patch and make sure the
scatterlist and crypto APIs are being used correctly as he advised
earlier. I do have some comments on the zswap side though.

> ---
>  mm/zswap.c | 249 ++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 181 insertions(+), 68 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 257567edc587..c5487dd69ec6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -143,6 +143,7 @@ struct crypto_acomp_ctx {
>  	struct acomp_req *req;
>  	struct crypto_wait wait;
>  	u8 **buffers;
> +	struct sg_table *sg_outputs;
>  	struct mutex mutex;
>  	bool is_sleepable;
>  };
> @@ -271,6 +272,11 @@ static void acomp_ctx_dealloc(struct crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
>  			kfree(acomp_ctx->buffers[i]);
>  		kfree(acomp_ctx->buffers);
>  	}
> +
> +	if (acomp_ctx->sg_outputs) {
> +		sg_free_table(acomp_ctx->sg_outputs);
> +		kfree(acomp_ctx->sg_outputs);
> +	}
>  }
>  
>  static struct zswap_pool *zswap_pool_create(char *compressor)
> @@ -804,6 +810,7 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>  	int nid = cpu_to_node(cpu);
> +	struct scatterlist *sg;
>  	int ret = -ENOMEM;
>  	u8 i;
>  
> @@ -849,6 +856,22 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  			goto fail;
>  	}
>  
> +	acomp_ctx->sg_outputs = kmalloc(sizeof(*acomp_ctx->sg_outputs),
> +					GFP_KERNEL);
> +	if (!acomp_ctx->sg_outputs)
> +		goto fail;
> +
> +	if (sg_alloc_table(acomp_ctx->sg_outputs, pool->compr_batch_size,
> +			   GFP_KERNEL))
> +		goto fail;
> +
> +	/*
> +	 * Statically map the per-CPU destination buffers to the per-CPU
> +	 * SG lists.
> +	 */
> +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, pool->compr_batch_size, i)
> +		sg_set_buf(sg, acomp_ctx->buffers[i], PAGE_SIZE);
> +
>  	/*
>  	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
>  	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
> @@ -869,84 +892,177 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>  	return ret;
>  }
>  
> -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> -			   struct zswap_pool *pool, bool wb_enabled)
> +/*
> + * Unified code path for compressors that do and do not support batching. This
> + * procedure will compress multiple @nr_pages in @folio starting from the
> + * @start index.
> + *
> + * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE. zswap_store() makes
> + * sure of this by design and zswap_store_pages() warns if this is not
> + * true.
> + *
> + * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the compressor does not
> + * support batching.
> + *
> + * If @pool->compr_batch_size is 1, each page is processed sequentially.
> + *
> + * If @pool->compr_batch_size is > 1, compression batching is invoked within
> + * the algorithm's driver, except if @nr_pages is 1: if so, the driver can
> + * choose to call the sequential/non-batching compress API.
> + *
> + * In both cases, if all compressions are successful, the compressed buffers
> + * are stored in zsmalloc.
> + *
> + * Traversing multiple SG lists when @nr_comps is > 1 is expensive, and impacts
> + * batching performance if we were to repeat this operation multiple times,
> + * such as:
> + *   - to map destination buffers to each SG list in the @acomp_ctx->sg_outputs
> + *     sg_table.
> + *   - to initialize each output SG list's @sg->length to PAGE_SIZE.
> + *   - to get the compressed output length in each @sg->length.
> + *
> + * These are some design choices made to optimize batching with SG lists:
> + *
> + * 1) The source folio pages in the batch are directly submitted to
> + *    crypto_acomp via acomp_request_set_src_folio().
> + *
> + * 2) The per-CPU @acomp_ctx->sg_outputs scatterlists are used to set up
> + *    destination buffers for interfacing with crypto_acomp.
> + *
> + * 3) To optimize performance, we map the per-CPU @acomp_ctx->buffers to the
> + *    @acomp_ctx->sg_outputs->sgl SG lists at pool creation time. The only task
> + *    remaining to be done for the output SG lists in zswap_compress() is to
> + *    set each @sg->length to PAGE_SIZE. This is done in zswap_compress()
> + *    for non-batching compressors. This needs to be done within the compress
> + *    batching driver procedure as part of iterating through the SG lists for
> + *    batch setup, so as to minimize expensive traversals through the SG lists.
> + *
> + * 4) Important requirements for batching compressors:
> + *    - Each @sg->length in @acomp_ctx->req->sg_outputs->sgl should reflect the
> + *      compression outcome for that specific page, and be set to:
> + *      - the page's compressed length, or
> + *      - the compression error value for that page.
> + *    - The @acomp_ctx->req->dlen should be set to the first page's
> + *      @sg->length. This enables code generalization in zswap_compress()
> + *      for non-batching and batching compressors.
> + *
> + * acomp_ctx mutex locking:
> + *    Earlier, the mutex was held per page compression. With the new code,
> + *    [un]locking the mutex per page caused regressions for software
> + *    compressors. We now lock the mutex once per batch, which resolves the
> + *    regression.
> + */

Please, no huge comments describing what the code is doing. If there's
anything that is not clear from reading the code or needs to be
explained or documented, please do so **concisely** in the relevant part
of the function.

> +static bool zswap_compress(struct folio *folio, long start, unsigned int nr_pages,
> +			   struct zswap_entry *entries[], struct zswap_pool *pool,
> +			   int nid, bool wb_enabled)
>  {
> +	gfp_t gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> +	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> +	unsigned int slen = nr_comps * PAGE_SIZE;
>  	struct crypto_acomp_ctx *acomp_ctx;
> -	struct scatterlist input, output;
> -	int comp_ret = 0, alloc_ret = 0;
> -	unsigned int dlen = PAGE_SIZE;
> +	int err = 0, err_sg = 0;
> +	struct scatterlist *sg;
> +	unsigned int i, j, k;
>  	unsigned long handle;
> -	gfp_t gfp;
> -	u8 *dst;
> -	bool mapped = false;
> +	int *errp, dlen;
> +	void *dst;
>  
>  	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
>  	mutex_lock(&acomp_ctx->mutex);
>  
> -	dst = acomp_ctx->buffers[0];
> -	sg_init_table(&input, 1);
> -	sg_set_page(&input, page, PAGE_SIZE, 0);
> -
> -	sg_init_one(&output, dst, PAGE_SIZE);
> -	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +	errp = (pool->compr_batch_size == 1) ? &err : &err_sg;

err_sg is not used anywhere, so *errp could end up being garbage. Why do
we need this?

>  
>  	/*
> -	 * it maybe looks a little bit silly that we send an asynchronous request,
> -	 * then wait for its completion synchronously. This makes the process look
> -	 * synchronous in fact.
> -	 * Theoretically, acomp supports users send multiple acomp requests in one
> -	 * acomp instance, then get those requests done simultaneously. but in this
> -	 * case, zswap actually does store and load page by page, there is no
> -	 * existing method to send the second page before the first page is done
> -	 * in one thread doing zswap.
> -	 * but in different threads running on different cpu, we have different
> -	 * acomp instance, so multiple threads can do (de)compression in parallel.
> +	 * [i] refers to the incoming batch space and is used to
> +	 *     index into the folio pages.
> +	 *
> +	 * [j] refers to the incoming batch space and is used to
> +	 *     index into the @entries for the folio's pages in this
> +	 *     batch, per compress call while iterating over the output SG
> +	 *     lists. Also used to index into the folio's pages from @start,
> +	 *     in case of compress errors.
> +	 *
> +	 * [k] refers to the @acomp_ctx space, as determined by
> +	 *     @pool->compr_batch_size, and is used to index into
> +	 *     @acomp_ctx->sg_outputs->sgl and @acomp_ctx->buffers.
>  	 */
> -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> -	dlen = acomp_ctx->req->dlen;
> +	for (i = 0; i < nr_pages; i += nr_comps) {

What are looping over here? I thought zswap_compress() takes in exactly
one batch.

> +		acomp_request_set_src_folio(acomp_ctx->req, folio,
> +					    (start + i) * PAGE_SIZE,
> +					    slen);
>  
> -	/*
> -	 * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> -	 * save the content as is without a compression, to keep the LRU order
> -	 * of writebacks.  If writeback is disabled, reject the page since it
> -	 * only adds metadata overhead.  swap_writeout() will put the page back
> -	 * to the active LRU list in the case.
> -	 */
> -	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
> -		if (!wb_enabled) {
> -			comp_ret = comp_ret ? comp_ret : -EINVAL;
> -			goto unlock;
> -		}
> -		comp_ret = 0;
> -		dlen = PAGE_SIZE;
> -		dst = kmap_local_page(page);
> -		mapped = true;
> -	}
> +		acomp_ctx->sg_outputs->sgl->length = slen;
>  
> -	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> -	handle = zs_malloc(pool->zs_pool, dlen, gfp, page_to_nid(page));
> -	if (IS_ERR_VALUE(handle)) {
> -		alloc_ret = PTR_ERR((void *)handle);
> -		goto unlock;
> -	}
> +		acomp_request_set_dst_sg(acomp_ctx->req,
> +					 acomp_ctx->sg_outputs->sgl,
> +					 slen);
> +
> +		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> +				      &acomp_ctx->wait);
> +
> +		acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
> +
> +		/*
> +		 * If a page cannot be compressed into a size smaller than
> +		 * PAGE_SIZE, save the content as is without a compression, to
> +		 * keep the LRU order of writebacks.  If writeback is disabled,
> +		 * reject the page since it only adds metadata overhead.
> +		 * swap_writeout() will put the page back to the active LRU list
> +		 * in the case.
> +		 *
> +		 * It is assumed that any compressor that sets the output length
> +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> +		 * error status in @err; i.e, will not return a successful
> +		 * compression status in @err in this case.
> +		 */

Ugh, checking the compression error and checking the compression length
are now in separate places so we need to check if writeback is disabled
in separate places and store the page as-is. It's ugly, and I think the
current code is not correct.

> +		if (err && !wb_enabled)
> +			goto compress_error;
> +
> +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> +			j = k + i;

Please use meaningful iterator names rather than i, j, and k and the huge
comment explaining what they are.

> +			dst = acomp_ctx->buffers[k];
> +			dlen = sg->length | *errp;

Why are we doing this?

> +
> +			if (dlen < 0) {

We should do the incompressible page handling also if dlen is PAGE_SIZE,
or if the compression failed (I guess that's the intention of bit OR'ing
with *errp?)

> +				dlen = PAGE_SIZE;
> +				dst = kmap_local_page(folio_page(folio, start + j));
> +			}
> +
> +			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
>  
> -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> -	entry->handle = handle;
> -	entry->length = dlen;
> +			if (IS_ERR_VALUE(handle)) {
> +				if (PTR_ERR((void *)handle) == -ENOSPC)
> +					zswap_reject_compress_poor++;
> +				else
> +					zswap_reject_alloc_fail++;
>  
> -unlock:
> -	if (mapped)
> -		kunmap_local(dst);
> -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> -		zswap_reject_compress_poor++;
> -	else if (comp_ret)
> -		zswap_reject_compress_fail++;
> -	else if (alloc_ret)
> -		zswap_reject_alloc_fail++;
> +				goto err_unlock;
> +			}
> +
> +			zs_obj_write(pool->zs_pool, handle, dst, dlen);
> +			entries[j]->handle = handle;
> +			entries[j]->length = dlen;
> +			if (dst != acomp_ctx->buffers[k])
> +				kunmap_local(dst);
> +		}
> +	} /* finished compress and store nr_pages. */
> +
> +	mutex_unlock(&acomp_ctx->mutex);
> +	return true;
> +
> +compress_error:
> +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> +		if ((int)sg->length < 0) {
> +			if ((int)sg->length == -ENOSPC)
> +				zswap_reject_compress_poor++;
> +			else
> +				zswap_reject_compress_fail++;
> +		}
> +	}
>  
> +err_unlock:
>  	mutex_unlock(&acomp_ctx->mutex);
> -	return comp_ret == 0 && alloc_ret == 0;
> +	return false;
>  }
>  
>  static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio *folio,
>  		INIT_LIST_HEAD(&entries[i]->lru);
>  	}
>  
> -	for (i = 0; i < nr_pages; ++i) {
> -		struct page *page = folio_page(folio, start + i);
> -
> -		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> -			goto store_pages_failed;
> -	}
> +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> +				     nid, wb_enabled)))
> +		goto store_pages_failed;
>  
>  	for (i = 0; i < nr_pages; ++i) {
>  		struct zswap_entry *old, *entry = entries[i];
> -- 
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-13 21:34   ` Yosry Ahmed
@ 2025-11-13 23:55     ` Sridhar, Kanchana P
  2025-11-14  0:46       ` Yosry Ahmed
  2025-11-14  5:52       ` Yosry Ahmed
  0 siblings, 2 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-11-13 23:55 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 1:35 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Tue, Nov 04, 2025 at 01:12:35AM -0800, Kanchana P Sridhar wrote:
> > This patch introduces a new unified implementation of zswap_compress()
> > for compressors that do and do not support batching. This eliminates
> > code duplication and facilitates code maintainability with the
> > introduction of compress batching.
> >
> > The vectorized implementation of calling the earlier zswap_compress()
> > sequentially, one page at a time in zswap_store_pages(), is replaced
> > with this new version of zswap_compress() that accepts multiple pages to
> > compress as a batch.
> >
> > If the compressor does not support batching, each page in the batch is
> > compressed and stored sequentially. If the compressor supports batching,
> > for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
> > is compressed in parallel in hardware. If the batch is compressed
> > without errors, the compressed buffers are then stored in zsmalloc. In
> > case of compression errors, the current behavior is preserved for the
> > batching zswap_compress(): if the folio's memcg is writeback enabled,
> > pages with compression errors are store uncompressed in zsmalloc; if
> > not, we return an error for the folio in zswap_store().
> >
> > As per Herbert's suggestion in [1] for batching to be based on SG lists
> > to interface with the crypto API, a "struct sg_table *sg_outputs" is
> > added to the per-CPU acomp_ctx. In zswap_cpu_comp_prepare(), memory
> is
> > allocated for @pool->compr_batch_size scatterlists in
> > @acomp_ctx->sg_outputs. The per-CPU @acomp_ctx->buffers' addresses
> are
> > statically mapped to the respective SG lists. The existing non-NUMA
> > sg_alloc_table() was found to give better performance than a NUMA-aware
> > allocation function, hence is used in this patch.
> >
> > Batching compressors should initialize the output SG lengths to
> > PAGE_SIZE as part of the internal compress batching setup, to avoid
> > having to do multiple traversals over the @acomp_ctx->sg_outputs->sgl.
> > This is exactly how batching is implemented in the iaa_crypto driver's
> > compress batching procedure, iaa_comp_acompress_batch().
> >
> > The batched zswap_compress() implementation is generalized as much as
> > possible for non-batching and batching compressors, so that the
> > subsequent incompressible page handling, zs_pool writes, and error
> > handling code is seamless for both, without the use of conditionals to
> > switch to specialized code for either.
> >
> > The new batching implementation of zswap_compress() is called with a
> > batch of @nr_pages sent from zswap_store() to zswap_store_pages().
> > zswap_compress() steps through the batch in increments of the
> > compressor's batch-size, sets up the acomp_ctx->req's src/dst SG lists
> > to contain the folio pages and output buffers, before calling
> > crypto_acomp_compress().
> >
> > Some important requirements of this batching architecture for batching
> > compressors:
> >
> >   1) The output SG lengths for each sg in the acomp_req->dst should be
> >      intialized to PAGE_SIZE as part of other batch setup in the batch
> >      compression function. zswap will not take care of this in the
> >      interest of avoiding repetitive traversals of the
> >      @acomp_ctx->sg_outputs->sgl so as to not lose the benefits of
> >      batching.
> >
> >   2) In case of a compression error for any page in the batch, the
> >      batching compressor should set the corresponding @sg->length to a
> >      negative error number, as suggested by Herbert. Otherwise, the
> >      @sg->length will contain the compressed output length.
> >
> >   3) Batching compressors should set acomp_req->dlen to
> >      acomp_req->dst->length, i.e., the sg->length of the first SG in
> >      acomp_req->dst.
> >
> > Another important change this patch makes is with the acomp_ctx mutex
> > locking in zswap_compress(). Earlier, the mutex was held per page's
> > compression. With the new code, [un]locking the mutex per page caused
> > regressions for software compressors when testing with 30 usemem
> > processes, and also kernel compilation with 'allmod' config. The
> > regressions were more eggregious when PMD folios were stored. The
> > implementation in this commit locks/unlocks the mutex once per batch,
> > that resolves the regression.
> >
> > Architectural considerations for the zswap batching framework:
> >
> ==============================================================
> > We have designed the zswap batching framework to be
> > hardware-agnostic. It has no dependencies on Intel-specific features and
> > can be leveraged by any hardware accelerator or software-based
> > compressor. In other words, the framework is open and inclusive by
> > design.
> >
> > Other ongoing work that can use batching:
> > =========================================
> > This patch-series demonstrates the performance benefits of compress
> > batching when used in zswap_store() of large folios. shrink_folio_list()
> > "reclaim batching" of any-order folios is the major next work that uses
> > the zswap compress batching framework: our testing of kernel_compilation
> > with writeback and the zswap shrinker indicates 10X fewer pages get
> > written back when we reclaim 32 folios as a batch, as compared to one
> > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > submit a patch-series with this data and the resulting performance
> > improvements shortly. Reclaim batching relieves memory pressure faster
> > than reclaiming one folio at a time, hence alleviates the need to scan
> > slab memory for writeback.
> >
> > Nhat has given ideas on using batching with the ongoing kcompressd work,
> > as well as beneficially using decompression batching & block IO batching
> > to improve zswap writeback efficiency.
> >
> > Experiments that combine zswap compress batching, reclaim batching,
> > swapin_readahead() decompression batching of prefetched pages, and
> > writeback batching show that 0 pages are written back with deflate-iaa
> > and zstd. For comparison, the baselines for these compressors see
> > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> >
> > To summarize, these are future clients of the batching framework:
> >
> >    - shrink_folio_list() reclaim batching of multiple folios:
> >        Implemented, will submit patch-series.
> >    - zswap writeback with decompress batching:
> >        Implemented, will submit patch-series.
> >    - zram:
> >        Implemented, will submit patch-series.
> >    - kcompressd:
> >        Not yet implemented.
> >    - file systems:
> >        Not yet implemented.
> >    - swapin_readahead() decompression batching of prefetched pages:
> >        Implemented, will submit patch-series.
> >
> > Additionally, any place we have folios that need to be compressed, can
> > potentially be parallelized.
> >
> > Performance data:
> > =================
> >
> > As suggested by Barry, this is the performance data gathered on Intel
> > Sapphire Rapids with usemem 30 processes running at 50% memory
> pressure
> > and kernel_compilation/allmod config run with 2G limit using 32
> > threads. To keep comparisons simple, all testing was done without the
> > zswap shrinker.
> >
> >   usemem30 with 64K folios:
> >   =========================
> >
> >      zswap shrinker_enabled = N.
> >
> >      -----------------------------------------------------------------------
> >                      mm-unstable-10-24-2025             v13
> >      -----------------------------------------------------------------------
> >      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
> >                                                                  vs.
> >                                                              IAA Sequential
> >      -----------------------------------------------------------------------
> >      Total throughput (KB/s)     6,118,675       9,901,216       62%
> >      Average throughput (KB/s)     203,955         330,040       62%
> >      elapsed time (sec)              98.94           70.90      -28%
> >      sys time (sec)               2,379.29        1,686.18      -29%
> >      -----------------------------------------------------------------------
> >
> >      -----------------------------------------------------------------------
> >                      mm-unstable-10-24-2025             v13
> >      -----------------------------------------------------------------------
> >      zswap compressor                 zstd            zstd   v13 zstd
> >                                                              improvement
> >      -----------------------------------------------------------------------
> >      Total throughput (KB/s)     5,983,561       6,003,851      0.3%
> >      Average throughput (KB/s)     199,452         200,128      0.3%
> >      elapsed time (sec)             100.93           96.62     -4.3%
> >      sys time (sec)               2,532.49        2,395.83       -5%
> >      -----------------------------------------------------------------------
> >
> >   usemem30 with 2M folios:
> >   ========================
> >
> >      -----------------------------------------------------------------------
> >                      mm-unstable-10-24-2025             v13
> >      -----------------------------------------------------------------------
> >      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
> >                                                                  vs.
> >                                                              IAA Sequential
> >      -----------------------------------------------------------------------
> >      Total throughput (KB/s)     6,309,635      10,558,225       67%
> >      Average throughput (KB/s)     210,321         351,940       67%
> >      elapsed time (sec)              88.70           67.84      -24%
> >      sys time (sec)               2,059.83        1,581.07      -23%
> >      -----------------------------------------------------------------------
> >
> >      -----------------------------------------------------------------------
> >                      mm-unstable-10-24-2025             v13
> >      -----------------------------------------------------------------------
> >      zswap compressor                 zstd            zstd   v13 zstd
> >                                                              improvement
> >      -----------------------------------------------------------------------
> >      Total throughput (KB/s)     6,562,687       6,567,946      0.1%
> >      Average throughput (KB/s)     218,756         218,931      0.1%
> >      elapsed time (sec)              94.69           88.79       -6%
> >      sys time (sec)               2,253.97        2,083.43       -8%
> >      -----------------------------------------------------------------------
> >
> >     The main takeaway from usemem, a workload that is mostly compression
> >     dominated (very few swapins) is that the higher the number of batches,
> >     such as with larger folios, the more the benefit of batching cost
> >     amortization, as shown by the PMD usemem data. This aligns well
> >     with the future direction for batching.
> >
> > kernel_compilation/allmodconfig, 64K folios:
> > ============================================
> >
> >      --------------------------------------------------------------------------
> >                mm-unstable-10-24-2025             v13
> >      --------------------------------------------------------------------------
> >      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
> >                                                              vs.
> >                                                         IAA Sequential
> >      --------------------------------------------------------------------------
> >      real_sec                 836.64          806.94      -3.5%
> >      sys_sec                3,897.57        3,661.83        -6%
> >      --------------------------------------------------------------------------
> >
> >      --------------------------------------------------------------------------
> >                mm-unstable-10-24-2025             v13
> >      --------------------------------------------------------------------------
> >      zswap compressor           zstd            zstd    Improvement
> >      --------------------------------------------------------------------------
> >      real_sec                 880.62          850.41      -3.4%
> >      sys_sec                5,171.90        5,076.51      -1.8%
> >      --------------------------------------------------------------------------
> >
> > kernel_compilation/allmodconfig, PMD folios:
> > ============================================
> >
> >      --------------------------------------------------------------------------
> >                mm-unstable-10-24-2025             v13
> >      --------------------------------------------------------------------------
> >      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
> >                                                              vs.
> >                                                         IAA Sequential
> >      --------------------------------------------------------------------------
> >      real_sec                 818.48          779.67      -4.7%
> >      sys_sec                4,226.52        4,245.18       0.4%
> >      --------------------------------------------------------------------------
> >
> >      --------------------------------------------------------------------------
> >               mm-unstable-10-24-2025             v13
> >      --------------------------------------------------------------------------
> >      zswap compressor          zstd             zstd    Improvement
> >      --------------------------------------------------------------------------
> >      real_sec                888.45           849.54      -4.4%
> >      sys_sec               5,866.72         5,847.17      -0.3%
> >      --------------------------------------------------------------------------
> >
> > [1]:
> https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99
> aea2ce3d284e6c5a3253061d97b08c4752a798
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> I won't go through the commit log and rewrite for this one too, but
> please do so similar to how I did for the previous patches. Do not
> describe the code, give a high-level overview of what is happening and
> why it's happeneing, as well as very concise performance results.

With all due respect, I am not describing the code. zswap compress batching
is a major architectural change and I am documenting the changes from the
status quo, for other zswap developers. Yes, some of this might involve
weaving in repetition of current behavior, again to stress the backward
compatibility of main concepts.

I believe there is not one redundant datapoint when it comes to performance
metrics in this summary - please elaborate. Thanks.

> 
> Do not include things that only make sense in the context of a patch and
> won't make sense as part of git histroy.

This makes sense, duly noted and will be addressed.

> 
> That being said, I'd like Herbert to review this patch and make sure the
> scatterlist and crypto APIs are being used correctly as he advised
> earlier. I do have some comments on the zswap side though.
> 
> > ---
> >  mm/zswap.c | 249 ++++++++++++++++++++++++++++++++++++++----------
> -----
> >  1 file changed, 181 insertions(+), 68 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 257567edc587..c5487dd69ec6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -143,6 +143,7 @@ struct crypto_acomp_ctx {
> >  	struct acomp_req *req;
> >  	struct crypto_wait wait;
> >  	u8 **buffers;
> > +	struct sg_table *sg_outputs;
> >  	struct mutex mutex;
> >  	bool is_sleepable;
> >  };
> > @@ -271,6 +272,11 @@ static void acomp_ctx_dealloc(struct
> crypto_acomp_ctx *acomp_ctx, u8 nr_buffers)
> >  			kfree(acomp_ctx->buffers[i]);
> >  		kfree(acomp_ctx->buffers);
> >  	}
> > +
> > +	if (acomp_ctx->sg_outputs) {
> > +		sg_free_table(acomp_ctx->sg_outputs);
> > +		kfree(acomp_ctx->sg_outputs);
> > +	}
> >  }
> >
> >  static struct zswap_pool *zswap_pool_create(char *compressor)
> > @@ -804,6 +810,7 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> >  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> >  	int nid = cpu_to_node(cpu);
> > +	struct scatterlist *sg;
> >  	int ret = -ENOMEM;
> >  	u8 i;
> >
> > @@ -849,6 +856,22 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >  			goto fail;
> >  	}
> >
> > +	acomp_ctx->sg_outputs = kmalloc(sizeof(*acomp_ctx->sg_outputs),
> > +					GFP_KERNEL);
> > +	if (!acomp_ctx->sg_outputs)
> > +		goto fail;
> > +
> > +	if (sg_alloc_table(acomp_ctx->sg_outputs, pool->compr_batch_size,
> > +			   GFP_KERNEL))
> > +		goto fail;
> > +
> > +	/*
> > +	 * Statically map the per-CPU destination buffers to the per-CPU
> > +	 * SG lists.
> > +	 */
> > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, pool-
> >compr_batch_size, i)
> > +		sg_set_buf(sg, acomp_ctx->buffers[i], PAGE_SIZE);
> > +
> >  	/*
> >  	 * if the backend of acomp is async zip, crypto_req_done() will
> wakeup
> >  	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
> > @@ -869,84 +892,177 @@ static int zswap_cpu_comp_prepare(unsigned
> int cpu, struct hlist_node *node)
> >  	return ret;
> >  }
> >
> > -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > -			   struct zswap_pool *pool, bool wb_enabled)
> > +/*
> > + * Unified code path for compressors that do and do not support batching.
> This
> > + * procedure will compress multiple @nr_pages in @folio starting from the
> > + * @start index.
> > + *
> > + * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE.
> zswap_store() makes
> > + * sure of this by design and zswap_store_pages() warns if this is not
> > + * true.
> > + *
> > + * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the
> compressor does not
> > + * support batching.
> > + *
> > + * If @pool->compr_batch_size is 1, each page is processed sequentially.
> > + *
> > + * If @pool->compr_batch_size is > 1, compression batching is invoked
> within
> > + * the algorithm's driver, except if @nr_pages is 1: if so, the driver can
> > + * choose to call the sequential/non-batching compress API.
> > + *
> > + * In both cases, if all compressions are successful, the compressed buffers
> > + * are stored in zsmalloc.
> > + *
> > + * Traversing multiple SG lists when @nr_comps is > 1 is expensive, and
> impacts
> > + * batching performance if we were to repeat this operation multiple
> times,
> > + * such as:
> > + *   - to map destination buffers to each SG list in the @acomp_ctx-
> >sg_outputs
> > + *     sg_table.
> > + *   - to initialize each output SG list's @sg->length to PAGE_SIZE.
> > + *   - to get the compressed output length in each @sg->length.
> > + *
> > + * These are some design choices made to optimize batching with SG lists:
> > + *
> > + * 1) The source folio pages in the batch are directly submitted to
> > + *    crypto_acomp via acomp_request_set_src_folio().
> > + *
> > + * 2) The per-CPU @acomp_ctx->sg_outputs scatterlists are used to set up
> > + *    destination buffers for interfacing with crypto_acomp.
> > + *
> > + * 3) To optimize performance, we map the per-CPU @acomp_ctx->buffers
> to the
> > + *    @acomp_ctx->sg_outputs->sgl SG lists at pool creation time. The only
> task
> > + *    remaining to be done for the output SG lists in zswap_compress() is to
> > + *    set each @sg->length to PAGE_SIZE. This is done in zswap_compress()
> > + *    for non-batching compressors. This needs to be done within the
> compress
> > + *    batching driver procedure as part of iterating through the SG lists for
> > + *    batch setup, so as to minimize expensive traversals through the SG
> lists.
> > + *
> > + * 4) Important requirements for batching compressors:
> > + *    - Each @sg->length in @acomp_ctx->req->sg_outputs->sgl should
> reflect the
> > + *      compression outcome for that specific page, and be set to:
> > + *      - the page's compressed length, or
> > + *      - the compression error value for that page.
> > + *    - The @acomp_ctx->req->dlen should be set to the first page's
> > + *      @sg->length. This enables code generalization in zswap_compress()
> > + *      for non-batching and batching compressors.
> > + *
> > + * acomp_ctx mutex locking:
> > + *    Earlier, the mutex was held per page compression. With the new code,
> > + *    [un]locking the mutex per page caused regressions for software
> > + *    compressors. We now lock the mutex once per batch, which resolves
> the
> > + *    regression.
> > + */
> 
> Please, no huge comments describing what the code is doing. If there's
> anything that is not clear from reading the code or needs to be
> explained or documented, please do so **concisely** in the relevant part
> of the function.

Again, these are important requirements related to the major change, i.e.,
batching, wrt why/how. I think it is important to note considerations for the
next batching algorithm, just like I have done within the IAA driver. To be very
clear, I am not describing code.

If questions arise as to why the mutex is being locked per batch as against
per page, I think the comment above is helpful and saves time for folks to
understand the "why".

> 
> > +static bool zswap_compress(struct folio *folio, long start, unsigned int
> nr_pages,
> > +			   struct zswap_entry *entries[], struct zswap_pool
> *pool,
> > +			   int nid, bool wb_enabled)
> >  {
> > +	gfp_t gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> __GFP_MOVABLE;
> > +	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> > +	unsigned int slen = nr_comps * PAGE_SIZE;
> >  	struct crypto_acomp_ctx *acomp_ctx;
> > -	struct scatterlist input, output;
> > -	int comp_ret = 0, alloc_ret = 0;
> > -	unsigned int dlen = PAGE_SIZE;
> > +	int err = 0, err_sg = 0;
> > +	struct scatterlist *sg;
> > +	unsigned int i, j, k;
> >  	unsigned long handle;
> > -	gfp_t gfp;
> > -	u8 *dst;
> > -	bool mapped = false;
> > +	int *errp, dlen;
> > +	void *dst;
> >
> >  	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> >  	mutex_lock(&acomp_ctx->mutex);
> >
> > -	dst = acomp_ctx->buffers[0];
> > -	sg_init_table(&input, 1);
> > -	sg_set_page(&input, page, PAGE_SIZE, 0);
> > -
> > -	sg_init_one(&output, dst, PAGE_SIZE);
> > -	acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> > +	errp = (pool->compr_batch_size == 1) ? &err : &err_sg;
> 
> err_sg is not used anywhere, so *errp could end up being garbage. Why do
> we need this?

err_sg is initialized to 0 and never changes. It can never be garbage.
We need this because of the current dichotomy between software compressors
and IAA in the sg->length based error handling per Herbert's suggestions,
included in the huge function comment block. It is needed to avoid branches
and have the zswap_compress() code look seamless for all compressors.

> 
> >
> >  	/*
> > -	 * it maybe looks a little bit silly that we send an asynchronous
> request,
> > -	 * then wait for its completion synchronously. This makes the process
> look
> > -	 * synchronous in fact.
> > -	 * Theoretically, acomp supports users send multiple acomp requests
> in one
> > -	 * acomp instance, then get those requests done simultaneously. but
> in this
> > -	 * case, zswap actually does store and load page by page, there is no
> > -	 * existing method to send the second page before the first page is
> done
> > -	 * in one thread doing zswap.
> > -	 * but in different threads running on different cpu, we have different
> > -	 * acomp instance, so multiple threads can do (de)compression in
> parallel.
> > +	 * [i] refers to the incoming batch space and is used to
> > +	 *     index into the folio pages.
> > +	 *
> > +	 * [j] refers to the incoming batch space and is used to
> > +	 *     index into the @entries for the folio's pages in this
> > +	 *     batch, per compress call while iterating over the output SG
> > +	 *     lists. Also used to index into the folio's pages from @start,
> > +	 *     in case of compress errors.
> > +	 *
> > +	 * [k] refers to the @acomp_ctx space, as determined by
> > +	 *     @pool->compr_batch_size, and is used to index into
> > +	 *     @acomp_ctx->sg_outputs->sgl and @acomp_ctx->buffers.
> >  	 */
> > -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >req), &acomp_ctx->wait);
> > -	dlen = acomp_ctx->req->dlen;
> > +	for (i = 0; i < nr_pages; i += nr_comps) {
> 
> What are looping over here? I thought zswap_compress() takes in exactly
> one batch.

We are iterating once over one batch for batching compressors, and one
page at a time for software.

> 
> > +		acomp_request_set_src_folio(acomp_ctx->req, folio,
> > +					    (start + i) * PAGE_SIZE,
> > +					    slen);
> >
> > -	/*
> > -	 * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > -	 * save the content as is without a compression, to keep the LRU
> order
> > -	 * of writebacks.  If writeback is disabled, reject the page since it
> > -	 * only adds metadata overhead.  swap_writeout() will put the page
> back
> > -	 * to the active LRU list in the case.
> > -	 */
> > -	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
> > -		if (!wb_enabled) {
> > -			comp_ret = comp_ret ? comp_ret : -EINVAL;
> > -			goto unlock;
> > -		}
> > -		comp_ret = 0;
> > -		dlen = PAGE_SIZE;
> > -		dst = kmap_local_page(page);
> > -		mapped = true;
> > -	}
> > +		acomp_ctx->sg_outputs->sgl->length = slen;
> >
> > -	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> __GFP_MOVABLE;
> > -	handle = zs_malloc(pool->zs_pool, dlen, gfp, page_to_nid(page));
> > -	if (IS_ERR_VALUE(handle)) {
> > -		alloc_ret = PTR_ERR((void *)handle);
> > -		goto unlock;
> > -	}
> > +		acomp_request_set_dst_sg(acomp_ctx->req,
> > +					 acomp_ctx->sg_outputs->sgl,
> > +					 slen);
> > +
> > +		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >req),
> > +				      &acomp_ctx->wait);
> > +
> > +		acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req-
> >dlen;
> > +
> > +		/*
> > +		 * If a page cannot be compressed into a size smaller than
> > +		 * PAGE_SIZE, save the content as is without a compression,
> to
> > +		 * keep the LRU order of writebacks.  If writeback is disabled,
> > +		 * reject the page since it only adds metadata overhead.
> > +		 * swap_writeout() will put the page back to the active LRU
> list
> > +		 * in the case.
> > +		 *
> > +		 * It is assumed that any compressor that sets the output
> length
> > +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> > +		 * error status in @err; i.e, will not return a successful
> > +		 * compression status in @err in this case.
> > +		 */
> 
> Ugh, checking the compression error and checking the compression length
> are now in separate places so we need to check if writeback is disabled
> in separate places and store the page as-is. It's ugly, and I think the
> current code is not correct.

The code is 100% correct. You need to spend more time understanding
the code. I have stated my assumption above in the comments to
help in understanding the "why".

From a maintainer, I would expect more responsible statements than
this. A flippant remark made without understanding the code (and,
disparaging the comments intended to help you do this), can impact
someone's career. I am held accountable in my job based on your
comments.

That said, I have worked tirelessly and innovated to make the code
compliant with Herbert's suggestions (which btw have enabled an
elegant batching implementation and code commonality for IAA and
software compressors), validated it thoroughly for IAA and ZSTD to
ensure that both demonstrate performance improvements, which
are crucial for memory savings. I am proud of this work.


> 
> > +		if (err && !wb_enabled)
> > +			goto compress_error;
> > +
> > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > +			j = k + i;
> 
> Please use meaningful iterator names rather than i, j, and k and the huge
> comment explaining what they are.

I happen to have a different view: having longer iterator names firstly makes
code seem "verbose" and detracts from readability, not to mention exceeding the
80-character line limit. The comments are essential for code maintainability
and avoid out-of-bounds errors when the next zswap developer wants to
optimize the code.

One drawback of i/j/k iterators is mis-typing errors which cannot be caught
at compile time. Let me think some more about how to strike a good balance.

> 
> > +			dst = acomp_ctx->buffers[k];
> > +			dlen = sg->length | *errp;
> 
> Why are we doing this?
> 
> > +
> > +			if (dlen < 0) {
> 
> We should do the incompressible page handling also if dlen is PAGE_SIZE,
> or if the compression failed (I guess that's the intention of bit OR'ing
> with *errp?)

Yes, indeed: that's the intention of bit OR'ing with *errp.

> 
> > +				dlen = PAGE_SIZE;
> > +				dst = kmap_local_page(folio_page(folio, start
> + j));
> > +			}
> > +
> > +			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
> >
> > -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > -	entry->handle = handle;
> > -	entry->length = dlen;
> > +			if (IS_ERR_VALUE(handle)) {
> > +				if (PTR_ERR((void *)handle) == -ENOSPC)
> > +					zswap_reject_compress_poor++;
> > +				else
> > +					zswap_reject_alloc_fail++;
> >
> > -unlock:
> > -	if (mapped)
> > -		kunmap_local(dst);
> > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > -		zswap_reject_compress_poor++;
> > -	else if (comp_ret)
> > -		zswap_reject_compress_fail++;
> > -	else if (alloc_ret)
> > -		zswap_reject_alloc_fail++;
> > +				goto err_unlock;
> > +			}
> > +
> > +			zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > +			entries[j]->handle = handle;
> > +			entries[j]->length = dlen;
> > +			if (dst != acomp_ctx->buffers[k])
> > +				kunmap_local(dst);
> > +		}
> > +	} /* finished compress and store nr_pages. */
> > +
> > +	mutex_unlock(&acomp_ctx->mutex);
> > +	return true;
> > +
> > +compress_error:
> > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > +		if ((int)sg->length < 0) {
> > +			if ((int)sg->length == -ENOSPC)
> > +				zswap_reject_compress_poor++;
> > +			else
> > +				zswap_reject_compress_fail++;
> > +		}
> > +	}
> >
> > +err_unlock:
> >  	mutex_unlock(&acomp_ctx->mutex);
> > -	return comp_ret == 0 && alloc_ret == 0;
> > +	return false;
> >  }
> >
> >  static bool zswap_decompress(struct zswap_entry *entry, struct folio
> *folio)
> > @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio
> *folio,
> >  		INIT_LIST_HEAD(&entries[i]->lru);
> >  	}
> >
> > -	for (i = 0; i < nr_pages; ++i) {
> > -		struct page *page = folio_page(folio, start + i);
> > -
> > -		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> > -			goto store_pages_failed;
> > -	}
> > +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> > +				     nid, wb_enabled)))
> > +		goto store_pages_failed;
> >
> >  	for (i = 0; i < nr_pages; ++i) {
> >  		struct zswap_entry *old, *entry = entries[i];
> > --
> > 2.27.0
> >


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-13 23:55     ` Sridhar, Kanchana P
@ 2025-11-14  0:46       ` Yosry Ahmed
  2025-12-19  2:29         ` Sridhar, Kanchana P
  2025-11-14  5:52       ` Yosry Ahmed
  1 sibling, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-11-14  0:46 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh

On Thu, Nov 13, 2025 at 11:55:10PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Thursday, November 13, 2025 1:35 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
> > On Tue, Nov 04, 2025 at 01:12:35AM -0800, Kanchana P Sridhar wrote:
> > > This patch introduces a new unified implementation of zswap_compress()
> > > for compressors that do and do not support batching. This eliminates
> > > code duplication and facilitates code maintainability with the
> > > introduction of compress batching.
> > >
> > > The vectorized implementation of calling the earlier zswap_compress()
> > > sequentially, one page at a time in zswap_store_pages(), is replaced
> > > with this new version of zswap_compress() that accepts multiple pages to
> > > compress as a batch.
> > >
> > > If the compressor does not support batching, each page in the batch is
> > > compressed and stored sequentially. If the compressor supports batching,
> > > for e.g., 'deflate-iaa', the Intel IAA hardware accelerator, the batch
> > > is compressed in parallel in hardware. If the batch is compressed
> > > without errors, the compressed buffers are then stored in zsmalloc. In
> > > case of compression errors, the current behavior is preserved for the
> > > batching zswap_compress(): if the folio's memcg is writeback enabled,
> > > pages with compression errors are store uncompressed in zsmalloc; if
> > > not, we return an error for the folio in zswap_store().
> > >
> > > As per Herbert's suggestion in [1] for batching to be based on SG lists
> > > to interface with the crypto API, a "struct sg_table *sg_outputs" is
> > > added to the per-CPU acomp_ctx. In zswap_cpu_comp_prepare(), memory
> > is
> > > allocated for @pool->compr_batch_size scatterlists in
> > > @acomp_ctx->sg_outputs. The per-CPU @acomp_ctx->buffers' addresses
> > are
> > > statically mapped to the respective SG lists. The existing non-NUMA
> > > sg_alloc_table() was found to give better performance than a NUMA-aware
> > > allocation function, hence is used in this patch.
> > >
> > > Batching compressors should initialize the output SG lengths to
> > > PAGE_SIZE as part of the internal compress batching setup, to avoid
> > > having to do multiple traversals over the @acomp_ctx->sg_outputs->sgl.
> > > This is exactly how batching is implemented in the iaa_crypto driver's
> > > compress batching procedure, iaa_comp_acompress_batch().
> > >
> > > The batched zswap_compress() implementation is generalized as much as
> > > possible for non-batching and batching compressors, so that the
> > > subsequent incompressible page handling, zs_pool writes, and error
> > > handling code is seamless for both, without the use of conditionals to
> > > switch to specialized code for either.
> > >
> > > The new batching implementation of zswap_compress() is called with a
> > > batch of @nr_pages sent from zswap_store() to zswap_store_pages().
> > > zswap_compress() steps through the batch in increments of the
> > > compressor's batch-size, sets up the acomp_ctx->req's src/dst SG lists
> > > to contain the folio pages and output buffers, before calling
> > > crypto_acomp_compress().
> > >
> > > Some important requirements of this batching architecture for batching
> > > compressors:
> > >
> > >   1) The output SG lengths for each sg in the acomp_req->dst should be
> > >      intialized to PAGE_SIZE as part of other batch setup in the batch
> > >      compression function. zswap will not take care of this in the
> > >      interest of avoiding repetitive traversals of the
> > >      @acomp_ctx->sg_outputs->sgl so as to not lose the benefits of
> > >      batching.
> > >
> > >   2) In case of a compression error for any page in the batch, the
> > >      batching compressor should set the corresponding @sg->length to a
> > >      negative error number, as suggested by Herbert. Otherwise, the
> > >      @sg->length will contain the compressed output length.
> > >
> > >   3) Batching compressors should set acomp_req->dlen to
> > >      acomp_req->dst->length, i.e., the sg->length of the first SG in
> > >      acomp_req->dst.
> > >
> > > Another important change this patch makes is with the acomp_ctx mutex
> > > locking in zswap_compress(). Earlier, the mutex was held per page's
> > > compression. With the new code, [un]locking the mutex per page caused
> > > regressions for software compressors when testing with 30 usemem
> > > processes, and also kernel compilation with 'allmod' config. The
> > > regressions were more eggregious when PMD folios were stored. The
> > > implementation in this commit locks/unlocks the mutex once per batch,
> > > that resolves the regression.
> > >
> > > Architectural considerations for the zswap batching framework:
> > >
> > ==============================================================
> > > We have designed the zswap batching framework to be
> > > hardware-agnostic. It has no dependencies on Intel-specific features and
> > > can be leveraged by any hardware accelerator or software-based
> > > compressor. In other words, the framework is open and inclusive by
> > > design.
> > >
> > > Other ongoing work that can use batching:
> > > =========================================
> > > This patch-series demonstrates the performance benefits of compress
> > > batching when used in zswap_store() of large folios. shrink_folio_list()
> > > "reclaim batching" of any-order folios is the major next work that uses
> > > the zswap compress batching framework: our testing of kernel_compilation
> > > with writeback and the zswap shrinker indicates 10X fewer pages get
> > > written back when we reclaim 32 folios as a batch, as compared to one
> > > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > > submit a patch-series with this data and the resulting performance
> > > improvements shortly. Reclaim batching relieves memory pressure faster
> > > than reclaiming one folio at a time, hence alleviates the need to scan
> > > slab memory for writeback.
> > >
> > > Nhat has given ideas on using batching with the ongoing kcompressd work,
> > > as well as beneficially using decompression batching & block IO batching
> > > to improve zswap writeback efficiency.
> > >
> > > Experiments that combine zswap compress batching, reclaim batching,
> > > swapin_readahead() decompression batching of prefetched pages, and
> > > writeback batching show that 0 pages are written back with deflate-iaa
> > > and zstd. For comparison, the baselines for these compressors see
> > > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> > >
> > > To summarize, these are future clients of the batching framework:
> > >
> > >    - shrink_folio_list() reclaim batching of multiple folios:
> > >        Implemented, will submit patch-series.
> > >    - zswap writeback with decompress batching:
> > >        Implemented, will submit patch-series.
> > >    - zram:
> > >        Implemented, will submit patch-series.
> > >    - kcompressd:
> > >        Not yet implemented.
> > >    - file systems:
> > >        Not yet implemented.
> > >    - swapin_readahead() decompression batching of prefetched pages:
> > >        Implemented, will submit patch-series.
> > >
> > > Additionally, any place we have folios that need to be compressed, can
> > > potentially be parallelized.
> > >
> > > Performance data:
> > > =================
> > >
> > > As suggested by Barry, this is the performance data gathered on Intel
> > > Sapphire Rapids with usemem 30 processes running at 50% memory
> > pressure
> > > and kernel_compilation/allmod config run with 2G limit using 32
> > > threads. To keep comparisons simple, all testing was done without the
> > > zswap shrinker.
> > >
> > >   usemem30 with 64K folios:
> > >   =========================
> > >
> > >      zswap shrinker_enabled = N.
> > >
> > >      -----------------------------------------------------------------------
> > >                      mm-unstable-10-24-2025             v13
> > >      -----------------------------------------------------------------------
> > >      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
> > >                                                                  vs.
> > >                                                              IAA Sequential
> > >      -----------------------------------------------------------------------
> > >      Total throughput (KB/s)     6,118,675       9,901,216       62%
> > >      Average throughput (KB/s)     203,955         330,040       62%
> > >      elapsed time (sec)              98.94           70.90      -28%
> > >      sys time (sec)               2,379.29        1,686.18      -29%
> > >      -----------------------------------------------------------------------
> > >
> > >      -----------------------------------------------------------------------
> > >                      mm-unstable-10-24-2025             v13
> > >      -----------------------------------------------------------------------
> > >      zswap compressor                 zstd            zstd   v13 zstd
> > >                                                              improvement
> > >      -----------------------------------------------------------------------
> > >      Total throughput (KB/s)     5,983,561       6,003,851      0.3%
> > >      Average throughput (KB/s)     199,452         200,128      0.3%
> > >      elapsed time (sec)             100.93           96.62     -4.3%
> > >      sys time (sec)               2,532.49        2,395.83       -5%
> > >      -----------------------------------------------------------------------
> > >
> > >   usemem30 with 2M folios:
> > >   ========================
> > >
> > >      -----------------------------------------------------------------------
> > >                      mm-unstable-10-24-2025             v13
> > >      -----------------------------------------------------------------------
> > >      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
> > >                                                                  vs.
> > >                                                              IAA Sequential
> > >      -----------------------------------------------------------------------
> > >      Total throughput (KB/s)     6,309,635      10,558,225       67%
> > >      Average throughput (KB/s)     210,321         351,940       67%
> > >      elapsed time (sec)              88.70           67.84      -24%
> > >      sys time (sec)               2,059.83        1,581.07      -23%
> > >      -----------------------------------------------------------------------
> > >
> > >      -----------------------------------------------------------------------
> > >                      mm-unstable-10-24-2025             v13
> > >      -----------------------------------------------------------------------
> > >      zswap compressor                 zstd            zstd   v13 zstd
> > >                                                              improvement
> > >      -----------------------------------------------------------------------
> > >      Total throughput (KB/s)     6,562,687       6,567,946      0.1%
> > >      Average throughput (KB/s)     218,756         218,931      0.1%
> > >      elapsed time (sec)              94.69           88.79       -6%
> > >      sys time (sec)               2,253.97        2,083.43       -8%
> > >      -----------------------------------------------------------------------
> > >
> > >     The main takeaway from usemem, a workload that is mostly compression
> > >     dominated (very few swapins) is that the higher the number of batches,
> > >     such as with larger folios, the more the benefit of batching cost
> > >     amortization, as shown by the PMD usemem data. This aligns well
> > >     with the future direction for batching.
> > >
> > > kernel_compilation/allmodconfig, 64K folios:
> > > ============================================
> > >
> > >      --------------------------------------------------------------------------
> > >                mm-unstable-10-24-2025             v13
> > >      --------------------------------------------------------------------------
> > >      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
> > >                                                              vs.
> > >                                                         IAA Sequential
> > >      --------------------------------------------------------------------------
> > >      real_sec                 836.64          806.94      -3.5%
> > >      sys_sec                3,897.57        3,661.83        -6%
> > >      --------------------------------------------------------------------------
> > >
> > >      --------------------------------------------------------------------------
> > >                mm-unstable-10-24-2025             v13
> > >      --------------------------------------------------------------------------
> > >      zswap compressor           zstd            zstd    Improvement
> > >      --------------------------------------------------------------------------
> > >      real_sec                 880.62          850.41      -3.4%
> > >      sys_sec                5,171.90        5,076.51      -1.8%
> > >      --------------------------------------------------------------------------
> > >
> > > kernel_compilation/allmodconfig, PMD folios:
> > > ============================================
> > >
> > >      --------------------------------------------------------------------------
> > >                mm-unstable-10-24-2025             v13
> > >      --------------------------------------------------------------------------
> > >      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
> > >                                                              vs.
> > >                                                         IAA Sequential
> > >      --------------------------------------------------------------------------
> > >      real_sec                 818.48          779.67      -4.7%
> > >      sys_sec                4,226.52        4,245.18       0.4%
> > >      --------------------------------------------------------------------------
> > >
> > >      --------------------------------------------------------------------------
> > >               mm-unstable-10-24-2025             v13
> > >      --------------------------------------------------------------------------
> > >      zswap compressor          zstd             zstd    Improvement
> > >      --------------------------------------------------------------------------
> > >      real_sec                888.45           849.54      -4.4%
> > >      sys_sec               5,866.72         5,847.17      -0.3%
> > >      --------------------------------------------------------------------------
> > >
> > > [1]:
> > https://lore.kernel.org/all/aJ7Fk6RpNc815Ivd@gondor.apana.org.au/T/#m99
> > aea2ce3d284e6c5a3253061d97b08c4752a798
> > >
> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > 
> > I won't go through the commit log and rewrite for this one too, but
> > please do so similar to how I did for the previous patches. Do not
> > describe the code, give a high-level overview of what is happening and
> > why it's happeneing, as well as very concise performance results.
> 
> With all due respect, I am not describing the code. zswap compress batching
> is a major architectural change and I am documenting the changes from the
> status quo, for other zswap developers. Yes, some of this might involve
> weaving in repetition of current behavior, again to stress the backward
> compatibility of main concepts.

As I said, I did not go through the commit log as I did for previous
ones, which did include unnecessary description of the code. What I
asked is for you to do similar changes here, if needed, because the
commit log is too big.

For example, you should remove mentions of ongoing work and future work,
simply because things change and they may not land. Just briefly
mentioning that there are future use cases (with maybe an example) is
sufficient.

> 
> I believe there is not one redundant datapoint when it comes to performance
> metrics in this summary - please elaborate. Thanks.

I never said they were redundant, I said we should make them more
concise. For example, the first table can be replaced by stating that
throughput improves by ~62% and the time is reduced by 28-29% and so on.

> 
> > 
> > Do not include things that only make sense in the context of a patch and
> > won't make sense as part of git histroy.
> 
> This makes sense, duly noted and will be addressed.
> 
> > 
> > That being said, I'd like Herbert to review this patch and make sure the
> > scatterlist and crypto APIs are being used correctly as he advised
> > earlier. I do have some comments on the zswap side though.
> > 
[..]
> > > @@ -869,84 +892,177 @@ static int zswap_cpu_comp_prepare(unsigned
> > int cpu, struct hlist_node *node)
> > >  	return ret;
> > >  }
> > >
> > > -static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > > -			   struct zswap_pool *pool, bool wb_enabled)
> > > +/*
> > > + * Unified code path for compressors that do and do not support batching.
> > This
> > > + * procedure will compress multiple @nr_pages in @folio starting from the
> > > + * @start index.
> > > + *
> > > + * It is assumed that @nr_pages <= ZSWAP_MAX_BATCH_SIZE.
> > zswap_store() makes
> > > + * sure of this by design and zswap_store_pages() warns if this is not
> > > + * true.
> > > + *
> > > + * @nr_pages can be in (1, ZSWAP_MAX_BATCH_SIZE] even if the
> > compressor does not
> > > + * support batching.
> > > + *
> > > + * If @pool->compr_batch_size is 1, each page is processed sequentially.
> > > + *
> > > + * If @pool->compr_batch_size is > 1, compression batching is invoked
> > within
> > > + * the algorithm's driver, except if @nr_pages is 1: if so, the driver can
> > > + * choose to call the sequential/non-batching compress API.
> > > + *
> > > + * In both cases, if all compressions are successful, the compressed buffers
> > > + * are stored in zsmalloc.
> > > + *
> > > + * Traversing multiple SG lists when @nr_comps is > 1 is expensive, and
> > impacts
> > > + * batching performance if we were to repeat this operation multiple
> > times,
> > > + * such as:
> > > + *   - to map destination buffers to each SG list in the @acomp_ctx-
> > >sg_outputs
> > > + *     sg_table.
> > > + *   - to initialize each output SG list's @sg->length to PAGE_SIZE.
> > > + *   - to get the compressed output length in each @sg->length.
> > > + *
> > > + * These are some design choices made to optimize batching with SG lists:
> > > + *
> > > + * 1) The source folio pages in the batch are directly submitted to
> > > + *    crypto_acomp via acomp_request_set_src_folio().
> > > + *
> > > + * 2) The per-CPU @acomp_ctx->sg_outputs scatterlists are used to set up
> > > + *    destination buffers for interfacing with crypto_acomp.
> > > + *
> > > + * 3) To optimize performance, we map the per-CPU @acomp_ctx->buffers
> > to the
> > > + *    @acomp_ctx->sg_outputs->sgl SG lists at pool creation time. The only
> > task
> > > + *    remaining to be done for the output SG lists in zswap_compress() is to
> > > + *    set each @sg->length to PAGE_SIZE. This is done in zswap_compress()
> > > + *    for non-batching compressors. This needs to be done within the
> > compress
> > > + *    batching driver procedure as part of iterating through the SG lists for
> > > + *    batch setup, so as to minimize expensive traversals through the SG
> > lists.
> > > + *
> > > + * 4) Important requirements for batching compressors:
> > > + *    - Each @sg->length in @acomp_ctx->req->sg_outputs->sgl should
> > reflect the
> > > + *      compression outcome for that specific page, and be set to:
> > > + *      - the page's compressed length, or
> > > + *      - the compression error value for that page.
> > > + *    - The @acomp_ctx->req->dlen should be set to the first page's
> > > + *      @sg->length. This enables code generalization in zswap_compress()
> > > + *      for non-batching and batching compressors.
> > > + *
> > > + * acomp_ctx mutex locking:
> > > + *    Earlier, the mutex was held per page compression. With the new code,
> > > + *    [un]locking the mutex per page caused regressions for software
> > > + *    compressors. We now lock the mutex once per batch, which resolves
> > the
> > > + *    regression.
> > > + */
> > 
> > Please, no huge comments describing what the code is doing. If there's
> > anything that is not clear from reading the code or needs to be
> > explained or documented, please do so **concisely** in the relevant part
> > of the function.
> 
> Again, these are important requirements related to the major change, i.e.,
> batching, wrt why/how. I think it is important to note considerations for the
> next batching algorithm, just like I have done within the IAA driver. To be very
> clear, I am not describing code.
> 
> If questions arise as to why the mutex is being locked per batch as against
> per page, I think the comment above is helpful and saves time for folks to
> understand the "why".

Having a huge comment above the function does not help. For things like
this, you should add a brief comment above the mutex locking (where it's
relevant). Otherwise it's easy for someone to move the mutex locking
without reading this comment.

Same applies for other things. I am not saying we should throw away the
entire comment, but it's not helpful in its current form. Concise
comments in the relevant parts are much more helpful. Keep comments
above the function to general notes and things that are important to
callers, not implementation details.

> 
> > 
> > > +static bool zswap_compress(struct folio *folio, long start, unsigned int
> > nr_pages,
> > > +			   struct zswap_entry *entries[], struct zswap_pool
> > *pool,
> > > +			   int nid, bool wb_enabled)
> > >  {
> > > +	gfp_t gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> > __GFP_MOVABLE;
> > > +	unsigned int nr_comps = min(nr_pages, pool->compr_batch_size);
> > > +	unsigned int slen = nr_comps * PAGE_SIZE;
> > >  	struct crypto_acomp_ctx *acomp_ctx;
> > > -	struct scatterlist input, output;
> > > -	int comp_ret = 0, alloc_ret = 0;
> > > -	unsigned int dlen = PAGE_SIZE;
> > > +	int err = 0, err_sg = 0;
> > > +	struct scatterlist *sg;
> > > +	unsigned int i, j, k;
> > >  	unsigned long handle;
> > > -	gfp_t gfp;
> > > -	u8 *dst;
> > > -	bool mapped = false;
> > > +	int *errp, dlen;
> > > +	void *dst;
> > >
> > >  	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > >  	mutex_lock(&acomp_ctx->mutex);
> > >
> > > -	dst = acomp_ctx->buffers[0];
> > > -	sg_init_table(&input, 1);
> > > -	sg_set_page(&input, page, PAGE_SIZE, 0);
> > > -
> > > -	sg_init_one(&output, dst, PAGE_SIZE);
> > > -	acomp_request_set_params(acomp_ctx->req, &input, &output,
> > PAGE_SIZE, dlen);
> > > +	errp = (pool->compr_batch_size == 1) ? &err : &err_sg;
> > 
> > err_sg is not used anywhere, so *errp could end up being garbage. Why do
> > we need this?
> 
> err_sg is initialized to 0 and never changes. It can never be garbage.
> We need this because of the current dichotomy between software compressors
> and IAA in the sg->length based error handling per Herbert's suggestions,
> included in the huge function comment block. It is needed to avoid branches
> and have the zswap_compress() code look seamless for all compressors.

This is exactly what I meant by saying the huge comment doesn't help. It
should be documented where it is implemented.

That being said, the code is confusing and not readable, why do we need
to do such manuevring with the error codes? It's really hard to track.

> 
> > 
> > >
> > >  	/*
> > > -	 * it maybe looks a little bit silly that we send an asynchronous
> > request,
> > > -	 * then wait for its completion synchronously. This makes the process
> > look
> > > -	 * synchronous in fact.
> > > -	 * Theoretically, acomp supports users send multiple acomp requests
> > in one
> > > -	 * acomp instance, then get those requests done simultaneously. but
> > in this
> > > -	 * case, zswap actually does store and load page by page, there is no
> > > -	 * existing method to send the second page before the first page is
> > done
> > > -	 * in one thread doing zswap.
> > > -	 * but in different threads running on different cpu, we have different
> > > -	 * acomp instance, so multiple threads can do (de)compression in
> > parallel.
> > > +	 * [i] refers to the incoming batch space and is used to
> > > +	 *     index into the folio pages.
> > > +	 *
> > > +	 * [j] refers to the incoming batch space and is used to
> > > +	 *     index into the @entries for the folio's pages in this
> > > +	 *     batch, per compress call while iterating over the output SG
> > > +	 *     lists. Also used to index into the folio's pages from @start,
> > > +	 *     in case of compress errors.
> > > +	 *
> > > +	 * [k] refers to the @acomp_ctx space, as determined by
> > > +	 *     @pool->compr_batch_size, and is used to index into
> > > +	 *     @acomp_ctx->sg_outputs->sgl and @acomp_ctx->buffers.
> > >  	 */
> > > -	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> > >req), &acomp_ctx->wait);
> > > -	dlen = acomp_ctx->req->dlen;
> > > +	for (i = 0; i < nr_pages; i += nr_comps) {
> > 
> > What are looping over here? I thought zswap_compress() takes in exactly
> > one batch.
> 
> We are iterating once over one batch for batching compressors, and one
> page at a time for software.

I thought we wanted to have a single acomp API that takes in a batch of
pages, and then either hands them over to HW compressors, or loops over
them for SW compressors. This would simplify the users like zswap
because the differences between SW and HW compressors would be handled
internally.

> 
> > 
> > > +		acomp_request_set_src_folio(acomp_ctx->req, folio,
> > > +					    (start + i) * PAGE_SIZE,
> > > +					    slen);
> > >
> > > -	/*
> > > -	 * If a page cannot be compressed into a size smaller than PAGE_SIZE,
> > > -	 * save the content as is without a compression, to keep the LRU
> > order
> > > -	 * of writebacks.  If writeback is disabled, reject the page since it
> > > -	 * only adds metadata overhead.  swap_writeout() will put the page
> > back
> > > -	 * to the active LRU list in the case.
> > > -	 */
> > > -	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
> > > -		if (!wb_enabled) {
> > > -			comp_ret = comp_ret ? comp_ret : -EINVAL;
> > > -			goto unlock;
> > > -		}
> > > -		comp_ret = 0;
> > > -		dlen = PAGE_SIZE;
> > > -		dst = kmap_local_page(page);
> > > -		mapped = true;
> > > -	}
> > > +		acomp_ctx->sg_outputs->sgl->length = slen;
> > >
> > > -	gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM |
> > __GFP_MOVABLE;
> > > -	handle = zs_malloc(pool->zs_pool, dlen, gfp, page_to_nid(page));
> > > -	if (IS_ERR_VALUE(handle)) {
> > > -		alloc_ret = PTR_ERR((void *)handle);
> > > -		goto unlock;
> > > -	}
> > > +		acomp_request_set_dst_sg(acomp_ctx->req,
> > > +					 acomp_ctx->sg_outputs->sgl,
> > > +					 slen);
> > > +
> > > +		err = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> > >req),
> > > +				      &acomp_ctx->wait);
> > > +
> > > +		acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req-
> > >dlen;
> > > +
> > > +		/*
> > > +		 * If a page cannot be compressed into a size smaller than
> > > +		 * PAGE_SIZE, save the content as is without a compression,
> > to
> > > +		 * keep the LRU order of writebacks.  If writeback is disabled,
> > > +		 * reject the page since it only adds metadata overhead.
> > > +		 * swap_writeout() will put the page back to the active LRU
> > list
> > > +		 * in the case.
> > > +		 *
> > > +		 * It is assumed that any compressor that sets the output
> > length
> > > +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> > > +		 * error status in @err; i.e, will not return a successful
> > > +		 * compression status in @err in this case.
> > > +		 */
> > 
> > Ugh, checking the compression error and checking the compression length
> > are now in separate places so we need to check if writeback is disabled
> > in separate places and store the page as-is. It's ugly, and I think the
> > current code is not correct.
> 
> The code is 100% correct. You need to spend more time understanding
> the code. I have stated my assumption above in the comments to
> help in understanding the "why".
> 
> From a maintainer, I would expect more responsible statements than
> this. A flippant remark made without understanding the code (and,
> disparaging the comments intended to help you do this), can impact
> someone's career. I am held accountable in my job based on your
> comments.
> 
> That said, I have worked tirelessly and innovated to make the code
> compliant with Herbert's suggestions (which btw have enabled an
> elegant batching implementation and code commonality for IAA and
> software compressors), validated it thoroughly for IAA and ZSTD to
> ensure that both demonstrate performance improvements, which
> are crucial for memory savings. I am proud of this work.

I really do NOT appreciate the personal attack here. I am not sure why
my comment came across as a "flippant remark".

Let me be clear, I never said anything bad about "this work", or
expressed that I do not want to see it merged. You did a good job and
you should be proud of your work.

That being said, code review is part of the process, and you should know
better than anyone given how much this series evolved over 13 revisions
of careful reviews. I spent a considerable amount of time reviewing
previous revisions, pointing out problems, and helping this series
evolve. Telling me that I "should spend more time understanding the
code" is enraging at this point.

To be even more clear, I gain NOTHING by reviewing your code and helping
you land this work. I also have a job, and it's not reviewing your code.
I would tread very carefully if I were you.

Let's keep the discussion technical and civil. I will NOT tolerate such
comments going forward.

> 
> 
> > 
> > > +		if (err && !wb_enabled)
> > > +			goto compress_error;
> > > +
> > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > +			j = k + i;
> > 
> > Please use meaningful iterator names rather than i, j, and k and the huge
> > comment explaining what they are.
> 
> I happen to have a different view: having longer iterator names firstly makes
> code seem "verbose" and detracts from readability, not to mention exceeding the
> 80-character line limit. The comments are essential for code maintainability
> and avoid out-of-bounds errors when the next zswap developer wants to
> optimize the code.
> 
> One drawback of i/j/k iterators is mis-typing errors which cannot be caught
> at compile time. Let me think some more about how to strike a good balance.

I think if we get rid of the outer loop things will get much simpler. I
initially thought the acomp API will handle the looping internally for
SW compressors.

> 
> > 
> > > +			dst = acomp_ctx->buffers[k];
> > > +			dlen = sg->length | *errp;
> > 
> > Why are we doing this?
> > 
> > > +
> > > +			if (dlen < 0) {
> > 
> > We should do the incompressible page handling also if dlen is PAGE_SIZE,
> > or if the compression failed (I guess that's the intention of bit OR'ing
> > with *errp?)
> 
> Yes, indeed: that's the intention of bit OR'ing with *errp.

This is not very readable.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-14  0:46       ` Yosry Ahmed
@ 2025-12-19  2:29         ` Sridhar, Kanchana P
  2025-12-19 15:26           ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-19  2:29 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 4:46 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
[...]
> > > > Architectural considerations for the zswap batching framework:
> > > >
> > >
> ==============================================================
> > > > We have designed the zswap batching framework to be
> > > > hardware-agnostic. It has no dependencies on Intel-specific features
> and
> > > > can be leveraged by any hardware accelerator or software-based
> > > > compressor. In other words, the framework is open and inclusive by
> > > > design.
> > > >
> > > > Other ongoing work that can use batching:
> > > > =========================================
> > > > This patch-series demonstrates the performance benefits of compress
> > > > batching when used in zswap_store() of large folios. shrink_folio_list()
> > > > "reclaim batching" of any-order folios is the major next work that uses
> > > > the zswap compress batching framework: our testing of
> kernel_compilation
> > > > with writeback and the zswap shrinker indicates 10X fewer pages get
> > > > written back when we reclaim 32 folios as a batch, as compared to one
> > > > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > > > submit a patch-series with this data and the resulting performance
> > > > improvements shortly. Reclaim batching relieves memory pressure
> faster
> > > > than reclaiming one folio at a time, hence alleviates the need to scan
> > > > slab memory for writeback.
> > > >
> > > > Nhat has given ideas on using batching with the ongoing kcompressd
> work,
> > > > as well as beneficially using decompression batching & block IO batching
> > > > to improve zswap writeback efficiency.
> > > >
> > > > Experiments that combine zswap compress batching, reclaim batching,
> > > > swapin_readahead() decompression batching of prefetched pages, and
> > > > writeback batching show that 0 pages are written back with deflate-iaa
> > > > and zstd. For comparison, the baselines for these compressors see
> > > > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> > > >
> > > > To summarize, these are future clients of the batching framework:
> > > >
> > > >    - shrink_folio_list() reclaim batching of multiple folios:
> > > >        Implemented, will submit patch-series.
> > > >    - zswap writeback with decompress batching:
> > > >        Implemented, will submit patch-series.
> > > >    - zram:
> > > >        Implemented, will submit patch-series.
> > > >    - kcompressd:
> > > >        Not yet implemented.
> > > >    - file systems:
> > > >        Not yet implemented.
> > > >    - swapin_readahead() decompression batching of prefetched pages:
> > > >        Implemented, will submit patch-series.
> > > >
> > > > Additionally, any place we have folios that need to be compressed, can
> > > > potentially be parallelized.

[...]

> For example, you should remove mentions of ongoing work and future work,
> simply because things change and they may not land. Just briefly
> mentioning that there are future use cases (with maybe an example) is
> sufficient.

Hi Yosry,

The mentions of ongoing/future work were included as per Andrew's suggestion.
Hence, I would like to keep these in the commit log. Hope this is Ok with you?

Thanks,
Kanchana



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-19  2:29         ` Sridhar, Kanchana P
@ 2025-12-19 15:26           ` Yosry Ahmed
  2025-12-19 19:03             ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-12-19 15:26 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh

On Fri, Dec 19, 2025 at 02:29:15AM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Thursday, November 13, 2025 4:46 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> [...]
> > > > > Architectural considerations for the zswap batching framework:
> > > > >
> > > >
> > ==============================================================
> > > > > We have designed the zswap batching framework to be
> > > > > hardware-agnostic. It has no dependencies on Intel-specific features
> > and
> > > > > can be leveraged by any hardware accelerator or software-based
> > > > > compressor. In other words, the framework is open and inclusive by
> > > > > design.
> > > > >
> > > > > Other ongoing work that can use batching:
> > > > > =========================================
> > > > > This patch-series demonstrates the performance benefits of compress
> > > > > batching when used in zswap_store() of large folios. shrink_folio_list()
> > > > > "reclaim batching" of any-order folios is the major next work that uses
> > > > > the zswap compress batching framework: our testing of
> > kernel_compilation
> > > > > with writeback and the zswap shrinker indicates 10X fewer pages get
> > > > > written back when we reclaim 32 folios as a batch, as compared to one
> > > > > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > > > > submit a patch-series with this data and the resulting performance
> > > > > improvements shortly. Reclaim batching relieves memory pressure
> > faster
> > > > > than reclaiming one folio at a time, hence alleviates the need to scan
> > > > > slab memory for writeback.
> > > > >
> > > > > Nhat has given ideas on using batching with the ongoing kcompressd
> > work,
> > > > > as well as beneficially using decompression batching & block IO batching
> > > > > to improve zswap writeback efficiency.
> > > > >
> > > > > Experiments that combine zswap compress batching, reclaim batching,
> > > > > swapin_readahead() decompression batching of prefetched pages, and
> > > > > writeback batching show that 0 pages are written back with deflate-iaa
> > > > > and zstd. For comparison, the baselines for these compressors see
> > > > > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> > > > >
> > > > > To summarize, these are future clients of the batching framework:
> > > > >
> > > > >    - shrink_folio_list() reclaim batching of multiple folios:
> > > > >        Implemented, will submit patch-series.
> > > > >    - zswap writeback with decompress batching:
> > > > >        Implemented, will submit patch-series.
> > > > >    - zram:
> > > > >        Implemented, will submit patch-series.
> > > > >    - kcompressd:
> > > > >        Not yet implemented.
> > > > >    - file systems:
> > > > >        Not yet implemented.
> > > > >    - swapin_readahead() decompression batching of prefetched pages:
> > > > >        Implemented, will submit patch-series.
> > > > >
> > > > > Additionally, any place we have folios that need to be compressed, can
> > > > > potentially be parallelized.
> 
> [...]
> 
> > For example, you should remove mentions of ongoing work and future work,
> > simply because things change and they may not land. Just briefly
> > mentioning that there are future use cases (with maybe an example) is
> > sufficient.
> 
> Hi Yosry,
> 
> The mentions of ongoing/future work were included as per Andrew's suggestion.
> Hence, I would like to keep these in the commit log. Hope this is Ok with you?

We can keep them, but not in the detail they are currently in, and
avoiding mentioning what is implemented or not implemented yet because
it's not very relevant to the patch imo.

So maybe focus on the fact that the compression batching can be used for
other use cases like batching decompression in zswap writeback, batching
compression in zram, batch compression of different folios during
reclaim, etc -- without going too much into detail because these details
will probably change when these extensions are proposed.


> 
> Thanks,
> Kanchana
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-19 15:26           ` Yosry Ahmed
@ 2025-12-19 19:03             ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-19 19:03 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Friday, December 19, 2025 7:26 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Fri, Dec 19, 2025 at 02:29:15AM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Thursday, November 13, 2025 4:46 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > [...]
> > > > > > Architectural considerations for the zswap batching framework:
> > > > > >
> > > > >
> > >
> ==============================================================
> > > > > > We have designed the zswap batching framework to be
> > > > > > hardware-agnostic. It has no dependencies on Intel-specific features
> > > and
> > > > > > can be leveraged by any hardware accelerator or software-based
> > > > > > compressor. In other words, the framework is open and inclusive by
> > > > > > design.
> > > > > >
> > > > > > Other ongoing work that can use batching:
> > > > > > =========================================
> > > > > > This patch-series demonstrates the performance benefits of
> compress
> > > > > > batching when used in zswap_store() of large folios.
> shrink_folio_list()
> > > > > > "reclaim batching" of any-order folios is the major next work that
> uses
> > > > > > the zswap compress batching framework: our testing of
> > > kernel_compilation
> > > > > > with writeback and the zswap shrinker indicates 10X fewer pages get
> > > > > > written back when we reclaim 32 folios as a batch, as compared to
> one
> > > > > > folio at a time: this is with deflate-iaa and with zstd. We expect to
> > > > > > submit a patch-series with this data and the resulting performance
> > > > > > improvements shortly. Reclaim batching relieves memory pressure
> > > faster
> > > > > > than reclaiming one folio at a time, hence alleviates the need to scan
> > > > > > slab memory for writeback.
> > > > > >
> > > > > > Nhat has given ideas on using batching with the ongoing kcompressd
> > > work,
> > > > > > as well as beneficially using decompression batching & block IO
> batching
> > > > > > to improve zswap writeback efficiency.
> > > > > >
> > > > > > Experiments that combine zswap compress batching, reclaim
> batching,
> > > > > > swapin_readahead() decompression batching of prefetched pages,
> and
> > > > > > writeback batching show that 0 pages are written back with deflate-
> iaa
> > > > > > and zstd. For comparison, the baselines for these compressors see
> > > > > > 200K-800K pages written to disk (kernel compilation 'allmod' config).
> > > > > >
> > > > > > To summarize, these are future clients of the batching framework:
> > > > > >
> > > > > >    - shrink_folio_list() reclaim batching of multiple folios:
> > > > > >        Implemented, will submit patch-series.
> > > > > >    - zswap writeback with decompress batching:
> > > > > >        Implemented, will submit patch-series.
> > > > > >    - zram:
> > > > > >        Implemented, will submit patch-series.
> > > > > >    - kcompressd:
> > > > > >        Not yet implemented.
> > > > > >    - file systems:
> > > > > >        Not yet implemented.
> > > > > >    - swapin_readahead() decompression batching of prefetched
> pages:
> > > > > >        Implemented, will submit patch-series.
> > > > > >
> > > > > > Additionally, any place we have folios that need to be compressed,
> can
> > > > > > potentially be parallelized.
> >
> > [...]
> >
> > > For example, you should remove mentions of ongoing work and future
> work,
> > > simply because things change and they may not land. Just briefly
> > > mentioning that there are future use cases (with maybe an example) is
> > > sufficient.
> >
> > Hi Yosry,
> >
> > The mentions of ongoing/future work were included as per Andrew's
> suggestion.
> > Hence, I would like to keep these in the commit log. Hope this is Ok with
> you?
> 
> We can keep them, but not in the detail they are currently in, and
> avoiding mentioning what is implemented or not implemented yet because
> it's not very relevant to the patch imo.
> 
> So maybe focus on the fact that the compression batching can be used for
> other use cases like batching decompression in zswap writeback, batching
> compression in zram, batch compression of different folios during
> reclaim, etc -- without going too much into detail because these details
> will probably change when these extensions are proposed.

Sure, this sounds good, thanks!

> 
> 
> >
> > Thanks,
> > Kanchana
> >


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-13 23:55     ` Sridhar, Kanchana P
  2025-11-14  0:46       ` Yosry Ahmed
@ 2025-11-14  5:52       ` Yosry Ahmed
  2025-11-14  6:43         ` Sridhar, Kanchana P
  1 sibling, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-11-14  5:52 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh

On Thu, Nov 13, 2025 at 11:55:10PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Thursday, November 13, 2025 1:35 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
[..]
> > > +		/*
> > > +		 * If a page cannot be compressed into a size smaller than
> > > +		 * PAGE_SIZE, save the content as is without a compression,
> > to
> > > +		 * keep the LRU order of writebacks.  If writeback is disabled,
> > > +		 * reject the page since it only adds metadata overhead.
> > > +		 * swap_writeout() will put the page back to the active LRU
> > list
> > > +		 * in the case.
> > > +		 *
> > > +		 * It is assumed that any compressor that sets the output
> > length
> > > +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> > > +		 * error status in @err; i.e, will not return a successful
> > > +		 * compression status in @err in this case.
> > > +		 */
> > 
> > Ugh, checking the compression error and checking the compression length
> > are now in separate places so we need to check if writeback is disabled
> > in separate places and store the page as-is. It's ugly, and I think the
> > current code is not correct.
> 
> The code is 100% correct. You need to spend more time understanding
> the code. I have stated my assumption above in the comments to
> help in understanding the "why".
> 
> From a maintainer, I would expect more responsible statements than
> this. A flippant remark made without understanding the code (and,
> disparaging the comments intended to help you do this), can impact
> someone's career. I am held accountable in my job based on your
> comments.
> 
> That said, I have worked tirelessly and innovated to make the code
> compliant with Herbert's suggestions (which btw have enabled an
> elegant batching implementation and code commonality for IAA and
> software compressors), validated it thoroughly for IAA and ZSTD to
> ensure that both demonstrate performance improvements, which
> are crucial for memory savings. I am proud of this work.
> 
> 
> > 
> > > +		if (err && !wb_enabled)
> > > +			goto compress_error;
> > > +
> > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > +			j = k + i;
> > 
> > Please use meaningful iterator names rather than i, j, and k and the huge
> > comment explaining what they are.
> 
> I happen to have a different view: having longer iterator names firstly makes
> code seem "verbose" and detracts from readability, not to mention exceeding the
> 80-character line limit. The comments are essential for code maintainability
> and avoid out-of-bounds errors when the next zswap developer wants to
> optimize the code.
> 
> One drawback of i/j/k iterators is mis-typing errors which cannot be caught
> at compile time. Let me think some more about how to strike a good balance.
> 
> > 
> > > +			dst = acomp_ctx->buffers[k];
> > > +			dlen = sg->length | *errp;
> > 
> > Why are we doing this?
> > 
> > > +
> > > +			if (dlen < 0) {
> > 
> > We should do the incompressible page handling also if dlen is PAGE_SIZE,
> > or if the compression failed (I guess that's the intention of bit OR'ing
> > with *errp?)
> 
> Yes, indeed: that's the intention of bit OR'ing with *errp.

..and you never really answered my question. In the exising code we
store the page as incompressible if writeback is enabled AND
crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
if crypto_wait_req() fails and writeback is disabled, but what about the
rest?

We don't check again if writeback is enabled before storing the page is
incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
these cases no longer possible?

Also, why use errp, why not explicitly use the appropriate error code?
It's also unclear to me why the error code is always zero with HW
compression?

> 
> > 
> > > +				dlen = PAGE_SIZE;
> > > +				dst = kmap_local_page(folio_page(folio, start
> > + j));
> > > +			}
> > > +
> > > +			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
> > >
> > > -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > -	entry->handle = handle;
> > > -	entry->length = dlen;
> > > +			if (IS_ERR_VALUE(handle)) {
> > > +				if (PTR_ERR((void *)handle) == -ENOSPC)
> > > +					zswap_reject_compress_poor++;
> > > +				else
> > > +					zswap_reject_alloc_fail++;
> > >
> > > -unlock:
> > > -	if (mapped)
> > > -		kunmap_local(dst);
> > > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > > -		zswap_reject_compress_poor++;
> > > -	else if (comp_ret)
> > > -		zswap_reject_compress_fail++;
> > > -	else if (alloc_ret)
> > > -		zswap_reject_alloc_fail++;
> > > +				goto err_unlock;
> > > +			}
> > > +
> > > +			zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > +			entries[j]->handle = handle;
> > > +			entries[j]->length = dlen;
> > > +			if (dst != acomp_ctx->buffers[k])
> > > +				kunmap_local(dst);
> > > +		}
> > > +	} /* finished compress and store nr_pages. */
> > > +
> > > +	mutex_unlock(&acomp_ctx->mutex);
> > > +	return true;
> > > +
> > > +compress_error:
> > > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > +		if ((int)sg->length < 0) {
> > > +			if ((int)sg->length == -ENOSPC)
> > > +				zswap_reject_compress_poor++;
> > > +			else
> > > +				zswap_reject_compress_fail++;
> > > +		}
> > > +	}
> > >
> > > +err_unlock:
> > >  	mutex_unlock(&acomp_ctx->mutex);
> > > -	return comp_ret == 0 && alloc_ret == 0;
> > > +	return false;
> > >  }
> > >
> > >  static bool zswap_decompress(struct zswap_entry *entry, struct folio
> > *folio)
> > > @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio
> > *folio,
> > >  		INIT_LIST_HEAD(&entries[i]->lru);
> > >  	}
> > >
> > > -	for (i = 0; i < nr_pages; ++i) {
> > > -		struct page *page = folio_page(folio, start + i);
> > > -
> > > -		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> > > -			goto store_pages_failed;
> > > -	}
> > > +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> > > +				     nid, wb_enabled)))
> > > +		goto store_pages_failed;
> > >
> > >  	for (i = 0; i < nr_pages; ++i) {
> > >  		struct zswap_entry *old, *entry = entries[i];
> > > --
> > > 2.27.0
> > >


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-14  5:52       ` Yosry Ahmed
@ 2025-11-14  6:43         ` Sridhar, Kanchana P
  2025-11-14 15:37           ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-11-14  6:43 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Thursday, November 13, 2025 9:52 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Thu, Nov 13, 2025 at 11:55:10PM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Thursday, November 13, 2025 1:35 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > >
> [..]
> > > > +		/*
> > > > +		 * If a page cannot be compressed into a size smaller than
> > > > +		 * PAGE_SIZE, save the content as is without a compression,
> > > to
> > > > +		 * keep the LRU order of writebacks.  If writeback is disabled,
> > > > +		 * reject the page since it only adds metadata overhead.
> > > > +		 * swap_writeout() will put the page back to the active LRU
> > > list
> > > > +		 * in the case.
> > > > +		 *
> > > > +		 * It is assumed that any compressor that sets the output
> > > length
> > > > +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> > > > +		 * error status in @err; i.e, will not return a successful
> > > > +		 * compression status in @err in this case.
> > > > +		 */
> > >
> > > Ugh, checking the compression error and checking the compression length
> > > are now in separate places so we need to check if writeback is disabled
> > > in separate places and store the page as-is. It's ugly, and I think the
> > > current code is not correct.
> >
> > The code is 100% correct. You need to spend more time understanding
> > the code. I have stated my assumption above in the comments to
> > help in understanding the "why".
> >
> > From a maintainer, I would expect more responsible statements than
> > this. A flippant remark made without understanding the code (and,
> > disparaging the comments intended to help you do this), can impact
> > someone's career. I am held accountable in my job based on your
> > comments.
> >
> > That said, I have worked tirelessly and innovated to make the code
> > compliant with Herbert's suggestions (which btw have enabled an
> > elegant batching implementation and code commonality for IAA and
> > software compressors), validated it thoroughly for IAA and ZSTD to
> > ensure that both demonstrate performance improvements, which
> > are crucial for memory savings. I am proud of this work.
> >
> >
> > >
> > > > +		if (err && !wb_enabled)
> > > > +			goto compress_error;
> > > > +
> > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > +			j = k + i;
> > >
> > > Please use meaningful iterator names rather than i, j, and k and the huge
> > > comment explaining what they are.
> >
> > I happen to have a different view: having longer iterator names firstly makes
> > code seem "verbose" and detracts from readability, not to mention
> exceeding the
> > 80-character line limit. The comments are essential for code maintainability
> > and avoid out-of-bounds errors when the next zswap developer wants to
> > optimize the code.
> >
> > One drawback of i/j/k iterators is mis-typing errors which cannot be caught
> > at compile time. Let me think some more about how to strike a good
> balance.
> >
> > >
> > > > +			dst = acomp_ctx->buffers[k];
> > > > +			dlen = sg->length | *errp;
> > >
> > > Why are we doing this?
> > >
> > > > +
> > > > +			if (dlen < 0) {
> > >
> > > We should do the incompressible page handling also if dlen is PAGE_SIZE,
> > > or if the compression failed (I guess that's the intention of bit OR'ing
> > > with *errp?)
> >
> > Yes, indeed: that's the intention of bit OR'ing with *errp.
> 
> ..and you never really answered my question. In the exising code we
> store the page as incompressible if writeback is enabled AND
> crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> if crypto_wait_req() fails and writeback is disabled, but what about the
> rest?

Let me explain this some more. The new code only relies on the assumption
that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
("successful status"). In other words, the compressor will return an error status
in this case, which is expected to be a negative error code.

Under these (hopefully valid) assumptions, the code handles the simple case
of an error compression return status and writeback is disabled, by the
"goto compress_error".

The rest is handled by these:

1) First, I need to adapt the use of sg_outputs->sgl->length to represent the
compress length for software compressors, so I do this after crypto_wait_req()
returns:

                acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;

I did not want to propose any changes to crypto software compressors protocols.

2) After the check for the "if (err && !wb_enabled)" case, the new code has this:

                for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
                        j = k + i;
                        dst = acomp_ctx->buffers[k];
                        dlen = sg->length | *errp;

                        if (dlen < 0) {
                                dlen = PAGE_SIZE;
                                dst = kmap_local_page(folio_page(folio, start + j));
                        }

For batching compressors, namely, iaa_crypto, the individual output SG
lists sg->length follows the requirements from Herbert: each sg->length
is the compressed length or the error status (a negative error code).

Then all I need to know whether to store the page as incompressible
is to either directly test if sg->length is negative (for batching compressors),
or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
is negative. This is accomplished by the "dlen = sg->length | *errp;".

I believe this maintains backward compatibility with the existing code.
Please let me know if you agree.

> 
> We don't check again if writeback is enabled before storing the page is
> incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> these cases no longer possible?

Hope the above explanation clarifies things some more? These case
are possible, and as long as they return an error status, they should be
correctly handled by the new code.

> 
> Also, why use errp, why not explicitly use the appropriate error code?
> It's also unclear to me why the error code is always zero with HW
> compression?

This is because of the sg->length requirements (compressed length/error)
for the batching interface suggested by Herbert. Hence, I upfront define
err_sg to 0, and, set errp to &err_sg for batching compressors. For software
compressors, errp is set to &err, namely, the above check will always apply
the software compressor's error status to the compressed length via
the bit-OR to determine if the page needs to be stored uncompressed.


> 
> >
> > >
> > > > +				dlen = PAGE_SIZE;
> > > > +				dst = kmap_local_page(folio_page(folio, start
> > > + j));
> > > > +			}
> > > > +
> > > > +			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
> > > >
> > > > -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > > -	entry->handle = handle;
> > > > -	entry->length = dlen;
> > > > +			if (IS_ERR_VALUE(handle)) {
> > > > +				if (PTR_ERR((void *)handle) == -ENOSPC)
> > > > +					zswap_reject_compress_poor++;
> > > > +				else
> > > > +					zswap_reject_alloc_fail++;
> > > >
> > > > -unlock:
> > > > -	if (mapped)
> > > > -		kunmap_local(dst);
> > > > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > > > -		zswap_reject_compress_poor++;
> > > > -	else if (comp_ret)
> > > > -		zswap_reject_compress_fail++;
> > > > -	else if (alloc_ret)
> > > > -		zswap_reject_alloc_fail++;
> > > > +				goto err_unlock;
> > > > +			}
> > > > +
> > > > +			zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > > +			entries[j]->handle = handle;
> > > > +			entries[j]->length = dlen;
> > > > +			if (dst != acomp_ctx->buffers[k])
> > > > +				kunmap_local(dst);
> > > > +		}
> > > > +	} /* finished compress and store nr_pages. */
> > > > +
> > > > +	mutex_unlock(&acomp_ctx->mutex);
> > > > +	return true;
> > > > +
> > > > +compress_error:
> > > > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > +		if ((int)sg->length < 0) {
> > > > +			if ((int)sg->length == -ENOSPC)
> > > > +				zswap_reject_compress_poor++;
> > > > +			else
> > > > +				zswap_reject_compress_fail++;
> > > > +		}
> > > > +	}
> > > >
> > > > +err_unlock:
> > > >  	mutex_unlock(&acomp_ctx->mutex);
> > > > -	return comp_ret == 0 && alloc_ret == 0;
> > > > +	return false;
> > > >  }
> > > >
> > > >  static bool zswap_decompress(struct zswap_entry *entry, struct folio
> > > *folio)
> > > > @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio
> > > *folio,
> > > >  		INIT_LIST_HEAD(&entries[i]->lru);
> > > >  	}
> > > >
> > > > -	for (i = 0; i < nr_pages; ++i) {
> > > > -		struct page *page = folio_page(folio, start + i);
> > > > -
> > > > -		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> > > > -			goto store_pages_failed;
> > > > -	}
> > > > +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> > > > +				     nid, wb_enabled)))
> > > > +		goto store_pages_failed;
> > > >
> > > >  	for (i = 0; i < nr_pages; ++i) {
> > > >  		struct zswap_entry *old, *entry = entries[i];
> > > > --
> > > > 2.27.0
> > > >


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-14  6:43         ` Sridhar, Kanchana P
@ 2025-11-14 15:37           ` Yosry Ahmed
  2025-11-14 19:23             ` Sridhar, Kanchana P
  2025-11-26  5:46             ` Herbert Xu
  0 siblings, 2 replies; 79+ messages in thread
From: Yosry Ahmed @ 2025-11-14 15:37 UTC (permalink / raw)
  To: Sridhar, Kanchana P, SeongJae Park
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh

On Fri, Nov 14, 2025 at 06:43:21AM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Thursday, November 13, 2025 9:52 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
> > On Thu, Nov 13, 2025 at 11:55:10PM +0000, Sridhar, Kanchana P wrote:
> > >
> > > > -----Original Message-----
> > > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > Sent: Thursday, November 13, 2025 1:35 PM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > hannes@cmpxchg.org; nphamcs@gmail.com;
> > chengming.zhou@linux.dev;
> > > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > <vinicius.gomes@intel.com>;
> > > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > > <vinodh.gopal@intel.com>
> > > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> > with
> > > > compress batching of large folios.
> > > >
> > [..]
> > > > > +		/*
> > > > > +		 * If a page cannot be compressed into a size smaller than
> > > > > +		 * PAGE_SIZE, save the content as is without a compression,
> > > > to
> > > > > +		 * keep the LRU order of writebacks.  If writeback is disabled,
> > > > > +		 * reject the page since it only adds metadata overhead.
> > > > > +		 * swap_writeout() will put the page back to the active LRU
> > > > list
> > > > > +		 * in the case.
> > > > > +		 *
> > > > > +		 * It is assumed that any compressor that sets the output
> > > > length
> > > > > +		 * to 0 or a value >= PAGE_SIZE will also return a negative
> > > > > +		 * error status in @err; i.e, will not return a successful
> > > > > +		 * compression status in @err in this case.
> > > > > +		 */
> > > >
> > > > Ugh, checking the compression error and checking the compression length
> > > > are now in separate places so we need to check if writeback is disabled
> > > > in separate places and store the page as-is. It's ugly, and I think the
> > > > current code is not correct.
> > >
> > > The code is 100% correct. You need to spend more time understanding
> > > the code. I have stated my assumption above in the comments to
> > > help in understanding the "why".
> > >
> > > From a maintainer, I would expect more responsible statements than
> > > this. A flippant remark made without understanding the code (and,
> > > disparaging the comments intended to help you do this), can impact
> > > someone's career. I am held accountable in my job based on your
> > > comments.
> > >
> > > That said, I have worked tirelessly and innovated to make the code
> > > compliant with Herbert's suggestions (which btw have enabled an
> > > elegant batching implementation and code commonality for IAA and
> > > software compressors), validated it thoroughly for IAA and ZSTD to
> > > ensure that both demonstrate performance improvements, which
> > > are crucial for memory savings. I am proud of this work.
> > >
> > >
> > > >
> > > > > +		if (err && !wb_enabled)
> > > > > +			goto compress_error;
> > > > > +
> > > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > > +			j = k + i;
> > > >
> > > > Please use meaningful iterator names rather than i, j, and k and the huge
> > > > comment explaining what they are.
> > >
> > > I happen to have a different view: having longer iterator names firstly makes
> > > code seem "verbose" and detracts from readability, not to mention
> > exceeding the
> > > 80-character line limit. The comments are essential for code maintainability
> > > and avoid out-of-bounds errors when the next zswap developer wants to
> > > optimize the code.
> > >
> > > One drawback of i/j/k iterators is mis-typing errors which cannot be caught
> > > at compile time. Let me think some more about how to strike a good
> > balance.
> > >
> > > >
> > > > > +			dst = acomp_ctx->buffers[k];
> > > > > +			dlen = sg->length | *errp;
> > > >
> > > > Why are we doing this?
> > > >
> > > > > +
> > > > > +			if (dlen < 0) {
> > > >
> > > > We should do the incompressible page handling also if dlen is PAGE_SIZE,
> > > > or if the compression failed (I guess that's the intention of bit OR'ing
> > > > with *errp?)
> > >
> > > Yes, indeed: that's the intention of bit OR'ing with *errp.
> > 
> > ..and you never really answered my question. In the exising code we
> > store the page as incompressible if writeback is enabled AND
> > crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> > if crypto_wait_req() fails and writeback is disabled, but what about the
> > rest?
> 
> Let me explain this some more. The new code only relies on the assumption
> that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
> ("successful status"). In other words, the compressor will return an error status
> in this case, which is expected to be a negative error code.

I am not sure if all compressors do that, especially for the case where
dlen >= PAGE_SIZE. The existing code does not assume that there will be
an error code in these cases.

For the dlen == 0 case, the check was introduced recently by commit
dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page
as-is"). Looking through the history it seems like it was introduced in
v4 of that patch but I don't see the reasoning.

SeongJae, did you observe any compressors returning dlen == 0 but no
error code? What was the reasoning behind the dlen == 0 check?

> 
> Under these (hopefully valid) assumptions, the code handles the simple case
> of an error compression return status and writeback is disabled, by the
> "goto compress_error".
> 
> The rest is handled by these:
> 
> 1) First, I need to adapt the use of sg_outputs->sgl->length to represent the
> compress length for software compressors, so I do this after crypto_wait_req()
> returns:
> 
>                 acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;

For SW compressors, why is acomp_ctx->sg_outputs->sgl->length not set?
IIUC we are using the same API for SW and HW compressors, why is the
output length in different places for each of them?

> 
> I did not want to propose any changes to crypto software compressors protocols.
> 
> 2) After the check for the "if (err && !wb_enabled)" case, the new code has this:
> 
>                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
>                         j = k + i;
>                         dst = acomp_ctx->buffers[k];
>                         dlen = sg->length | *errp;
> 
>                         if (dlen < 0) {
>                                 dlen = PAGE_SIZE;
>                                 dst = kmap_local_page(folio_page(folio, start + j));
>                         }
> 
> For batching compressors, namely, iaa_crypto, the individual output SG
> lists sg->length follows the requirements from Herbert: each sg->length
> is the compressed length or the error status (a negative error code).
> 
> Then all I need to know whether to store the page as incompressible
> is to either directly test if sg->length is negative (for batching compressors),
> or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
> is negative. This is accomplished by the "dlen = sg->length | *errp;".
> 
> I believe this maintains backward compatibility with the existing code.
> Please let me know if you agree.

For batching compressors, will 'err' be set as well, or just sg->length?
If it's just sg->length, then we need to check again if WB is enabled
here before storing the page uncompressed. Right?

> 
> > 
> > We don't check again if writeback is enabled before storing the page is
> > incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> > these cases no longer possible?
> 
> Hope the above explanation clarifies things some more? These case
> are possible, and as long as they return an error status, they should be
> correctly handled by the new code.

As mentioned above, I am not sure if that's correct for dlen >=
PAGE_SIZE.

> 
> > 
> > Also, why use errp, why not explicitly use the appropriate error code?
> > It's also unclear to me why the error code is always zero with HW
> > compression?
> 
> This is because of the sg->length requirements (compressed length/error)
> for the batching interface suggested by Herbert. Hence, I upfront define
> err_sg to 0, and, set errp to &err_sg for batching compressors. For software
> compressors, errp is set to &err, namely, the above check will always apply
> the software compressor's error status to the compressed length via
> the bit-OR to determine if the page needs to be stored uncompressed.

Thanks for the clarification. I understand that the error code has
different sources for SW and HW compressors, but I do not like using
errp as an indirection. It makes the code unclear. I would rather we
explicitly check err for SW compressors and dlen for HW compressors.

That being said, I thought what Herbert suggested was that the same API
is used for both SW and HW compressors. IOW, either way we submit a
batch of pages (8 pages for SW compressors), and then the crypto API
would either give the entire batch to the compressor if it supports
batching, or loop over them internally and hand them page-by-page to
the compressor.

This would simplify usage as we do not have to handle the differences in
zswap.

If that is not doable, at the very least the API should be consistent.
Right now the error code and length are propagated differently to the
caller based on whether or not the compressor support batching.

> 
> 
> > 
> > >
> > > >
> > > > > +				dlen = PAGE_SIZE;
> > > > > +				dst = kmap_local_page(folio_page(folio, start
> > > > + j));
> > > > > +			}
> > > > > +
> > > > > +			handle = zs_malloc(pool->zs_pool, dlen, gfp, nid);
> > > > >
> > > > > -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > > > -	entry->handle = handle;
> > > > > -	entry->length = dlen;
> > > > > +			if (IS_ERR_VALUE(handle)) {
> > > > > +				if (PTR_ERR((void *)handle) == -ENOSPC)
> > > > > +					zswap_reject_compress_poor++;
> > > > > +				else
> > > > > +					zswap_reject_alloc_fail++;
> > > > >
> > > > > -unlock:
> > > > > -	if (mapped)
> > > > > -		kunmap_local(dst);
> > > > > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > > > > -		zswap_reject_compress_poor++;
> > > > > -	else if (comp_ret)
> > > > > -		zswap_reject_compress_fail++;
> > > > > -	else if (alloc_ret)
> > > > > -		zswap_reject_alloc_fail++;
> > > > > +				goto err_unlock;
> > > > > +			}
> > > > > +
> > > > > +			zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > > > +			entries[j]->handle = handle;
> > > > > +			entries[j]->length = dlen;
> > > > > +			if (dst != acomp_ctx->buffers[k])
> > > > > +				kunmap_local(dst);
> > > > > +		}
> > > > > +	} /* finished compress and store nr_pages. */
> > > > > +
> > > > > +	mutex_unlock(&acomp_ctx->mutex);
> > > > > +	return true;
> > > > > +
> > > > > +compress_error:
> > > > > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > > +		if ((int)sg->length < 0) {
> > > > > +			if ((int)sg->length == -ENOSPC)
> > > > > +				zswap_reject_compress_poor++;
> > > > > +			else
> > > > > +				zswap_reject_compress_fail++;
> > > > > +		}
> > > > > +	}
> > > > >
> > > > > +err_unlock:
> > > > >  	mutex_unlock(&acomp_ctx->mutex);
> > > > > -	return comp_ret == 0 && alloc_ret == 0;
> > > > > +	return false;
> > > > >  }
> > > > >
> > > > >  static bool zswap_decompress(struct zswap_entry *entry, struct folio
> > > > *folio)
> > > > > @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct folio
> > > > *folio,
> > > > >  		INIT_LIST_HEAD(&entries[i]->lru);
> > > > >  	}
> > > > >
> > > > > -	for (i = 0; i < nr_pages; ++i) {
> > > > > -		struct page *page = folio_page(folio, start + i);
> > > > > -
> > > > > -		if (!zswap_compress(page, entries[i], pool, wb_enabled))
> > > > > -			goto store_pages_failed;
> > > > > -	}
> > > > > +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries, pool,
> > > > > +				     nid, wb_enabled)))
> > > > > +		goto store_pages_failed;
> > > > >
> > > > >  	for (i = 0; i < nr_pages; ++i) {
> > > > >  		struct zswap_entry *old, *entry = entries[i];
> > > > > --
> > > > > 2.27.0
> > > > >


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-14 15:37           ` Yosry Ahmed
@ 2025-11-14 19:23             ` Sridhar, Kanchana P
  2025-11-14 19:44               ` Yosry Ahmed
  2025-11-26  5:46             ` Herbert Xu
  1 sibling, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-11-14 19:23 UTC (permalink / raw)
  To: Yosry Ahmed, SeongJae Park
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, 21cnbao, ying.huang, akpm,
	senozhatsky, sj, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Friday, November 14, 2025 7:38 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> <sj@kernel.org>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Fri, Nov 14, 2025 at 06:43:21AM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Thursday, November 13, 2025 9:52 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > >
> > > On Thu, Nov 13, 2025 at 11:55:10PM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > > Sent: Thursday, November 13, 2025 1:35 PM
> > > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > > hannes@cmpxchg.org; nphamcs@gmail.com;
> > > chengming.zhou@linux.dev;
> > > > > usamaarif642@gmail.com; ryan.roberts@arm.com;
> 21cnbao@gmail.com;
> > > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > > > senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com;
> linux-
> > > > > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > > > <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > > <vinicius.gomes@intel.com>;
> > > > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > > > <vinodh.gopal@intel.com>
> > > > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched
> zswap_compress()
> > > with
> > > > > compress batching of large folios.
> > > > >
> > > [..]
> > > > > > +		/*
> > > > > > +		 * If a page cannot be compressed into a size smaller
> than
> > > > > > +		 * PAGE_SIZE, save the content as is without a
> compression,
> > > > > to
> > > > > > +		 * keep the LRU order of writebacks.  If writeback is
> disabled,
> > > > > > +		 * reject the page since it only adds metadata
> overhead.
> > > > > > +		 * swap_writeout() will put the page back to the
> active LRU
> > > > > list
> > > > > > +		 * in the case.
> > > > > > +		 *
> > > > > > +		 * It is assumed that any compressor that sets the
> output
> > > > > length
> > > > > > +		 * to 0 or a value >= PAGE_SIZE will also return a
> negative
> > > > > > +		 * error status in @err; i.e, will not return a successful
> > > > > > +		 * compression status in @err in this case.
> > > > > > +		 */
> > > > >
> > > > > Ugh, checking the compression error and checking the compression
> length
> > > > > are now in separate places so we need to check if writeback is
> disabled
> > > > > in separate places and store the page as-is. It's ugly, and I think the
> > > > > current code is not correct.
> > > >
> > > > The code is 100% correct. You need to spend more time understanding
> > > > the code. I have stated my assumption above in the comments to
> > > > help in understanding the "why".
> > > >
> > > > From a maintainer, I would expect more responsible statements than
> > > > this. A flippant remark made without understanding the code (and,
> > > > disparaging the comments intended to help you do this), can impact
> > > > someone's career. I am held accountable in my job based on your
> > > > comments.
> > > >
> > > > That said, I have worked tirelessly and innovated to make the code
> > > > compliant with Herbert's suggestions (which btw have enabled an
> > > > elegant batching implementation and code commonality for IAA and
> > > > software compressors), validated it thoroughly for IAA and ZSTD to
> > > > ensure that both demonstrate performance improvements, which
> > > > are crucial for memory savings. I am proud of this work.
> > > >
> > > >
> > > > >
> > > > > > +		if (err && !wb_enabled)
> > > > > > +			goto compress_error;
> > > > > > +
> > > > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg,
> nr_comps, k) {
> > > > > > +			j = k + i;
> > > > >
> > > > > Please use meaningful iterator names rather than i, j, and k and the
> huge
> > > > > comment explaining what they are.
> > > >
> > > > I happen to have a different view: having longer iterator names firstly
> makes
> > > > code seem "verbose" and detracts from readability, not to mention
> > > exceeding the
> > > > 80-character line limit. The comments are essential for code
> maintainability
> > > > and avoid out-of-bounds errors when the next zswap developer wants
> to
> > > > optimize the code.
> > > >
> > > > One drawback of i/j/k iterators is mis-typing errors which cannot be
> caught
> > > > at compile time. Let me think some more about how to strike a good
> > > balance.
> > > >
> > > > >
> > > > > > +			dst = acomp_ctx->buffers[k];
> > > > > > +			dlen = sg->length | *errp;
> > > > >
> > > > > Why are we doing this?
> > > > >
> > > > > > +
> > > > > > +			if (dlen < 0) {
> > > > >
> > > > > We should do the incompressible page handling also if dlen is
> PAGE_SIZE,
> > > > > or if the compression failed (I guess that's the intention of bit OR'ing
> > > > > with *errp?)
> > > >
> > > > Yes, indeed: that's the intention of bit OR'ing with *errp.
> > >
> > > ..and you never really answered my question. In the exising code we
> > > store the page as incompressible if writeback is enabled AND
> > > crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> > > if crypto_wait_req() fails and writeback is disabled, but what about the
> > > rest?
> >
> > Let me explain this some more. The new code only relies on the assumption
> > that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
> > ("successful status"). In other words, the compressor will return an error
> status
> > in this case, which is expected to be a negative error code.
> 
> I am not sure if all compressors do that, especially for the case where
> dlen >= PAGE_SIZE. The existing code does not assume that there will be
> an error code in these cases.
> 
> For the dlen == 0 case, the check was introduced recently by commit
> dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page
> as-is"). Looking through the history it seems like it was introduced in
> v4 of that patch but I don't see the reasoning.

The existing code did not check for dlen == 0 and dlen >= PAGE_SIZE
prior to commit dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression
failed page as-is") either. We need SeongJae or Herbert to clarify whether
this check is needed, or if it is sufficient to rely on comp_ret, the return from
crypto_wait_req().

> 
> SeongJae, did you observe any compressors returning dlen == 0 but no
> error code? What was the reasoning behind the dlen == 0 check?
> 
> >
> > Under these (hopefully valid) assumptions, the code handles the simple case
> > of an error compression return status and writeback is disabled, by the
> > "goto compress_error".
> >
> > The rest is handled by these:
> >
> > 1) First, I need to adapt the use of sg_outputs->sgl->length to represent the
> > compress length for software compressors, so I do this after
> crypto_wait_req()
> > returns:
> >
> >                 acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
> 
> For SW compressors, why is acomp_ctx->sg_outputs->sgl->length not set?
> IIUC we are using the same API for SW and HW compressors, why is the
> output length in different places for each of them?

This is to first implement the SG lists batching interface in iaa_crypto, while
maintaining backward compatibility for SW compressors with the new API.
I believe we may want to adapt the crypto API to SW compressors
at a later point. I also believe this would be outside the scope of this patch.
It would be nice if Herbert can share his vision on this aspect.

> 
> >
> > I did not want to propose any changes to crypto software compressors
> protocols.
> >
> > 2) After the check for the "if (err && !wb_enabled)" case, the new code has
> this:
> >
> >                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> >                         j = k + i;
> >                         dst = acomp_ctx->buffers[k];
> >                         dlen = sg->length | *errp;
> >
> >                         if (dlen < 0) {
> >                                 dlen = PAGE_SIZE;
> >                                 dst = kmap_local_page(folio_page(folio, start + j));
> >                         }
> >
> > For batching compressors, namely, iaa_crypto, the individual output SG
> > lists sg->length follows the requirements from Herbert: each sg->length
> > is the compressed length or the error status (a negative error code).
> >
> > Then all I need to know whether to store the page as incompressible
> > is to either directly test if sg->length is negative (for batching compressors),
> > or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
> > is negative. This is accomplished by the "dlen = sg->length | *errp;".
> >
> > I believe this maintains backward compatibility with the existing code.
> > Please let me know if you agree.
> 
> For batching compressors, will 'err' be set as well, or just sg->length?
> If it's just sg->length, then we need to check again if WB is enabled
> here before storing the page uncompressed. Right?

iaa_crypto will set 'err' and set the sg->length as per the batching interface
spec from Herbert.

> 
> >
> > >
> > > We don't check again if writeback is enabled before storing the page is
> > > incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> > > these cases no longer possible?
> >
> > Hope the above explanation clarifies things some more? These case
> > are possible, and as long as they return an error status, they should be
> > correctly handled by the new code.
> 
> As mentioned above, I am not sure if that's correct for dlen >=
> PAGE_SIZE.

We need to get clarity on this from SeongJae/Herbert.

> 
> >
> > >
> > > Also, why use errp, why not explicitly use the appropriate error code?
> > > It's also unclear to me why the error code is always zero with HW
> > > compression?
> >
> > This is because of the sg->length requirements (compressed length/error)
> > for the batching interface suggested by Herbert. Hence, I upfront define
> > err_sg to 0, and, set errp to &err_sg for batching compressors. For software
> > compressors, errp is set to &err, namely, the above check will always apply
> > the software compressor's error status to the compressed length via
> > the bit-OR to determine if the page needs to be stored uncompressed.
> 
> Thanks for the clarification. I understand that the error code has
> different sources for SW and HW compressors, but I do not like using
> errp as an indirection. It makes the code unclear. I would rather we
> explicitly check err for SW compressors and dlen for HW compressors.
> 
> That being said, I thought what Herbert suggested was that the same API
> is used for both SW and HW compressors. IOW, either way we submit a
> batch of pages (8 pages for SW compressors), and then the crypto API
> would either give the entire batch to the compressor if it supports
> batching, or loop over them internally and hand them page-by-page to
> the compressor.

That was not how I understood Herbert's suggestion for the batching interface.
He did suggest the following:

"Before the call to acomp, the destination SG list should contain as
many elements as the number of units.  On return, the dst lengths
should be stored in each destination SG entry."

I have incorporated this suggestion in the iaa_crypto driver. For SW
compressors, I have tried not to propose any API changes, while making
sure that the zswap changes for the SG lists batching API work as expected
for SW without too much special-casing code.

I suppose I always assumed that we would update SW compressors later,
and not as part of this patch-set.

> 
> This would simplify usage as we do not have to handle the differences in
> zswap.

That's the nice thing about SG lists - I think the zswap_compress() calls to
the new batching API appears agnostic to SW and HW compressors.
Other than the upfront "errp = (pool->compr_batch_size == 1) ? &err : &err_sg;"
the logical code organization of the new zswap_compress() is quite similar to
the existing code. The post-compress "dlen = sg->length | *errp;" handles the rest.

> 
> If that is not doable, at the very least the API should be consistent.
> Right now the error code and length are propagated differently to the
> caller based on whether or not the compressor support batching.

Hopefully this minor difference is transitional while we move zswap to
use the new batching interface, with the assumption that crypto SW API
can be updated later. We would need to get Herbert's thoughts on this.

> 
> >
> >
> > >
> > > >
> > > > >
> > > > > > +				dlen = PAGE_SIZE;
> > > > > > +				dst =
> kmap_local_page(folio_page(folio, start
> > > > > + j));
> > > > > > +			}
> > > > > > +
> > > > > > +			handle = zs_malloc(pool->zs_pool, dlen, gfp,
> nid);
> > > > > >
> > > > > > -	zs_obj_write(pool->zs_pool, handle, dst, dlen);
> > > > > > -	entry->handle = handle;
> > > > > > -	entry->length = dlen;
> > > > > > +			if (IS_ERR_VALUE(handle)) {
> > > > > > +				if (PTR_ERR((void *)handle) == -
> ENOSPC)
> > > > > > +
> 	zswap_reject_compress_poor++;
> > > > > > +				else
> > > > > > +					zswap_reject_alloc_fail++;
> > > > > >
> > > > > > -unlock:
> > > > > > -	if (mapped)
> > > > > > -		kunmap_local(dst);
> > > > > > -	if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC)
> > > > > > -		zswap_reject_compress_poor++;
> > > > > > -	else if (comp_ret)
> > > > > > -		zswap_reject_compress_fail++;
> > > > > > -	else if (alloc_ret)
> > > > > > -		zswap_reject_alloc_fail++;
> > > > > > +				goto err_unlock;
> > > > > > +			}
> > > > > > +
> > > > > > +			zs_obj_write(pool->zs_pool, handle, dst,
> dlen);
> > > > > > +			entries[j]->handle = handle;
> > > > > > +			entries[j]->length = dlen;
> > > > > > +			if (dst != acomp_ctx->buffers[k])
> > > > > > +				kunmap_local(dst);
> > > > > > +		}
> > > > > > +	} /* finished compress and store nr_pages. */
> > > > > > +
> > > > > > +	mutex_unlock(&acomp_ctx->mutex);
> > > > > > +	return true;
> > > > > > +
> > > > > > +compress_error:
> > > > > > +	for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > > > +		if ((int)sg->length < 0) {
> > > > > > +			if ((int)sg->length == -ENOSPC)
> > > > > > +				zswap_reject_compress_poor++;
> > > > > > +			else
> > > > > > +				zswap_reject_compress_fail++;
> > > > > > +		}
> > > > > > +	}
> > > > > >
> > > > > > +err_unlock:
> > > > > >  	mutex_unlock(&acomp_ctx->mutex);
> > > > > > -	return comp_ret == 0 && alloc_ret == 0;
> > > > > > +	return false;
> > > > > >  }
> > > > > >
> > > > > >  static bool zswap_decompress(struct zswap_entry *entry, struct
> folio
> > > > > *folio)
> > > > > > @@ -1488,12 +1604,9 @@ static bool zswap_store_pages(struct
> folio
> > > > > *folio,
> > > > > >  		INIT_LIST_HEAD(&entries[i]->lru);
> > > > > >  	}
> > > > > >
> > > > > > -	for (i = 0; i < nr_pages; ++i) {
> > > > > > -		struct page *page = folio_page(folio, start + i);
> > > > > > -
> > > > > > -		if (!zswap_compress(page, entries[i], pool,
> wb_enabled))
> > > > > > -			goto store_pages_failed;
> > > > > > -	}
> > > > > > +	if (unlikely(!zswap_compress(folio, start, nr_pages, entries,
> pool,
> > > > > > +				     nid, wb_enabled)))
> > > > > > +		goto store_pages_failed;
> > > > > >
> > > > > >  	for (i = 0; i < nr_pages; ++i) {
> > > > > >  		struct zswap_entry *old, *entry = entries[i];
> > > > > > --
> > > > > > 2.27.0
> > > > > >


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-14 19:23             ` Sridhar, Kanchana P
@ 2025-11-14 19:44               ` Yosry Ahmed
  2025-11-14 19:59                 ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-11-14 19:44 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: SeongJae Park, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh

On Fri, Nov 14, 2025 at 07:23:42PM +0000, Sridhar, Kanchana P wrote:
[..]
 > > > > >
> > > > > > > +		if (err && !wb_enabled)
> > > > > > > +			goto compress_error;
> > > > > > > +
> > > > > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg,
> > nr_comps, k) {
> > > > > > > +			j = k + i;
> > > > > >
[..]
> > > > >
> > > > > >
> > > > > > > +			dst = acomp_ctx->buffers[k];
> > > > > > > +			dlen = sg->length | *errp;
> > > > > >
> > > > > > Why are we doing this?
> > > > > >
> > > > > > > +
> > > > > > > +			if (dlen < 0) {
> > > > > >
> > > > > > We should do the incompressible page handling also if dlen is
> > PAGE_SIZE,
> > > > > > or if the compression failed (I guess that's the intention of bit OR'ing
> > > > > > with *errp?)
> > > > >
> > > > > Yes, indeed: that's the intention of bit OR'ing with *errp.
> > > >
> > > > ..and you never really answered my question. In the exising code we
> > > > store the page as incompressible if writeback is enabled AND
> > > > crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> > > > if crypto_wait_req() fails and writeback is disabled, but what about the
> > > > rest?
> > >
> > > Let me explain this some more. The new code only relies on the assumption
> > > that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
> > > ("successful status"). In other words, the compressor will return an error
> > status
> > > in this case, which is expected to be a negative error code.
> > 
> > I am not sure if all compressors do that, especially for the case where
> > dlen >= PAGE_SIZE. The existing code does not assume that there will be
> > an error code in these cases.
> > 
> > For the dlen == 0 case, the check was introduced recently by commit
> > dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page
> > as-is"). Looking through the history it seems like it was introduced in
> > v4 of that patch but I don't see the reasoning.
> 
> The existing code did not check for dlen == 0 and dlen >= PAGE_SIZE
> prior to commit dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression
> failed page as-is") either. We need SeongJae or Herbert to clarify whether
> this check is needed, or if it is sufficient to rely on comp_ret, the return from
> crypto_wait_req().
> 
> > 
> > SeongJae, did you observe any compressors returning dlen == 0 but no
> > error code? What was the reasoning behind the dlen == 0 check?
> > 
> > >
> > > Under these (hopefully valid) assumptions, the code handles the simple case
> > > of an error compression return status and writeback is disabled, by the
> > > "goto compress_error".
> > >
> > > The rest is handled by these:
> > >
> > > 1) First, I need to adapt the use of sg_outputs->sgl->length to represent the
> > > compress length for software compressors, so I do this after
> > crypto_wait_req()
> > > returns:
> > >
> > >                 acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
> > 
> > For SW compressors, why is acomp_ctx->sg_outputs->sgl->length not set?
> > IIUC we are using the same API for SW and HW compressors, why is the
> > output length in different places for each of them?
> 
> This is to first implement the SG lists batching interface in iaa_crypto, while
> maintaining backward compatibility for SW compressors with the new API.
> I believe we may want to adapt the crypto API to SW compressors
> at a later point. I also believe this would be outside the scope of this patch.
> It would be nice if Herbert can share his vision on this aspect.
> 
> > 
> > >
> > > I did not want to propose any changes to crypto software compressors
> > protocols.
> > >
> > > 2) After the check for the "if (err && !wb_enabled)" case, the new code has
> > this:
> > >
> > >                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > >                         j = k + i;
> > >                         dst = acomp_ctx->buffers[k];
> > >                         dlen = sg->length | *errp;
> > >
> > >                         if (dlen < 0) {
> > >                                 dlen = PAGE_SIZE;
> > >                                 dst = kmap_local_page(folio_page(folio, start + j));
> > >                         }
> > >
> > > For batching compressors, namely, iaa_crypto, the individual output SG
> > > lists sg->length follows the requirements from Herbert: each sg->length
> > > is the compressed length or the error status (a negative error code).
> > >
> > > Then all I need to know whether to store the page as incompressible
> > > is to either directly test if sg->length is negative (for batching compressors),
> > > or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
> > > is negative. This is accomplished by the "dlen = sg->length | *errp;".
> > >
> > > I believe this maintains backward compatibility with the existing code.
> > > Please let me know if you agree.
> > 
> > For batching compressors, will 'err' be set as well, or just sg->length?
> > If it's just sg->length, then we need to check again if WB is enabled
> > here before storing the page uncompressed. Right?
> 
> iaa_crypto will set 'err' and set the sg->length as per the batching interface
> spec from Herbert.

So both 'err' and sg->length will contain the same error? In this case
why do we need to check if dlen < 0? Shouldn't checking 'err' be
sufficient? and it would work for both SW and HW and we wouldn't need
errp. Did I miss something?

> 
> > 
> > >
> > > >
> > > > We don't check again if writeback is enabled before storing the page is
> > > > incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> > > > these cases no longer possible?
> > >
> > > Hope the above explanation clarifies things some more? These case
> > > are possible, and as long as they return an error status, they should be
> > > correctly handled by the new code.
> > 
> > As mentioned above, I am not sure if that's correct for dlen >=
> > PAGE_SIZE.
> 
> We need to get clarity on this from SeongJae/Herbert.
> 
> > 
> > >
> > > >
> > > > Also, why use errp, why not explicitly use the appropriate error code?
> > > > It's also unclear to me why the error code is always zero with HW
> > > > compression?
> > >
> > > This is because of the sg->length requirements (compressed length/error)
> > > for the batching interface suggested by Herbert. Hence, I upfront define
> > > err_sg to 0, and, set errp to &err_sg for batching compressors. For software
> > > compressors, errp is set to &err, namely, the above check will always apply
> > > the software compressor's error status to the compressed length via
> > > the bit-OR to determine if the page needs to be stored uncompressed.
> > 
> > Thanks for the clarification. I understand that the error code has
> > different sources for SW and HW compressors, but I do not like using
> > errp as an indirection. It makes the code unclear. I would rather we
> > explicitly check err for SW compressors and dlen for HW compressors.
> > 
> > That being said, I thought what Herbert suggested was that the same API
> > is used for both SW and HW compressors. IOW, either way we submit a
> > batch of pages (8 pages for SW compressors), and then the crypto API
> > would either give the entire batch to the compressor if it supports
> > batching, or loop over them internally and hand them page-by-page to
> > the compressor.
> 
> That was not how I understood Herbert's suggestion for the batching interface.
> He did suggest the following:
> 
> "Before the call to acomp, the destination SG list should contain as
> many elements as the number of units.  On return, the dst lengths
> should be stored in each destination SG entry."
> 
> I have incorporated this suggestion in the iaa_crypto driver. For SW
> compressors, I have tried not to propose any API changes, while making
> sure that the zswap changes for the SG lists batching API work as expected
> for SW without too much special-casing code.
> 
> I suppose I always assumed that we would update SW compressors later,
> and not as part of this patch-set.

I am not sure I understand what changes lie in the crypto layer and what
changes lie in the SW compressors. I am not suggesting we do any
modification to the SW compressors.

I imagined that the crypto layer would present a uniform API regardless
of whether or not the compressor supports batching. Ideally zswap would
pass in a batch to crypto and it would figure out if it needs to break
them down or not. Then the output length and errors would be presented
uniformly to the caller.

That being said, I am not at all familiar with how crypto works and how
straightforward that would be. Herbert, WDYT?

> 
> > 
> > This would simplify usage as we do not have to handle the differences in
> > zswap.
> 
> That's the nice thing about SG lists - I think the zswap_compress() calls to
> the new batching API appears agnostic to SW and HW compressors.
> Other than the upfront "errp = (pool->compr_batch_size == 1) ? &err : &err_sg;"
> the logical code organization of the new zswap_compress() is quite similar to
> the existing code. The post-compress "dlen = sg->length | *errp;" handles the rest.

It would be even nicer if the batches are also abstracted by SG lists.

Also, I don't like how the error codes and output lengths are presented
differently for HW and SW compressors.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-14 19:44               ` Yosry Ahmed
@ 2025-11-14 19:59                 ` Sridhar, Kanchana P
  2025-11-14 20:49                   ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-11-14 19:59 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: SeongJae Park, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Friday, November 14, 2025 11:44 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: SeongJae Park <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Fri, Nov 14, 2025 at 07:23:42PM +0000, Sridhar, Kanchana P wrote:
> [..]
>  > > > > >
> > > > > > > > +		if (err && !wb_enabled)
> > > > > > > > +			goto compress_error;
> > > > > > > > +
> > > > > > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg,
> > > nr_comps, k) {
> > > > > > > > +			j = k + i;
> > > > > > >
> [..]
> > > > > >
> > > > > > >
> > > > > > > > +			dst = acomp_ctx->buffers[k];
> > > > > > > > +			dlen = sg->length | *errp;
> > > > > > >
> > > > > > > Why are we doing this?
> > > > > > >
> > > > > > > > +
> > > > > > > > +			if (dlen < 0) {
> > > > > > >
> > > > > > > We should do the incompressible page handling also if dlen is
> > > PAGE_SIZE,
> > > > > > > or if the compression failed (I guess that's the intention of bit
> OR'ing
> > > > > > > with *errp?)
> > > > > >
> > > > > > Yes, indeed: that's the intention of bit OR'ing with *errp.
> > > > >
> > > > > ..and you never really answered my question. In the exising code we
> > > > > store the page as incompressible if writeback is enabled AND
> > > > > crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> > > > > if crypto_wait_req() fails and writeback is disabled, but what about the
> > > > > rest?
> > > >
> > > > Let me explain this some more. The new code only relies on the
> assumption
> > > > that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
> > > > ("successful status"). In other words, the compressor will return an error
> > > status
> > > > in this case, which is expected to be a negative error code.
> > >
> > > I am not sure if all compressors do that, especially for the case where
> > > dlen >= PAGE_SIZE. The existing code does not assume that there will be
> > > an error code in these cases.
> > >
> > > For the dlen == 0 case, the check was introduced recently by commit
> > > dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page
> > > as-is"). Looking through the history it seems like it was introduced in
> > > v4 of that patch but I don't see the reasoning.
> >
> > The existing code did not check for dlen == 0 and dlen >= PAGE_SIZE
> > prior to commit dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression
> > failed page as-is") either. We need SeongJae or Herbert to clarify whether
> > this check is needed, or if it is sufficient to rely on comp_ret, the return from
> > crypto_wait_req().
> >
> > >
> > > SeongJae, did you observe any compressors returning dlen == 0 but no
> > > error code? What was the reasoning behind the dlen == 0 check?
> > >
> > > >
> > > > Under these (hopefully valid) assumptions, the code handles the simple
> case
> > > > of an error compression return status and writeback is disabled, by the
> > > > "goto compress_error".
> > > >
> > > > The rest is handled by these:
> > > >
> > > > 1) First, I need to adapt the use of sg_outputs->sgl->length to represent
> the
> > > > compress length for software compressors, so I do this after
> > > crypto_wait_req()
> > > > returns:
> > > >
> > > >                 acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
> > >
> > > For SW compressors, why is acomp_ctx->sg_outputs->sgl->length not set?
> > > IIUC we are using the same API for SW and HW compressors, why is the
> > > output length in different places for each of them?
> >
> > This is to first implement the SG lists batching interface in iaa_crypto, while
> > maintaining backward compatibility for SW compressors with the new API.
> > I believe we may want to adapt the crypto API to SW compressors
> > at a later point. I also believe this would be outside the scope of this patch.
> > It would be nice if Herbert can share his vision on this aspect.
> >
> > >
> > > >
> > > > I did not want to propose any changes to crypto software compressors
> > > protocols.
> > > >
> > > > 2) After the check for the "if (err && !wb_enabled)" case, the new code
> has
> > > this:
> > > >
> > > >                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > >                         j = k + i;
> > > >                         dst = acomp_ctx->buffers[k];
> > > >                         dlen = sg->length | *errp;
> > > >
> > > >                         if (dlen < 0) {
> > > >                                 dlen = PAGE_SIZE;
> > > >                                 dst = kmap_local_page(folio_page(folio, start + j));
> > > >                         }
> > > >
> > > > For batching compressors, namely, iaa_crypto, the individual output SG
> > > > lists sg->length follows the requirements from Herbert: each sg->length
> > > > is the compressed length or the error status (a negative error code).
> > > >
> > > > Then all I need to know whether to store the page as incompressible
> > > > is to either directly test if sg->length is negative (for batching
> compressors),
> > > > or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
> > > > is negative. This is accomplished by the "dlen = sg->length | *errp;".
> > > >
> > > > I believe this maintains backward compatibility with the existing code.
> > > > Please let me know if you agree.
> > >
> > > For batching compressors, will 'err' be set as well, or just sg->length?
> > > If it's just sg->length, then we need to check again if WB is enabled
> > > here before storing the page uncompressed. Right?
> >
> > iaa_crypto will set 'err' and set the sg->length as per the batching interface
> > spec from Herbert.
> 
> So both 'err' and sg->length will contain the same error? In this case
> why do we need to check if dlen < 0? Shouldn't checking 'err' be
> sufficient? and it would work for both SW and HW and we wouldn't need
> errp. Did I miss something?

Great question. For a batching compressor, 'err' will contain an error if any
page in the batch had a compression error. This allows the early bail-out
path for SW and HW compressors if writeback is not enabled for the folio.

Only the specific pages' sg->length will contain an error code. The other
batch pages that compressed fine will have the compressed length in
sg->length. This enables the post-compression loop with the errp check
bit-ORed with the sg->length, which for SW, has been brought up to date
with the acomp_req->dlen before we get to the wb_enabled code path.

> 
> >
> > >
> > > >
> > > > >
> > > > > We don't check again if writeback is enabled before storing the page is
> > > > > incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> > > > > these cases no longer possible?
> > > >
> > > > Hope the above explanation clarifies things some more? These case
> > > > are possible, and as long as they return an error status, they should be
> > > > correctly handled by the new code.
> > >
> > > As mentioned above, I am not sure if that's correct for dlen >=
> > > PAGE_SIZE.
> >
> > We need to get clarity on this from SeongJae/Herbert.
> >
> > >
> > > >
> > > > >
> > > > > Also, why use errp, why not explicitly use the appropriate error code?
> > > > > It's also unclear to me why the error code is always zero with HW
> > > > > compression?
> > > >
> > > > This is because of the sg->length requirements (compressed
> length/error)
> > > > for the batching interface suggested by Herbert. Hence, I upfront define
> > > > err_sg to 0, and, set errp to &err_sg for batching compressors. For
> software
> > > > compressors, errp is set to &err, namely, the above check will always
> apply
> > > > the software compressor's error status to the compressed length via
> > > > the bit-OR to determine if the page needs to be stored uncompressed.
> > >
> > > Thanks for the clarification. I understand that the error code has
> > > different sources for SW and HW compressors, but I do not like using
> > > errp as an indirection. It makes the code unclear. I would rather we
> > > explicitly check err for SW compressors and dlen for HW compressors.
> > >
> > > That being said, I thought what Herbert suggested was that the same API
> > > is used for both SW and HW compressors. IOW, either way we submit a
> > > batch of pages (8 pages for SW compressors), and then the crypto API
> > > would either give the entire batch to the compressor if it supports
> > > batching, or loop over them internally and hand them page-by-page to
> > > the compressor.
> >
> > That was not how I understood Herbert's suggestion for the batching
> interface.
> > He did suggest the following:
> >
> > "Before the call to acomp, the destination SG list should contain as
> > many elements as the number of units.  On return, the dst lengths
> > should be stored in each destination SG entry."
> >
> > I have incorporated this suggestion in the iaa_crypto driver. For SW
> > compressors, I have tried not to propose any API changes, while making
> > sure that the zswap changes for the SG lists batching API work as expected
> > for SW without too much special-casing code.
> >
> > I suppose I always assumed that we would update SW compressors later,
> > and not as part of this patch-set.
> 
> I am not sure I understand what changes lie in the crypto layer and what
> changes lie in the SW compressors. I am not suggesting we do any
> modification to the SW compressors.
> 
> I imagined that the crypto layer would present a uniform API regardless
> of whether or not the compressor supports batching. Ideally zswap would
> pass in a batch to crypto and it would figure out if it needs to break
> them down or not. Then the output length and errors would be presented
> uniformly to the caller.

From my understanding, this would require changes to the crypto layer for
SW compressors, which again IIUC, does not set the sg->length, only sets
the acomp_req->dlen (IIUC, a temporary state until crypto for SW also uses
SG lists).

Ideally, batching could be handled similarly by crypto for SW. I believe we
will get there, albeit outside the scope of this patch.

> 
> That being said, I am not at all familiar with how crypto works and how
> straightforward that would be. Herbert, WDYT?
> 
> >
> > >
> > > This would simplify usage as we do not have to handle the differences in
> > > zswap.
> >
> > That's the nice thing about SG lists - I think the zswap_compress() calls to
> > the new batching API appears agnostic to SW and HW compressors.
> > Other than the upfront "errp = (pool->compr_batch_size == 1) ? &err :
> &err_sg;"
> > the logical code organization of the new zswap_compress() is quite similar to
> > the existing code. The post-compress "dlen = sg->length | *errp;" handles
> the rest.
> 
> It would be even nicer if the batches are also abstracted by SG lists.
> 
> Also, I don't like how the error codes and output lengths are presented
> differently for HW and SW compressors.

I do believe this is short-term and is the first step in implementing batching
in zswap. We should get Herbert's thoughts on this.





^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-14 19:59                 ` Sridhar, Kanchana P
@ 2025-11-14 20:49                   ` Yosry Ahmed
  0 siblings, 0 replies; 79+ messages in thread
From: Yosry Ahmed @ 2025-11-14 20:49 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: SeongJae Park, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, kasong, linux-crypto, herbert, davem, clabbe,
	ardb, ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius,
	Feghali, Wajdi K, Gopal, Vinodh

On Fri, Nov 14, 2025 at 07:59:57PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Friday, November 14, 2025 11:44 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: SeongJae Park <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; hannes@cmpxchg.org; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
> > On Fri, Nov 14, 2025 at 07:23:42PM +0000, Sridhar, Kanchana P wrote:
> > [..]
> >  > > > > >
> > > > > > > > > +		if (err && !wb_enabled)
> > > > > > > > > +			goto compress_error;
> > > > > > > > > +
> > > > > > > > > +		for_each_sg(acomp_ctx->sg_outputs->sgl, sg,
> > > > nr_comps, k) {
> > > > > > > > > +			j = k + i;
> > > > > > > >
> > [..]
> > > > > > >
> > > > > > > >
> > > > > > > > > +			dst = acomp_ctx->buffers[k];
> > > > > > > > > +			dlen = sg->length | *errp;
> > > > > > > >
> > > > > > > > Why are we doing this?
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +			if (dlen < 0) {
> > > > > > > >
> > > > > > > > We should do the incompressible page handling also if dlen is
> > > > PAGE_SIZE,
> > > > > > > > or if the compression failed (I guess that's the intention of bit
> > OR'ing
> > > > > > > > with *errp?)
> > > > > > >
> > > > > > > Yes, indeed: that's the intention of bit OR'ing with *errp.
> > > > > >
> > > > > > ..and you never really answered my question. In the exising code we
> > > > > > store the page as incompressible if writeback is enabled AND
> > > > > > crypto_wait_req() fails or dlen is zero or PAGE_SIZE. We check above
> > > > > > if crypto_wait_req() fails and writeback is disabled, but what about the
> > > > > > rest?
> > > > >
> > > > > Let me explain this some more. The new code only relies on the
> > assumption
> > > > > that if dlen is zero or >= PAGE_SIZE, the compressor will not return a 0
> > > > > ("successful status"). In other words, the compressor will return an error
> > > > status
> > > > > in this case, which is expected to be a negative error code.
> > > >
> > > > I am not sure if all compressors do that, especially for the case where
> > > > dlen >= PAGE_SIZE. The existing code does not assume that there will be
> > > > an error code in these cases.
> > > >
> > > > For the dlen == 0 case, the check was introduced recently by commit
> > > > dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression failed page
> > > > as-is"). Looking through the history it seems like it was introduced in
> > > > v4 of that patch but I don't see the reasoning.
> > >
> > > The existing code did not check for dlen == 0 and dlen >= PAGE_SIZE
> > > prior to commit dca4437a5861 ("mm/zswap: store <PAGE_SIZE compression
> > > failed page as-is") either. We need SeongJae or Herbert to clarify whether
> > > this check is needed, or if it is sufficient to rely on comp_ret, the return from
> > > crypto_wait_req().
> > >
> > > >
> > > > SeongJae, did you observe any compressors returning dlen == 0 but no
> > > > error code? What was the reasoning behind the dlen == 0 check?
> > > >
> > > > >
> > > > > Under these (hopefully valid) assumptions, the code handles the simple
> > case
> > > > > of an error compression return status and writeback is disabled, by the
> > > > > "goto compress_error".
> > > > >
> > > > > The rest is handled by these:
> > > > >
> > > > > 1) First, I need to adapt the use of sg_outputs->sgl->length to represent
> > the
> > > > > compress length for software compressors, so I do this after
> > > > crypto_wait_req()
> > > > > returns:
> > > > >
> > > > >                 acomp_ctx->sg_outputs->sgl->length = acomp_ctx->req->dlen;
> > > >
> > > > For SW compressors, why is acomp_ctx->sg_outputs->sgl->length not set?
> > > > IIUC we are using the same API for SW and HW compressors, why is the
> > > > output length in different places for each of them?
> > >
> > > This is to first implement the SG lists batching interface in iaa_crypto, while
> > > maintaining backward compatibility for SW compressors with the new API.
> > > I believe we may want to adapt the crypto API to SW compressors
> > > at a later point. I also believe this would be outside the scope of this patch.
> > > It would be nice if Herbert can share his vision on this aspect.
> > >
> > > >
> > > > >
> > > > > I did not want to propose any changes to crypto software compressors
> > > > protocols.
> > > > >
> > > > > 2) After the check for the "if (err && !wb_enabled)" case, the new code
> > has
> > > > this:
> > > > >
> > > > >                 for_each_sg(acomp_ctx->sg_outputs->sgl, sg, nr_comps, k) {
> > > > >                         j = k + i;
> > > > >                         dst = acomp_ctx->buffers[k];
> > > > >                         dlen = sg->length | *errp;
> > > > >
> > > > >                         if (dlen < 0) {
> > > > >                                 dlen = PAGE_SIZE;
> > > > >                                 dst = kmap_local_page(folio_page(folio, start + j));
> > > > >                         }
> > > > >
> > > > > For batching compressors, namely, iaa_crypto, the individual output SG
> > > > > lists sg->length follows the requirements from Herbert: each sg->length
> > > > > is the compressed length or the error status (a negative error code).
> > > > >
> > > > > Then all I need to know whether to store the page as incompressible
> > > > > is to either directly test if sg->length is negative (for batching
> > compressors),
> > > > > or sg->length bit-OR'ed with the crypto_wait_req() return status ("err")
> > > > > is negative. This is accomplished by the "dlen = sg->length | *errp;".
> > > > >
> > > > > I believe this maintains backward compatibility with the existing code.
> > > > > Please let me know if you agree.
> > > >
> > > > For batching compressors, will 'err' be set as well, or just sg->length?
> > > > If it's just sg->length, then we need to check again if WB is enabled
> > > > here before storing the page uncompressed. Right?
> > >
> > > iaa_crypto will set 'err' and set the sg->length as per the batching interface
> > > spec from Herbert.
> > 
> > So both 'err' and sg->length will contain the same error? In this case
> > why do we need to check if dlen < 0? Shouldn't checking 'err' be
> > sufficient? and it would work for both SW and HW and we wouldn't need
> > errp. Did I miss something?
> 
> Great question. For a batching compressor, 'err' will contain an error if any
> page in the batch had a compression error. This allows the early bail-out
> path for SW and HW compressors if writeback is not enabled for the folio.
> 
> Only the specific pages' sg->length will contain an error code. The other
> batch pages that compressed fine will have the compressed length in
> sg->length. This enables the post-compression loop with the errp check
> bit-ORed with the sg->length, which for SW, has been brought up to date
> with the acomp_req->dlen before we get to the wb_enabled code path.
> 
> > 
> > >
> > > >
> > > > >
> > > > > >
> > > > > > We don't check again if writeback is enabled before storing the page is
> > > > > > incompressible, and we do not check if dlen is zero or PAGE_SIZE. Are
> > > > > > these cases no longer possible?
> > > > >
> > > > > Hope the above explanation clarifies things some more? These case
> > > > > are possible, and as long as they return an error status, they should be
> > > > > correctly handled by the new code.
> > > >
> > > > As mentioned above, I am not sure if that's correct for dlen >=
> > > > PAGE_SIZE.
> > >
> > > We need to get clarity on this from SeongJae/Herbert.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Also, why use errp, why not explicitly use the appropriate error code?
> > > > > > It's also unclear to me why the error code is always zero with HW
> > > > > > compression?
> > > > >
> > > > > This is because of the sg->length requirements (compressed
> > length/error)
> > > > > for the batching interface suggested by Herbert. Hence, I upfront define
> > > > > err_sg to 0, and, set errp to &err_sg for batching compressors. For
> > software
> > > > > compressors, errp is set to &err, namely, the above check will always
> > apply
> > > > > the software compressor's error status to the compressed length via
> > > > > the bit-OR to determine if the page needs to be stored uncompressed.
> > > >
> > > > Thanks for the clarification. I understand that the error code has
> > > > different sources for SW and HW compressors, but I do not like using
> > > > errp as an indirection. It makes the code unclear. I would rather we
> > > > explicitly check err for SW compressors and dlen for HW compressors.
> > > >
> > > > That being said, I thought what Herbert suggested was that the same API
> > > > is used for both SW and HW compressors. IOW, either way we submit a
> > > > batch of pages (8 pages for SW compressors), and then the crypto API
> > > > would either give the entire batch to the compressor if it supports
> > > > batching, or loop over them internally and hand them page-by-page to
> > > > the compressor.
> > >
> > > That was not how I understood Herbert's suggestion for the batching
> > interface.
> > > He did suggest the following:
> > >
> > > "Before the call to acomp, the destination SG list should contain as
> > > many elements as the number of units.  On return, the dst lengths
> > > should be stored in each destination SG entry."
> > >
> > > I have incorporated this suggestion in the iaa_crypto driver. For SW
> > > compressors, I have tried not to propose any API changes, while making
> > > sure that the zswap changes for the SG lists batching API work as expected
> > > for SW without too much special-casing code.
> > >
> > > I suppose I always assumed that we would update SW compressors later,
> > > and not as part of this patch-set.
> > 
> > I am not sure I understand what changes lie in the crypto layer and what
> > changes lie in the SW compressors. I am not suggesting we do any
> > modification to the SW compressors.
> > 
> > I imagined that the crypto layer would present a uniform API regardless
> > of whether or not the compressor supports batching. Ideally zswap would
> > pass in a batch to crypto and it would figure out if it needs to break
> > them down or not. Then the output length and errors would be presented
> > uniformly to the caller.
> 
> From my understanding, this would require changes to the crypto layer for
> SW compressors, which again IIUC, does not set the sg->length, only sets
> the acomp_req->dlen (IIUC, a temporary state until crypto for SW also uses
> SG lists).
> 
> Ideally, batching could be handled similarly by crypto for SW. I believe we
> will get there, albeit outside the scope of this patch.
> 
> > 
> > That being said, I am not at all familiar with how crypto works and how
> > straightforward that would be. Herbert, WDYT?
> > 
> > >
> > > >
> > > > This would simplify usage as we do not have to handle the differences in
> > > > zswap.
> > >
> > > That's the nice thing about SG lists - I think the zswap_compress() calls to
> > > the new batching API appears agnostic to SW and HW compressors.
> > > Other than the upfront "errp = (pool->compr_batch_size == 1) ? &err :
> > &err_sg;"
> > > the logical code organization of the new zswap_compress() is quite similar to
> > > the existing code. The post-compress "dlen = sg->length | *errp;" handles
> > the rest.
> > 
> > It would be even nicer if the batches are also abstracted by SG lists.
> > 
> > Also, I don't like how the error codes and output lengths are presented
> > differently for HW and SW compressors.
> 
> I do believe this is short-term and is the first step in implementing batching
> in zswap. We should get Herbert's thoughts on this.

If we have to keep the different approaches for now I would still like
to simplify the error handling. We should remove errp and explicitly
check sg->length or 'err' based on the batch size.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-14 15:37           ` Yosry Ahmed
  2025-11-14 19:23             ` Sridhar, Kanchana P
@ 2025-11-26  5:46             ` Herbert Xu
  2025-11-26  6:34               ` Yosry Ahmed
  1 sibling, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2025-11-26  5:46 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sridhar, Kanchana P, SeongJae Park, linux-kernel, linux-mm,
	hannes, nphamcs, chengming.zhou, usamaarif642, ryan.roberts,
	21cnbao, ying.huang, akpm, senozhatsky, kasong, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Fri, Nov 14, 2025 at 03:37:53PM +0000, Yosry Ahmed wrote:
>
> Thanks for the clarification. I understand that the error code has
> different sources for SW and HW compressors, but I do not like using
> errp as an indirection. It makes the code unclear. I would rather we
> explicitly check err for SW compressors and dlen for HW compressors.
> 
> That being said, I thought what Herbert suggested was that the same API
> is used for both SW and HW compressors. IOW, either way we submit a
> batch of pages (8 pages for SW compressors), and then the crypto API
> would either give the entire batch to the compressor if it supports
> batching, or loop over them internally and hand them page-by-page to
> the compressor.
> 
> This would simplify usage as we do not have to handle the differences in
> zswap.
> 
> If that is not doable, at the very least the API should be consistent.
> Right now the error code and length are propagated differently to the
> caller based on whether or not the compressor support batching.

Yes we should only have one code path in zswap, regardless of whether
batching is used or not.

The degenerate case of a batch with a single page should be handled
by the Crypto API.

So I will change crypto_acomp to take care of this case.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-26  5:46             ` Herbert Xu
@ 2025-11-26  6:34               ` Yosry Ahmed
  2025-11-26 20:05                 ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-11-26  6:34 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sridhar, Kanchana P, SeongJae Park, linux-kernel, linux-mm,
	hannes, nphamcs, chengming.zhou, usamaarif642, ryan.roberts,
	21cnbao, ying.huang, akpm, senozhatsky, kasong, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Wed, Nov 26, 2025 at 01:46:57PM +0800, Herbert Xu wrote:
> On Fri, Nov 14, 2025 at 03:37:53PM +0000, Yosry Ahmed wrote:
> >
> > Thanks for the clarification. I understand that the error code has
> > different sources for SW and HW compressors, but I do not like using
> > errp as an indirection. It makes the code unclear. I would rather we
> > explicitly check err for SW compressors and dlen for HW compressors.
> > 
> > That being said, I thought what Herbert suggested was that the same API
> > is used for both SW and HW compressors. IOW, either way we submit a
> > batch of pages (8 pages for SW compressors), and then the crypto API
> > would either give the entire batch to the compressor if it supports
> > batching, or loop over them internally and hand them page-by-page to
> > the compressor.
> > 
> > This would simplify usage as we do not have to handle the differences in
> > zswap.
> > 
> > If that is not doable, at the very least the API should be consistent.
> > Right now the error code and length are propagated differently to the
> > caller based on whether or not the compressor support batching.
> 
> Yes we should only have one code path in zswap, regardless of whether
> batching is used or not.
> 
> The degenerate case of a batch with a single page should be handled
> by the Crypto API.
> 
> So I will change crypto_acomp to take care of this case.

Nice :)

> 
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-26  6:34               ` Yosry Ahmed
@ 2025-11-26 20:05                 ` Sridhar, Kanchana P
  2025-12-08  3:23                   ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-11-26 20:05 UTC (permalink / raw)
  To: Yosry Ahmed, Herbert Xu
  Cc: SeongJae Park, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, kasong, linux-crypto, davem, clabbe, ardb,
	ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius, Feghali,
	Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Tuesday, November 25, 2025 10:35 PM
> To: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Wed, Nov 26, 2025 at 01:46:57PM +0800, Herbert Xu wrote:
> > On Fri, Nov 14, 2025 at 03:37:53PM +0000, Yosry Ahmed wrote:
> > >
> > > Thanks for the clarification. I understand that the error code has
> > > different sources for SW and HW compressors, but I do not like using
> > > errp as an indirection. It makes the code unclear. I would rather we
> > > explicitly check err for SW compressors and dlen for HW compressors.
> > >
> > > That being said, I thought what Herbert suggested was that the same API
> > > is used for both SW and HW compressors. IOW, either way we submit a
> > > batch of pages (8 pages for SW compressors), and then the crypto API
> > > would either give the entire batch to the compressor if it supports
> > > batching, or loop over them internally and hand them page-by-page to
> > > the compressor.
> > >
> > > This would simplify usage as we do not have to handle the differences in
> > > zswap.
> > >
> > > If that is not doable, at the very least the API should be consistent.
> > > Right now the error code and length are propagated differently to the
> > > caller based on whether or not the compressor support batching.
> >
> > Yes we should only have one code path in zswap, regardless of whether
> > batching is used or not.
> >
> > The degenerate case of a batch with a single page should be handled
> > by the Crypto API.
> >
> > So I will change crypto_acomp to take care of this case.
> 
> Nice :)

Thanks Herbert and Yosry!

Herbert, to make sure I understand, will you be implementing all of these
features in crypto_acomp for software compressors? I would appreciate it
if you can clarify:

1) Error & compressed length propagation to the dst sg->length only for
    non-batching compressors.
    a) For batching compressors, this wouldn't apply since errors could occur
        for any page in the batch, and the first page (dst sg->length) could have
        successfully compressed.

2) Will you also be handling the case where zswap can send an SG list batch
     with multiple pages to a non-batching compressor, and the crypto_acomp
     API will internally compress each page sequentially, propagate
     errors/compress lengths before returning?
        
If so, this would really standardize the code in zswap for batching and
non-batching compressors.

Thanks,
Kanchana

> 
> >
> > Cheers,
> > --
> > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-11-26 20:05                 ` Sridhar, Kanchana P
@ 2025-12-08  3:23                   ` Herbert Xu
  2025-12-08  4:17                     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2025-12-08  3:23 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Yosry Ahmed, SeongJae Park, linux-kernel, linux-mm, hannes,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, senozhatsky, kasong, linux-crypto, davem,
	clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Wed, Nov 26, 2025 at 08:05:40PM +0000, Sridhar, Kanchana P wrote:
>
> Herbert, to make sure I understand, will you be implementing all of these
> features in crypto_acomp for software compressors? I would appreciate it
> if you can clarify:
> 
> 1) Error & compressed length propagation to the dst sg->length only for
>     non-batching compressors.
>     a) For batching compressors, this wouldn't apply since errors could occur
>         for any page in the batch, and the first page (dst sg->length) could have
>         successfully compressed.

This would be the first step.

> 2) Will you also be handling the case where zswap can send an SG list batch
>      with multiple pages to a non-batching compressor, and the crypto_acomp
>      API will internally compress each page sequentially, propagate
>      errors/compress lengths before returning?
>         
> If so, this would really standardize the code in zswap for batching and
> non-batching compressors.

Yes this will be done as the next step.  My understanding is that
your patch-set doesn't require this yet as all non-batching compressors
will have a batch size of 1.

But yes this certainly will be extended, not just with sequential
processing, but we could also use pcrypt/cryptd to parallelise the
compression across CPUs.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-08  3:23                   ` Herbert Xu
@ 2025-12-08  4:17                     ` Sridhar, Kanchana P
  2025-12-08  4:24                       ` Herbert Xu
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-08  4:17 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Yosry Ahmed, SeongJae Park, linux-kernel, linux-mm, hannes,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, senozhatsky, kasong, linux-crypto, davem,
	clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Sunday, December 7, 2025 7:24 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>; SeongJae Park <sj@kernel.org>;
> linux-kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Wed, Nov 26, 2025 at 08:05:40PM +0000, Sridhar, Kanchana P wrote:
> >
> > Herbert, to make sure I understand, will you be implementing all of these
> > features in crypto_acomp for software compressors? I would appreciate it
> > if you can clarify:
> >
> > 1) Error & compressed length propagation to the dst sg->length only for
> >     non-batching compressors.
> >     a) For batching compressors, this wouldn't apply since errors could occur
> >         for any page in the batch, and the first page (dst sg->length) could have
> >         successfully compressed.
> 
> This would be the first step.

Hi Herbert,

Thanks for these clarifications! This sounds like a great first step.

> 
> > 2) Will you also be handling the case where zswap can send an SG list batch
> >      with multiple pages to a non-batching compressor, and the crypto_acomp
> >      API will internally compress each page sequentially, propagate
> >      errors/compress lengths before returning?
> >
> > If so, this would really standardize the code in zswap for batching and
> > non-batching compressors.
> 
> Yes this will be done as the next step.  My understanding is that
> your patch-set doesn't require this yet as all non-batching compressors
> will have a batch size of 1.

I see. So the way my patch-set tries to standardize batching in
zswap_compress() is to call it with a batch of 8 pages, regardless of batching
or non-batching compressors. In zswap_compress(), I presently iterate
through each page in the batch for sequential processing for non-batching
compressors whose batch size is 1. For batching compressors, the iteration
happens just once: the whole batch is compressed in one call to
crypto_acomp_compress().

When the next step is ready, I will no longer need this for loop that
iterates over the batch in "batch_size" increments. If Yosry and Nhat are
Ok with staging it as you've described, this should all be good.

Also, I have incorporated your suggestion to implement batching within
iaa_crypto in a manner that adheres to the acomp API. I was planning to
start creating an updated patch-set with this. Please let me know if it would
be a good idea to wait to sync with the first step you are working on before
submitting the updated patch-set. Thanks for collaboration!

> 
> But yes this certainly will be extended, not just with sequential
> processing, but we could also use pcrypt/cryptd to parallelise the
> compression across CPUs.

Sounds great!

Best regards,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-08  4:17                     ` Sridhar, Kanchana P
@ 2025-12-08  4:24                       ` Herbert Xu
  2025-12-08  4:33                         ` Sridhar, Kanchana P
  2025-12-09  1:15                         ` Yosry Ahmed
  0 siblings, 2 replies; 79+ messages in thread
From: Herbert Xu @ 2025-12-08  4:24 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Yosry Ahmed, SeongJae Park, linux-kernel, linux-mm, hannes,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, senozhatsky, kasong, linux-crypto, davem,
	clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Dec 08, 2025 at 04:17:38AM +0000, Sridhar, Kanchana P wrote:
>
> I see. So the way my patch-set tries to standardize batching in
> zswap_compress() is to call it with a batch of 8 pages, regardless of batching
> or non-batching compressors. In zswap_compress(), I presently iterate
> through each page in the batch for sequential processing for non-batching
> compressors whose batch size is 1. For batching compressors, the iteration
> happens just once: the whole batch is compressed in one call to
> crypto_acomp_compress().

Oh I wasn't aware of this.  In that case there is no need for me
to delay the next step and we can do it straight away.

I had thought that the batch size was to limit the batching size
to acomp.  But if it's not, perhaps we can remove the batch size
exposure altogether.  IOW it would only be visible internally to
the acomp API while the users such as zswap would simply batch
things in whatever size that suits them.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-08  4:24                       ` Herbert Xu
@ 2025-12-08  4:33                         ` Sridhar, Kanchana P
  2025-12-09  1:15                         ` Yosry Ahmed
  1 sibling, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-08  4:33 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Yosry Ahmed, SeongJae Park, linux-kernel, linux-mm, hannes,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, senozhatsky, kasong, linux-crypto, davem,
	clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Sunday, December 7, 2025 8:24 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosry.ahmed@linux.dev>; SeongJae Park <sj@kernel.org>;
> linux-kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Mon, Dec 08, 2025 at 04:17:38AM +0000, Sridhar, Kanchana P wrote:
> >
> > I see. So the way my patch-set tries to standardize batching in
> > zswap_compress() is to call it with a batch of 8 pages, regardless of batching
> > or non-batching compressors. In zswap_compress(), I presently iterate
> > through each page in the batch for sequential processing for non-batching
> > compressors whose batch size is 1. For batching compressors, the iteration
> > happens just once: the whole batch is compressed in one call to
> > crypto_acomp_compress().
> 
> Oh I wasn't aware of this.  In that case there is no need for me
> to delay the next step and we can do it straight away.

Sure, makes sense, thanks!

> 
> I had thought that the batch size was to limit the batching size
> to acomp.  But if it's not, perhaps we can remove the batch size
> exposure altogether.  IOW it would only be visible internally to
> the acomp API while the users such as zswap would simply batch
> things in whatever size that suits them.

Yes, I think this can be done. In case zswap sends a batch that is not
an integral multiple of the acomp algorithm's batch-size, we might
have to trade-off one sub-optimal batch (fewer pages than the alg's
batch-size) for a cleaner solution.

Thanks,
Kanchana

> 
> Thanks,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-08  4:24                       ` Herbert Xu
  2025-12-08  4:33                         ` Sridhar, Kanchana P
@ 2025-12-09  1:15                         ` Yosry Ahmed
  2025-12-09  2:32                           ` Herbert Xu
  1 sibling, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-12-09  1:15 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sridhar, Kanchana P, SeongJae Park, linux-kernel, linux-mm,
	hannes, nphamcs, chengming.zhou, usamaarif642, ryan.roberts,
	21cnbao, ying.huang, akpm, senozhatsky, kasong, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Mon, Dec 08, 2025 at 12:24:01PM +0800, Herbert Xu wrote:
> On Mon, Dec 08, 2025 at 04:17:38AM +0000, Sridhar, Kanchana P wrote:
> >
> > I see. So the way my patch-set tries to standardize batching in
> > zswap_compress() is to call it with a batch of 8 pages, regardless of batching
> > or non-batching compressors. In zswap_compress(), I presently iterate
> > through each page in the batch for sequential processing for non-batching
> > compressors whose batch size is 1. For batching compressors, the iteration
> > happens just once: the whole batch is compressed in one call to
> > crypto_acomp_compress().
> 
> Oh I wasn't aware of this.  In that case there is no need for me
> to delay the next step and we can do it straight away.
> 
> I had thought that the batch size was to limit the batching size
> to acomp.  But if it's not, perhaps we can remove the batch size
> exposure altogether.  IOW it would only be visible internally to
> the acomp API while the users such as zswap would simply batch
> things in whatever size that suits them.

Just to clarify, does this mean that zswap can pass a batch of (eight)
pages to the acomp API, and get the results for the batch uniformly
whether or not the underlying compressor supports batching?

If yes, then that's exactly what we want for zswap, because it will
simplify the interface significantly vs. what this batch is currently
doing to handle SW non-batching compressors vs HW batching compressors.

> 
> Thanks,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-09  1:15                         ` Yosry Ahmed
@ 2025-12-09  2:32                           ` Herbert Xu
  2025-12-09 16:55                             ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Herbert Xu @ 2025-12-09  2:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sridhar, Kanchana P, SeongJae Park, linux-kernel, linux-mm,
	hannes, nphamcs, chengming.zhou, usamaarif642, ryan.roberts,
	21cnbao, ying.huang, akpm, senozhatsky, kasong, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> 
> Just to clarify, does this mean that zswap can pass a batch of (eight)
> pages to the acomp API, and get the results for the batch uniformly
> whether or not the underlying compressor supports batching?

Correct.  In fact I'd like to remove the batch size exposure to zswap
altogether.  zswap should just pass along whatever maximum number of
pages that is convenient to itself.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-09  2:32                           ` Herbert Xu
@ 2025-12-09 16:55                             ` Yosry Ahmed
  2025-12-09 17:21                               ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-12-09 16:55 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sridhar, Kanchana P, SeongJae Park, linux-kernel, linux-mm,
	hannes, nphamcs, chengming.zhou, usamaarif642, ryan.roberts,
	21cnbao, ying.huang, akpm, senozhatsky, kasong, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > 
> > Just to clarify, does this mean that zswap can pass a batch of (eight)
> > pages to the acomp API, and get the results for the batch uniformly
> > whether or not the underlying compressor supports batching?
> 
> Correct.  In fact I'd like to remove the batch size exposure to zswap
> altogether.  zswap should just pass along whatever maximum number of
> pages that is convenient to itself.

I think exposing the batch size is still useful as a hint for zswap. In
the current series, zswap allocates as many per-CPU buffers as the
compressor's batch size, so no extra buffers for non-batching
compressors (including SW compressors).

If we use the same batch size regardless, we'll have to always allocate
8 (or N) per-CPU buffers, for little to no benefit on non-batching
compressors.

So we still want the batch size on the zswap side, but we want the
crypto API to be uniform whether or not the compressor supports
batching.

> 
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-09 16:55                             ` Yosry Ahmed
@ 2025-12-09 17:21                               ` Sridhar, Kanchana P
  2025-12-09 17:31                                 ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-09 17:21 UTC (permalink / raw)
  To: Yosry Ahmed, Herbert Xu
  Cc: SeongJae Park, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, kasong, linux-crypto, davem, clabbe, ardb,
	ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius, Feghali,
	Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Tuesday, December 9, 2025 8:55 AM
> To: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> > On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > >
> > > Just to clarify, does this mean that zswap can pass a batch of (eight)
> > > pages to the acomp API, and get the results for the batch uniformly
> > > whether or not the underlying compressor supports batching?
> >
> > Correct.  In fact I'd like to remove the batch size exposure to zswap
> > altogether.  zswap should just pass along whatever maximum number of
> > pages that is convenient to itself.
> 
> I think exposing the batch size is still useful as a hint for zswap. In
> the current series, zswap allocates as many per-CPU buffers as the
> compressor's batch size, so no extra buffers for non-batching
> compressors (including SW compressors).
> 
> If we use the same batch size regardless, we'll have to always allocate
> 8 (or N) per-CPU buffers, for little to no benefit on non-batching
> compressors.
> 
> So we still want the batch size on the zswap side, but we want the
> crypto API to be uniform whether or not the compressor supports
> batching.

Thanks Yosry, you bring up a good point. I currently have the outer for
loop in zswap_compress() due to the above constraint. For non-batching
compressors, we allocate only one per-CPU buffer. Hence, we need to
call crypto_acomp_compress() and write the compressed data to the
zs_poll for each page in the batch. Wouldn't we need to allocate
8 per-CPU buffers for non-batching compressors if we want zswap to
send a batch of 8 pages uniformly to the crypto API, so that
zswap_compress() can store the 8 pages in zs_pool after the crypto
API returns?

Thanks,
Kanchana

> 
> >
> > Cheers,
> > --
> > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-09 17:21                               ` Sridhar, Kanchana P
@ 2025-12-09 17:31                                 ` Yosry Ahmed
  2025-12-09 19:38                                   ` Sridhar, Kanchana P
  2025-12-10  4:28                                   ` Herbert Xu
  0 siblings, 2 replies; 79+ messages in thread
From: Yosry Ahmed @ 2025-12-09 17:31 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Herbert Xu, SeongJae Park, linux-kernel, linux-mm, hannes,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, senozhatsky, kasong, linux-crypto, davem,
	clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Dec 09, 2025 at 05:21:06PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Tuesday, December 9, 2025 8:55 AM
> > To: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> > Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
> > On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> > > On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > > >
> > > > Just to clarify, does this mean that zswap can pass a batch of (eight)
> > > > pages to the acomp API, and get the results for the batch uniformly
> > > > whether or not the underlying compressor supports batching?
> > >
> > > Correct.  In fact I'd like to remove the batch size exposure to zswap
> > > altogether.  zswap should just pass along whatever maximum number of
> > > pages that is convenient to itself.
> > 
> > I think exposing the batch size is still useful as a hint for zswap. In
> > the current series, zswap allocates as many per-CPU buffers as the
> > compressor's batch size, so no extra buffers for non-batching
> > compressors (including SW compressors).
> > 
> > If we use the same batch size regardless, we'll have to always allocate
> > 8 (or N) per-CPU buffers, for little to no benefit on non-batching
> > compressors.
> > 
> > So we still want the batch size on the zswap side, but we want the
> > crypto API to be uniform whether or not the compressor supports
> > batching.
> 
> Thanks Yosry, you bring up a good point. I currently have the outer for
> loop in zswap_compress() due to the above constraint. For non-batching
> compressors, we allocate only one per-CPU buffer. Hence, we need to
> call crypto_acomp_compress() and write the compressed data to the
> zs_poll for each page in the batch. Wouldn't we need to allocate
> 8 per-CPU buffers for non-batching compressors if we want zswap to
> send a batch of 8 pages uniformly to the crypto API, so that
> zswap_compress() can store the 8 pages in zs_pool after the crypto
> API returns?

Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
compressors.

I think the cleanest way to handle this would be to:
- Rename zswap_compress() to __zswap_compress(), and make it handle a
  given batch size (which would be 1 or 8).
- Introduce zswap_compress() as a wrapper that breaks down the folio
  into batches and loops over them, passing them to __zswap_compress().
- __zswap_compress() has a single unified path (e.g. for compressed
  length and error handling), regardless of the batch size.

Can this be done with the current acomp API? I think all we really need
is to be able to pass in a batch of size N (which can be 1), and read
the error and compressed length in a single way. This is my main problem
with the current patch.

In the future, if it's beneifical for some SW compressors to batch
compressions, we can look into optimizations for the per-CPU buffers to
avoid allocating 8 pages per-CPU (e.g. shared page pool), or make this
opt-in for certain SW compressors that justify the cost.

> 
> Thanks,
> Kanchana
> 
> > 
> > >
> > > Cheers,
> > > --
> > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > Home Page: http://gondor.apana.org.au/~herbert/
> > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-09 17:31                                 ` Yosry Ahmed
@ 2025-12-09 19:38                                   ` Sridhar, Kanchana P
  2025-12-10 16:01                                     ` Yosry Ahmed
  2025-12-10  4:28                                   ` Herbert Xu
  1 sibling, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-09 19:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Herbert Xu, SeongJae Park, linux-kernel, linux-mm, hannes,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, senozhatsky, kasong, linux-crypto, davem,
	clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Tuesday, December 9, 2025 9:32 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; SeongJae Park
> <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Tue, Dec 09, 2025 at 05:21:06PM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Tuesday, December 9, 2025 8:55 AM
> > > To: Herbert Xu <herbert@gondor.apana.org.au>
> > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> > > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > > <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>;
> > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > >
> > > On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> > > > On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > > > >
> > > > > Just to clarify, does this mean that zswap can pass a batch of (eight)
> > > > > pages to the acomp API, and get the results for the batch uniformly
> > > > > whether or not the underlying compressor supports batching?
> > > >
> > > > Correct.  In fact I'd like to remove the batch size exposure to zswap
> > > > altogether.  zswap should just pass along whatever maximum number of
> > > > pages that is convenient to itself.
> > >
> > > I think exposing the batch size is still useful as a hint for zswap. In
> > > the current series, zswap allocates as many per-CPU buffers as the
> > > compressor's batch size, so no extra buffers for non-batching
> > > compressors (including SW compressors).
> > >
> > > If we use the same batch size regardless, we'll have to always allocate
> > > 8 (or N) per-CPU buffers, for little to no benefit on non-batching
> > > compressors.
> > >
> > > So we still want the batch size on the zswap side, but we want the
> > > crypto API to be uniform whether or not the compressor supports
> > > batching.
> >
> > Thanks Yosry, you bring up a good point. I currently have the outer for
> > loop in zswap_compress() due to the above constraint. For non-batching
> > compressors, we allocate only one per-CPU buffer. Hence, we need to
> > call crypto_acomp_compress() and write the compressed data to the
> > zs_poll for each page in the batch. Wouldn't we need to allocate
> > 8 per-CPU buffers for non-batching compressors if we want zswap to
> > send a batch of 8 pages uniformly to the crypto API, so that
> > zswap_compress() can store the 8 pages in zs_pool after the crypto
> > API returns?
> 
> Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> compressors.
> 
> I think the cleanest way to handle this would be to:
> - Rename zswap_compress() to __zswap_compress(), and make it handle a
>   given batch size (which would be 1 or 8).
> - Introduce zswap_compress() as a wrapper that breaks down the folio
>   into batches and loops over them, passing them to __zswap_compress().
> - __zswap_compress() has a single unified path (e.g. for compressed
>   length and error handling), regardless of the batch size.
> 
> Can this be done with the current acomp API? I think all we really need
> is to be able to pass in a batch of size N (which can be 1), and read
> the error and compressed length in a single way. This is my main problem
> with the current patch.

Once Herbert gives us the crypto_acomp modification for non-batching
compressors to set the acomp_req->dst->length to the
compressed length/error value, I think the same could be accomplished
with the current patch, since I will be able to delete the "errp". IOW, I think
a simplification is possible without introducing __zswap_compress(). The
code will look seamless for non-batching and batching compressors, and the
distinction will be made apparent by the outer for loop that iterates over
the batch based on the pool->compr_batch_size in the current patch.

Alternately, we could introduce the __zswap_compress() that abstracts
one single iteration through the outer for loop: it compresses 1 or 8 pages
as a "batch". However, the distinction would still need to be made for
non-batching vs. batching compressors in the zswap_compress() wrapper:
both for sending the pool->compr_batch_size # of pages to
__zswap_compress() and for iterating over the single/multiple dst buffers
to write to zs_pool (the latter could be done within __zswap_compress(),
but the point remains: we would need to distinguish in one or the other).

It could be argued that keeping the seamless-ness in handling the calls to
crypto based on the pool->compr_batch_size and the logical distinctions
imposed by this in iterating over the output SG lists/buffers, would be
cleaner being self-contained in zswap_compress(). We already have a
zswap_store_pages() that processes the folio in batches. Maybe minimizing
the functions that do batch processing could be cleaner?

In any case, let me know which would be preferable.

Thanks,
Kanchana

> 
> In the future, if it's beneifical for some SW compressors to batch
> compressions, we can look into optimizations for the per-CPU buffers to
> avoid allocating 8 pages per-CPU (e.g. shared page pool), or make this
> opt-in for certain SW compressors that justify the cost.
> 
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > >
> > > > Cheers,
> > > > --
> > > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Home Page: http://gondor.apana.org.au/~herbert/
> > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-09 19:38                                   ` Sridhar, Kanchana P
@ 2025-12-10 16:01                                     ` Yosry Ahmed
  2025-12-10 18:47                                       ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2025-12-10 16:01 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Herbert Xu, SeongJae Park, linux-kernel, linux-mm, hannes,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, senozhatsky, kasong, linux-crypto, davem,
	clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Dec 09, 2025 at 07:38:20PM +0000, Sridhar, Kanchana P wrote:
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Tuesday, December 9, 2025 9:32 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>; SeongJae Park
> > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> > Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> > compress batching of large folios.
> > 
> > On Tue, Dec 09, 2025 at 05:21:06PM +0000, Sridhar, Kanchana P wrote:
> > >
> > > > -----Original Message-----
> > > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > Sent: Tuesday, December 9, 2025 8:55 AM
> > > > To: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> > > > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > hannes@cmpxchg.org; nphamcs@gmail.com;
> > chengming.zhou@linux.dev;
> > > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > > > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > > > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > > > <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>;
> > > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> > with
> > > > compress batching of large folios.
> > > >
> > > > On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> > > > > On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > > > > >
> > > > > > Just to clarify, does this mean that zswap can pass a batch of (eight)
> > > > > > pages to the acomp API, and get the results for the batch uniformly
> > > > > > whether or not the underlying compressor supports batching?
> > > > >
> > > > > Correct.  In fact I'd like to remove the batch size exposure to zswap
> > > > > altogether.  zswap should just pass along whatever maximum number of
> > > > > pages that is convenient to itself.
> > > >
> > > > I think exposing the batch size is still useful as a hint for zswap. In
> > > > the current series, zswap allocates as many per-CPU buffers as the
> > > > compressor's batch size, so no extra buffers for non-batching
> > > > compressors (including SW compressors).
> > > >
> > > > If we use the same batch size regardless, we'll have to always allocate
> > > > 8 (or N) per-CPU buffers, for little to no benefit on non-batching
> > > > compressors.
> > > >
> > > > So we still want the batch size on the zswap side, but we want the
> > > > crypto API to be uniform whether or not the compressor supports
> > > > batching.
> > >
> > > Thanks Yosry, you bring up a good point. I currently have the outer for
> > > loop in zswap_compress() due to the above constraint. For non-batching
> > > compressors, we allocate only one per-CPU buffer. Hence, we need to
> > > call crypto_acomp_compress() and write the compressed data to the
> > > zs_poll for each page in the batch. Wouldn't we need to allocate
> > > 8 per-CPU buffers for non-batching compressors if we want zswap to
> > > send a batch of 8 pages uniformly to the crypto API, so that
> > > zswap_compress() can store the 8 pages in zs_pool after the crypto
> > > API returns?
> > 
> > Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> > compressors.
> > 
> > I think the cleanest way to handle this would be to:
> > - Rename zswap_compress() to __zswap_compress(), and make it handle a
> >   given batch size (which would be 1 or 8).
> > - Introduce zswap_compress() as a wrapper that breaks down the folio
> >   into batches and loops over them, passing them to __zswap_compress().
> > - __zswap_compress() has a single unified path (e.g. for compressed
> >   length and error handling), regardless of the batch size.
> > 
> > Can this be done with the current acomp API? I think all we really need
> > is to be able to pass in a batch of size N (which can be 1), and read
> > the error and compressed length in a single way. This is my main problem
> > with the current patch.
> 
> Once Herbert gives us the crypto_acomp modification for non-batching
> compressors to set the acomp_req->dst->length to the
> compressed length/error value, I think the same could be accomplished
> with the current patch, since I will be able to delete the "errp". IOW, I think
> a simplification is possible without introducing __zswap_compress(). The
> code will look seamless for non-batching and batching compressors, and the
> distinction will be made apparent by the outer for loop that iterates over
> the batch based on the pool->compr_batch_size in the current patch.

I think moving the outer loop outside to a wrapper could make the
function digestable without nested loops.

> 
> Alternately, we could introduce the __zswap_compress() that abstracts
> one single iteration through the outer for loop: it compresses 1 or 8 pages
> as a "batch". However, the distinction would still need to be made for
> non-batching vs. batching compressors in the zswap_compress() wrapper:
> both for sending the pool->compr_batch_size # of pages to
> __zswap_compress() and for iterating over the single/multiple dst buffers
> to write to zs_pool (the latter could be done within __zswap_compress(),
> but the point remains: we would need to distinguish in one or the other).

Not sure what you mean by the latter. IIUC, for all compressors
__zswap_compress() would iterate over the dst buffers and write to
zs_pool, whether the number of dst buffers is 1 or 8. So there wouldn't
be any different handling in __zswap_compress(), right?

That's my whole motivation for introducing a wrapper that abstracts away
the batching size.

> 
> It could be argued that keeping the seamless-ness in handling the calls to
> crypto based on the pool->compr_batch_size and the logical distinctions
> imposed by this in iterating over the output SG lists/buffers, would be
> cleaner being self-contained in zswap_compress(). We already have a
> zswap_store_pages() that processes the folio in batches. Maybe minimizing
> the functions that do batch processing could be cleaner?

Yeah it's not great that we'll end up with zswap_store_pages() splitting
the folio into batches of 8, then zswap_compress() further splitting
them into compression batches -- but we'll have that anyway. Whether
it's inside zswap_compress() or a wrapper doesn't make things much
different imo.

Also, splitting the folio differently at different levels make semantic
sense. zswap_store_pages() splits it into batches of 8, because this is
what zswap handles (mainly to avoid dynamically allocating things like
entries). zswap_compress() will split it further if the underlying
compressor prefers that, to avoid allocating many buffer pages. So I
think it kinda makes sense.

In the future, we can revisit the split in zswap_compress() if we have a
good case for batching compression for SW (e.g. compress every 8 pages
as a single unit), or if we can optimize the per-CPU buffers somehow.

> 
> In any case, let me know which would be preferable.
> 
> Thanks,
> Kanchana
> 
> > 
> > In the future, if it's beneifical for some SW compressors to batch
> > compressions, we can look into optimizations for the per-CPU buffers to
> > avoid allocating 8 pages per-CPU (e.g. shared page pool), or make this
> > opt-in for certain SW compressors that justify the cost.
> > 
> > >
> > > Thanks,
> > > Kanchana
> > >
> > > >
> > > > >
> > > > > Cheers,
> > > > > --
> > > > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > > > Home Page: http://gondor.apana.org.au/~herbert/
> > > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-10 16:01                                     ` Yosry Ahmed
@ 2025-12-10 18:47                                       ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-10 18:47 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Herbert Xu, SeongJae Park, linux-kernel, linux-mm, hannes,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, 21cnbao,
	ying.huang, akpm, senozhatsky, kasong, linux-crypto, davem,
	clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@linux.dev>
> Sent: Wednesday, December 10, 2025 8:02 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; SeongJae Park
> <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Tue, Dec 09, 2025 at 07:38:20PM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > Sent: Tuesday, December 9, 2025 9:32 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: Herbert Xu <herbert@gondor.apana.org.au>; SeongJae Park
> > > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > > <vinicius.gomes@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>;
> > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress()
> with
> > > compress batching of large folios.
> > >
> > > On Tue, Dec 09, 2025 at 05:21:06PM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > > > > Sent: Tuesday, December 9, 2025 8:55 AM
> > > > > To: Herbert Xu <herbert@gondor.apana.org.au>
> > > > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae
> Park
> > > > > <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > > hannes@cmpxchg.org; nphamcs@gmail.com;
> > > chengming.zhou@linux.dev;
> > > > > usamaarif642@gmail.com; ryan.roberts@arm.com;
> 21cnbao@gmail.com;
> > > > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> > > > > senozhatsky@chromium.org; kasong@tencent.com; linux-
> > > > > crypto@vger.kernel.org; davem@davemloft.net;
> clabbe@baylibre.com;
> > > > > ardb@kernel.org; ebiggers@google.com; surenb@google.com;
> Accardi,
> > > > > Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> > > > > <vinicius.gomes@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>;
> > > > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > > > Subject: Re: [PATCH v13 22/22] mm: zswap: Batched
> zswap_compress()
> > > with
> > > > > compress batching of large folios.
> > > > >
> > > > > On Tue, Dec 09, 2025 at 10:32:20AM +0800, Herbert Xu wrote:
> > > > > > On Tue, Dec 09, 2025 at 01:15:02AM +0000, Yosry Ahmed wrote:
> > > > > > >
> > > > > > > Just to clarify, does this mean that zswap can pass a batch of
> (eight)
> > > > > > > pages to the acomp API, and get the results for the batch uniformly
> > > > > > > whether or not the underlying compressor supports batching?
> > > > > >
> > > > > > Correct.  In fact I'd like to remove the batch size exposure to zswap
> > > > > > altogether.  zswap should just pass along whatever maximum
> number of
> > > > > > pages that is convenient to itself.
> > > > >
> > > > > I think exposing the batch size is still useful as a hint for zswap. In
> > > > > the current series, zswap allocates as many per-CPU buffers as the
> > > > > compressor's batch size, so no extra buffers for non-batching
> > > > > compressors (including SW compressors).
> > > > >
> > > > > If we use the same batch size regardless, we'll have to always allocate
> > > > > 8 (or N) per-CPU buffers, for little to no benefit on non-batching
> > > > > compressors.
> > > > >
> > > > > So we still want the batch size on the zswap side, but we want the
> > > > > crypto API to be uniform whether or not the compressor supports
> > > > > batching.
> > > >
> > > > Thanks Yosry, you bring up a good point. I currently have the outer for
> > > > loop in zswap_compress() due to the above constraint. For non-batching
> > > > compressors, we allocate only one per-CPU buffer. Hence, we need to
> > > > call crypto_acomp_compress() and write the compressed data to the
> > > > zs_poll for each page in the batch. Wouldn't we need to allocate
> > > > 8 per-CPU buffers for non-batching compressors if we want zswap to
> > > > send a batch of 8 pages uniformly to the crypto API, so that
> > > > zswap_compress() can store the 8 pages in zs_pool after the crypto
> > > > API returns?
> > >
> > > Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> > > compressors.
> > >
> > > I think the cleanest way to handle this would be to:
> > > - Rename zswap_compress() to __zswap_compress(), and make it handle
> a
> > >   given batch size (which would be 1 or 8).
> > > - Introduce zswap_compress() as a wrapper that breaks down the folio
> > >   into batches and loops over them, passing them to __zswap_compress().
> > > - __zswap_compress() has a single unified path (e.g. for compressed
> > >   length and error handling), regardless of the batch size.
> > >
> > > Can this be done with the current acomp API? I think all we really need
> > > is to be able to pass in a batch of size N (which can be 1), and read
> > > the error and compressed length in a single way. This is my main problem
> > > with the current patch.
> >
> > Once Herbert gives us the crypto_acomp modification for non-batching
> > compressors to set the acomp_req->dst->length to the
> > compressed length/error value, I think the same could be accomplished
> > with the current patch, since I will be able to delete the "errp". IOW, I think
> > a simplification is possible without introducing __zswap_compress(). The
> > code will look seamless for non-batching and batching compressors, and the
> > distinction will be made apparent by the outer for loop that iterates over
> > the batch based on the pool->compr_batch_size in the current patch.
> 
> I think moving the outer loop outside to a wrapper could make the
> function digestable without nested loops.

Sure. We would still iterate over the output SG lists in __zswap_compress(),
but yes, there wouldn't be nested loops.

> 
> >
> > Alternately, we could introduce the __zswap_compress() that abstracts
> > one single iteration through the outer for loop: it compresses 1 or 8 pages
> > as a "batch". However, the distinction would still need to be made for
> > non-batching vs. batching compressors in the zswap_compress() wrapper:
> > both for sending the pool->compr_batch_size # of pages to
> > __zswap_compress() and for iterating over the single/multiple dst buffers
> > to write to zs_pool (the latter could be done within __zswap_compress(),
> > but the point remains: we would need to distinguish in one or the other).
> 
> Not sure what you mean by the latter. IIUC, for all compressors
> __zswap_compress() would iterate over the dst buffers and write to
> zs_pool, whether the number of dst buffers is 1 or 8. So there wouldn't
> be any different handling in __zswap_compress(), right?

Yes, this is correct.

> 
> That's my whole motivation for introducing a wrapper that abstracts away
> the batching size.

Yes, you're right.

> 
> >
> > It could be argued that keeping the seamless-ness in handling the calls to
> > crypto based on the pool->compr_batch_size and the logical distinctions
> > imposed by this in iterating over the output SG lists/buffers, would be
> > cleaner being self-contained in zswap_compress(). We already have a
> > zswap_store_pages() that processes the folio in batches. Maybe minimizing
> > the functions that do batch processing could be cleaner?
> 
> Yeah it's not great that we'll end up with zswap_store_pages() splitting
> the folio into batches of 8, then zswap_compress() further splitting
> them into compression batches -- but we'll have that anyway. Whether
> it's inside zswap_compress() or a wrapper doesn't make things much
> different imo.
> 
> Also, splitting the folio differently at different levels make semantic
> sense. zswap_store_pages() splits it into batches of 8, because this is
> what zswap handles (mainly to avoid dynamically allocating things like
> entries). zswap_compress() will split it further if the underlying
> compressor prefers that, to avoid allocating many buffer pages. So I
> think it kinda makes sense.

Agreed.

> 
> In the future, we can revisit the split in zswap_compress() if we have a
> good case for batching compression for SW (e.g. compress every 8 pages
> as a single unit), or if we can optimize the per-CPU buffers somehow.

Yes. Let me see how best the __zswap_compress() API can support this.

Thanks!
Kanchana

> 
> >
> > In any case, let me know which would be preferable.
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > In the future, if it's beneifical for some SW compressors to batch
> > > compressions, we can look into optimizations for the per-CPU buffers to
> > > avoid allocating 8 pages per-CPU (e.g. shared page pool), or make this
> > > opt-in for certain SW compressors that justify the cost.
> > >
> > > >
> > > > Thanks,
> > > > Kanchana
> > > >
> > > > >
> > > > > >
> > > > > > Cheers,
> > > > > > --
> > > > > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > > > > Home Page: http://gondor.apana.org.au/~herbert/
> > > > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-09 17:31                                 ` Yosry Ahmed
  2025-12-09 19:38                                   ` Sridhar, Kanchana P
@ 2025-12-10  4:28                                   ` Herbert Xu
  2025-12-10  5:36                                     ` Sridhar, Kanchana P
  2025-12-10 15:53                                     ` Yosry Ahmed
  1 sibling, 2 replies; 79+ messages in thread
From: Herbert Xu @ 2025-12-10  4:28 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sridhar, Kanchana P, SeongJae Park, linux-kernel, linux-mm,
	hannes, nphamcs, chengming.zhou, usamaarif642, ryan.roberts,
	21cnbao, ying.huang, akpm, senozhatsky, kasong, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Dec 09, 2025 at 05:31:35PM +0000, Yosry Ahmed wrote:
>
> Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> compressors.

OK so the consensus is that we're keeping the visible batch size
attribute for now, which will be set to 1 for everything but iaa.

So for now I'm going to just provide a trivial acomp fallback so
that non-batching algorithms conform to the batching calling
convention for a batch size of 1.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-10  4:28                                   ` Herbert Xu
@ 2025-12-10  5:36                                     ` Sridhar, Kanchana P
  2025-12-10 15:53                                     ` Yosry Ahmed
  1 sibling, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-12-10  5:36 UTC (permalink / raw)
  To: Herbert Xu, Yosry Ahmed
  Cc: SeongJae Park, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, kasong, linux-crypto, davem, clabbe, ardb,
	ebiggers, surenb, Accardi, Kristen C, Gomes, Vinicius, Feghali,
	Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Tuesday, December 9, 2025 8:28 PM
> To: Yosry Ahmed <yosry.ahmed@linux.dev>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; SeongJae Park
> <sj@kernel.org>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with
> compress batching of large folios.
> 
> On Tue, Dec 09, 2025 at 05:31:35PM +0000, Yosry Ahmed wrote:
> >
> > Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> > compressors.
> 
> OK so the consensus is that we're keeping the visible batch size
> attribute for now, which will be set to 1 for everything but iaa.
> 
> So for now I'm going to just provide a trivial acomp fallback so
> that non-batching algorithms conform to the batching calling
> convention for a batch size of 1.

Thanks Herbert, this sounds good.

Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios.
  2025-12-10  4:28                                   ` Herbert Xu
  2025-12-10  5:36                                     ` Sridhar, Kanchana P
@ 2025-12-10 15:53                                     ` Yosry Ahmed
  1 sibling, 0 replies; 79+ messages in thread
From: Yosry Ahmed @ 2025-12-10 15:53 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Sridhar, Kanchana P, SeongJae Park, linux-kernel, linux-mm,
	hannes, nphamcs, chengming.zhou, usamaarif642, ryan.roberts,
	21cnbao, ying.huang, akpm, senozhatsky, kasong, linux-crypto,
	davem, clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius, Feghali, Wajdi K, Gopal, Vinodh

On Wed, Dec 10, 2025 at 12:28:11PM +0800, Herbert Xu wrote:
> On Tue, Dec 09, 2025 at 05:31:35PM +0000, Yosry Ahmed wrote:
> >
> > Ugh, yes.. I don't think we want to burn 7 extra pages per-CPU for SW
> > compressors.
> 
> OK so the consensus is that we're keeping the visible batch size
> attribute for now, which will be set to 1 for everything but iaa.
> 
> So for now I'm going to just provide a trivial acomp fallback so
> that non-batching algorithms conform to the batching calling
> convention for a batch size of 1.

Thank you!

> 
> Cheers,
> -- 
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver
  2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
                   ` (21 preceding siblings ...)
  2025-11-04  9:12 ` [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
@ 2025-11-13 18:14 ` Sridhar, Kanchana P
  22 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2025-11-13 18:14 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosry.ahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, 21cnbao, ying.huang,
	akpm, senozhatsky, sj, kasong, linux-crypto, herbert, davem,
	clabbe, ardb, ebiggers, surenb, Accardi, Kristen C, Gomes,
	Vinicius
  Cc: Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Tuesday, November 4, 2025 1:12 AM
> To: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org;
> senozhatsky@chromium.org; sj@kernel.org; kasong@tencent.com; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>
> Cc: Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: [PATCH v13 00/22] zswap compression batching with optimized
> iaa_crypto driver
> 
> v13: zswap compression batching with optimized iaa_crypto driver
> ==============================================================
> ==
> This updated patch-series further generalizes the batching implementation of
> zswap_compress() for non-batching and batching compressors. It makes sure
> the
> bulk allocation of zswap entries preserves the current behavior of addition of
> an entry to the LRU list for the nid of the page.
> 
> Based on Herbert's suggestions, the batching interfaces from zswap to crypto,
> from crypto to iaa_crypto, and the batching implementation within iaa_crypto
> now
> use the folio directly as the source (sg_page_iter for retrieving pages), and
> destination SG lists. A unit_size has been added to struct acomp_req, with
> kernel users such as zswap using the new acomp_request_set_unit_size() API
> to
> set the unit size to use while breaking down the request's src/dst
> scatterlists. zswap sets the unit-size to PAGE_SIZE.

Hi Nhat, Yosry, Herbert,

I just wanted to follow up on whether there are other code review comments
or suggestions you have on this latest patch set. Thanks very much for your time
in reviewing and improving the patch-series.

Nhat, I will make the change to the struct zswap_entry bit-fields to be
macro-defined constants, either as an update to this series, or submit a separate
patch with this change if that's Ok with you.

Thanks,
Kanchana

> 
> Following Andrew's suggestion, the next two paragraphs emphasize
> generality and
> alignment with current kernel efforts.
> 
> Architectural considerations for the zswap batching framework:
> ==============================================================
> We have designed the zswap batching framework to be hardware-agnostic. It
> has no
> dependencies on Intel-specific features and can be leveraged by any
> hardware
> accelerator or software-based compressor. In other words, the framework is
> open
> and inclusive by design.
> 
> Other ongoing work that can use batching:
> =========================================
> This patch-series demonstrates the performance benefits of compress
> batching when used in zswap_store() of large folios. shrink_folio_list()
> "reclaim batching" of any-order folios is the next major work that uses
> this zswap compress batching framework: our testing of kernel_compilation
> with writeback and the zswap shrinker indicates 10X fewer pages get
> written back when we reclaim 32 folios as a batch, as compared to one
> folio at a time: this is with deflate-iaa and with zstd. We expect to
> submit a patch-series with this data and the resulting performance
> improvements shortly. Reclaim batching relieves memory pressure faster
> than reclaiming one folio at a time, hence alleviates the need to scan
> slab memory for writeback.
> 
> Many thanks to Nhat for suggesting ideas on using batching with the
> ongoing kcompressd work, as well as beneficially using decompression
> batching & block IO batching to improve zswap writeback efficiency.
> 
> Experiments with kernel compilation benchmark (allmod config) that
> combine zswap compress batching, reclaim batching, swapin_readahead()
> decompression batching of prefetched pages, and writeback batching show
> that 0 pages are written back to disk with deflate-iaa and zstd. For
> comparison, the baselines for these compressors see 200K-800K pages
> written to disk.
> 
> To summarize, these are future clients of the batching framework:
> 
>    - shrink_folio_list() reclaim batching of multiple folios:
>        Implemented, will submit patch-series.
>    - zswap writeback with decompress batching:
>        Implemented, will submit patch-series.
>    - zram:
>        Implemented, will submit patch-series.
>    - kcompressd:
>        Not yet implemented.
>    - file systems:
>        Not yet implemented.
>    - swapin_readahead() decompression batching of prefetched pages:
>        Implemented, will submit patch-series.
> 
> 
> iaa_crypto Driver Rearchitecting and Optimizations:
> ===================================================
> 
> The most significant highlight of v13 is a new, lightweight and highly
> optimized iaa_crypto driver, resulting directly in the latency and
> throughput improvements noted later in this cover letter.
> 
>  1) Better stability, more functionally versatile to support zswap
>     with better performance on different Intel platforms.
> 
>     a) Patches 0002, 0005 and 0011 together resolve a race condition in
>        mainline v6.15, reported from internal validation, when IAA
>        wqs/devices are disabled while workloads are using IAA.
> 
>     b) Patch 0002 introduces a new architecture for mapping cores to
>        IAAs based on packages instead of NUMA nodes, and generalizing
>        how WQs are used: as package level shared resources for all
>        same-package cores (default for compress WQs), or dedicated to
>        mapped cores (default for decompress WQs). Further, users are
>        able to configure multiple WQs and specify how many of those are
>        for compress jobs only vs. decompress jobs only. sysfs iaa_crypto
>        driver parameters can be used to change the default settings for
>        performance tuning.
> 
>     c) idxd descriptor allocation moved from blocking to non-blocking
>        with retry limits and mitigations if limits are exceeded.
> 
>     d) Code cleanup for readability and clearer code flow.
> 
>     e) Fixes IAA re-registration errors upon disabling/enabling IAA wqs
>        and devices that exists in the mainline v6.15.
> 
>     f) Addition of a layer that encapsulates iaa_crypto's core functionality to
>        rely only on idxd, dma and scatterlists to provide clean interfaces to
>        crypto_acomp.
> 
>     g) New Dynamic compression mode for Granite Rapids to get better
>        compression ratio by echo-ing 'deflate-iaa-dynamic' as the zswap
>        compressor.
> 
>     h) New crypto_acomp API crypto_acomp_batch_size() that will return
>        the driver's max batch size if the driver has registered a batch_size
>        that's greater than 1; or 1 if there is no driver specific definition of
>        batch_size.
> 
>        Accordingly, iaa_crypto sets the acomp_alg batch_size to its internal
>        IAA_CRYPTO_MAX_BATCH_SIZE for fixed and dynamic modes.
> 
>  2) Performance optimizations (please refer to the latency data per
>     optimization in the commit logs):
> 
>     a) Distributing [de]compress jobs in round-robin manner to available
>        IAAs on package.
> 
>     b) Replacing the compute-intensive iaa_wq_get()/iaa_wq_put() with a
>        percpu_ref in struct iaa_wq, thereby eliminating acquiring a
>        spinlock in the fast path, while using a combination of the
>        iaa_crypto_enabled atomic with spinlocks in the slow path to
>        ensure the compress/decompress code sees a consistent state of the
>        wq tables.
> 
>     c) Directly call movdir64b for non-irq use cases, i.e., the most
>        common usage. Avoid the overhead of irq-specific computes in
>        idxd_submit_desc() to gain latency.
> 
>     d) Batching of compressions/decompressions using async submit-poll
>        mechanism to derive the benefits of hardware parallelism.
> 
>     e) Batching compressors need to manage their own "requests"
>        abstraction, and remove this driver-specific aspect from being
>        managed by kernel users such as zswap. iaa_crypto maintains
>        per-CPU "struct iaa_req **reqs" to submit multiple jobs to the
>        hardware accelerator to run in parallel.
> 
>     f) Modifies the iaa_crypto batching API and their implementation to expect
> a
>        src SG list that contains the batch's pages and a dst SG list that has
>        multiple scatterlists for the batch's output buffers.
> 
>     g) Submit the two largest data buffers first for decompression
>        batching, so that the longest running jobs get a head start,
>        reducing latency for the batch.
> 
>  3)  Compress/decompress batching are implemented using SG lists as the
> batching
>      interface.
> 
> 
> Main Changes in Zswap Compression Batching:
> ===========================================
> 
>  Note to zswap maintainers:
>  --------------------------
>  Patches 19 and 20 can be reviewed and improved/merged independently
>  of this series, since they are zswap centric. These 2 patches help
>  batching but the crypto_acomp_batch_size() from the iaa_crypto commits
>  in this series is not a requirement, unlike patches 21-22.
> 
>  1) v13 preserves the pool acomp_ctx resources creation/deletion
>     simplification of v11, namely, lasting from pool creation-deletion,
>     persisting through CPU hot[un]plug operations. Further, zswap no
>     longer needs to create multiple "struct acomp_req" in the per-CPU
>     acomp_ctx. zswap only needs to manage multiple "u8 **buffers".
> 
>  2) We store the compressor's batch-size (@pool->compr_batch_size) directly
> in
>     struct zswap_pool for quick retrieval in the zswap_store() fast path.
> 
>  3) Optimizations to not cause regressions in software compressors with
>     the introduction of the new unified zswap_compress() framework that
>     implements compression batching for all compressors. These optimizations
>     help recover the performance for non-batching compressors:
> 
>     a) kmem_cache_alloc_bulk(), kmem_cache_free_bulk() to allocate/free
>        batch zswap_entry-s. These kmem_cache API allow allocator
>        optimizations with internal locks for multiple allocations.
> 
>     b) The page's nid is stored in a new nid field added to zswap_entry, so the
>        zswap_lru_add()/zswap_lru_del() will add/delete the entry from the LRU
>        list of the page's nid. This preserves the current behavior wrt the
>        shrinker.
> 
>     c) Writes to the zswap_entry right after it is allocated without
>        modifying the publishing order. This avoids different code blocks
>        in zswap_store_pages() having to bring the zswap_entries to the
>        cache for writing, potentially evicting other working set
>        structures, impacting performance.
> 
>     d) ZSWAP_MAX_BATCH_SIZE is used as the batch-size for software
>        compressors, since this gives the best performance with zstd.
> 
>     e) Minimize branches in zswap_compress().
> 
>  4) During pool creation, these key additions are allocated as part of the
>     per-CPU acomp_ctx so as to recover performance with the new,
> generalized SG
>     lists based zswap_compress() batching interface:
> 
>     a) An sg_table "acomp_ctx->sg_outputs" is allocated to contain the
>        compressor's batch-size number of SG lists that will contain the
>        destination buffers/lengths after batch compression.
> 
>     b) The per-CPU destination buffers are mapped to the per-CPU SG lists: this
>        needs to be done only once, and optimizes performance.
> 
>  5) A unified zswap_compress() API is added to compress multiple pages.
> Thanks
>     to Nhat, Yosry and Johannes for their helpful suggestions to accomplish
>     this.
> 
>  6) Finally, zswap_compress() has been re-written to incorporate Herbert's
>     suggestions to use source folios and output SG lists for batching. The new
>     zswap_compress() code has been made as generic to software and batching
>     compressors as possible, so that it is easy to read and maintain. The
>     recent changes related to PAGE_SIZE dst buffers, zsmalloc and
> incompressible
>     pages have been incorporated into the batched zswap_compress() as well.
> To
>     resolve regressions with zstd, I took the liberty of not explicitly checking
>     for dlen == 0 and dlen > PAGE_SIZE (as in the mainline); instead,
>     expecting that a negative err value will be returned by the software
>     compressor in such cases.
> 
> 
> Compression Batching:
> =====================
> 
> This patch-series introduces batch compression of pages in large folios to
> improve zswap swapout latency. It preserves the existing zswap protocols
> for non-batching software compressors by calling crypto_acomp sequentially
> per page in the batch. Additionally, in support of hardware accelerators
> that can process a batch as an integral unit, the patch-series allows
> zswap to call crypto_acomp without API changes, for compressors
> that intrinsically support batching. The zswap_compress() code has very
> minimal
> special casing for software/batching compressors.
> 
> The patch series provides a proof point by using the Intel Analytics
> Accelerator (IAA) for implementing the compress/decompress batching API
> using hardware parallelism in the iaa_crypto driver and another proof point
> with a sequential software compressor, zstd.
> 
> SUMMARY:
> ========
> 
>   The first proof point is to test with IAA using a sequential call (fully
>   synchronous, compress one page at a time) vs. a batching call (fully
>   asynchronous, submit a batch to IAA for parallel compression, then poll for
>   completion statuses).
> 
>     The performance testing data with 30 usemem processes/64K folios
>     shows 62% throughput gains and 28% elapsed/sys time reductions with
>     deflate-iaa; and 5% sys time reduction with zstd for a small
>     throughput increase. For PMD folios, a 67% throughput gain and 23%
>     elapsed/sys time reduction is seen.
> 
>     Kernel compilation test with 64K folios using 32 threads and the
>     zswap shrinker_enabled set to "N", demonstrates similar
>     improvements: zswap_store() large folios using IAA compress batching
>     improves the workload performance by 3.5% and reduces sys time by
>     6% as compared to IAA sequential. For zstd, compress batching
>     improves workload performance by 3.4% and reduces sys time by
>     1.8% as compared to sequentially calling zswap_compress() per page
>     in a folio.
> 
>     The main takeaway from usemem, a workload that is mostly compression
>     dominated (very few swapins) is that the higher the number of batches,
>     such as with larger folios, the more the benefit of batching cost
>     amortization, as shown by the PMD usemem data. This aligns well with the
>     future direction for batching.
> 
>   The second proof point is to make sure that software algorithms such as
>   zstd do not regress. The data indicates that for sequential software
>   algorithms a performance gain is achieved.
> 
>     With the performance optimizations implemented in patches 21-22 of v13:
> 
>     *  zstd usemem metrics with 64K folios are within range of variation
>        with a slight sys time improvement. zstd usemem30 workload
> performance
>        with PMD folios improves by 6% and sys time reduces by 8%, for
> comparable
>        throughput as the baseline.
> 
>     *  With kernel compilation, I used zstd without the zswap shrinker to enable
>        more direct comparisons with the changes in this series. Subsequent
> patch
>        series I expect to submit in collaboration with Nhat, will enable the
>        zswap shrinker to quantify the benefits of decompression batching during
>        writeback. With this series' compression batching within large folios, we
>        get a 6%-1.8% reduction in sys time, a 3.5%-3.4% improvement in
> workload
>        performance with 64K folios for deflate-iaa/zstd respectively.
> 
>     These optimizations pertain to ensuring common code paths and removing
>     redundant branches/computes. Additionally, using the batching code for
>     non-batching compressors to sequentially compress/store batches of up
>     to ZSWAP_MAX_BATCH_SIZE pages seems to help, most likely due to
>     cache locality of working set structures such as the array of
>     zswap_entry-s for the batch.
> 
>     Our internal validation of zstd with the batching interface vs. IAA with
>     the batching interface on Emerald Rapids has shown that IAA
>     compress/decompress batching gives 21.3% more memory savings as
> compared
>     to zstd, for 5% performance loss as compared to the baseline without any
>     memory pressure. IAA batching demonstrates more than 2X the memory
>     savings obtained by zstd at this 95% performance KPI.
>     The compression ratio with IAA is 2.23, and with zstd 2.96. Even with
>     this compression ratio deficit for IAA, batching is extremely
>     beneficial. As we improve the compression ratio of the IAA accelerator,
>     we expect to see even better memory savings with IAA as compared to
>     software compressors.
> 
> 
>   Batching Roadmap:
>   =================
> 
>   1) Compression batching within large folios (this series).
> 
>   2) zswap writeback decompression batching:
> 
>      This is being co-developed with Nhat Pham, and shows promising
>      results. We plan to submit an RFC shortly.
> 
>   3) Reclaim batching of hybrid folios:
> 
>      We can expect to see even more significant performance and throughput
>      improvements if we use the parallelism offered by IAA to do reclaim
>      batching of 4K/large folios (really any-order folios), and using the
>      zswap_store() high throughput compression pipeline to batch-compress
>      pages comprising these folios, not just batching within large
>      folios. This is the reclaim batching patch 13 in v1, which we expect
>      to submit in a separate patch-series. As mentioned earlier, reclaim
>      batching reduces the # of writeback pages by 10X for zstd and
>      deflate-iaa.
> 
>   4) swapin_readahead() decompression batching:
> 
>      We have developed a zswap load batching interface to be used
>      for parallel decompression batching, using swapin_readahead().
> 
>   These capabilities are architected so as to be useful to zswap and
>   zram. We have integrated these components with zram and expect to
> submit an
>   RFC soon.
> 
> 
>   v13 Performance Summary:
>   ========================
> 
>   This is a performance testing summary of results with usemem30
>   (30 usemem processes running in a cgroup limited at 150G, each trying to
>    allocate 10G).
> 
>   usemem30 with 64K folios:
>   =========================
> 
>      zswap shrinker_enabled = N.
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     6,118,675       9,901,216       62%
>      Average throughput (KB/s)     203,955         330,040       62%
>      elapsed time (sec)              98.94           70.90      -28%
>      sys time (sec)               2,379.29        1,686.18      -29%
>      -----------------------------------------------------------------------
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor                 zstd            zstd   v13 zstd
>                                                              improvement
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     5,983,561       6,003,851      0.3%
>      Average throughput (KB/s)     199,452         200,128      0.3%
>      elapsed time (sec)             100.93           96.62     -4.3%
>      sys time (sec)               2,532.49        2,395.83       -5%
>      -----------------------------------------------------------------------
> 
>   usemem30 with 2M folios:
>   ========================
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor          deflate-iaa     deflate-iaa   IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     6,309,635      10,558,225       67%
>      Average throughput (KB/s)     210,321         351,940       67%
>      elapsed time (sec)              88.70           67.84      -24%
>      sys time (sec)               2,059.83        1,581.07      -23%
>      -----------------------------------------------------------------------
> 
>      -----------------------------------------------------------------------
>                      mm-unstable-10-24-2025             v13
>      -----------------------------------------------------------------------
>      zswap compressor                 zstd            zstd   v13 zstd
>                                                              improvement
>      -----------------------------------------------------------------------
>      Total throughput (KB/s)     6,562,687       6,567,946      0.1%
>      Average throughput (KB/s)     218,756         218,931      0.1%
>      elapsed time (sec)              94.69           88.79       -6%
>      sys time (sec)               2,253.97        2,083.43       -8%
>      -----------------------------------------------------------------------
> 
> 
>   This is a performance testing summary of results with
>   kernel_compilation test (allmod config, 32 cores, cgroup limited to 2G).
> 
>   zswap shrinker_enabled = N.
> 
>   kernel_compilation with 64K folios:
>   ===================================
> 
>      --------------------------------------------------------------------------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
>                                                              vs.
>                                                         IAA Sequential
>      --------------------------------------------------------------------------
>      real_sec                 836.64          806.94      -3.5%
>      sys_sec                3,897.57        3,661.83        -6%
>      --------------------------------------------------------------------------
> 
>      --------------------------------------------------------------------------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor           zstd            zstd    Improvement
>      --------------------------------------------------------------------------
>      real_sec                 880.62          850.41      -3.4%
>      sys_sec                5,171.90        5,076.51      -1.8%
>      --------------------------------------------------------------------------
> 
> 
>   kernel_compilation with PMD folios:
>   ===================================
> 
>      --------------------------------------------------------------------------
>                mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor    deflate-iaa     deflate-iaa    IAA Batching
>                                                              vs.
>                                                         IAA Sequential
>      --------------------------------------------------------------------------
>      real_sec                 818.48          779.67      -4.7%
>      sys_sec                4,226.52        4,245.18       0.4%
>      --------------------------------------------------------------------------
> 
>      --------------------------------------------------------------------------
>               mm-unstable-10-24-2025             v13
>      --------------------------------------------------------------------------
>      zswap compressor          zstd             zstd    Improvement
>      --------------------------------------------------------------------------
>      real_sec                888.45           849.54      -4.4%
>      sys_sec               5,866.72         5,847.17      -0.3%
>      --------------------------------------------------------------------------
> 
> 
> 
> The patch-series is organized as follows:
> =========================================
> 
>  1) crypto acomp & iaa_crypto driver enablers for batching: Relevant
>     patches are tagged with "crypto:" in the subject:
> 
>     Patch 1) Reorganizes the iaa_crypto driver code into logically related
>              sections and avoids forward declarations, in order to facilitate
>              subsequent iaa_crypto patches. This patch makes no
>              functional changes.
> 
>     Patch 2) Makes an infrastructure change in the iaa_crypto driver
>              to map IAA devices/work-queues to cores based on packages
>              instead of NUMA nodes. This doesn't impact performance on
>              the Sapphire Rapids system used for performance
>              testing. However, this change fixes functional problems we
>              found on Granite Rapids during internal validation, where the
>              number of NUMA nodes is greater than the number of packages,
>              which was resulting in over-utilization of some IAA devices
>              and non-usage of other IAA devices as per the current NUMA
>              based mapping infrastructure.
> 
>              This patch also develops a new architecture that
>              generalizes how IAA device WQs are used. It enables
>              designating IAA device WQs as either compress-only or
>              decompress-only or generic. Once IAA device WQ types are
>              thus defined, it also allows the configuration of whether
>              device WQs will be shared by all cores on the package, or
>              used only by "mapped cores" obtained by a simple allocation
>              of available IAAs to cores on the package.
> 
>              As a result of the overhaul of wq_table definition,
>              allocation and rebalancing, this patch eliminates
>              duplication of device WQs in per-CPU wq_tables, thereby
>              saving 140MiB on a 384 cores dual socket Granite Rapids server
>              with 8 IAAs.
> 
>              Regardless of how the user has configured the WQs' usage,
>              the next WQ to use is obtained through a direct look-up in
>              per-CPU "cpu_comp_wqs" and "cpu_decomp_wqs" structures so
>              as to minimize latency in the critical path driver compress
>              and decompress routines.
> 
>     Patch 3) Code cleanup, consistency of function parameters.
> 
>     Patch 4) Makes a change to iaa_crypto driver's descriptor allocation,
>              from blocking to non-blocking with retries/timeouts and
>              mitigations in case of timeouts during compress/decompress
>              ops. This prevents tasks getting blocked indefinitely, which
>              was observed when testing 30 cores running workloads, with
>              only 1 IAA enabled on Sapphire Rapids (out of 4). These
>              timeouts are typically only encountered, and associated
>              mitigations exercised, only in configurations with 1 IAA
>              device shared by 30+ cores.
> 
>     Patch 5) Optimize iaa_wq refcounts using a percpu_ref instead of
>              spinlocks and "int refcount".
> 
>     Patch 6) Code simplification and restructuring for understandability
>              in core iaa_compress() and iaa_decompress() routines.
> 
>     Patch 7) Refactor hardware descriptor setup to their own procedures
>              to reduce code clutter.
> 
>     Patch 8) Simplify and optimize job submission for the most commonly used
>              non-irq async mode by directly calling movdir64b.
> 
>     Patch 9) Deprecate exporting symbols for adding IAA compression
>              modes.
> 
>     Patch 10) All dma_map_sg() calls will pass in 1 for the nents instead of
>               sg_nents(), for these main reasons: performance; no existing
>               iaa_crypto use cases that allow multiple SG lists to be mapped for
>               dma at once; facilitates new SG lists batching interface through
>               crypto.
> 
>     Patch 11) Move iaa_crypto core functionality to a layer that relies only on
>               the idxd driver, dma, and scatterlists. Implement clean interfaces
>               to crypto_acomp.
> 
>     Patch 12) Define a unit_size in struct acomp_req to enable batching, and
>               provides acomp_request_set_unit_size() for use by kernel
>               modules. zswap_cpu_comp_prepare() calls this API to set the
>               unit_size for zswap as PAGE_SIZE.
> 
>     Patch 13) Implement asynchronous descriptor submit and polling
> mechanisms,
>               enablers for batching. Develop IAA batching of compressions and
>               decompressions for deriving hardware parallelism.
> 
>     Patch 14) Enables the "async" mode, sets it as the default.
> 
>     Patch 15) Disables verify_compress by default.
> 
>     Patch 16) Decompress batching optimization: Find the two largest
>               buffers in the batch and submit them first.
> 
>     Patch 17) Add a new Dynamic compression mode that can be used on
>               Granite Rapids.
> 
>     Patch 18) Add a batch_size data member to struct acomp_alg and
>               a crypto_acomp_batch_size() API that returns the compressor's
>               batch-size, if it has defined one; 1 otherwise.
> 
>  2) zswap modifications to enable compress batching in zswap_store()
>     of large folios (including pmd-mappable folios):
> 
>     Patch 19) Simplifies the zswap_pool's per-CPU acomp_ctx resource
>               management and lifetime to be from pool creation to pool
>               deletion.
> 
>     Patch 20) Uses IS_ERR_OR_NULL() in zswap_cpu_comp_prepare() to check
> for
>               valid acomp/req, thereby making it consistent with the resource
>               de-allocation code.
> 
>     Patch 21) Defines a zswap-specific ZSWAP_MAX_BATCH_SIZE (currently set
>               as 8U) to denote the maximum number of acomp_ctx batching
>               resources to allocate, thus limiting the amount of extra
>               memory used for batching. Further, the "struct
>               crypto_acomp_ctx" is modified to contain multiple buffers.
>               New "u8 compr_batch_size" member is added to "struct zswap_pool"
>               to track the number of dst buffers associated with the compressor
>               (more than 1 if the compressor supports batching).
> 
>               Modifies zswap_store() to store the folio in batches of
>               pool->compr_batch_size (batching compressors) or
>               ZSWAP_MAX_BATCH_SIZE (sequential compressors) by calling a new
>               zswap_store_pages() that takes a range of indices in the folio to
>               be stored.
> 
>               zswap_store_pages() bulk-allocates zswap entries for the batch,
>               calls zswap_compress() for each page in this range, and stores
>               the entries in xarray/LRU.
> 
>     Patch 22) Introduces a new unified batching implementation of
>               zswap_compress() for compressors that do and do not support
>               batching. This eliminates code duplication and facilitates
>               code maintainability with the introduction of compress
>               batching. Further, there are many optimizations to this common
>               code that result in workload throughput and performance
>               improvements with software compressors and hardware accelerators
>               such as IAA.
> 
>               zstd performance is better or on par with mm-unstable. We
>               see impressive throughput/performance improvements with
>               IAA and workload performance/sys time improvement with zstd
>               batching vs. no-batching.
> 
> 
> With v13 of this patch series, the IAA compress batching feature will be
> enabled seamlessly on Intel platforms that have IAA by selecting
> 'deflate-iaa' as the zswap compressor, and using the iaa_crypto 'async'
> sync_mode driver attribute (the default).
> 
> 
> System setup for testing:
> =========================
> Testing of this patch-series was done with mm-unstable as of 10-24-2025,
> commit 813c0fa931ce, without and with this patch-series. Data was
> gathered on an Intel Sapphire Rapids (SPR) server, dual-socket 56 cores
> per socket, 4 IAA devices per socket, each IAA has total 128 WQ entries,
> 503 GiB RAM and 525G SSD disk partition swap. Core frequency was fixed
> at 2500MHz.
> 
> Other kernel configuration parameters:
> 
>     zswap compressor  : zstd, deflate-iaa
>     zswap allocator   : zsmalloc
>     vm.page-cluster   : 0
> 
> IAA "compression verification" is disabled and IAA is run in the async
> mode (the defaults with this series).
> 
> I ran experiments with these workloads:
> 
> 1) usemem 30 processes with zswap shrinker_enabled=N. Two sets of
>    experiments, one with 64K folios, another with PMD folios.
> 
> 2) Kernel compilation allmodconfig with 2G max memory, 32 threads, with
>    zswap shrinker_enabled=N to test batching performance impact in
>    isolation. Two sets of experiments, one with 64K folios, another with PMD
>    folios.
> 
> IAA configuration is done by a CLI: script is included at the end of the
> cover letter.
> 
> 
> Performance testing (usemem30):
> ===============================
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 150G. There is no swap limit set for the cgroup. 30 usemem
> processes were run, each allocating and writing 10G of memory, and
> sleeping for 10 sec before exiting:
> 
>  usemem --init-time -w -O -b 1 -s 10 -n 30 10g
>  echo 0 > /sys/module/zswap/parameters/shrinker_enabled
> 
>  IAA WQ Configuration (script is iincluded at the end of the cover
>  letter):
> 
>    ./enable_iaa.sh -d 4 -q 1
> 
>  This enables all 4 IAAs on the socket, and configures 1 WQ per IAA
>  device, each containing 128 entries. The driver distributes compress
>  jobs from each core to wqX.0 of all same-package IAAs in a
>  round-robin manner. Decompress jobs are send to the wqX.0 of the
>  mapped IAA device.
> 
>  Since usemem has significantly more swapouts than swapins, this
>  configuration is the most optimal.
> 
>  64K folios: usemem30: deflate-iaa:
>  ==================================
> 
>  -------------------------------------------------------------------------------
>                     mm-unstable-10-24-2025             v13
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)        6,118,675       9,901,216         62%
>  Avg throughput (KB/s)            203,955         330,040         62%
>  elapsed time (sec)                 98.94           70.90        -28%
>  sys time (sec)                  2,379.29        1,686.18        -29%
> 
>  -------------------------------------------------------------------------------
>  memcg_high                     1,263,467       1,404,068
>  memcg_swap_fail                    1,728           1,377
>  64kB_swpout_fallback               1,728           1,377
>  zswpout                       58,174,008      64,508,622
>  zswpin                                43             138
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-64kB                   3,634,162       4,030,643
>  SWPOUT-64kB                            0               0
>  pgmajfault                         2,398           2,488
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  IAA incompressible pages               0               0
>  -------------------------------------------------------------------------------
> 
> 
>  2M folios: usemem30: deflate-iaa:
>  =================================
> 
>  -------------------------------------------------------------------------------
>                     mm-unstable-10-24-2025             v13
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa     deflate-iaa     IAA Batching
>                                                                   vs.
>                                                               IAA Sequential
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)        6,309,635      10,558,225        67%
>  Avg throughput (KB/s)            210,321         351,940        67%
>  elapsed time (sec)                 88.70           67.84       -24%
>  sys time (sec)                  2,059.83        1,581.07       -23%
> 
>  -------------------------------------------------------------------------------
>  memcg_high                       116,246         125,218
>  memcg_swap_fail                       41             177
>  thp_swpout_fallback                   41             177
>  zswpout                       59,880,021      64,509,854
>  zswpin                                69             425
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-2048kB                   116,912         125,822
>  thp_swpout                             0               0
>  pgmajfault                         2,408           4,026
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  IAA incompressible pages               0               0
>  -------------------------------------------------------------------------------
> 
> 
>  64K folios: usemem30: zstd:
>  ===========================
> 
>  -------------------------------------------------------------------------------
>                     mm-unstable-10-24-2025             v13
>  -------------------------------------------------------------------------------
>  zswap compressor                    zstd            zstd        v13 zstd
>                                                                  improvement
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)        5,983,561       6,003,851         0.3%
>  Avg throughput (KB/s)            199,452         200,128         0.3%
>  elapsed time (sec)                100.93           96.62        -4.3%
>  sys time (sec)                  2,532.49        2,395.83          -5%
> 
>  -------------------------------------------------------------------------------
>  memcg_high                     1,122,198       1,113,384
>  memcg_swap_fail                      192              55
>  64kB_swpout_fallback                 192              55
>  zswpout                       48,766,907      48,799,863
>  zswpin                                89              68
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-64kB                   3,047,702       3,049,908
>  SWPOUT-64kB                            0               0
>  pgmajfault                         2,428           2,390
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  -------------------------------------------------------------------------------
> 
> 
>  2M folios: usemem30: zstd:
>  ==========================
> 
>  -------------------------------------------------------------------------------
>                     mm-unstable-10-24-2025             v13
>  -------------------------------------------------------------------------------
>  zswap compressor                    zstd            zstd        v13 zstd
>                                                                  improvement
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)        6,562,687       6,567,946         0.1%
>  Avg throughput (KB/s)            218,756         218,931         0.1%
>  elapsed time (sec)                 94.69           88.79          -6%
>  sys time (sec)                  2,253.97        2,083.43          -8%
> 
>  --------------------------------------------------------------------------------
>  memcg_high                        92,709          92,686
>  memcg_swap_fail                       33             226
>  thp_swpout_fallback                   33             226
>  zswpout                       47,851,601      47,847,171
>  zswpin                                65             441
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-2048kB                    93,427          93,238
>  thp_swpout                             0               0
>  pgmajfault                         2,382           2,767
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  -------------------------------------------------------------------------------
> 
> 
> Performance testing (Kernel compilation, allmodconfig):
> =======================================================
> 
> The experiments with kernel compilation test use 32 threads and build
> the "allmodconfig" that takes ~14 minutes, and has considerable
> swapout/swapin activity. The cgroup's memory.max is set to 2G. zswap
> writeback is not enabled so as to isolate the performance impact of only large
> folio batch compression.
> 
>  echo 0 > /sys/module/zswap/parameters/shrinker_enabled
> 
>  IAA WQ Configuration (script is at the end of the cover letter):
> 
>    ./enable_iaa.sh -d 4 -q 2
> 
>  This enables all 4 IAAs on the socket, and configures 2 WQs per IAA,
>  each containing 64 entries. The driver sends decompresses to wqX.0 of
>  the mapped IAA device, and distributes compresses to wqX.1 of all
>  same-package IAAs in a round-robin manner.
> 
>  64K folios: Kernel compilation/allmodconfig: deflate-iaa:
>  =========================================================
> 
>  -------------------------------------------------------------------------------
>                     mm-unstable-10-24-2025             v13
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>  -------------------------------------------------------------------------------
>  real_sec                          836.64          806.94       -3.5%
>  user_sec                       15,702.26       15,695.13
>  sys_sec                         3,897.57        3,661.83         -6%
>  -------------------------------------------------------------------------------
>  Max_Res_Set_Size_KB            1,872,500       1,873,144
>  -------------------------------------------------------------------------------
>  memcg_high                             0               0
>  memcg_swap_fail                        0               0
>  64kB_swpout_fallback                   0               0
>  zswpout                       94,890,390      93,332,527
>  zswpin                        28,305,656      28,111,525
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-64kB                   3,088,473       3,018,341
>  SWPOUT-64kB                            0               0
>  pgmajfault                    29,958,141      29,776,102
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  IAA incompressible pages             684             442
>  -------------------------------------------------------------------------------
> 
> 
>  2M folios: Kernel compilation/allmodconfig: deflate-iaa:
>  ========================================================
> 
>  -------------------------------------------------------------------------------
>                     mm-unstable-10-24-2025             v13
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa     deflate-iaa    IAA Batching
>                                                                  vs.
>                                                              IAA Sequential
>  -------------------------------------------------------------------------------
>  real_sec                          818.48          779.67         -4.7%
>  user_sec                       15,798.78       15,807.93
>  sys_sec                         4,226.52        4,245.18          0.4%
>  -------------------------------------------------------------------------------
>  Max_Res_Set_Size_KB            1,871,096       1,871,100
>  -------------------------------------------------------------------------------
>  memcg_high                             0               0
>  memcg_swap_fail                        0               0
>  thp_swpout_fallback                    0               0
>  zswpout                      105,675,621     109,930,550
>  zswpin                        36,537,688      38,205,575
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-2048kB                    15,600          15,800
>  thp_swpout                             0               0
>  pgmajfault                    37,843,091      39,540,387
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  IAA incompressible pages             188             349
>  -------------------------------------------------------------------------------
> 
> 
> With the iaa_crypto driver changes for non-blocking descriptor allocations,
> no timeouts-with-mitigations were seen in compress/decompress jobs, for all
> of the above experiments.
> 
> 
>  64K folios: Kernel compilation/allmodconfig: zstd:
>  ==================================================
> 
>  -------------------------------------------------------------------------------
>                     mm-unstable-10-24-2025             v13
>  -------------------------------------------------------------------------------
>  zswap compressor                    zstd            zstd    Improvement
>  -------------------------------------------------------------------------------
>  real_sec                          880.62          850.41        -3.4%
>  user_sec                       15,717.23       15,683.17
>  sys_sec                         5,171.90        5,076.51        -1.8%
>  -------------------------------------------------------------------------------
>  Max_Res_Set_Size_KB            1,871,276       1,874,744
>  -------------------------------------------------------------------------------
>  memcg_high                             0               0
>  memcg_swap_fail                        0               0
>  64kB_swpout_fallback                   0               0
>  zswpout                       76,599,637      76,472,392
>  zswpin                        21,833,178      22,538,969
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-64kB                   2,462,404       2,446,549
>  SWPOUT-64kB                            0               0
>  pgmajfault                    23,027,211      23,830,391
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  -------------------------------------------------------------------------------
> 
> 
>  2M folios: Kernel compilation/allmodconfig: zstd:
>  =================================================
> 
>  -------------------------------------------------------------------------------
>                     mm-unstable-10-24-2025             v13
>  -------------------------------------------------------------------------------
>  zswap compressor                    zstd            zstd    Improvement
>  -------------------------------------------------------------------------------
>  real_sec                          888.45          849.54       -4.4%
>  user_sec                       15,841.87       15,828.10
>  sys_sec                         5,866.72        5,847.17       -0.3%
>  -------------------------------------------------------------------------------
>  Max_Res_Set_Size_KB            1,871,096       1,872,892
>  -------------------------------------------------------------------------------
>  memcg_high                             0               0
>  memcg_swap_fail                        0               0
>  thp_swpout_fallback                    0               0
>  zswpout                       89,891,328      90,847,761
>  zswpin                        29,249,656      29,999,617
>  pswpout                                0               0
>  pswpin                                 0               0
>  ZSWPOUT-2048kB                    12,198          12,481
>  thp_swpout                             0               0
>  pgmajfault                    30,077,425      30,915,945
>  zswap_reject_compress_fail             0               0
>  zswap_reject_reclaim_fail              0               0
>  -------------------------------------------------------------------------------
> 
> 
> 
> Changes since v12:
> ==================
> 1)  Rebased to mm-unstable as of 10-24-2025, commit 813c0fa931ce.
> 2)  Added "int nid" to zswap_entry to store the page's nid, to preserve zswap
>     LRU list/shrinker behavior with bulk allocation, as suggested by Nhat and
>     Yosry. No change in memory footprint of struct zswap_entry.
> 3)  Added a WARN_ON() if kmem_cache_alloc_bulk() returns 0 or a number
> that's
>     different than nr_entries, as suggested by Yosry.
> 4)  Confirmed that kmem_cache_bulk_free() works for both bulk and non-bulk
>     allocated entries, to follow-up on Yosry's comment.
> 5)  Moved the call to cpuhp_state_remove_instance() to
> zswap_pool_destroy(), as
>     suggested by Yosry.
> 6)  Variable names changed to "nid" and "wb_enabled", per Yosry's
> suggestion.
> 7)  Concise comments in zswap.c, and summarized commit logs, as suggested
> by
>     Yosry.
> 8)  Minimized branches in zswap_compress().
> 9)  Deleted allocating extra memory in acomp_req->__ctx[] to statically store
>     addresses to SG lists' lengths, as suggested by Herbert.
> 10) Deleted the iaa_comp API and export symbols, as suggested by Herbert.
> 11) Deleted @batch_size in struct crypto_acomp. Instead, the value is
> returned
>     from struct acomp_alg directly, as suggested by Herbert.
> 12) Addressed checkpatch.pl warnings and coding style suggestions in the
>     iaa_crypto patches, provided by Vinicius Gomes in internal code
>     reviews. Thanks Vinicius!
> 
> 
> Changes since v11:
> ==================
> 1) Rebased to mm-unstable as of 9-18-2025, commit 1f98191f08b4.
> 2) Incorporated Herbert's suggestions on submitting the folio as the source
> and
>    SG lists for the destination to create the compress batching interface from
>    zswap to crypto.
> 3) As per Herbert's suggestion, added a new unit_size member to struct
>    acomp_req, along with a acomp_request_set_unit_size() API for kernel
> modules
>    to set the unit size to use while breaking down the request's src/dst
>    scatterlists.
> 4) Implemented iaa_crypto batching using the new SG lists based architecture
> and
>    crypto interfaces.
> 5) To make the SG lists based approach functional and performant for IAA, I
> have
>    changed all the calls to dma_map_sg() to use nents of 1. This should not be
> a
>    concern, since it eliminates redundant computes to scan an SG list with only
>    one scatterlist for existing kernel users, i.e. zswap with the
>    zswap_compress() modifications in this series. This will continue to hold
>    true with the zram IAA batching support I am developing. There are no
> kernel
>    use cases for the iaa_crypto driver that will break this assumption.
> 6) Addressed Herbert's comment about batch_size being a statically defined
> data
>    member in struct acomp_alg and struct crypto_acomp.
> 7) Addressed Nhat's comment about VM_WARN_ON_ONCE(nr_pages >
>    ZSWAP_MAX_BATCH_SIZE) in zswap_store_pages().
> 8) Nhat's comment about deleting struct swap_batch_decomp_data is
> automatically
>    addressed by the SG lists based rewrite of the crypto batching interface.
> 9) Addressed Barry's comment about renaming pool->batch_size to
>    pool->store_batch_size.
> 10) Incorporated Barry's suggestion to merge patches that introduce data
> members
>     to structures and/or API and their usage.
> 11) Added performance data to patch 0023's commit log, as suggested by
> Barry.
> 
> Changes since v10:
> ==================
> 1) Rebased to mm-unstable as of 7-30-2025, commit 01da54f10fdd.
> 2) Added change logging in patch 0024 on there being no Intel-specific
>    dependencies in the batching framework, as suggested by
>    Andrew Morton. Thanks Andrew!
> 3) Added change logging in patch 0024 on other ongoing work that can use
>    batching, as per Andrew's suggestion. Thanks Andrew!
> 4) Added the IAA configuration script in the cover letter, as suggested
>    by Nhat Pham. Thanks Nhat!
> 5) As suggested by Nhat, dropped patch 0020 from v10, that moves CPU
>    hotplug procedures to pool functions.
> 6) Gathered kernel_compilation 'allmod' config performance data with
>    writeback and zswap shrinker_enabled=Y.
> 7) Changed the pool->batch_size for software compressors to be
>    ZSWAP_MAX_BATCH_SIZE since this gives better performance with the
> zswap
>    shrinker enabled.
> 8) Was unable to replicate in v11 the issue seen in v10 with higher
>    memcg_swap_fail than in the baseline, with usemem30/zstd.
> 
> Changes since v9:
> =================
> 1) Rebased to mm-unstable as of 6-24-2025, commit 23b9c0472ea3.
> 2) iaa_crypto rearchitecting, mainline race condition fix, performance
>    optimizations, code cleanup.
> 3) Addressed Herbert's comments in v9 patch 10, that an array based
>    crypto_acomp interface is not acceptable.
> 4) Optimized the implementation of the batching zswap_compress() and
>    zswap_store_pages() added in v9, to recover performance when
>    integrated with the changes in commit 56e5a103a721 ("zsmalloc: prefer
>    the the original page's node for compressed data").
> 
> Changes since v8:
> =================
> 1) Rebased to mm-unstable as of 4-21-2025, commit 2c01d9f3c611.
> 2) Backported commits for reverting request chaining, since these are
>    in cryptodev-2.6 but not yet in mm-unstable: without these backports,
>    deflate-iaa is non-functional in mm-unstable:
>    commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining")
>    commit 5976fe19e240 ("Revert "crypto: testmgr - Add multibuffer acomp
>                          testing"")
>    Backported this hotfix as well:
>    commit 002ba346e3d7 ("crypto: scomp - Fix off-by-one bug when
>    calculating last page").
> 3) crypto_acomp_[de]compress() restored to non-request chained
>    implementations since request chaining has been removed from acomp in
>    commit 64929fe8c0a4 ("crypto: acomp - Remove request chaining").
> 4) New IAA WQ architecture to denote WQ type and whether or not a WQ
>    should be shared among all package cores, or only to the "mapped"
>    ones from an even cores-to-IAA distribution scheme.
> 5) Compress/decompress batching are implemented in iaa_crypto using new
>    crypto_acomp_batch_compress()/crypto_acomp_batch_decompress() API.
> 6) Defines a "void *data" in struct acomp_req, based on Herbert advising
>    against using req->base.data in the driver. This is needed for async
>    submit-poll to work.
> 7) In zswap.c, moved the CPU hotplug callbacks to reside in "pool
>    functions", per Yosry's suggestion to move procedures in a distinct
>    patch before refactoring patches.
> 8) A new "u8 nr_reqs" member is added to "struct zswap_pool" to track
>    the number of requests/buffers associated with the per-cpu acomp_ctx,
>    as per Yosry's suggestion.
> 9) Simplifications to the acomp_ctx resources allocation, deletion,
>    locking, and for these to exist from pool creation to pool deletion,
>    based on v8 code review discussions with Yosry.
> 10) Use IS_ERR_OR_NULL() consistently in zswap_cpu_comp_prepare() and
>     acomp_ctx_dealloc(), as per Yosry's v8 comment.
> 11) zswap_store_folio() is deleted, and instead, the loop over
>     zswap_store_pages() is moved inline in zswap_store(), per Yosry's
>     suggestion.
> 12) Better structure in zswap_compress(), unified procedure that
>     compresses/stores a batch of pages for both, non-batching and
>     batching compressors. Renamed from zswap_batch_compress() to
>     zswap_compress(): Thanks Yosry for these suggestions.
> 
> 
> Changes since v7:
> =================
> 1) Rebased to mm-unstable as of 3-3-2025, commit 5f089a9aa987.
> 2) Changed the acomp_ctx->nr_reqs to be u8 since ZSWAP_MAX_BATCH_SIZE
> is
>    defined as 8U, for saving memory in this per-cpu structure.
> 3) Fixed a typo in code comments in acomp_ctx_get_cpu_lock():
>    acomp_ctx->initialized to acomp_ctx->__online.
> 4) Incorporated suggestions from Yosry, Chengming, Nhat and Johannes,
>    thanks to all!
>    a) zswap_batch_compress() replaces zswap_compress(). Thanks Yosry
>       for this suggestion!
>    b) Process the folio in sub-batches of ZSWAP_MAX_BATCH_SIZE, regardless
>       of whether or not the compressor supports batching. This gets rid of
>       the kmalloc(entries), and allows us to allocate an array of
>       ZSWAP_MAX_BATCH_SIZE entries on the stack. This is implemented in
>       zswap_store_pages().
>    c) Use of a common structure and code paths for compressing a folio in
>       batches, either as a request chain (in parallel in IAA hardware) or
>       sequentially. No code duplication since zswap_compress() has been
>       replaced with zswap_batch_compress(), simplifying maintainability.
> 5) A key difference between compressors that support batching and
>    those that do not, is that for the latter, the acomp_ctx mutex is
>    locked/unlocked per ZSWAP_MAX_BATCH_SIZE batch, so that
> decompressions
>    to handle page-faults can make progress. This fixes the zstd kernel
>    compilation regression seen in v7. For compressors that support
>    batching, for e.g. IAA, the mutex is locked/released once for storing
>    the folio.
> 6) Used likely/unlikely compiler directives and prefetchw to restore
>    performance with the common code paths.
> 
> Changes since v6:
> =================
> 1) Rebased to mm-unstable as of 2-27-2025, commit d58172d128ac.
> 
> 2) Deleted crypto_acomp_batch_compress() and
>    crypto_acomp_batch_decompress() interfaces, as per Herbert's
>    suggestion. Batching is instead enabled by chaining the requests. For
>    non-batching compressors, there is no request chaining involved. Both,
>    batching and non-batching compressions are accomplished by zswap by
>    calling:
> 
>    crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]),
> &acomp_ctx->wait);
> 
> 3) iaa_crypto implementation of batch compressions/decompressions using
>    request chaining, as per Herbert's suggestions.
> 4) Simplification of the acomp_ctx resource allocation/deletion with
>    respect to CPU hot[un]plug, to address Yosry's suggestions to explore the
>    mutex options in zswap_cpu_comp_prepare(). Yosry, please let me know if
>    the per-cpu memory cost of this proposed change is acceptable (IAA:
>    64.8KB, Software compressors: 8.2KB). On the positive side, I believe
>    restarting reclaim on a CPU after it has been through an offline-online
>    transition, will be much faster by not deleting the acomp_ctx resources
>    when the CPU gets offlined.
> 5) Use of lockdep assertions rather than comments for internal locking
>    rules, as per Yosry's suggestion.
> 6) No specific references to IAA in zswap.c, as suggested by Yosry.
> 7) Explored various solutions other than the v6 zswap_store_folio()
>    implementation, to fix the zstd regression seen in v5, to attempt to
>    unify common code paths, and to allocate smaller arrays for the zswap
>    entries on the stack. All these options were found to cause usemem30
>    latency regression with zstd. The v6 version of zswap_store_folio() is
>    the only implementation that does not cause zstd regression, confirmed
>    by 10 consecutive runs, each giving quite consistent latency
>    numbers. Hence, the v6 implementation is carried forward to v7, with
>    changes for branching for batching vs. sequential compression API
>    calls.
> 
> 
> Changes since v5:
> =================
> 1) Rebased to mm-unstable as of 2-1-2025, commit 7de6fd8ab650.
> 
> Several improvements, regression fixes and bug fixes, based on Yosry's
> v5 comments (Thanks Yosry!):
> 
> 2) Fix for zstd performance regression in v5.
> 3) Performance debug and fix for marginal improvements with IAA batching
>    vs. sequential.
> 4) Performance testing data compares IAA with and without batching, instead
>    of IAA batching against zstd.
> 5) Commit logs/zswap comments not mentioning crypto_acomp
> implementation
>    details.
> 6) Delete the pr_info_once() when batching resources are allocated in
>    zswap_cpu_comp_prepare().
> 7) Use kcalloc_node() for the multiple acomp_ctx buffers/reqs in
>    zswap_cpu_comp_prepare().
> 8) Simplify and consolidate error handling cleanup code in
>    zswap_cpu_comp_prepare().
> 9) Introduce zswap_compress_folio() in a separate patch.
> 10) Bug fix in zswap_store_folio() when xa_store() failure can cause all
>     compressed objects and entries to be freed, and UAF when zswap_store()
>     tries to free the entries that were already added to the xarray prior
>     to the failure.
> 11) Deleting compressed_bytes/bytes. zswap_store_folio() also comprehends
>     the recent fixes in commit bf5eaaaf7941 ("mm/zswap: fix inconsistency
>     when zswap_store_page() fails") by Hyeonggon Yoo.
> 
> iaa_crypto improvements/fixes/changes:
> 
> 12) Enables asynchronous mode and makes it the default. With commit
>     4ebd9a5ca478 ("crypto: iaa - Fix IAA disabling that occurs when
>     sync_mode is set to 'async'"), async mode was previously just sync. We
>     now have true async support.
> 13) Change idxd descriptor allocations from blocking to non-blocking with
>     timeouts, and mitigations for compress/decompress ops that fail to
>     obtain a descriptor. This is a fix for tasks blocked errors seen in
>     configurations where 30+ cores are running workloads under high memory
>     pressure, and sending comps/decomps to 1 IAA device.
> 14) Fixes a bug with unprotected access of "deflate_generic_tfm" in
>     deflate_generic_decompress(), which can cause data corruption and
>     zswap_decompress() kernel crash.
> 15) zswap uses crypto_acomp_batch_compress() with async polling instead of
>     request chaining for slightly better latency. However, the request
>     chaining framework itself is unchanged, preserved from v5.
> 
> 
> Changes since v4:
> =================
> 1) Rebased to mm-unstable as of 12-20-2024, commit 5555a83c82d6.
> 2) Added acomp request chaining, as suggested by Herbert. Thanks Herbert!
> 3) Implemented IAA compress batching using request chaining.
> 4) zswap_store() batching simplifications suggested by Chengming, Yosry and
>    Nhat, thanks to all!
>    - New zswap_compress_folio() that is called by zswap_store().
>    - Move the loop over folio's pages out of zswap_store() and into a
>      zswap_store_folio() that stores all pages.
>    - Allocate all zswap entries for the folio upfront.
>    - Added zswap_batch_compress().
>    - Branch to call zswap_compress() or zswap_batch_compress() inside
>      zswap_compress_folio().
>    - All iterations over pages kept in same function level.
>    - No helpers other than the newly added zswap_store_folio() and
>      zswap_compress_folio().
> 
> 
> Changes since v3:
> =================
> 1) Rebased to mm-unstable as of 11-18-2024, commit 5a7056135bb6.
> 2) Major re-write of iaa_crypto driver's mapping of IAA devices to cores,
>    based on packages instead of NUMA nodes.
> 3) Added acomp_has_async_batching() API to crypto acomp, that allows
>    zswap/zram to query if a crypto_acomp has registered batch_compress and
>    batch_decompress interfaces.
> 4) Clear the poll bits on the acomp_reqs passed to
>    iaa_comp_a[de]compress_batch() so that a module like zswap can be
>    confident about the acomp_reqs[0] not having the poll bit set before
>    calling the fully synchronous API crypto_acomp_[de]compress().
>    Herbert, I would appreciate it if you can review changes 2-4; in patches
>    1-8 in v4. I did not want to introduce too many iaa_crypto changes in
>    v4, given that patch 7 is already making a major change. I plan to work
>    on incorporating the request chaining using the ahash interface in v5
>    (I need to understand the basic crypto ahash better). Thanks Herbert!
> 5) Incorporated Johannes' suggestion to not have a sysctl to enable
>    compress batching.
> 6) Incorporated Yosry's suggestion to allocate batching resources in the
>    cpu hotplug onlining code, since there is no longer a sysctl to control
>    batching. Thanks Yosry!
> 7) Incorporated Johannes' suggestions related to making the overall
>    sequence of events between zswap_store() and zswap_batch_store()
> similar
>    as much as possible for readability and control flow, better naming of
>    procedures, avoiding forward declarations, not inlining error path
>    procedures, deleting zswap internal details from zswap.h, etc. Thanks
>    Johannes, really appreciate the direction!
>    I have tried to explain the minimal future-proofing in terms of the
>    zswap_batch_store() signature and the definition of "struct
>    zswap_batch_store_sub_batch" in the comments for this struct. I hope the
>    new code explains the control flow a bit better.
> 
> 
> Changes since v2:
> =================
> 1) Rebased to mm-unstable as of 11-5-2024, commit 7994b7ea6ac8.
> 2) Fixed an issue in zswap_create_acomp_ctx() with checking for NULL
>    returned by kmalloc_node() for acomp_ctx->buffers and for
>    acomp_ctx->reqs.
> 3) Fixed a bug in zswap_pool_can_batch() for returning true if
>    pool->can_batch_comp is found to be equal to BATCH_COMP_ENABLED,
> and if
>    the per-cpu acomp_batch_ctx tests true for batching resources having
>    been allocated on this cpu. Also, changed from per_cpu_ptr() to
>    raw_cpu_ptr().
> 4) Incorporated the zswap_store_propagate_errors() compilation warning fix
>    suggested by Dan Carpenter. Thanks Dan!
> 5) Replaced the references to SWAP_CRYPTO_SUB_BATCH_SIZE in comments
> in
>    zswap.h, with SWAP_CRYPTO_BATCH_SIZE.
> 
> Changes since v1:
> =================
> 1) Rebased to mm-unstable as of 11-1-2024, commit 5c4cf96cd702.
> 2) Incorporated Herbert's suggestions to use an acomp_req flag to indicate
>    async/poll mode, and to encapsulate the polling functionality in the
>    iaa_crypto driver. Thanks Herbert!
> 3) Incorporated Herbert's and Yosry's suggestions to implement the batching
>    API in iaa_crypto and to make its use seamless from zswap's
>    perspective. Thanks Herbert and Yosry!
> 4) Incorporated Yosry's suggestion to make it more convenient for the user
>    to enable compress batching, while minimizing the memory footprint
>    cost. Thanks Yosry!
> 5) Incorporated Yosry's suggestion to de-couple the shrink_folio_list()
>    reclaim batching patch from this series, since it requires a broader
>    discussion.
> 
> 
> IAA configuration script "enable_iaa.sh":
> =========================================
> 
>  Acknowledgements: Binuraj Ravindran and Rakib Al-Fahad.
> 
>  Usage:
>  ------
> 
>    ./enable_iaa.sh -d <num_IAAs> -q <num_WQs_per_IAA>
> 
> 
>  #---------------------------------<cut here>----------------------------------
>  #!/usr/bin/env bash
>  #SPDX-License-Identifier: BSD-3-Clause
>  #Copyright (c) 2025, Intel Corporation
>  #Description: Configure IAA devices
> 
>  VERIFY_COMPRESS_PATH="/sys/bus/dsa/drivers/crypto/verify_compress"
> 
>  iax_dev_id="0cfe"
>  num_iaa=$(lspci -d:${iax_dev_id} | wc -l)
>  sockets=$(lscpu | grep Socket | awk '{print $2}')
>  echo "Found ${num_iaa} instances in ${sockets} sockets(s)"
> 
>  #  The same number of devices will be configured in each socket, if there
>  #  are  more than one socket.
>  #  Normalize with respect to the number of sockets.
>  device_num_per_socket=$(( num_iaa/sockets ))
>  num_iaa_per_socket=$(( num_iaa / sockets ))
> 
>  iaa_wqs=2
>  verbose=0
>  iaa_engines=8
>  mode="dedicated"
>  wq_type="kernel"
>  iaa_crypto_mode="async"
>  verify_compress=0
> 
> 
>  # Function to handle errors
>  handle_error() {
>      echo "Error: $1"
>      exit 1
>  }
> 
>  # Process arguments
> 
>  while getopts "d:hm:q:vD" opt; do
>    case $opt in
>      d)
>        device_num_per_socket=$OPTARG
>        ;;
>      m)
>        iaa_crypto_mode=$OPTARG
>        ;;
>      q)
>        iaa_wqs=$OPTARG
>        ;;
>      D)
>        verbose=1
>        ;;
>      v)
>        verify_compress=1
>        ;;
>      h)
>        echo "Usage: $0 [-d <device_count>][-q <wq_per_device>][-v]"
>        echo "       -d - number of devices"
>        echo "       -q - number of WQs per device"
>        echo "       -v - verbose mode"
>        echo "       -h - help"
>        exit
>        ;;
>      \?)
>        echo "Invalid option: -$OPTARG" >&2
>        exit
>        ;;
>    esac
>  done
> 
>  LOG="configure_iaa.log"
> 
>  # Update wq_size based on number of wqs
>  wq_size=$(( 128 / iaa_wqs ))
> 
>  # Take care of the enumeration, if DSA is enabled.
>  dsa=`lspci | grep -c 0b25`
>  # set first,step counters to correctly enumerate iax devices based on
>  # whether running on guest or host with or without dsa
>  first=0
>  step=1
>  [[ $dsa -gt 0 && -d /sys/bus/dsa/devices/dsa0 ]] && first=1 && step=2
>  echo "first index: ${first}, step: ${step}"
> 
> 
>  #
>  # Switch to software compressors and disable IAAs to have a clean start
>  #
>  COMPRESSOR=/sys/module/zswap/parameters/compressor
>  last_comp=`cat ${COMPRESSOR}`
>  echo lzo > ${COMPRESSOR}
> 
>  echo "Disable IAA devices before configuring"
> 
>  for ((i = ${first}; i < ${step} * ${num_iaa}; i += ${step})); do
>      for ((j = 0; j < ${iaa_wqs}; j += 1)); do
>          cmd="accel-config disable-wq iax${i}/wq${i}.${j} >& /dev/null"
>         [[ $verbose == 1 ]] && echo $cmd; eval $cmd
>       done
>      cmd="accel-config disable-device iax${i} >& /dev/null"
>      [[ $verbose == 1 ]] && echo $cmd; eval $cmd
>  done
> 
>  rmmod iaa_crypto
>  modprobe iaa_crypto
> 
>  # apply crypto parameters
>  echo $verify_compress > ${VERIFY_COMPRESS_PATH} || handle_error "did
> not change verify_compress"
>  # Note: This is a temporary solution for during the kernel transition.
>  if [ -f /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa ];then
>      echo 1 > /sys/bus/dsa/drivers/crypto/g_comp_wqs_per_iaa ||
> handle_error "did not set g_comp_wqs_per_iaa"
>  elif [ -f /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa ];then
>      echo 1 > /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa || handle_error "did
> not set g_wqs_per_iaa"
>  fi
>  if [ -f /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq ];then
>      echo 1 > /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq ||
> handle_error "did not set g_consec_descs_per_gwq"
>  fi
>  echo ${iaa_crypto_mode} > /sys/bus/dsa/drivers/crypto/sync_mode ||
> handle_error "could not set sync_mode"
> 
> 
> 
>  echo "Configuring ${device_num_per_socket} device(s) out of
> $num_iaa_per_socket per socket"
>  if [ "${device_num_per_socket}" -le "${num_iaa_per_socket}" ]; then
>      echo "Configuring all devices"
>      start=${first}
>      end=$(( ${step} * ${device_num_per_socket} ))
>  else
>     echo "ERROR: Not enough devices"
>     exit
>  fi
> 
> 
>  #
>  # enable all iax devices and wqs
>  #
>  for (( socket = 0; socket < ${sockets}; socket += 1 )); do
>  for ((i = ${start}; i < ${end}; i += ${step})); do
> 
>      echo "Configuring iaa$i on socket ${socket}"
> 
>      for ((j = 0; j < ${iaa_engines}; j += 1)); do
>          cmd="accel-config config-engine iax${i}/engine${i}.${j} --group-id=0"
>          [[ $verbose == 1 ]] && echo $cmd; eval $cmd
>      done
> 
>      # Config  WQs
>      for ((j = 0; j < ${iaa_wqs}; j += 1)); do
>          # Config WQ: group 0,  priority=10, mode=shared, type = kernel
> name=kernel, driver_name=crypto
>          cmd="accel-config config-wq iax${i}/wq${i}.${j} -g 0 -s ${wq_size} -p 10 -
> m ${mode} -y ${wq_type} -n iaa_crypto${i}${j} -d crypto"
>          [[ $verbose == 1 ]] && echo $cmd; eval $cmd
>       done
> 
>      # Enable Device and WQs
>      cmd="accel-config enable-device iax${i}"
>      [[ $verbose == 1 ]] && echo $cmd; eval $cmd
> 
>      for ((j = 0; j < ${iaa_wqs}; j += 1)); do
>          cmd="accel-config enable-wq iax${i}/wq${i}.${j}"
>          [[ $verbose == 1 ]] && echo $cmd; eval $cmd
>       done
> 
>  done
>      start=$(( start + ${step} * ${num_iaa_per_socket} ))
>      end=$(( start + (${step} * ${device_num_per_socket}) ))
>  done
> 
>  # Restore the last compressor
>  echo "$last_comp" > ${COMPRESSOR}
> 
>  # Check if the configuration is correct
>  echo "Configured IAA devices:"
>  accel-config list | grep iax
> 
>  #---------------------------------<cut here>----------------------------------
> 
> 
> I would greatly appreciate code review comments for the iaa_crypto driver
> and mm patches included in this series!
> 
> Thanks,
> Kanchana
> 
> 
> 
> Kanchana P Sridhar (22):
>   crypto: iaa - Reorganize the iaa_crypto driver code.
>   crypto: iaa - New architecture for IAA device WQ comp/decomp usage &
>     core mapping.
>   crypto: iaa - Simplify, consistency of function parameters, minor
>     stats bug fix.
>   crypto: iaa - Descriptor allocation timeouts with mitigations.
>   crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting.
>   crypto: iaa - Simplify the code flow in iaa_compress() and
>     iaa_decompress().
>   crypto: iaa - Refactor hardware descriptor setup into separate
>     procedures.
>   crypto: iaa - Simplified, efficient job submissions for non-irq mode.
>   crypto: iaa - Deprecate exporting add/remove IAA compression modes.
>   crypto: iaa - Expect a single scatterlist for a [de]compress request's
>     src/dst.
>   crypto: iaa - Rearchitect iaa_crypto to have clean interfaces with
>     crypto_acomp
>   crypto: acomp - Define a unit_size in struct acomp_req to enable
>     batching.
>   crypto: iaa - IAA Batching for parallel compressions/decompressions.
>   crypto: iaa - Enable async mode and make it the default.
>   crypto: iaa - Disable iaa_verify_compress by default.
>   crypto: iaa - Submit the two largest source buffers first in
>     decompress batching.
>   crypto: iaa - Add deflate-iaa-dynamic compression mode.
>   crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's
>     batch-size.
>   mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to
>     deletion.
>   mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx
>     resources.
>   mm: zswap: zswap_store() will process a large folio in batches.
>   mm: zswap: Batched zswap_compress() with compress batching of large
>     folios.
> 
>  .../driver-api/crypto/iaa/iaa-crypto.rst      |  168 +-
>  crypto/acompress.c                            |   14 +
>  crypto/testmgr.c                              |   10 +
>  crypto/testmgr.h                              |   74 +
>  drivers/crypto/intel/iaa/Makefile             |    4 +-
>  drivers/crypto/intel/iaa/iaa_crypto.h         |   87 +-
>  .../intel/iaa/iaa_crypto_comp_dynamic.c       |   22 +
>  drivers/crypto/intel/iaa/iaa_crypto_main.c    | 2836 ++++++++++++-----
>  drivers/crypto/intel/iaa/iaa_crypto_stats.c   |    8 +
>  drivers/crypto/intel/iaa/iaa_crypto_stats.h   |    2 +
>  include/crypto/acompress.h                    |   48 +
>  include/crypto/internal/acompress.h           |    3 +
>  mm/zswap.c                                    |  700 ++--
>  13 files changed, 2905 insertions(+), 1071 deletions(-)
>  create mode 100644 drivers/crypto/intel/iaa/iaa_crypto_comp_dynamic.c
> 
> --
> 2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2025-12-19 19:03 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-04  9:12 [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 01/22] crypto: iaa - Reorganize the iaa_crypto driver code Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 02/22] crypto: iaa - New architecture for IAA device WQ comp/decomp usage & core mapping Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 03/22] crypto: iaa - Simplify, consistency of function parameters, minor stats bug fix Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 04/22] crypto: iaa - Descriptor allocation timeouts with mitigations Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 05/22] crypto: iaa - iaa_wq uses percpu_refs for get/put reference counting Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 06/22] crypto: iaa - Simplify the code flow in iaa_compress() and iaa_decompress() Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 07/22] crypto: iaa - Refactor hardware descriptor setup into separate procedures Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 08/22] crypto: iaa - Simplified, efficient job submissions for non-irq mode Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 09/22] crypto: iaa - Deprecate exporting add/remove IAA compression modes Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 10/22] crypto: iaa - Expect a single scatterlist for a [de]compress request's src/dst Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 11/22] crypto: iaa - Rearchitect iaa_crypto to have clean interfaces with crypto_acomp Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 12/22] crypto: acomp - Define a unit_size in struct acomp_req to enable batching Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 13/22] crypto: iaa - IAA Batching for parallel compressions/decompressions Kanchana P Sridhar
2025-11-14  9:59   ` Herbert Xu
2025-11-16 18:53     ` Sridhar, Kanchana P
2025-11-17  3:12       ` Herbert Xu
2025-11-17  5:47         ` Sridhar, Kanchana P
2025-11-04  9:12 ` [PATCH v13 14/22] crypto: iaa - Enable async mode and make it the default Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 15/22] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 16/22] crypto: iaa - Submit the two largest source buffers first in decompress batching Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 17/22] crypto: iaa - Add deflate-iaa-dynamic compression mode Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 18/22] crypto: acomp - Add crypto_acomp_batch_size() to get an algorithm's batch-size Kanchana P Sridhar
2025-11-04  9:12 ` [PATCH v13 19/22] mm: zswap: Per-CPU acomp_ctx resources exist from pool creation to deletion Kanchana P Sridhar
2025-11-13 20:24   ` Yosry Ahmed
2025-12-12  0:55     ` Sridhar, Kanchana P
2025-12-12  1:06       ` Yosry Ahmed
2025-12-12  1:58         ` Sridhar, Kanchana P
2025-12-12  2:47           ` Yosry Ahmed
2025-12-12  4:32             ` Sridhar, Kanchana P
2025-12-12 18:17     ` Sridhar, Kanchana P
2025-12-12 18:43       ` Yosry Ahmed
2025-12-12 20:53         ` Sridhar, Kanchana P
2025-12-12 22:25           ` Yosry Ahmed
2025-12-13 19:53             ` Sridhar, Kanchana P
2025-11-04  9:12 ` [PATCH v13 20/22] mm: zswap: Consistently use IS_ERR_OR_NULL() to check acomp_ctx resources Kanchana P Sridhar
2025-11-13 20:25   ` Yosry Ahmed
2025-12-12  1:07     ` Sridhar, Kanchana P
2025-11-04  9:12 ` [PATCH v13 21/22] mm: zswap: zswap_store() will process a large folio in batches Kanchana P Sridhar
2025-11-06 17:45   ` Nhat Pham
2025-11-07  2:28     ` Sridhar, Kanchana P
2025-11-13 20:52       ` Yosry Ahmed
2025-11-13 20:51   ` Yosry Ahmed
2025-12-12  1:43     ` Sridhar, Kanchana P
2025-12-12  4:40       ` Yosry Ahmed
2025-12-12 18:03         ` Sridhar, Kanchana P
2025-11-04  9:12 ` [PATCH v13 22/22] mm: zswap: Batched zswap_compress() with compress batching of large folios Kanchana P Sridhar
2025-11-13 21:34   ` Yosry Ahmed
2025-11-13 23:55     ` Sridhar, Kanchana P
2025-11-14  0:46       ` Yosry Ahmed
2025-12-19  2:29         ` Sridhar, Kanchana P
2025-12-19 15:26           ` Yosry Ahmed
2025-12-19 19:03             ` Sridhar, Kanchana P
2025-11-14  5:52       ` Yosry Ahmed
2025-11-14  6:43         ` Sridhar, Kanchana P
2025-11-14 15:37           ` Yosry Ahmed
2025-11-14 19:23             ` Sridhar, Kanchana P
2025-11-14 19:44               ` Yosry Ahmed
2025-11-14 19:59                 ` Sridhar, Kanchana P
2025-11-14 20:49                   ` Yosry Ahmed
2025-11-26  5:46             ` Herbert Xu
2025-11-26  6:34               ` Yosry Ahmed
2025-11-26 20:05                 ` Sridhar, Kanchana P
2025-12-08  3:23                   ` Herbert Xu
2025-12-08  4:17                     ` Sridhar, Kanchana P
2025-12-08  4:24                       ` Herbert Xu
2025-12-08  4:33                         ` Sridhar, Kanchana P
2025-12-09  1:15                         ` Yosry Ahmed
2025-12-09  2:32                           ` Herbert Xu
2025-12-09 16:55                             ` Yosry Ahmed
2025-12-09 17:21                               ` Sridhar, Kanchana P
2025-12-09 17:31                                 ` Yosry Ahmed
2025-12-09 19:38                                   ` Sridhar, Kanchana P
2025-12-10 16:01                                     ` Yosry Ahmed
2025-12-10 18:47                                       ` Sridhar, Kanchana P
2025-12-10  4:28                                   ` Herbert Xu
2025-12-10  5:36                                     ` Sridhar, Kanchana P
2025-12-10 15:53                                     ` Yosry Ahmed
2025-11-13 18:14 ` [PATCH v13 00/22] zswap compression batching with optimized iaa_crypto driver Sridhar, Kanchana P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox