[RFC PATCH v1 00/13] zswap IAA compress batching

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 00/13] zswap IAA compress batching
@ 2024-10-18  6:40 Kanchana P Sridhar
  2024-10-18  6:40 ` [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to acomp_alg and acomp_req Kanchana P Sridhar
                   ` (13 more replies)
  0 siblings, 14 replies; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar


IAA Compression Batching:
=========================

This RFC patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel compression of pages in a folio, and for batched reclaim
of hybrid any-order batches of folios in shrink_folio_list().

The patch-series is organized as follows:

 1) iaa_crypto driver enablers for batching: Relevant patches are tagged
    with "crypto:" in the subject:

    a) async poll crypto_acomp interface without interrupts.
    b) crypto testmgr acomp poll support.
    c) Modifying the default sync_mode to "async" and disabling
       verify_compress by default, to facilitate users to run IAA easily for
       comparison with software compressors.
    d) Changing the cpu-to-iaa mappings to more evenly balance cores to IAA
       devices.
    e) Addition of a "global_wq" per IAA, which can be used as a global
       resource for the socket. If the user configures 2WQs per IAA device,
       the driver will distribute compress jobs from all cores on the
       socket to the "global_wqs" of all the IAA devices on that socket, in
       a round-robin manner. This can be used to improve compression
       throughput for workloads that see a lot of swapout activity.

 2) Migrating zswap to use async poll in zswap_compress()/decompress().
 3) A centralized batch compression API that can be used by swap modules.
 4) IAA compress batching within large folio zswap stores.
 5) IAA compress batching of any-order hybrid folios in
    shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
    parameter can be used to configure the number of folios in [1, 32] to
    be reclaimed using compress batching.

IAA compress batching can be enabled only on platforms that have IAA, by
setting this config variable:

 CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y"
 
The performance testing data with usemem 30 instances shows throughput
gains of up to 40%, elapsed time reduction of up to 22% and sys time
reduction of up to 30% with IAA compression batching.

Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 3.5% reduction in workload run time as
compared to software compressors.


System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 10-16-2024,
commit 817952b8be34, without and with this patch-series.
Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores
per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
partition swap. Core frequency was fixed at 2500MHz.

The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:

usemem --init-time -w -O -s 10 -n 30 10g

Other kernel configuration parameters:

    zswap compressor : deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 2,4

IAA "compression verification" is disabled and the async poll acomp
interface is used in the iaa_crypto driver (the defaults with this
series).


Performance testing (usemem30):
===============================

 4K folios: deflate-iaa:
 =======================

 -------------------------------------------------------------------------------
                mm-unstable-10-16-2024  shrink_folio_list()  shrink_folio_list()
                                         batching of folios   batching of folios
 -------------------------------------------------------------------------------
 zswap compressor          deflate-iaa          deflate-iaa          deflate-iaa
 vm.compress-batchsize             n/a                    1                   32
 vm.page-cluster                     2                    2                    2
 -------------------------------------------------------------------------------
 Total throughput            4,470,466            5,770,824            6,363,045
           (KB/s)
 Average throughput            149,015              192,360              212,101
           (KB/s)
 elapsed time                   119.24               100.96                92.99
        (sec)
 sys time (sec)               2,819.29             2,168.08             1,970.79

 -------------------------------------------------------------------------------
 memcg_high                    668,185              646,357              613,421
 memcg_swap_fail                     0                    0                    0
 zswpout                    62,991,796           58,275,673           53,070,201
 zswpin                            431                  415                  396
 pswpout                             0                    0                    0
 pswpin                              0                    0                    0
 thp_swpout                          0                    0                    0
 thp_swpout_fallback                 0                    0                    0
 pgmajfault                      3,137                3,085                3,440
 swap_ra                            99                  100                   95
 swap_ra_hit                        42                   44                   45
 -------------------------------------------------------------------------------


 16k/32/64k folios: deflate-iaa:
 ===============================
 All three large folio sizes 16k/32/64k were enabled to "always".

 -------------------------------------------------------------------------------
                mm-unstable-  zswap_store()      + shrink_folio_list()
                  10-16-2024    batching of         batching of folios
                                   pages in
                               large folios
 -------------------------------------------------------------------------------
 zswap compr     deflate-iaa     deflate-iaa          deflate-iaa
 vm.compress-            n/a             n/a         4          8             16
 batchsize
 vm.page-                  2               2         2          2              2
  cluster
 -------------------------------------------------------------------------------
 Total throughput   7,182,198   8,448,994    8,584,728    8,729,643    8,775,944
           (KB/s)             
 Avg throughput       239,406     281,633      286,157      290,988      292,531
         (KB/s)               
 elapsed time           85.04       77.84        77.03        75.18        74.98
         (sec)                
 sys time (sec)      1,730.77    1,527.40     1,528.52     1,473.76     1,465.97

 -------------------------------------------------------------------------------
 memcg_high           648,125     694,188      696,004      699,728      724,887
 memcg_swap_fail        1,550       2,540        1,627        1,577        1,517
 zswpout           57,606,876  56,624,450   56,125,082    55,999,42   57,352,204
 zswpin                   421         406          422          400          437
 pswpout                    0           0            0            0            0
 pswpin                     0           0            0            0            0
 thp_swpout                 0           0            0            0            0
 thp_swpout_fallback        0           0            0            0            0
 16kB-mthp_swpout_          0           0            0            0            0
          fallback
 32kB-mthp_swpout_          0           0            0            0            0
          fallback
 64kB-mthp_swpout_      1,550       2,539        1,627        1,577        1,517
          fallback
 pgmajfault             3,102       3,126        3,473        3,454        3,134
 swap_ra                  107         144          109          124          181
 swap_ra_hit               51          88           45           66          107
 ZSWPOUT-16kB               2           3            4            4            3
 ZSWPOUT-32kB               0           2            1            1            0
 ZSWPOUT-64kB       3,598,889   3,536,556    3,506,134    3,498,324    3,582,921
 SWPOUT-16kB                0           0            0            0            0
 SWPOUT-32kB                0           0            0            0            0
 SWPOUT-64kB                0           0            0            0            0
 -------------------------------------------------------------------------------


 2M folios: deflate-iaa:
 =======================

 -------------------------------------------------------------------------------
                   mm-unstable-10-16-2024    zswap_store() batching of pages
                                                      in pmd-mappable folios
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa                deflate-iaa
 vm.compress-batchsize                n/a                        n/a
 vm.page-cluster                        2                          2
 -------------------------------------------------------------------------------
 Total throughput               7,444,592                 8,916,349     
           (KB/s)                                                  
 Average throughput               248,153                   297,211     
           (KB/s)                                                  
 elapsed time                       86.29                     73.44     
        (sec)                                                      
 sys time (sec)                  1,833.21                  1,418.58     
                                                                   
 -------------------------------------------------------------------------------
 memcg_high                        81,786                    89,905     
 memcg_swap_fail                       82                       395     
 zswpout                       58,874,092                57,721,884     
 zswpin                               422                       458     
 pswpout                                0                         0     
 pswpin                                 0                         0     
 thp_swpout                             0                         0     
 thp_swpout_fallback                   82                       394     
 pgmajfault                        14,864                    21,544     
 swap_ra                           34,953                    53,751     
 swap_ra_hit                       34,895                    53,660     
 ZSWPOUT-2048kB                   114,815                   112,269     
 SWPOUT-2048kB                          0                         0     
 -------------------------------------------------------------------------------

Since 4K folios account for ~0.4% of all zswapouts when pmd-mappable folios
are enabled for usemem30, we cannot expect much improvement from reclaim
batching.


Performance testing (Kernel compilation):
=========================================

As mentioned earlier, for workloads that see a lot of swapout activity, we
can benefit from configuring 2 WQs per IAA device, with compress jobs from
all same-socket cores being distributed toothe wq.1 of all IAAs on the
socket, with the "global_wq" developed in this patch-series.

Although this data includes IAA decompress batching, which will be
submitted as a separate RFC patch-series, I am listing it here to quantify
the benefit of distributing compress jobs among all IAAs. The kernel
compilation test with "allmodconfig" is able to quantify this well:


 4K folios: deflate-iaa: kernel compilation to quantify crypto patches
 =====================================================================


 ------------------------------------------------------------------------------
                   IAA shrink_folio_list() compress batching and
                       swapin_readahead() decompress batching

                                      1WQ      2WQ (distribute compress jobs)

                        1 local WQ (wq.0)    1 local WQ (wq.0) +
                                  per IAA    1 global WQ (wq.1) per IAA
                        
 ------------------------------------------------------------------------------
 zswap compressor             deflate-iaa         deflate-iaa
 vm.compress-batchsize                 32                  32
 vm.page-cluster                        4                   4
 ------------------------------------------------------------------------------
 real_sec                          746.77              745.42  
 user_sec                       15,732.66           15,738.85
 sys_sec                         5,384.14            5,247.86
 Max_Res_Set_Size_KB            1,874,432           1,872,640

 ------------------------------------------------------------------------------
 zswpout                      101,648,460         104,882,982
 zswpin                        27,418,319          29,428,515
 pswpout                              213                  22
 pswpin                               207                   6
 pgmajfault                    21,896,616          23,629,768
 swap_ra                        6,054,409           6,385,080
 swap_ra_hit                    3,791,628           3,985,141
 ------------------------------------------------------------------------------

The iaa_crypto wq stats will show almost the same number of compress calls
for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively.
We see a latency reduction of 2.5% by distributing compress jobs among all
IAA devices on the socket.

I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana



Kanchana P Sridhar (13):
  crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
  crypto: iaa - Add support for irq-less crypto async interface
  crypto: testmgr - Add crypto testmgr acomp poll support.
  mm: zswap: zswap_compress()/decompress() can submit, then poll an
    acomp_req.
  crypto: iaa - Make async mode the default.
  crypto: iaa - Disable iaa_verify_compress by default.
  crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to
    IAAs.
  crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA
    node.
  mm: zswap: Config variable to enable compress batching in
    zswap_store().
  mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if
    platform has IAA.
  mm: swap: Add IAA batch compression API
    swap_crypto_acomp_compress_batch().
  mm: zswap: Compress batching with Intel IAA in zswap_store() of large
    folios.
  mm: vmscan, swap, zswap: Compress batching of folios in
    shrink_folio_list().

 crypto/acompress.c                         |   1 +
 crypto/testmgr.c                           |  70 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 467 +++++++++++--
 include/crypto/acompress.h                 |  18 +
 include/crypto/internal/acompress.h        |   1 +
 include/linux/fs.h                         |   2 +
 include/linux/mm.h                         |   8 +
 include/linux/writeback.h                  |   5 +
 include/linux/zswap.h                      | 106 +++
 kernel/sysctl.c                            |   9 +
 mm/Kconfig                                 |  12 +
 mm/page_io.c                               | 152 +++-
 mm/swap.c                                  |  15 +
 mm/swap.h                                  |  96 +++
 mm/swap_state.c                            | 115 +++
 mm/vmscan.c                                | 154 +++-
 mm/zswap.c                                 | 771 +++++++++++++++++++--
 17 files changed, 1870 insertions(+), 132 deletions(-)


base-commit: 817952b8be34aad40e07f6832fb9d1fc08961550
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
@ 2024-10-18  6:40 ` Kanchana P Sridhar
  2024-10-18  7:55   ` Herbert Xu
  2024-10-18  6:40 ` [RFC PATCH v1 02/13] crypto: iaa - Add support for irq-less crypto async interface Kanchana P Sridhar
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

For async compress/decompress, provide a way for the caller to poll
for compress/decompress completion, rather than wait for an interrupt
to signal completion.

Callers can submit a compress/decompress using crypto_acomp_compress
and decompress and rather than wait on a completion, call
crypto_acomp_poll() to check for completion.

This is useful for hardware accelerators where the overhead of
interrupts and waiting for completions is too expensive.  Typically
the compress/decompress hw operations complete very quickly and in the
vast majority of cases, adding the overhead of interrupt handling and
waiting for completions simply adds unnecessary delays and cancels the
gains of using the hw acceleration.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/acompress.c                  |  1 +
 include/crypto/acompress.h          | 18 ++++++++++++++++++
 include/crypto/internal/acompress.h |  1 +
 3 files changed, 20 insertions(+)

diff --git a/crypto/acompress.c b/crypto/acompress.c
index 6fdf0ff9f3c0..00ec7faa2714 100644
--- a/crypto/acompress.c
+++ b/crypto/acompress.c
@@ -71,6 +71,7 @@ static int crypto_acomp_init_tfm(struct crypto_tfm *tfm)
 
 	acomp->compress = alg->compress;
 	acomp->decompress = alg->decompress;
+	acomp->poll = alg->poll;
 	acomp->dst_free = alg->dst_free;
 	acomp->reqsize = alg->reqsize;
 
diff --git a/include/crypto/acompress.h b/include/crypto/acompress.h
index 54937b615239..65b5de30c8b1 100644
--- a/include/crypto/acompress.h
+++ b/include/crypto/acompress.h
@@ -51,6 +51,7 @@ struct acomp_req {
 struct crypto_acomp {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	int (*poll)(struct acomp_req *req);
 	void (*dst_free)(struct scatterlist *dst);
 	unsigned int reqsize;
 	struct crypto_tfm base;
@@ -265,4 +266,21 @@ static inline int crypto_acomp_decompress(struct acomp_req *req)
 	return crypto_acomp_reqtfm(req)->decompress(req);
 }
 
+/**
+ * crypto_acomp_poll() -- Invoke asynchronous poll operation
+ *
+ * Function invokes the asynchronous poll operation
+ *
+ * @req:	asynchronous request
+ *
+ * Return: zero on poll completion, -EAGAIN if not complete, or
+ *         error code in case of error
+ */
+static inline int crypto_acomp_poll(struct acomp_req *req)
+{
+	struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
+
+	return tfm->poll(req);
+}
+
 #endif
diff --git a/include/crypto/internal/acompress.h b/include/crypto/internal/acompress.h
index 8831edaafc05..fbf5f6c6eeb6 100644
--- a/include/crypto/internal/acompress.h
+++ b/include/crypto/internal/acompress.h
@@ -37,6 +37,7 @@
 struct acomp_alg {
 	int (*compress)(struct acomp_req *req);
 	int (*decompress)(struct acomp_req *req);
+	int (*poll)(struct acomp_req *req);
 	void (*dst_free)(struct scatterlist *dst);
 	int (*init)(struct crypto_acomp *tfm);
 	void (*exit)(struct crypto_acomp *tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 02/13] crypto: iaa - Add support for irq-less crypto async interface
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
  2024-10-18  6:40 ` [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to acomp_alg and acomp_req Kanchana P Sridhar
@ 2024-10-18  6:40 ` Kanchana P Sridhar
  2024-10-18  6:40 ` [RFC PATCH v1 03/13] crypto: testmgr - Add crypto testmgr acomp poll support Kanchana P Sridhar
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Add a crypto acomp poll() implementation so that callers can use true
async iaa compress/decompress without interrupts.

To use this mode with zswap, select the 'async' iaa_crypto
driver_sync_mode:

  echo async > /sys/bus/dsa/drivers/crypto/sync_mode

This will cause the iaa_crypto driver to register its acomp_alg
implementation using a non-NULL poll() member, which callers such as
zswap can check for the presence of and use true async mode if found.

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 74 ++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 237f87000070..6a8577ac1330 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -1788,6 +1788,74 @@ static void compression_ctx_init(struct iaa_compression_ctx *ctx)
 	ctx->use_irq = use_irq;
 }
 
+static int iaa_comp_poll(struct acomp_req *req)
+{
+	struct idxd_desc *idxd_desc;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	struct idxd_wq *wq;
+	bool compress_op;
+	int ret;
+
+	idxd_desc = req->base.data;
+	if (!idxd_desc)
+		return -EAGAIN;
+
+	compress_op = (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS);
+	wq = idxd_desc->wq;
+	iaa_wq = idxd_wq_get_private(wq);
+	idxd = iaa_wq->iaa_device->idxd;
+	pdev = idxd->pdev;
+	dev = &pdev->dev;
+
+	ret = check_completion(dev, idxd_desc->iax_completion, true, true);
+	if (ret == -EAGAIN)
+		return ret;
+	if (ret)
+		goto out;
+
+	req->dlen = idxd_desc->iax_completion->output_size;
+
+	/* Update stats */
+	if (compress_op) {
+		update_total_comp_bytes_out(req->dlen);
+		update_wq_comp_bytes(wq, req->dlen);
+	} else {
+		update_total_decomp_bytes_in(req->slen);
+		update_wq_decomp_bytes(wq, req->slen);
+	}
+
+	if (iaa_verify_compress && (idxd_desc->iax_hw->opcode == IAX_OPCODE_COMPRESS)) {
+		struct crypto_tfm *tfm = req->base.tfm;
+		dma_addr_t src_addr, dst_addr;
+		u32 compression_crc;
+
+		compression_crc = idxd_desc->iax_completion->crc;
+
+		dma_sync_sg_for_device(dev, req->dst, 1, DMA_FROM_DEVICE);
+		dma_sync_sg_for_device(dev, req->src, 1, DMA_TO_DEVICE);
+
+		src_addr = sg_dma_address(req->src);
+		dst_addr = sg_dma_address(req->dst);
+
+		ret = iaa_compress_verify(tfm, req, wq, src_addr, req->slen,
+					  dst_addr, &req->dlen, compression_crc);
+	}
+out:
+	/* caller doesn't call crypto_wait_req, so no acomp_request_complete() */
+
+	dma_unmap_sg(dev, req->dst, sg_nents(req->dst), DMA_FROM_DEVICE);
+	dma_unmap_sg(dev, req->src, sg_nents(req->src), DMA_TO_DEVICE);
+
+	idxd_free_desc(idxd_desc->wq, idxd_desc);
+
+	dev_dbg(dev, "%s: returning ret=%d\n", __func__, ret);
+
+	return ret;
+}
+
 static int iaa_comp_init_fixed(struct crypto_acomp *acomp_tfm)
 {
 	struct crypto_tfm *tfm = crypto_acomp_tfm(acomp_tfm);
@@ -1813,6 +1881,7 @@ static struct acomp_alg iaa_acomp_fixed_deflate = {
 	.compress		= iaa_comp_acompress,
 	.decompress		= iaa_comp_adecompress,
 	.dst_free               = dst_free,
+	.poll			= iaa_comp_poll,
 	.base			= {
 		.cra_name		= "deflate",
 		.cra_driver_name	= "deflate-iaa",
@@ -1827,6 +1896,11 @@ static int iaa_register_compression_device(void)
 {
 	int ret;
 
+	if (async_mode && !use_irq)
+		iaa_acomp_fixed_deflate.poll = iaa_comp_poll;
+	else
+		iaa_acomp_fixed_deflate.poll = NULL;
+
 	ret = crypto_register_acomp(&iaa_acomp_fixed_deflate);
 	if (ret) {
 		pr_err("deflate algorithm acomp fixed registration failed (%d)\n", ret);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 03/13] crypto: testmgr - Add crypto testmgr acomp poll support.
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
  2024-10-18  6:40 ` [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to acomp_alg and acomp_req Kanchana P Sridhar
  2024-10-18  6:40 ` [RFC PATCH v1 02/13] crypto: iaa - Add support for irq-less crypto async interface Kanchana P Sridhar
@ 2024-10-18  6:40 ` Kanchana P Sridhar
  2024-10-18  6:40 ` [RFC PATCH v1 04/13] mm: zswap: zswap_compress()/decompress() can submit, then poll an acomp_req Kanchana P Sridhar
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch enables the newly added acomp poll API to be exercised in the
crypto test_acomp() calls to compress/decompress, if the acomp registers
a poll method.

Signed-off-by: Glover, Andre <andre.glover@intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 crypto/testmgr.c | 70 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 65 insertions(+), 5 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index ee8da628e9da..54f6f59ae501 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3482,7 +3482,19 @@ static int test_acomp(struct crypto_acomp *tfm,
 		acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 					   crypto_req_done, &wait);
 
-		ret = crypto_wait_req(crypto_acomp_compress(req), &wait);
+		if (tfm->poll) {
+			ret = crypto_acomp_compress(req);
+			if (ret == -EINPROGRESS) {
+				do {
+					ret = crypto_acomp_poll(req);
+					if (ret && ret != -EAGAIN)
+						break;
+				} while (ret);
+			}
+		} else {
+			ret = crypto_wait_req(crypto_acomp_compress(req), &wait);
+		}
+
 		if (ret) {
 			pr_err("alg: acomp: compression failed on test %d for %s: ret=%d\n",
 			       i + 1, algo, -ret);
@@ -3498,7 +3510,19 @@ static int test_acomp(struct crypto_acomp *tfm,
 		crypto_init_wait(&wait);
 		acomp_request_set_params(req, &src, &dst, ilen, dlen);
 
-		ret = crypto_wait_req(crypto_acomp_decompress(req), &wait);
+		if (tfm->poll) {
+			ret = crypto_acomp_decompress(req);
+			if (ret == -EINPROGRESS) {
+				do {
+					ret = crypto_acomp_poll(req);
+					if (ret && ret != -EAGAIN)
+						break;
+				} while (ret);
+			}
+		} else {
+			ret = crypto_wait_req(crypto_acomp_decompress(req), &wait);
+		}
+
 		if (ret) {
 			pr_err("alg: acomp: compression failed on test %d for %s: ret=%d\n",
 			       i + 1, algo, -ret);
@@ -3531,7 +3555,19 @@ static int test_acomp(struct crypto_acomp *tfm,
 		sg_init_one(&src, input_vec, ilen);
 		acomp_request_set_params(req, &src, NULL, ilen, 0);
 
-		ret = crypto_wait_req(crypto_acomp_compress(req), &wait);
+		if (tfm->poll) {
+			ret = crypto_acomp_compress(req);
+			if (ret == -EINPROGRESS) {
+				do {
+					ret = crypto_acomp_poll(req);
+					if (ret && ret != -EAGAIN)
+						break;
+				} while (ret);
+			}
+		} else {
+			ret = crypto_wait_req(crypto_acomp_compress(req), &wait);
+		}
+
 		if (ret) {
 			pr_err("alg: acomp: compression failed on NULL dst buffer test %d for %s: ret=%d\n",
 			       i + 1, algo, -ret);
@@ -3574,7 +3610,19 @@ static int test_acomp(struct crypto_acomp *tfm,
 		acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
 					   crypto_req_done, &wait);
 
-		ret = crypto_wait_req(crypto_acomp_decompress(req), &wait);
+		if (tfm->poll) {
+			ret = crypto_acomp_decompress(req);
+			if (ret == -EINPROGRESS) {
+				do {
+					ret = crypto_acomp_poll(req);
+					if (ret && ret != -EAGAIN)
+						break;
+				} while (ret);
+			}
+		} else {
+			ret = crypto_wait_req(crypto_acomp_decompress(req), &wait);
+		}
+
 		if (ret) {
 			pr_err("alg: acomp: decompression failed on test %d for %s: ret=%d\n",
 			       i + 1, algo, -ret);
@@ -3606,7 +3654,19 @@ static int test_acomp(struct crypto_acomp *tfm,
 		crypto_init_wait(&wait);
 		acomp_request_set_params(req, &src, NULL, ilen, 0);
 
-		ret = crypto_wait_req(crypto_acomp_decompress(req), &wait);
+		if (tfm->poll) {
+			ret = crypto_acomp_decompress(req);
+			if (ret == -EINPROGRESS) {
+				do {
+					ret = crypto_acomp_poll(req);
+					if (ret && ret != -EAGAIN)
+						break;
+				} while (ret);
+			}
+		} else {
+			ret = crypto_wait_req(crypto_acomp_decompress(req), &wait);
+		}
+
 		if (ret) {
 			pr_err("alg: acomp: decompression failed on NULL dst buffer test %d for %s: ret=%d\n",
 			       i + 1, algo, -ret);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 04/13] mm: zswap: zswap_compress()/decompress() can submit, then poll an acomp_req.
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2024-10-18  6:40 ` [RFC PATCH v1 03/13] crypto: testmgr - Add crypto testmgr acomp poll support Kanchana P Sridhar
@ 2024-10-18  6:40 ` Kanchana P Sridhar
  2024-10-23  0:48   ` Yosry Ahmed
  2024-10-18  6:40 ` [RFC PATCH v1 05/13] crypto: iaa - Make async mode the default Kanchana P Sridhar
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

If the crypto_acomp has a poll interface registered, zswap_compress()
and zswap_decompress() will submit the acomp_req, and then poll() for a
successful completion/error status in a busy-wait loop. This allows an
asynchronous way to manage (potentially multiple) acomp_reqs without
the use of interrupts, which is supported in the iaa_crypto driver.

This enables us to implement batch submission of multiple
compression/decompression jobs to the Intel IAA hardware accelerator,
which will process them in parallel; followed by polling the batch's
acomp_reqs for completion status.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 51 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index f6316b66fb23..948c9745ee57 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -910,18 +910,34 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
 
 	/*
-	 * it maybe looks a little bit silly that we send an asynchronous request,
-	 * then wait for its completion synchronously. This makes the process look
-	 * synchronous in fact.
-	 * Theoretically, acomp supports users send multiple acomp requests in one
-	 * acomp instance, then get those requests done simultaneously. but in this
-	 * case, zswap actually does store and load page by page, there is no
-	 * existing method to send the second page before the first page is done
-	 * in one thread doing zwap.
-	 * but in different threads running on different cpu, we have different
-	 * acomp instance, so multiple threads can do (de)compression in parallel.
+	 * If the crypto_acomp provides an asynchronous poll() interface,
+	 * submit the descriptor and poll for a completion status.
+	 *
+	 * It maybe looks a little bit silly that we send an asynchronous
+	 * request, then wait for its completion in a busy-wait poll loop, or,
+	 * synchronously. This makes the process look synchronous in fact.
+	 * Theoretically, acomp supports users send multiple acomp requests in
+	 * one acomp instance, then get those requests done simultaneously.
+	 * But in this case, zswap actually does store and load page by page,
+	 * there is no existing method to send the second page before the
+	 * first page is done in one thread doing zswap.
+	 * But in different threads running on different cpu, we have different
+	 * acomp instance, so multiple threads can do (de)compression in
+	 * parallel.
 	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
+	if (acomp_ctx->acomp->poll) {
+		comp_ret = crypto_acomp_compress(acomp_ctx->req);
+		if (comp_ret == -EINPROGRESS) {
+			do {
+				comp_ret = crypto_acomp_poll(acomp_ctx->req);
+				if (comp_ret && comp_ret != -EAGAIN)
+					break;
+			} while (comp_ret);
+		}
+	} else {
+		comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
+	}
+
 	dlen = acomp_ctx->req->dlen;
 	if (comp_ret)
 		goto unlock;
@@ -959,6 +975,7 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	struct scatterlist input, output;
 	struct crypto_acomp_ctx *acomp_ctx;
 	u8 *src;
+	int ret;
 
 	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
 	mutex_lock(&acomp_ctx->mutex);
@@ -984,7 +1001,17 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	sg_init_table(&output, 1);
 	sg_set_folio(&output, folio, PAGE_SIZE, 0);
 	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
-	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
+	if (acomp_ctx->acomp->poll) {
+		ret = crypto_acomp_decompress(acomp_ctx->req);
+		if (ret == -EINPROGRESS) {
+			do {
+				ret = crypto_acomp_poll(acomp_ctx->req);
+				BUG_ON(ret && ret != -EAGAIN);
+			} while (ret);
+		}
+	} else {
+		BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
+	}
 	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
 	mutex_unlock(&acomp_ctx->mutex);
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 05/13] crypto: iaa - Make async mode the default.
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2024-10-18  6:40 ` [RFC PATCH v1 04/13] mm: zswap: zswap_compress()/decompress() can submit, then poll an acomp_req Kanchana P Sridhar
@ 2024-10-18  6:40 ` Kanchana P Sridhar
  2024-10-18  6:40 ` [RFC PATCH v1 06/13] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default in the most efficient/recommended "async"
mode, namely, asynchronous submission of descriptors, followed by polling
for job completions. Earlier, the "sync" mode used to be the default.

This way, anyone that wants to use IAA can do so after building the kernel,
and *without* having to go through these steps to use async poll:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo async > /sys/bus/dsa/drivers/crypto/sync_mode
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 6a8577ac1330..6c262b1eb09d 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -153,7 +153,7 @@ static DRIVER_ATTR_RW(verify_compress);
  */
 
 /* Use async mode */
-static bool async_mode;
+static bool async_mode = true;
 /* Use interrupts */
 static bool use_irq;
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 06/13] crypto: iaa - Disable iaa_verify_compress by default.
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2024-10-18  6:40 ` [RFC PATCH v1 05/13] crypto: iaa - Make async mode the default Kanchana P Sridhar
@ 2024-10-18  6:40 ` Kanchana P Sridhar
  2024-10-18  6:40 ` [RFC PATCH v1 07/13] crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to IAAs Kanchana P Sridhar
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch makes it easier for IAA hardware acceleration in the iaa_crypto
driver to be loaded by default with "iaa_verify_compress" disabled, to
facilitate performance comparisons with software compressors (which also
do not run compress verification by default). Earlier, iaa_crypto compress
verification used to be enabled by default.

With this patch, if users want to enable compress verification, they can do
so with these steps:

  1) disable all the IAA device/wq bindings that happen at boot time
  2) rmmod iaa_crypto
  3) modprobe iaa_crypto
  4) echo 1 > /sys/bus/dsa/drivers/crypto/verify_compress
  5) re-run initialization of the IAA devices and wqs

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 6c262b1eb09d..8e6859c97970 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -94,7 +94,7 @@ static bool iaa_crypto_enabled;
 static bool iaa_crypto_registered;
 
 /* Verify results of IAA compress or not */
-static bool iaa_verify_compress = true;
+static bool iaa_verify_compress = false;
 
 static ssize_t verify_compress_show(struct device_driver *driver, char *buf)
 {
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 07/13] crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to IAAs.
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2024-10-18  6:40 ` [RFC PATCH v1 06/13] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
@ 2024-10-18  6:40 ` Kanchana P Sridhar
  2024-10-18  6:40 ` [RFC PATCH v1 08/13] crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA node Kanchana P Sridhar
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This change distributes the cpus more evenly among the IAAs in each socket.

 Old algorithm to assign cpus to IAA:
 ------------------------------------
 If "nr_cpus" = nr_logical_cpus (includes hyper-threading), the current
 algorithm determines "nr_cpus_per_node" = nr_cpus / nr_nodes.

 Hence, on a 2-socket Sapphire Rapids server where each socket has 56 cores
 and 4 IAA devices, nr_cpus_per_node = 112.

 Further, cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa
 Hence, cpus_per_iaa = 224/8 = 28.

 The iaa_crypto driver then assigns 28 "logical" node cpus per IAA device
 on that node, that results in this cpu-to-iaa mapping:

 lscpu|grep NUMA
 NUMA node(s):        2
 NUMA node0 CPU(s):   0-55,112-167
 NUMA node1 CPU(s):   56-111,168-223

 NUMA node 0:
 cpu   0-27    28-55  112-139  140-167
 iaa   iax1    iax3   iax5     iax7

 NUMA node 1:
 cpu   56-83  84-111  168-195   196-223
 iaa   iax9   iax11   iax13     iax15

 This appears non-optimal for a few reasons:

 1) The 2 logical threads on a core will get assigned to different IAA
    devices. For e.g.:
      cpu 0:   iax1
      cpu 112: iax5
 2) One of the logical threads on a core is assigned to an IAA that is not
    closest to that core. For e.g. cpu 112.
 3) If numactl is used to start processes sequentially on the logical
    cores, some of the IAA devices on the socket could be over-subscribed,
    while some could be under-utilized.

This patch introduces a scheme to more evenly balance the logical cores to
IAA devices on a socket.

 New algorithm to assign cpus to IAA:
 ------------------------------------
 We introduce a function "cpu_to_iaa()" that takes a logical cpu and
 returns the IAA device closest to it.

 If "nr_cpus" = nr_logical_cpus (includes hyper-threading), the new
 algorithm determines "nr_cpus_per_node" = topology_num_cores_per_package().

 Hence, on a 2-socket Sapphire Rapids server where each socket has 56 cores
 and 4 IAA devices, nr_cpus_per_node = 56.

 Further, cpus_per_iaa = (nr_nodes * nr_cpus_per_node) / nr_iaa
 Hence, cpus_per_iaa = 112/8 = 14.

 The iaa_crypto driver then assigns 14 "logical" node cpus per IAA device
 on that node, that results in this cpu-to-iaa mapping:

 NUMA node 0:
 cpu   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
 iaa   iax1           iax3           iax5           iax7

 NUMA node 1:
 cpu   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
 iaa   iax9           iax11          iax13           iax15

 This resolves the 3 issues with non-optimality of cpu-to-iaa mappings
 pointed out earlier with the existing approach.

Originally-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 84 ++++++++++++++--------
 1 file changed, 54 insertions(+), 30 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index 8e6859c97970..c854a7a1aaa4 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -55,6 +55,46 @@ static struct idxd_wq *wq_table_next_wq(int cpu)
 	return entry->wqs[entry->cur_wq];
 }
 
+/*
+ * Given a cpu, find the closest IAA instance.  The idea is to try to
+ * choose the most appropriate IAA instance for a caller and spread
+ * available workqueues around to clients.
+ */
+static inline int cpu_to_iaa(int cpu)
+{
+	int node, n_cpus = 0, test_cpu, iaa = 0;
+	int nr_iaa_per_node;
+	const struct cpumask *node_cpus;
+
+	if (!nr_nodes)
+		return 0;
+
+	nr_iaa_per_node = nr_iaa / nr_nodes;
+	if (!nr_iaa_per_node)
+		return 0;
+
+	for_each_online_node(node) {
+		node_cpus = cpumask_of_node(node);
+		if (!cpumask_test_cpu(cpu, node_cpus))
+			continue;
+
+		for_each_cpu(test_cpu, node_cpus) {
+			if ((n_cpus % nr_cpus_per_node) == 0)
+				iaa = node * nr_iaa_per_node;
+
+			if (test_cpu == cpu)
+				return iaa;
+
+			n_cpus++;
+
+			if ((n_cpus % cpus_per_iaa) == 0)
+				iaa++;
+		}
+	}
+
+	return -1;
+}
+
 static void wq_table_add(int cpu, struct idxd_wq *wq)
 {
 	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
@@ -895,8 +935,7 @@ static int wq_table_add_wqs(int iaa, int cpu)
  */
 static void rebalance_wq_table(void)
 {
-	const struct cpumask *node_cpus;
-	int node, cpu, iaa = -1;
+	int cpu, iaa;
 
 	if (nr_iaa == 0)
 		return;
@@ -906,37 +945,22 @@ static void rebalance_wq_table(void)
 
 	clear_wq_table();
 
-	if (nr_iaa == 1) {
-		for (cpu = 0; cpu < nr_cpus; cpu++) {
-			if (WARN_ON(wq_table_add_wqs(0, cpu))) {
-				pr_debug("could not add any wqs for iaa 0 to cpu %d!\n", cpu);
-				return;
-			}
-		}
-
-		return;
-	}
-
-	for_each_node_with_cpus(node) {
-		node_cpus = cpumask_of_node(node);
-
-		for (cpu = 0; cpu <  cpumask_weight(node_cpus); cpu++) {
-			int node_cpu = cpumask_nth(cpu, node_cpus);
-
-			if (WARN_ON(node_cpu >= nr_cpu_ids)) {
-				pr_debug("node_cpu %d doesn't exist!\n", node_cpu);
-				return;
-			}
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		iaa = cpu_to_iaa(cpu);
+		pr_debug("rebalance: cpu=%d iaa=%d\n", cpu, iaa);
 
-			if ((cpu % cpus_per_iaa) == 0)
-				iaa++;
+		if (WARN_ON(iaa == -1)) {
+			pr_debug("rebalance (cpu_to_iaa(%d)) failed!\n", cpu);
+			return;
+		}
 
-			if (WARN_ON(wq_table_add_wqs(iaa, node_cpu))) {
-				pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
-				return;
-			}
+		if (WARN_ON(wq_table_add_wqs(iaa, cpu))) {
+			pr_debug("could not add any wqs for iaa %d to cpu %d!\n", iaa, cpu);
+			return;
 		}
 	}
+
+	pr_debug("Finished rebalance local wqs.");
 }
 
 static inline int check_completion(struct device *dev,
@@ -2084,7 +2108,7 @@ static int __init iaa_crypto_init_module(void)
 		pr_err("IAA couldn't find any nodes with cpus\n");
 		return -ENODEV;
 	}
-	nr_cpus_per_node = nr_cpus / nr_nodes;
+	nr_cpus_per_node = topology_num_cores_per_package();
 
 	if (crypto_has_comp("deflate-generic", 0, 0))
 		deflate_generic_tfm = crypto_alloc_comp("deflate-generic", 0, 0);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 08/13] crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA node.
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (6 preceding siblings ...)
  2024-10-18  6:40 ` [RFC PATCH v1 07/13] crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to IAAs Kanchana P Sridhar
@ 2024-10-18  6:40 ` Kanchana P Sridhar
  2024-10-18  6:40 ` [RFC PATCH v1 09/13] mm: zswap: Config variable to enable compress batching in zswap_store() Kanchana P Sridhar
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This change enables processes running on any logical core on a NUMA node to
use all the IAA devices enabled on that NUMA node for compress jobs. In
other words, compressions originating from any process in a node will be
distributed in round-robin manner to the available IAA devices on the same
socket. The main premise behind this change is to make sure that no
compress engines on any IAA device are left un-utilized/under-utilized. In
other words, the compress engines on all IAA devices are considered a
global resource for that socket.

This allows the use of all IAA devices present in a given NUMA node for
(batched) compressions originating from zswap/zram, from all cores
on this node.

A new per-cpu "global_wq_table" implements this in the iaa_crypto driver.
We can think of the global WQ per IAA as a WQ to which all cores on
that socket can submit compress jobs.

To avail of this feature, the user must configure 2 WQs per IAA in order to
enable distribution of compress jobs to multiple IAA devices.

Each IAA will have 2 WQs:
 wq.0 (local WQ):
   Used for decompress jobs from cores mapped by the cpu_to_iaa() "even
   balancing of logical cores to IAA devices" algorithm.

 wq.1 (global WQ):
   Used for compress jobs from *all* logical cores on that socket.

The iaa_crypto driver will place all global WQs from all same-socket IAA
devices in the global_wq_table per cpu on that socket. When the driver
receives a compress job, it will lookup the "next" global WQ in the cpu's
global_wq_table to submit the descriptor.

The starting wq in the global_wq_table for each cpu is the global wq
associated with the IAA nearest to it, so that we stagger the starting
global wq for each process. This results in very uniform usage of all IAAs
for compress jobs.

Two new driver module parameters are added for this feature:

g_wqs_per_iaa (default 1):

 /sys/bus/dsa/drivers/crypto/g_wqs_per_iaa

 This represents the number of global WQs that can be configured per IAA
 device. The default is 1, and is the recommended setting to enable the use
 of this feature once the user configures 2 WQs per IAA using higher level
 scripts as described in
 Documentation/driver-api/crypto/iaa/iaa-crypto.rst.

g_consec_descs_per_gwq (default 1):

 /sys/bus/dsa/drivers/crypto/g_consec_descs_per_gwq

 This represents the number of consecutive compress jobs that will be
 submitted to the same global WQ (i.e. to the same IAA device) from a given
 core, before moving to the next global WQ. The default is 1, which is also
 the recommended setting to avail of this feature.

The decompress jobs from any core will be sent to the "local" IAA, namely
the one that the driver assigns with the cpu_to_iaa() mapping algorithm
that evenly balances the assignment of logical cores to IAA devices on a
NUMA node.

On a 2-socket Sapphire Rapids server where each socket has 56 cores and
4 IAA devices, this is how the compress/decompress jobs will be mapped
when the user configures 2 WQs per IAA device (which implies wq.1 will
be added to the global WQ table for each logical core on that NUMA node):

 lscpu|grep NUMA
 NUMA node(s):        2
 NUMA node0 CPU(s):   0-55,112-167
 NUMA node1 CPU(s):   56-111,168-223

 Compress jobs:
 --------------
 NUMA node 0:
 All cpus (0-55,112-167) can send compress jobs to all IAA devices on the
 socket (iax1/iax3/iax5/iax7) in round-robin manner:
 iaa   iax1           iax3           iax5           iax7

 NUMA node 1:
 All cpus (56-111,168-223) can send compress jobs to all IAA devices on the
 socket (iax9/iax11/iax13/iax15) in round-robin manner:
 iaa   iax9           iax11          iax13           iax15

 Decompress jobs:
 ----------------
 NUMA node 0:
 cpu   0-13,112-125   14-27,126-139  28-41,140-153  42-55,154-167
 iaa   iax1           iax3           iax5           iax7

 NUMA node 1:
 cpu   56-69,168-181  70-83,182-195  84-97,196-209   98-111,210-223
 iaa   iax9           iax11          iax13           iax15

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 305 ++++++++++++++++++++-
 1 file changed, 290 insertions(+), 15 deletions(-)

diff --git a/drivers/crypto/intel/iaa/iaa_crypto_main.c b/drivers/crypto/intel/iaa/iaa_crypto_main.c
index c854a7a1aaa4..2d6c517e9d9b 100644
--- a/drivers/crypto/intel/iaa/iaa_crypto_main.c
+++ b/drivers/crypto/intel/iaa/iaa_crypto_main.c
@@ -29,14 +29,23 @@ static unsigned int nr_iaa;
 static unsigned int nr_cpus;
 static unsigned int nr_nodes;
 static unsigned int nr_cpus_per_node;
-
 /* Number of physical cpus sharing each iaa instance */
 static unsigned int cpus_per_iaa;
 
 static struct crypto_comp *deflate_generic_tfm;
 
 /* Per-cpu lookup table for balanced wqs */
-static struct wq_table_entry __percpu *wq_table;
+static struct wq_table_entry __percpu *wq_table = NULL;
+
+/* Per-cpu lookup table for global wqs shared by all cpus. */
+static struct wq_table_entry __percpu *global_wq_table = NULL;
+
+/*
+ * Per-cpu counter of consecutive descriptors allocated to
+ * the same wq in the global_wq_table, so that we know
+ * when to switch to the next wq in the global_wq_table.
+ */
+static int __percpu *num_consec_descs_per_wq = NULL;
 
 static struct idxd_wq *wq_table_next_wq(int cpu)
 {
@@ -104,26 +113,68 @@ static void wq_table_add(int cpu, struct idxd_wq *wq)
 
 	entry->wqs[entry->n_wqs++] = wq;
 
-	pr_debug("%s: added iaa wq %d.%d to idx %d of cpu %d\n", __func__,
-		 entry->wqs[entry->n_wqs - 1]->idxd->id,
-		 entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+	pr_debug("%s: added iaa local wq %d.%d to idx %d of cpu %d\n", __func__,
+		entry->wqs[entry->n_wqs - 1]->idxd->id,
+		entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+}
+
+static void global_wq_table_add(int cpu, struct idxd_wq *wq)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (WARN_ON(entry->n_wqs == entry->max_wqs))
+		return;
+
+	entry->wqs[entry->n_wqs++] = wq;
+
+	pr_debug("%s: added iaa global wq %d.%d to idx %d of cpu %d\n", __func__,
+		entry->wqs[entry->n_wqs - 1]->idxd->id,
+		entry->wqs[entry->n_wqs - 1]->id, entry->n_wqs - 1, cpu);
+}
+
+static void global_wq_table_set_start_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int start_wq = (entry->n_wqs / nr_iaa) * cpu_to_iaa(cpu);
+
+	if ((start_wq >= 0) && (start_wq < entry->n_wqs))
+		entry->cur_wq = start_wq;
 }
 
 static void wq_table_free_entry(int cpu)
 {
 	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
 
-	kfree(entry->wqs);
-	memset(entry, 0, sizeof(*entry));
+	if (entry) {
+		kfree(entry->wqs);
+		memset(entry, 0, sizeof(*entry));
+	}
+
+	entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (entry) {
+		kfree(entry->wqs);
+		memset(entry, 0, sizeof(*entry));
+	}
 }
 
 static void wq_table_clear_entry(int cpu)
 {
 	struct wq_table_entry *entry = per_cpu_ptr(wq_table, cpu);
 
-	entry->n_wqs = 0;
-	entry->cur_wq = 0;
-	memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+	if (entry) {
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+		memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+	}
+
+	entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (entry) {
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+		memset(entry->wqs, 0, entry->max_wqs * sizeof(struct idxd_wq *));
+	}
 }
 
 LIST_HEAD(iaa_devices);
@@ -163,6 +214,70 @@ static ssize_t verify_compress_store(struct device_driver *driver,
 }
 static DRIVER_ATTR_RW(verify_compress);
 
+/* Number of global wqs per iaa*/
+static int g_wqs_per_iaa = 1;
+
+static ssize_t g_wqs_per_iaa_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_wqs_per_iaa);
+}
+
+static ssize_t g_wqs_per_iaa_store(struct device_driver *driver,
+				     const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_wqs_per_iaa);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_wqs_per_iaa);
+
+/*
+ * Number of consecutive descriptors to allocate from a
+ * given global wq before switching to the next wq in
+ * the global_wq_table.
+ */
+static int g_consec_descs_per_gwq = 1;
+
+static ssize_t g_consec_descs_per_gwq_show(struct device_driver *driver, char *buf)
+{
+	return sprintf(buf, "%d\n", g_consec_descs_per_gwq);
+}
+
+static ssize_t g_consec_descs_per_gwq_store(struct device_driver *driver,
+				     const char *buf, size_t count)
+{
+	int ret = -EBUSY;
+
+	mutex_lock(&iaa_devices_lock);
+
+	if (iaa_crypto_enabled)
+		goto out;
+
+	ret = kstrtoint(buf, 10, &g_consec_descs_per_gwq);
+	if (ret)
+		goto out;
+
+	ret = count;
+out:
+	mutex_unlock(&iaa_devices_lock);
+
+	return ret;
+}
+static DRIVER_ATTR_RW(g_consec_descs_per_gwq);
+
 /*
  * The iaa crypto driver supports three 'sync' methods determining how
  * compressions and decompressions are performed:
@@ -751,7 +866,20 @@ static void free_wq_table(void)
 	for (cpu = 0; cpu < nr_cpus; cpu++)
 		wq_table_free_entry(cpu);
 
-	free_percpu(wq_table);
+	if (wq_table) {
+		free_percpu(wq_table);
+		wq_table = NULL;
+	}
+
+	if (global_wq_table) {
+		free_percpu(global_wq_table);
+		global_wq_table = NULL;
+	}
+
+	if (num_consec_descs_per_wq) {
+		free_percpu(num_consec_descs_per_wq);
+		num_consec_descs_per_wq = NULL;
+	}
 
 	pr_debug("freed wq table\n");
 }
@@ -774,6 +902,38 @@ static int alloc_wq_table(int max_wqs)
 		}
 
 		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+	}
+
+	global_wq_table = alloc_percpu(struct wq_table_entry);
+	if (!global_wq_table) {
+		free_wq_table();
+		return -ENOMEM;
+	}
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		entry = per_cpu_ptr(global_wq_table, cpu);
+		entry->wqs = kzalloc(GFP_KERNEL, max_wqs * sizeof(struct wq *));
+		if (!entry->wqs) {
+			free_wq_table();
+			return -ENOMEM;
+		}
+
+		entry->max_wqs = max_wqs;
+		entry->n_wqs = 0;
+		entry->cur_wq = 0;
+	}
+
+	num_consec_descs_per_wq = alloc_percpu(int);
+	if (!num_consec_descs_per_wq) {
+		free_wq_table();
+		return -ENOMEM;
+	}
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+		*num_consec_descs = 0;
 	}
 
 	pr_debug("initialized wq table\n");
@@ -912,9 +1072,14 @@ static int wq_table_add_wqs(int iaa, int cpu)
 	}
 
 	list_for_each_entry(iaa_wq, &found_device->wqs, list) {
-		wq_table_add(cpu, iaa_wq->wq);
+
+		if (((found_device->n_wq - g_wqs_per_iaa) < 1) ||
+			(n_wqs_added < (found_device->n_wq - g_wqs_per_iaa))) {
+			wq_table_add(cpu, iaa_wq->wq);
+		}
+
 		pr_debug("rebalance: added wq for cpu=%d: iaa wq %d.%d\n",
-			 cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
+			cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
 		n_wqs_added++;
 	}
 
@@ -927,6 +1092,63 @@ static int wq_table_add_wqs(int iaa, int cpu)
 	return ret;
 }
 
+static int global_wq_table_add_wqs(void)
+{
+	struct iaa_device *iaa_device;
+	int ret = 0, n_wqs_added;
+	struct idxd_device *idxd;
+	struct iaa_wq *iaa_wq;
+	struct pci_dev *pdev;
+	struct device *dev;
+	int cpu, node, node_of_cpu = -1;
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+
+#ifdef CONFIG_NUMA
+		node_of_cpu = -1;
+		for_each_online_node(node) {
+			const struct cpumask *node_cpus;
+			node_cpus = cpumask_of_node(node);
+			if (!cpumask_test_cpu(cpu, node_cpus))
+				continue;
+			node_of_cpu = node;
+			break;
+		}
+#endif
+		list_for_each_entry(iaa_device, &iaa_devices, list) {
+			idxd = iaa_device->idxd;
+			pdev = idxd->pdev;
+			dev = &pdev->dev;
+
+#ifdef CONFIG_NUMA
+			if (dev && (node_of_cpu != dev->numa_node))
+				continue;
+#endif
+
+			if (iaa_device->n_wq <= g_wqs_per_iaa)
+				continue;
+
+			n_wqs_added = 0;
+
+			list_for_each_entry(iaa_wq, &iaa_device->wqs, list) {
+
+				if (n_wqs_added < (iaa_device->n_wq - g_wqs_per_iaa)) {
+					n_wqs_added++;
+				}
+				else {
+					global_wq_table_add(cpu, iaa_wq->wq);
+					pr_debug("rebalance: added global wq for cpu=%d: iaa wq %d.%d\n",
+						cpu, iaa_wq->wq->idxd->id, iaa_wq->wq->id);
+				}
+			}
+		}
+
+		global_wq_table_set_start_wq(cpu);
+	}
+
+	return ret;
+}
+
 /*
  * Rebalance the wq table so that given a cpu, it's easy to find the
  * closest IAA instance.  The idea is to try to choose the most
@@ -961,6 +1183,7 @@ static void rebalance_wq_table(void)
 	}
 
 	pr_debug("Finished rebalance local wqs.");
+	global_wq_table_add_wqs();
 }
 
 static inline int check_completion(struct device *dev,
@@ -1509,6 +1732,27 @@ static int iaa_decompress(struct crypto_tfm *tfm, struct acomp_req *req,
 	goto out;
 }
 
+/*
+ * Caller should make sure to call only if the
+ * per_cpu_ptr "global_wq_table" is non-NULL
+ * and has at least one wq configured.
+ */
+static struct idxd_wq *global_wq_table_next_wq(int cpu)
+{
+	struct wq_table_entry *entry = per_cpu_ptr(global_wq_table, cpu);
+	int *num_consec_descs = per_cpu_ptr(num_consec_descs_per_wq, cpu);
+
+	if ((*num_consec_descs) == g_consec_descs_per_gwq) {
+		if (++entry->cur_wq >= entry->n_wqs)
+			entry->cur_wq = 0;
+		*num_consec_descs = 0;
+	}
+
+	++(*num_consec_descs);
+
+	return entry->wqs[entry->cur_wq];
+}
+
 static int iaa_comp_acompress(struct acomp_req *req)
 {
 	struct iaa_compression_ctx *compression_ctx;
@@ -1521,6 +1765,7 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	struct idxd_wq *wq;
 	struct device *dev;
 	int order = -1;
+	struct wq_table_entry *entry;
 
 	compression_ctx = crypto_tfm_ctx(tfm);
 
@@ -1535,8 +1780,15 @@ static int iaa_comp_acompress(struct acomp_req *req)
 	}
 
 	cpu = get_cpu();
-	wq = wq_table_next_wq(cpu);
+	entry = per_cpu_ptr(global_wq_table, cpu);
+
+	if (!entry || entry->n_wqs == 0) {
+		wq = wq_table_next_wq(cpu);
+	} else {
+		wq = global_wq_table_next_wq(cpu);
+	}
 	put_cpu();
+
 	if (!wq) {
 		pr_debug("no wq configured for cpu=%d\n", cpu);
 		return -ENODEV;
@@ -2145,13 +2397,32 @@ static int __init iaa_crypto_init_module(void)
 		goto err_sync_attr_create;
 	}
 
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_wqs_per_iaa);
+	if (ret) {
+		pr_debug("IAA g_wqs_per_iaa attr creation failed\n");
+		goto err_g_wqs_per_iaa_attr_create;
+	}
+
+	ret = driver_create_file(&iaa_crypto_driver.drv,
+				&driver_attr_g_consec_descs_per_gwq);
+	if (ret) {
+		pr_debug("IAA g_consec_descs_per_gwq attr creation failed\n");
+		goto err_g_consec_descs_per_gwq_attr_create;
+	}
+
 	if (iaa_crypto_debugfs_init())
 		pr_warn("debugfs init failed, stats not available\n");
 
 	pr_debug("initialized\n");
 out:
 	return ret;
-
+err_g_consec_descs_per_gwq_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+err_g_wqs_per_iaa_attr_create:
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_sync_mode);
 err_sync_attr_create:
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
@@ -2175,6 +2446,10 @@ static void __exit iaa_crypto_cleanup_module(void)
 			   &driver_attr_sync_mode);
 	driver_remove_file(&iaa_crypto_driver.drv,
 			   &driver_attr_verify_compress);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_wqs_per_iaa);
+	driver_remove_file(&iaa_crypto_driver.drv,
+			   &driver_attr_g_consec_descs_per_gwq);
 	idxd_driver_unregister(&iaa_crypto_driver);
 	iaa_aecs_cleanup_fixed();
 	crypto_free_comp(deflate_generic_tfm);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 09/13] mm: zswap: Config variable to enable compress batching in zswap_store().
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (7 preceding siblings ...)
  2024-10-18  6:40 ` [RFC PATCH v1 08/13] crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA node Kanchana P Sridhar
@ 2024-10-18  6:40 ` Kanchana P Sridhar
  2024-10-23  0:49   ` Yosry Ahmed
  2024-10-18  6:40 ` [RFC PATCH v1 10/13] mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if platform has IAA Kanchana P Sridhar
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Add a new zswap config variable that controls whether zswap_store() will
compress a batch of pages, for instance, the pages in a large folio:

  CONFIG_ZSWAP_STORE_BATCHING_ENABLED

The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in commit
ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto
driver core") is used to detect if the system has the Intel Analytics
Accelerator (IAA), and the iaa_crypto module is available. If so, the
kernel build will prompt for CONFIG_ZSWAP_STORE_BATCHING_ENABLED. Hence,
users have the ability to set CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y" only
on systems that have Intel IAA.

If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, and IAA is configured
as the zswap compressor, zswap_store() will process the pages in a large
folio in batches, i.e., multiple pages at a time. Pages in a batch will be
compressed in parallel in hardware, then stored. On systems without Intel
IAA and/or if zswap uses software compressors, pages in the batch will be
compressed sequentially and stored.

The patch also implements a zswap API that returns the status of this
config variable.

Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h |  6 ++++++
 mm/Kconfig            | 12 ++++++++++++
 mm/zswap.c            | 14 ++++++++++++++
 3 files changed, 32 insertions(+)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index d961ead91bf1..74ad2a24b309 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -24,6 +24,7 @@ struct zswap_lruvec_state {
 	atomic_long_t nr_disk_swapins;
 };
 
+bool zswap_store_batching_enabled(void);
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
@@ -39,6 +40,11 @@ bool zswap_never_enabled(void);
 
 struct zswap_lruvec_state {};
 
+static inline bool zswap_store_batching_enabled(void)
+{
+	return false;
+}
+
 static inline bool zswap_store(struct folio *folio)
 {
 	return false;
diff --git a/mm/Kconfig b/mm/Kconfig
index 33fa51d608dc..26d1a5cee471 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -125,6 +125,18 @@ config ZSWAP_COMPRESSOR_DEFAULT
        default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD
        default ""
 
+config ZSWAP_STORE_BATCHING_ENABLED
+	bool "Batching of zswap stores with Intel IAA"
+	depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO
+	default n
+	help
+	Enables zswap_store to swapout large folios in batches of 8 pages,
+	rather than a page at a time, if the system has Intel IAA for hardware
+	acceleration of compressions. If IAA is configured as the zswap
+	compressor, this will parallelize batch compression of upto 8 pages
+	in the folio in	hardware, thereby improving large folio compression
+	throughput and reducing swapout latency.
+
 choice
 	prompt "Default allocator"
 	depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 948c9745ee57..4893302d8c34 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -127,6 +127,15 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
 		CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
 module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
 
+/*
+ * Enable/disable batching of compressions if zswap_store is called with a
+ * large folio. If enabled, and if IAA is the zswap compressor, pages are
+ * compressed in parallel in batches of say, 8 pages.
+ * If not, every page is compressed sequentially.
+ */
+static bool __zswap_store_batching_enabled = IS_ENABLED(
+	CONFIG_ZSWAP_STORE_BATCHING_ENABLED);
+
 bool zswap_is_enabled(void)
 {
 	return zswap_enabled;
@@ -241,6 +250,11 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 	pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,		\
 		 zpool_get_type((p)->zpool))
 
+__always_inline bool zswap_store_batching_enabled(void)
+{
+	return __zswap_store_batching_enabled;
+}
+
 /*********************************
 * pool functions
 **********************************/
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 10/13] mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if platform has IAA.
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (8 preceding siblings ...)
  2024-10-18  6:40 ` [RFC PATCH v1 09/13] mm: zswap: Config variable to enable compress batching in zswap_store() Kanchana P Sridhar
@ 2024-10-18  6:40 ` Kanchana P Sridhar
  2024-10-23  0:51   ` Yosry Ahmed
  2024-10-18  6:40 ` [RFC PATCH v1 11/13] mm: swap: Add IAA batch compression API swap_crypto_acomp_compress_batch() Kanchana P Sridhar
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Intel IAA hardware acceleration can be used effectively to improve the
zswap_store() performance of large folios by batching multiple pages in a
folio to be compressed in parallel by IAA. Hence, to build compress batching
of zswap large folio stores using IAA, we need to be able to submit a batch
of compress jobs from zswap to the hardware to compress in parallel if the
iaa_crypto "async" mode is used.

The IAA compress batching paradigm works as follows:

 1) Submit N crypto_acomp_compress() jobs using N requests.
 2) Use the iaa_crypto driver async poll() method to check for the jobs
    to complete.
 3) There are no ordering constraints implied by submission, hence we
    could loop through the requests and process any job that has
    completed.
 4) This would repeat until all jobs have completed with success/error
    status.

To facilitate this, we need to provide for multiple acomp_reqs in
"struct crypto_acomp_ctx", each representing a distinct compress
job. Likewise, there needs to be a distinct destination buffer
corresponding to each acomp_req.

If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, this patch will set the
SWAP_CRYPTO_SUB_BATCH_SIZE constant to 8UL. This implies each per-cpu
crypto_acomp_ctx associated with the zswap_pool can submit up to 8
acomp_reqs at a time to accomplish parallel compressions.

If IAA is not present and/or CONFIG_ZSWAP_STORE_BATCHING_ENABLED is not
set, SWAP_CRYPTO_SUB_BATCH_SIZE will be set to 1UL.

On an Intel Sapphire Rapids server, each socket has 4 IAA, each of which
has 2 compress engines and 8 decompress engines. Experiments modeling a
contended system with say 72 processes running under a cgroup with a fixed
memory-limit, have shown that there is a significant performance
improvement with dispatching compress jobs from all cores to all the
IAA devices on the socket. Hence, SWAP_CRYPTO_SUB_BATCH_SIZE is set to
8 to maximize compression throughput if IAA is available.

The definition of "struct crypto_acomp_ctx" is modified to make the
req/buffer be arrays of size SWAP_CRYPTO_SUB_BATCH_SIZE. Thus, the
added memory footprint cost of this per-cpu structure for batching is
incurred only for platforms that have Intel IAA.

Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/swap.h  |  11 ++++++
 mm/zswap.c | 104 ++++++++++++++++++++++++++++++++++-------------------
 2 files changed, 78 insertions(+), 37 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index ad2f121de970..566616c971d4 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -8,6 +8,17 @@ struct mempolicy;
 #include <linux/swapops.h> /* for swp_offset */
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
+/*
+ * For IAA compression batching:
+ * Maximum number of IAA acomp compress requests that will be processed
+ * in a sub-batch.
+ */
+#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED)
+#define SWAP_CRYPTO_SUB_BATCH_SIZE 8UL
+#else
+#define SWAP_CRYPTO_SUB_BATCH_SIZE 1UL
+#endif
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
diff --git a/mm/zswap.c b/mm/zswap.c
index 4893302d8c34..579869d1bdf6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -152,9 +152,9 @@ bool zswap_never_enabled(void)
 
 struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
+	struct acomp_req *req[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	u8 *buffer[SWAP_CRYPTO_SUB_BATCH_SIZE];
 	struct crypto_wait wait;
-	u8 *buffer;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -832,49 +832,64 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
 	int ret;
+	int i, j;
 
 	mutex_init(&acomp_ctx->mutex);
 
-	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!acomp_ctx->buffer)
-		return -ENOMEM;
-
 	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
 	if (IS_ERR(acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp));
-		ret = PTR_ERR(acomp);
-		goto acomp_fail;
+		return PTR_ERR(acomp);
 	}
 	acomp_ctx->acomp = acomp;
 	acomp_ctx->is_sleepable = acomp_is_async(acomp);
 
-	req = acomp_request_alloc(acomp_ctx->acomp);
-	if (!req) {
-		pr_err("could not alloc crypto acomp_request %s\n",
-		       pool->tfm_name);
-		ret = -ENOMEM;
-		goto req_fail;
+	for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i) {
+		acomp_ctx->buffer[i] = kmalloc_node(PAGE_SIZE * 2,
+						GFP_KERNEL, cpu_to_node(cpu));
+		if (!acomp_ctx->buffer[i]) {
+			for (j = 0; j < i; ++j)
+				kfree(acomp_ctx->buffer[j]);
+			ret = -ENOMEM;
+			goto buf_fail;
+		}
+	}
+
+	for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i) {
+		acomp_ctx->req[i] = acomp_request_alloc(acomp_ctx->acomp);
+		if (!acomp_ctx->req[i]) {
+			pr_err("could not alloc crypto acomp_request req[%d] %s\n",
+			       i, pool->tfm_name);
+			for (j = 0; j < i; ++j)
+				acomp_request_free(acomp_ctx->req[j]);
+			ret = -ENOMEM;
+			goto req_fail;
+		}
 	}
-	acomp_ctx->req = req;
 
+	/*
+	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
+	 * or non-poll mode of acomp, hence there is only one "wait" per
+	 * acomp_ctx, with callback set to req[0].
+	 */
 	crypto_init_wait(&acomp_ctx->wait);
 	/*
 	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
-	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+	acomp_request_set_callback(acomp_ctx->req[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
 	return 0;
 
 req_fail:
+	for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i)
+		kfree(acomp_ctx->buffer[i]);
+buf_fail:
 	crypto_free_acomp(acomp_ctx->acomp);
-acomp_fail:
-	kfree(acomp_ctx->buffer);
 	return ret;
 }
 
@@ -884,11 +899,17 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 
 	if (!IS_ERR_OR_NULL(acomp_ctx)) {
-		if (!IS_ERR_OR_NULL(acomp_ctx->req))
-			acomp_request_free(acomp_ctx->req);
+		int i;
+
+		for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i)
+			if (!IS_ERR_OR_NULL(acomp_ctx->req[i]))
+				acomp_request_free(acomp_ctx->req[i]);
+
+		for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i)
+			kfree(acomp_ctx->buffer[i]);
+
 		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 			crypto_free_acomp(acomp_ctx->acomp);
-		kfree(acomp_ctx->buffer);
 	}
 
 	return 0;
@@ -911,7 +932,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->buffer;
+	dst = acomp_ctx->buffer[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
@@ -921,7 +942,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * giving the dst buffer with enough length to avoid buffer overflow.
 	 */
 	sg_init_one(&output, dst, PAGE_SIZE * 2);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	acomp_request_set_params(acomp_ctx->req[0], &input, &output, PAGE_SIZE, dlen);
 
 	/*
 	 * If the crypto_acomp provides an asynchronous poll() interface,
@@ -940,19 +961,20 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * parallel.
 	 */
 	if (acomp_ctx->acomp->poll) {
-		comp_ret = crypto_acomp_compress(acomp_ctx->req);
+		comp_ret = crypto_acomp_compress(acomp_ctx->req[0]);
 		if (comp_ret == -EINPROGRESS) {
 			do {
-				comp_ret = crypto_acomp_poll(acomp_ctx->req);
+				comp_ret = crypto_acomp_poll(acomp_ctx->req[0]);
 				if (comp_ret && comp_ret != -EAGAIN)
 					break;
 			} while (comp_ret);
 		}
 	} else {
-		comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
+		comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req[0]),
+					   &acomp_ctx->wait);
 	}
 
-	dlen = acomp_ctx->req->dlen;
+	dlen = acomp_ctx->req[0]->dlen;
 	if (comp_ret)
 		goto unlock;
 
@@ -1006,31 +1028,39 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	 */
 	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
 	    !virt_addr_valid(src)) {
-		memcpy(acomp_ctx->buffer, src, entry->length);
-		src = acomp_ctx->buffer;
+		memcpy(acomp_ctx->buffer[0], src, entry->length);
+		src = acomp_ctx->buffer[0];
 		zpool_unmap_handle(zpool, entry->handle);
 	}
 
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_folio(&output, folio, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
+	acomp_request_set_params(acomp_ctx->req[0], &input, &output,
+				 entry->length, PAGE_SIZE);
 	if (acomp_ctx->acomp->poll) {
-		ret = crypto_acomp_decompress(acomp_ctx->req);
+		ret = crypto_acomp_decompress(acomp_ctx->req[0]);
 		if (ret == -EINPROGRESS) {
 			do {
-				ret = crypto_acomp_poll(acomp_ctx->req);
+				ret = crypto_acomp_poll(acomp_ctx->req[0]);
 				BUG_ON(ret && ret != -EAGAIN);
 			} while (ret);
 		}
 	} else {
-		BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
+		BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req[0]),
+				       &acomp_ctx->wait));
 	}
-	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
-	mutex_unlock(&acomp_ctx->mutex);
+	BUG_ON(acomp_ctx->req[0]->dlen != PAGE_SIZE);
 
-	if (src != acomp_ctx->buffer)
+	if (src != acomp_ctx->buffer[0])
 		zpool_unmap_handle(zpool, entry->handle);
+
+	/*
+	 * It is safer to unlock the mutex after the check for
+	 * "src != acomp_ctx->buffer[0]" so that the value of "src"
+	 * does not change.
+	 */
+	mutex_unlock(&acomp_ctx->mutex);
 }
 
 /*********************************
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 11/13] mm: swap: Add IAA batch compression API swap_crypto_acomp_compress_batch().
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (9 preceding siblings ...)
  2024-10-18  6:40 ` [RFC PATCH v1 10/13] mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if platform has IAA Kanchana P Sridhar
@ 2024-10-18  6:40 ` Kanchana P Sridhar
  2024-10-23  0:53   ` Yosry Ahmed
  2024-10-18  6:41 ` [RFC PATCH v1 12/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:40 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Added a new API swap_crypto_acomp_compress_batch() that does batch
compression. A system that has Intel IAA can avail of this API to submit a
batch of compress jobs for parallel compression in the hardware, to improve
performance. On a system without IAA, this API will process each compress
job sequentially.

The purpose of this API is to be invocable from any swap module that needs
to compress large folios, or a batch of pages in the general case. For
instance, zswap would batch compress up to SWAP_CRYPTO_SUB_BATCH_SIZE
(i.e. 8 if the system has IAA) pages in the large folio in parallel to
improve zswap_store() performance.

Towards this eventual goal:

1) The definition of "struct crypto_acomp_ctx" is moved to mm/swap.h
   so that mm modules like swap_state.c and zswap.c can reference it.
2) The swap_crypto_acomp_compress_batch() interface is implemented in
   swap_state.c.

It would be preferable for "struct crypto_acomp_ctx" to be defined in,
and for swap_crypto_acomp_compress_batch() to be exported via
include/linux/swap.h so that modules outside mm (for e.g. zram) can
potentially use the API for batch compressions with IAA. I would
appreciate RFC comments on this.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/swap.h       |  45 +++++++++++++++++++
 mm/swap_state.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/zswap.c      |   9 ----
 3 files changed, 160 insertions(+), 9 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 566616c971d4..4dcb67e2cc33 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -7,6 +7,7 @@ struct mempolicy;
 #ifdef CONFIG_SWAP
 #include <linux/swapops.h> /* for swp_offset */
 #include <linux/blk_types.h> /* for bio_end_io_t */
+#include <linux/crypto.h>
 
 /*
  * For IAA compression batching:
@@ -19,6 +20,39 @@ struct mempolicy;
 #define SWAP_CRYPTO_SUB_BATCH_SIZE 1UL
 #endif
 
+/* linux/mm/swap_state.c, zswap.c */
+struct crypto_acomp_ctx {
+	struct crypto_acomp *acomp;
+	struct acomp_req *req[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	u8 *buffer[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	struct crypto_wait wait;
+	struct mutex mutex;
+	bool is_sleepable;
+};
+
+/**
+ * This API provides IAA compress batching functionality for use by swap
+ * modules.
+ * The acomp_ctx mutex should be locked/unlocked before/after calling this
+ * procedure.
+ *
+ * @pages: Pages to be compressed.
+ * @dsts: Pre-allocated destination buffers to store results of IAA compression.
+ * @dlens: Will contain the compressed lengths.
+ * @errors: Will contain a 0 if the page was successfully compressed, or a
+ *          non-0 error value to be processed by the calling function.
+ * @nr_pages: The number of pages, up to SWAP_CRYPTO_SUB_BATCH_SIZE,
+ *            to be compressed.
+ * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
+ */
+void swap_crypto_acomp_compress_batch(
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_pages,
+	struct crypto_acomp_ctx *acomp_ctx);
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
@@ -119,6 +153,17 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 
 #else /* CONFIG_SWAP */
 struct swap_iocb;
+struct crypto_acomp_ctx {};
+static inline void swap_crypto_acomp_compress_batch(
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_pages,
+	struct crypto_acomp_ctx *acomp_ctx)
+{
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4669f29cf555..117c3caa5679 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -23,6 +23,8 @@
 #include <linux/swap_slots.h>
 #include <linux/huge_mm.h>
 #include <linux/shmem_fs.h>
+#include <linux/scatterlist.h>
+#include <crypto/acompress.h>
 #include "internal.h"
 #include "swap.h"
 
@@ -742,6 +744,119 @@ void exit_swap_address_space(unsigned int type)
 	swapper_spaces[type] = NULL;
 }
 
+#ifdef CONFIG_SWAP
+
+/**
+ * This API provides IAA compress batching functionality for use by swap
+ * modules.
+ * The acomp_ctx mutex should be locked/unlocked before/after calling this
+ * procedure.
+ *
+ * @pages: Pages to be compressed.
+ * @dsts: Pre-allocated destination buffers to store results of IAA compression.
+ * @dlens: Will contain the compressed lengths.
+ * @errors: Will contain a 0 if the page was successfully compressed, or a
+ *          non-0 error value to be processed by the calling function.
+ * @nr_pages: The number of pages, up to SWAP_CRYPTO_SUB_BATCH_SIZE,
+ *            to be compressed.
+ * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
+ */
+void swap_crypto_acomp_compress_batch(
+	struct page *pages[],
+	u8 *dsts[],
+	unsigned int dlens[],
+	int errors[],
+	int nr_pages,
+	struct crypto_acomp_ctx *acomp_ctx)
+{
+	struct scatterlist inputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	struct scatterlist outputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	bool compressions_done = false;
+	int i, j;
+
+	BUG_ON(nr_pages > SWAP_CRYPTO_SUB_BATCH_SIZE);
+
+	/*
+	 * Prepare and submit acomp_reqs to IAA.
+	 * IAA will process these compress jobs in parallel in async mode.
+	 * If the compressor does not support a poll() method, or if IAA is
+	 * used in sync mode, the jobs will be processed sequentially using
+	 * acomp_ctx->req[0] and acomp_ctx->wait.
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		j = acomp_ctx->acomp->poll ? i : 0;
+		sg_init_table(&inputs[i], 1);
+		sg_set_page(&inputs[i], pages[i], PAGE_SIZE, 0);
+
+		/*
+		 * Each acomp_ctx->buffer[] is of size (PAGE_SIZE * 2).
+		 * Reflect same in sg_list.
+		 */
+		sg_init_one(&outputs[i], dsts[i], PAGE_SIZE * 2);
+		acomp_request_set_params(acomp_ctx->req[j], &inputs[i],
+					 &outputs[i], PAGE_SIZE, dlens[i]);
+
+		/*
+		 * If the crypto_acomp provides an asynchronous poll()
+		 * interface, submit the request to the driver now, and poll for
+		 * a completion status later, after all descriptors have been
+		 * submitted. If the crypto_acomp does not provide a poll()
+		 * interface, submit the request and wait for it to complete,
+		 * i.e., synchronously, before moving on to the next request.
+		 */
+		if (acomp_ctx->acomp->poll) {
+			errors[i] = crypto_acomp_compress(acomp_ctx->req[j]);
+
+			if (errors[i] != -EINPROGRESS)
+				errors[i] = -EINVAL;
+			else
+				errors[i] = -EAGAIN;
+		} else {
+			errors[i] = crypto_wait_req(
+					      crypto_acomp_compress(acomp_ctx->req[j]),
+					      &acomp_ctx->wait);
+			if (!errors[i])
+				dlens[i] = acomp_ctx->req[j]->dlen;
+		}
+	}
+
+	/*
+	 * If not doing async compressions, the batch has been processed at
+	 * this point and we can return.
+	 */
+	if (!acomp_ctx->acomp->poll)
+		return;
+
+	/*
+	 * Poll for and process IAA compress job completions
+	 * in out-of-order manner.
+	 */
+	while (!compressions_done) {
+		compressions_done = true;
+
+		for (i = 0; i < nr_pages; ++i) {
+			/*
+			 * Skip, if the compression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = crypto_acomp_poll(acomp_ctx->req[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					compressions_done = false;
+			} else {
+				dlens[i] = acomp_ctx->req[i]->dlen;
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(swap_crypto_acomp_compress_batch);
+
+#endif /* CONFIG_SWAP */
+
 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
 			   unsigned long *end)
 {
diff --git a/mm/zswap.c b/mm/zswap.c
index 579869d1bdf6..cab3114321f9 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -150,15 +150,6 @@ bool zswap_never_enabled(void)
 * data structures
 **********************************/
 
-struct crypto_acomp_ctx {
-	struct crypto_acomp *acomp;
-	struct acomp_req *req[SWAP_CRYPTO_SUB_BATCH_SIZE];
-	u8 *buffer[SWAP_CRYPTO_SUB_BATCH_SIZE];
-	struct crypto_wait wait;
-	struct mutex mutex;
-	bool is_sleepable;
-};
-
 /*
  * The lock ordering is zswap_tree.lock -> zswap_pool.lru_lock.
  * The only case where lru_lock is not acquired while holding tree.lock is
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 12/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios.
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (10 preceding siblings ...)
  2024-10-18  6:40 ` [RFC PATCH v1 11/13] mm: swap: Add IAA batch compression API swap_crypto_acomp_compress_batch() Kanchana P Sridhar
@ 2024-10-18  6:41 ` Kanchana P Sridhar
  2024-10-18  6:41 ` [RFC PATCH v1 13/13] mm: vmscan, swap, zswap: Compress batching of folios in shrink_folio_list() Kanchana P Sridhar
  2024-10-23  0:56 ` [RFC PATCH v1 00/13] zswap IAA compress batching Yosry Ahmed
  13 siblings, 0 replies; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:41 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

If the system has Intel IAA, and if CONFIG_ZSWAP_STORE_BATCHING_ENABLED
is set to "y", zswap_store() will call swap_crypto_acomp_compress_batch()
to batch compress up to SWAP_CRYPTO_SUB_BATCH_SIZE pages in large folios
in parallel using the multiple compress engines available in IAA hardware.

On platforms with multiple IAA devices per socket, compress jobs from all
cores in a socket will be distributed among all IAA devices on the socket
by the iaa_crypto driver.

If zswap_store() is called with a large folio, and if
zswap_store_batching_enabled() returns "true", it will call the
main __zswap_store_batch_core() interface for compress batching. The
interface represents the extensible compress batching architecture that can
potentially be called with a batch of any-order folios from
shrink_folio_list(). In other words, although zswap_store()
calls __zswap_store_batch_core() with exactly one large folio in this
patch, we will reuse this API to reclaim a batch of folios in subsequent
patches.

The newly added functions that implement batched stores follow the
general structure of zswap_store() of a large folio. Some amount of
restructuring and optimization is done to minimize failure points
for a batch, fail early and maximize the zswap store pipeline occupancy
with SWAP_CRYPTO_SUB_BATCH_SIZE pages, potentially from multiple
folios. This is intended to maximize reclaim throughput with the IAA
hardware parallel compressions.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h |  84 ++++++
 mm/zswap.c            | 591 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 671 insertions(+), 4 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 74ad2a24b309..9bbe330686f6 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -24,6 +24,88 @@ struct zswap_lruvec_state {
 	atomic_long_t nr_disk_swapins;
 };
 
+/*
+ * struct zswap_store_sub_batch_page:
+ *
+ * This represents one "zswap batching element", namely, the
+ * attributes associated with a page in a large folio that will
+ * be compressed and stored in zswap. The term "batch" is reserved
+ * for a conceptual "batch" of folios that can be sent to
+ * zswap_store() by reclaim. The term "sub-batch" is used to describe
+ * a collection of "zswap batching elements", i.e., an array of
+ * "struct zswap_store_sub_batch_page *".
+ *
+ * The zswap compress sub-batch size is specified by
+ * SWAP_CRYPTO_SUB_BATCH_SIZE, currently set as 8UL if the
+ * platform has Intel IAA. This means zswap can store a large folio
+ * by creating sub-batches of up to 8 pages and compressing this
+ * batch using IAA to parallelize the 8 compress jobs in hardware.
+ * For e.g., a 64KB folio can be compressed as 2 sub-batches of
+ * 8 pages each. This can significantly improve the zswap_store()
+ * performance for large folios.
+ *
+ * Although the page itself is represented directly, the structure
+ * adds a "u8 batch_idx" to represent an index for the folio in a
+ * conceptual "batch of folios" that can be passed to zswap_store().
+ * Conceptually, this allows for up to 256 folios that can be passed
+ * to zswap_store(). If this conceptual number of folios sent to
+ * zswap_store() exceeds 256, the "batch_idx" needs to become u16.
+ */
+struct zswap_store_sub_batch_page {
+	u8 batch_idx;
+	swp_entry_t swpentry;
+	struct obj_cgroup *objcg;
+	struct zswap_entry *entry;
+	int error; /* folio error status. */
+};
+
+/*
+ * struct zswap_store_pipeline_state:
+ *
+ * This stores state during IAA compress batching of (conceptually, a batch of)
+ * folios. The term pipelining in this context, refers to breaking down
+ * the batch of folios being reclaimed into sub-batches of
+ * SWAP_CRYPTO_SUB_BATCH_SIZE pages, batch compressing and storing the
+ * sub-batch. This concept could be further evolved to use overlap of CPU
+ * computes with IAA computes. For instance, we could stage the post-compress
+ * computes for sub-batch "N-1" to happen in parallel with IAA batch
+ * compression of sub-batch "N".
+ *
+ * We begin by developing the concept of compress batching. Pipelining with
+ * overlap can be future work.
+ *
+ * @errors: The errors status for the batch of reclaim folios passed in from
+ *          a higher mm layer such as swap_writepage().
+ * @pool: A valid zswap_pool.
+ * @acomp_ctx: The per-cpu pointer to the crypto_acomp_ctx for the @pool.
+ * @sub_batch: This is an array that represents the sub-batch of up to
+ *             SWAP_CRYPTO_SUB_BATCH_SIZE pages that are being stored
+ *             in zswap.
+ * @comp_dsts: The destination buffers for crypto_acomp_compress() for each
+ *             page being compressed.
+ * @comp_dlens: The destination buffers' lengths from crypto_acomp_compress()
+ *              obtained after crypto_acomp_poll() returns completion status,
+ *              for each page being compressed.
+ * @comp_errors: Compression errors for each page being compressed.
+ * @nr_comp_pages: Total number of pages in @sub_batch.
+ *
+ * Note:
+ * The max sub-batch size is SWAP_CRYPTO_SUB_BATCH_SIZE, currently 8UL.
+ * Hence, if SWAP_CRYPTO_SUB_BATCH_SIZE exceeds 256, some of the
+ * u8 members (except @comp_dsts) need to become u16.
+ */
+struct zswap_store_pipeline_state {
+	int *errors;
+	struct zswap_pool *pool;
+	struct crypto_acomp_ctx *acomp_ctx;
+	struct zswap_store_sub_batch_page *sub_batch;
+	struct page **comp_pages;
+	u8 **comp_dsts;
+	unsigned int *comp_dlens;
+	int *comp_errors;
+	u8 nr_comp_pages;
+};
+
 bool zswap_store_batching_enabled(void);
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
@@ -39,6 +121,8 @@ bool zswap_never_enabled(void);
 #else
 
 struct zswap_lruvec_state {};
+struct zswap_store_sub_batch_page {};
+struct zswap_store_pipeline_state {};
 
 static inline bool zswap_store_batching_enabled(void)
 {
diff --git a/mm/zswap.c b/mm/zswap.c
index cab3114321f9..1c12a7b9f4ff 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -130,7 +130,7 @@ module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
 /*
  * Enable/disable batching of compressions if zswap_store is called with a
  * large folio. If enabled, and if IAA is the zswap compressor, pages are
- * compressed in parallel in batches of say, 8 pages.
+ * compressed in parallel in batches of SWAP_CRYPTO_SUB_BATCH_SIZE pages.
  * If not, every page is compressed sequentially.
  */
 static bool __zswap_store_batching_enabled = IS_ENABLED(
@@ -246,6 +246,12 @@ __always_inline bool zswap_store_batching_enabled(void)
 	return __zswap_store_batching_enabled;
 }
 
+static void __zswap_store_batch_core(
+	int node_id,
+	struct folio **folios,
+	int *errors,
+	unsigned int nr_folios);
+
 /*********************************
 * pool functions
 **********************************/
@@ -906,6 +912,9 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
+/*
+ * The acomp_ctx->mutex must be locked/unlocked in the calling procedure.
+ */
 static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 			   struct zswap_pool *pool)
 {
@@ -921,8 +930,6 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
 
-	mutex_lock(&acomp_ctx->mutex);
-
 	dst = acomp_ctx->buffer[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
@@ -992,7 +999,6 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	else if (alloc_ret)
 		zswap_reject_alloc_fail++;
 
-	mutex_unlock(&acomp_ctx->mutex);
 	return comp_ret == 0 && alloc_ret == 0;
 }
 
@@ -1545,10 +1551,17 @@ static ssize_t zswap_store_page(struct page *page,
 	return -EINVAL;
 }
 
+/*
+ * Modified to use the IAA compress batching framework implemented in
+ * __zswap_store_batch_core() if zswap_store_batching_enabled() is true.
+ * The batching code is intended to significantly improve folio store
+ * performance over the sequential code.
+ */
 bool zswap_store(struct folio *folio)
 {
 	long nr_pages = folio_nr_pages(folio);
 	swp_entry_t swp = folio->swap;
+	struct crypto_acomp_ctx *acomp_ctx;
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
 	struct zswap_pool *pool;
@@ -1556,6 +1569,17 @@ bool zswap_store(struct folio *folio)
 	bool ret = false;
 	long index;
 
+	/*
+	 * Improve large folio zswap_store() latency with IAA compress batching.
+	 */
+	if (folio_test_large(folio) && zswap_store_batching_enabled()) {
+		int error = -1;
+		__zswap_store_batch_core(folio_nid(folio), &folio, &error, 1);
+		if (!error)
+			ret = true;
+		return ret;
+	}
+
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
 
@@ -1588,6 +1612,9 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
+	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+	mutex_lock(&acomp_ctx->mutex);
+
 	for (index = 0; index < nr_pages; ++index) {
 		struct page *page = folio_page(folio, index);
 		ssize_t bytes;
@@ -1609,6 +1636,7 @@ bool zswap_store(struct folio *folio)
 	ret = true;
 
 put_pool:
+	mutex_unlock(&acomp_ctx->mutex);
 	zswap_pool_put(pool);
 put_objcg:
 	obj_cgroup_put(objcg);
@@ -1638,6 +1666,561 @@ bool zswap_store(struct folio *folio)
 	return ret;
 }
 
+/*
+ * Note: If SWAP_CRYPTO_SUB_BATCH_SIZE exceeds 256, change the
+ * u8 stack variables in the next several functions, to u16.
+ */
+
+/*
+ * Propagate the "sbp" error condition to other batch elements belonging to
+ * the same folio as "sbp".
+ */
+static __always_inline void zswap_store_propagate_errors(
+	struct zswap_store_pipeline_state *zst,
+	u8 error_batch_idx)
+{
+	u8 i;
+
+	if (zst->errors[error_batch_idx])
+		return;
+
+	for (i = 0; i < zst->nr_comp_pages; ++i) {
+		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
+
+		if (sbp->batch_idx == error_batch_idx) {
+			if (!sbp->error) {
+				if (!IS_ERR_VALUE(sbp->entry->handle))
+					zpool_free(zst->pool->zpool, sbp->entry->handle);
+
+				if (sbp->entry) {
+					zswap_entry_cache_free(sbp->entry);
+					sbp->entry = NULL;
+				}
+				sbp->error = -EINVAL;
+			}
+		}
+	}
+
+	/*
+	 * Set zswap status for the folio to "error"
+	 * for use in swap_writepage.
+	 */
+	zst->errors[error_batch_idx] = -EINVAL;
+}
+
+static __always_inline void zswap_process_comp_errors(
+	struct zswap_store_pipeline_state *zst)
+{
+	u8 i;
+
+	for (i = 0; i < zst->nr_comp_pages; ++i) {
+		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
+
+		if (zst->comp_errors[i]) {
+			if (zst->comp_errors[i] == -ENOSPC)
+				zswap_reject_compress_poor++;
+			else
+				zswap_reject_compress_fail++;
+
+			if (!sbp->error)
+				zswap_store_propagate_errors(zst,
+							     sbp->batch_idx);
+		}
+	}
+}
+
+static void zswap_compress_batch(struct zswap_store_pipeline_state *zst)
+{
+	/*
+	 * Compress up to SWAP_CRYPTO_SUB_BATCH_SIZE pages.
+	 * If IAA is the zswap compressor, this compresses the
+	 * pages in parallel, leading to significant performance
+	 * improvements as compared to software compressors.
+	 */
+	swap_crypto_acomp_compress_batch(
+		zst->comp_pages,
+		zst->comp_dsts,
+		zst->comp_dlens,
+		zst->comp_errors,
+		zst->nr_comp_pages,
+		zst->acomp_ctx);
+
+	/*
+	 * Scan the sub-batch for any compression errors,
+	 * and invalidate pages with errors, along with other
+	 * pages belonging to the same folio as the error pages.
+	 */
+	zswap_process_comp_errors(zst);
+}
+
+static void zswap_zpool_store_sub_batch(
+	struct zswap_store_pipeline_state *zst)
+{
+	u8 i;
+
+	for (i = 0; i < zst->nr_comp_pages; ++i) {
+		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
+		struct zpool *zpool;
+		unsigned long handle;
+		char *buf;
+		gfp_t gfp;
+		int err;
+
+		/* Skip pages that had compress errors. */
+		if (sbp->error)
+			continue;
+
+		zpool = zst->pool->zpool;
+		gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
+		if (zpool_malloc_support_movable(zpool))
+			gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
+		err = zpool_malloc(zpool, zst->comp_dlens[i], gfp, &handle);
+
+		if (err) {
+			if (err == -ENOSPC)
+				zswap_reject_compress_poor++;
+			else
+				zswap_reject_alloc_fail++;
+
+			/*
+			 * An error should be propagated to other pages of the
+			 * same folio in the sub-batch, and zpool resources for
+			 * those pages (in sub-batch order prior to this zpool
+			 * error) should be de-allocated.
+			 */
+			zswap_store_propagate_errors(zst, sbp->batch_idx);
+			continue;
+		}
+
+		buf = zpool_map_handle(zpool, handle, ZPOOL_MM_WO);
+		memcpy(buf, zst->comp_dsts[i], zst->comp_dlens[i]);
+		zpool_unmap_handle(zpool, handle);
+
+		sbp->entry->handle = handle;
+		sbp->entry->length = zst->comp_dlens[i];
+	}
+}
+
+/*
+ * Returns true if the entry was successfully
+ * stored in the xarray, and false otherwise.
+ */
+static bool zswap_store_entry(swp_entry_t page_swpentry,
+			      struct zswap_entry *entry)
+{
+	struct zswap_entry *old = xa_store(swap_zswap_tree(page_swpentry),
+					   swp_offset(page_swpentry),
+					   entry, GFP_KERNEL);
+	if (xa_is_err(old)) {
+		int err = xa_err(old);
+
+		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+		zswap_reject_alloc_fail++;
+		return false;
+	}
+
+	/*
+	 * We may have had an existing entry that became stale when
+	 * the folio was redirtied and now the new version is being
+	 * swapped out. Get rid of the old.
+	 */
+	if (old)
+		zswap_entry_free(old);
+
+	return true;
+}
+
+static void zswap_batch_compress_post_proc(
+	struct zswap_store_pipeline_state *zst)
+{
+	int nr_objcg_pages = 0, nr_pages = 0;
+	struct obj_cgroup *objcg = NULL;
+	size_t compressed_bytes = 0;
+	u8 i;
+
+	zswap_zpool_store_sub_batch(zst);
+
+	for (i = 0; i < zst->nr_comp_pages; ++i) {
+		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
+
+		if (sbp->error)
+			continue;
+
+		if (!zswap_store_entry(sbp->swpentry, sbp->entry)) {
+			zswap_store_propagate_errors(zst, sbp->batch_idx);
+			continue;
+		}
+
+		/*
+		 * The entry is successfully compressed and stored in the tree,
+		 * there is no further possibility of failure. Grab refs to the
+		 * pool and objcg. These refs will be dropped by
+		 * zswap_entry_free() when the entry is removed from the tree.
+		 */
+		zswap_pool_get(zst->pool);
+		if (sbp->objcg)
+			obj_cgroup_get(sbp->objcg);
+
+		/*
+		 * We finish initializing the entry while it's already in xarray.
+		 * This is safe because:
+		 *
+		 * 1. Concurrent stores and invalidations are excluded by folio
+		 *    lock.
+		 *
+		 * 2. Writeback is excluded by the entry not being on the LRU yet.
+		 *    The publishing order matters to prevent writeback from seeing
+		 *    an incoherent entry.
+		 */
+		sbp->entry->pool = zst->pool;
+		sbp->entry->swpentry = sbp->swpentry;
+		sbp->entry->objcg = sbp->objcg;
+		sbp->entry->referenced = true;
+		if (sbp->entry->length) {
+			INIT_LIST_HEAD(&sbp->entry->lru);
+			zswap_lru_add(&zswap_list_lru, sbp->entry);
+		}
+
+		if (!objcg && sbp->objcg) {
+			objcg = sbp->objcg;
+		} else if (objcg && sbp->objcg && (objcg != sbp->objcg)) {
+			obj_cgroup_charge_zswap(objcg, compressed_bytes);
+			count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
+			compressed_bytes = 0;
+			nr_objcg_pages = 0;
+			objcg = sbp->objcg;
+		}
+
+		if (sbp->objcg) {
+			compressed_bytes += sbp->entry->length;
+			++nr_objcg_pages;
+		}
+
+		++nr_pages;
+	} /* for sub-batch pages. */
+
+	if (objcg) {
+		obj_cgroup_charge_zswap(objcg, compressed_bytes);
+		count_objcg_events(objcg, ZSWPOUT, nr_objcg_pages);
+	}
+
+	atomic_long_add(nr_pages, &zswap_stored_pages);
+	count_vm_events(ZSWPOUT, nr_pages);
+}
+
+static void zswap_store_sub_batch(struct zswap_store_pipeline_state *zst)
+{
+	u8 i;
+
+	for (i = 0; i < zst->nr_comp_pages; ++i) {
+		zst->comp_dsts[i] = zst->acomp_ctx->buffer[i];
+		zst->comp_dlens[i] = PAGE_SIZE;
+	} /* for sub-batch pages. */
+
+	/*
+	 * Batch compress sub-batch "N". If IAA is the compressor, the
+	 * hardware will compress multiple pages in parallel.
+	 */
+	zswap_compress_batch(zst);
+
+	zswap_batch_compress_post_proc(zst);
+}
+
+static void zswap_add_folio_pages_to_sb(
+	struct zswap_store_pipeline_state *zst,
+	struct folio* folio,
+	u8 batch_idx,
+	struct obj_cgroup *objcg,
+	struct zswap_entry *entries[],
+	long start_idx,
+	u8 add_nr_pages)
+{
+	long index;
+
+	for (index = start_idx; index < (start_idx + add_nr_pages); ++index) {
+		u8 i = zst->nr_comp_pages;
+		struct zswap_store_sub_batch_page *sbp = &zst->sub_batch[i];
+		struct page *page = folio_page(folio, index);
+		zst->comp_pages[i] = page;
+		sbp->swpentry = page_swap_entry(page);
+		sbp->batch_idx = batch_idx;
+		sbp->objcg = objcg;
+		sbp->entry = entries[index - start_idx];
+		sbp->error = 0;
+		++zst->nr_comp_pages;
+	}
+}
+
+static __always_inline void zswap_store_reset_sub_batch(
+	struct zswap_store_pipeline_state *zst)
+{
+	zst->nr_comp_pages = 0;
+}
+
+/* Allocate entries for the next sub-batch. */
+static int zswap_alloc_entries(u8 nr_entries,
+			       struct zswap_entry *entries[],
+			       int node_id)
+{
+	u8 i;
+
+	for (i = 0; i < nr_entries; ++i) {
+		entries[i] = zswap_entry_cache_alloc(GFP_KERNEL, node_id);
+		if (!entries[i]) {
+			u8 j;
+
+			zswap_reject_kmemcache_fail++;
+			for (j = 0; j < i; ++j)
+				zswap_entry_cache_free(entries[j]);
+			return -EINVAL;
+		}
+
+		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
+	}
+
+	return 0;
+}
+
+/*
+ * If the zswap store fails or zswap is disabled, we must invalidate
+ * the possibly stale entries which were previously stored at the
+ * offsets corresponding to each page of the folio. Otherwise,
+ * writeback could overwrite the new data in the swapfile.
+ */
+static void zswap_delete_stored_entries(struct folio *folio)
+{
+	swp_entry_t swp = folio->swap;
+	unsigned type = swp_type(swp);
+	pgoff_t offset = swp_offset(swp);
+	struct zswap_entry *entry;
+	struct xarray *tree;
+	long index;
+
+	for (index = 0; index < folio_nr_pages(folio); ++index) {
+		tree = swap_zswap_tree(swp_entry(type, offset + index));
+		entry = xa_erase(tree, offset + index);
+		if (entry)
+			zswap_entry_free(entry);
+	}
+}
+
+static void zswap_store_process_folio_errors(
+	struct folio **folios,
+	int *errors,
+	unsigned int nr_folios)
+{
+	u8 batch_idx;
+
+	for (batch_idx = 0; batch_idx < nr_folios; ++batch_idx)
+		if (errors[batch_idx])
+			zswap_delete_stored_entries(folios[batch_idx]);
+}
+
+/*
+ * Store a (batch of) any-order large folio(s) in zswap. Each folio will be
+ * broken into sub-batches of SWAP_CRYPTO_SUB_BATCH_SIZE pages, the
+ * sub-batch will be compressed by IAA in parallel, and stored in zpool/xarray.
+ *
+ * This the main procedure for batching of folios, and batching within
+ * large folios.
+ *
+ * This procedure should only be called if zswap supports batching of stores.
+ * Otherwise, the sequential implementation for storing folios as in the
+ * current zswap_store() should be used.
+ *
+ * The signature of this procedure is meant to allow the calling function,
+ * (for instance, swap_writepage()) to pass an array @folios
+ * (the "reclaim batch") of @nr_folios folios to be stored in zswap.
+ * All folios in the batch must have the same swap type and folio_nid @node_id
+ * (simplifying assumptions only to manage code complexity).
+ *
+ * @errors and @folios have @nr_folios number of entries, with one-one
+ * correspondence (@errors[i] represents the error status of @folios[i],
+ * for i in @nr_folios).
+ * The calling function (for instance, swap_writepage()) should initialize
+ * @errors[i] to a non-0 value.
+ * If zswap successfully stores @folios[i], it will set @errors[i] to 0.
+ * If there is an error in zswap, it will set @errors[i] to -EINVAL.
+ */
+static void __zswap_store_batch_core(
+	int node_id,
+	struct folio **folios,
+	int *errors,
+	unsigned int nr_folios)
+{
+	struct zswap_store_sub_batch_page sub_batch[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	struct page *comp_pages[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	u8 *comp_dsts[SWAP_CRYPTO_SUB_BATCH_SIZE] = { NULL };
+	unsigned int comp_dlens[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	int comp_errors[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	struct crypto_acomp_ctx *acomp_ctx;
+	struct zswap_pool *pool;
+	/*
+	 * For now, lets say a max of 256 large folios can be reclaimed
+	 * at a time, as a batch. If this exceeds 256, change this to u16.
+	 */
+	u8 batch_idx;
+
+	/* Initialize the compress batching pipeline state. */
+	struct zswap_store_pipeline_state zst = {
+		.errors = errors,
+		.pool = NULL,
+		.acomp_ctx = NULL,
+		.sub_batch = sub_batch,
+		.comp_pages = comp_pages,
+		.comp_dsts = comp_dsts,
+		.comp_dlens = comp_dlens,
+		.comp_errors = comp_errors,
+		.nr_comp_pages = 0,
+	};
+
+	pool = zswap_pool_current_get();
+	if (!pool) {
+		if (zswap_check_limits())
+			queue_work(shrink_wq, &zswap_shrink_work);
+		goto check_old;
+	}
+
+	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
+	mutex_lock(&acomp_ctx->mutex);
+	zst.pool = pool;
+	zst.acomp_ctx = acomp_ctx;
+
+	/*
+	 * Iterate over the folios passed in. Construct sub-batches of up to
+	 * SWAP_CRYPTO_SUB_BATCH_SIZE pages, if necessary, by iterating through
+	 * multiple folios from the input "folios". Process each sub-batch
+	 * with IAA batch compression. Detect errors from batch compression
+	 * and set the impacted folio's error status (this happens in
+	 * zswap_store_process_errors()).
+	 */
+	for (batch_idx = 0; batch_idx < nr_folios; ++batch_idx) {
+		struct folio *folio = folios[batch_idx];
+		BUG_ON(!folio);
+		long folio_start_idx, nr_pages = folio_nr_pages(folio);
+		struct zswap_entry *entries[SWAP_CRYPTO_SUB_BATCH_SIZE];
+		struct obj_cgroup *objcg = NULL;
+		struct mem_cgroup *memcg = NULL;
+
+		VM_WARN_ON_ONCE(!folio_test_locked(folio));
+		VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+
+		/*
+		 * If zswap is disabled, we must invalidate the possibly stale entry
+		 * which was previously stored at this offset. Otherwise, writeback
+		 * could overwrite the new data in the swapfile.
+		 */
+		if (!zswap_enabled)
+			continue;
+
+		/* Check cgroup limits */
+		objcg = get_obj_cgroup_from_folio(folio);
+		if (objcg && !obj_cgroup_may_zswap(objcg)) {
+			memcg = get_mem_cgroup_from_objcg(objcg);
+			if (shrink_memcg(memcg)) {
+				mem_cgroup_put(memcg);
+				goto put_objcg;
+			}
+			mem_cgroup_put(memcg);
+		}
+
+		if (zswap_check_limits())
+			goto put_objcg;
+
+		if (objcg) {
+			memcg = get_mem_cgroup_from_objcg(objcg);
+			if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
+				mem_cgroup_put(memcg);
+				goto put_objcg;
+			}
+			mem_cgroup_put(memcg);
+		}
+
+		/*
+		 * By default, set zswap status to "success" for use in
+		 * swap_writepage() when this returns. In case of errors,
+		 * a negative error number will over-write this when
+		 * zswap_store_process_errors() is called.
+		 */
+		errors[batch_idx] = 0;
+
+		folio_start_idx = 0;
+
+		while (nr_pages > 0) {
+			u8 add_nr_pages;
+
+			/*
+			 * If we have accumulated SWAP_CRYPTO_SUB_BATCH_SIZE
+			 * pages, process the sub-batch: it could contain pages
+			 * from multiple folios.
+			 */
+			if (zst.nr_comp_pages == SWAP_CRYPTO_SUB_BATCH_SIZE) {
+				zswap_store_sub_batch(&zst);
+				zswap_store_reset_sub_batch(&zst);
+				/*
+				 * Stop processing this folio if it had
+				 * compress errors.
+				 */
+				if (errors[batch_idx])
+					goto put_objcg;
+			}
+
+			add_nr_pages = min3((
+					(long)SWAP_CRYPTO_SUB_BATCH_SIZE -
+					(long)zst.nr_comp_pages),
+					nr_pages,
+					(long)SWAP_CRYPTO_SUB_BATCH_SIZE);
+
+			/*
+			 * Allocate zswap_entries for this sub-batch. If we
+			 * get errors while doing so, we can flag an error
+			 * for the folio, call the shrinker and move on.
+			 */
+			if (zswap_alloc_entries(add_nr_pages,
+						entries, node_id)) {
+				zswap_store_reset_sub_batch(&zst);
+				errors[batch_idx] = -EINVAL;
+				goto put_objcg;
+			}
+
+			zswap_add_folio_pages_to_sb(
+				&zst,
+				folio,
+				batch_idx,
+				objcg,
+				entries,
+				folio_start_idx,
+				add_nr_pages);
+
+			nr_pages -= add_nr_pages;
+			folio_start_idx += add_nr_pages;
+		} /* this folio has pages to be compressed. */
+
+		obj_cgroup_put(objcg);
+		continue;
+
+put_objcg:
+		obj_cgroup_put(objcg);
+		if (zswap_pool_reached_full)
+			queue_work(shrink_wq, &zswap_shrink_work);
+	} /* for batch folios */
+
+	if (!zswap_enabled)
+		goto check_old;
+
+	/*
+	 * Process last sub-batch: it could contain pages from
+	 * multiple folios.
+	 */
+	if (zst.nr_comp_pages)
+		zswap_store_sub_batch(&zst);
+
+	mutex_unlock(&acomp_ctx->mutex);
+	zswap_pool_put(pool);
+check_old:
+	zswap_store_process_folio_errors(folios, errors, nr_folios);
+}
+
 bool zswap_load(struct folio *folio)
 {
 	swp_entry_t swp = folio->swap;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v1 13/13] mm: vmscan, swap, zswap: Compress batching of folios in shrink_folio_list().
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (11 preceding siblings ...)
  2024-10-18  6:41 ` [RFC PATCH v1 12/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
@ 2024-10-18  6:41 ` Kanchana P Sridhar
  2024-10-28 14:41   ` Joel Granados
  2024-10-23  0:56 ` [RFC PATCH v1 00/13] zswap IAA compress batching Yosry Ahmed
  13 siblings, 1 reply; 36+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:41 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, joel.granados, bfoster, willy, linux-fsdevel
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch enables the use of Intel IAA hardware compression acceleration
to reclaim a batch of folios in shrink_folio_list(). This results in
reclaim throughput and workload/sys performance improvements.

The earlier patches on compress batching deployed multiple IAA compress
engines for compressing up to SWAP_CRYPTO_SUB_BATCH_SIZE pages within a
large folio that is being stored in zswap_store(). This patch further
propagates the efficiency improvements demonstrated with IAA "batching
within folios", to vmscan "batching of folios" which will also use
batching within folios using the extensible architecture of
the __zswap_store_batch_core() procedure added earlier, that accepts
an array of folios.

A plug mechanism is introduced in swap_writepage() to aggregate a batch of
up to vm.compress-batchsize ([1, 32]) folios before processing the plug.
The plug will be processed if any of the following is true:

 1) The plug has vm.compress-batchsize folios. If the system has Intel IAA,
    "sysctl vm.compress-batchsize" can be configured to be in [1, 32]. On
    systems without IAA, or if CONFIG_ZSWAP_STORE_BATCHING_ENABLED is not
    set, "sysctl vm.compress-batchsize" can only be 1.
 2) A folio of a different swap type or folio_nid as the current folios in
    the plug, needs to be added to the plug.
 3) A pmd-mappable folio needs to be swapped out. In this case, the
    existing folios in the plug are processed. The pmd-mappable folio is
    swapped out (zswap_store() will batch compress
    SWAP_CRYPTO_SUB_BATCH_SIZE pages in the pmd-mappable folio if system
    has IAA) in a batch of its own.

From zswap's perspective, it now receives a hybrid batch of any-order
(non-pmd-mappable) folios when the plug is processed via
zswap_store_batch() that calls __zswap_store_batch_core(). This makes sure
that the zswap compress batching pipeline occupancy and reclaim throughput
are maximized.

The shrink_folio_list() interface with swap_writepage() is modified to
work with the plug mechanism. When shrink_folio_list() calls pageout(), it
needs to handle new return codes from pageout(), namely, PAGE_BATCHED and
PAGE_BATCH_SUCCESS:

PAGE_BATCHED:
  The page is not yet swapped out, so we need to wait for the "imc_plug"
  batch to be processed, before running the post-pageout computes in
  shrink_folio_list().

PAGE_BATCH_SUCCESS:
  When the "imc_plug" is processed in swap_writepage(), a newly added
  status "AOP_PAGE_BATCH_SUCCESS" is returned to pageout(), which in turn
  returns PAGE_BATCH_SUCCESS to shrink_folio_list().

Upon receiving PAGE_BATCH_SUCCESS from pageout(), shrink_folio_list() must
now serialize and run the post-pageout computes for all the folios in
"imc_plug". To summarize this approach, this patch introduces a plug in
reclaim that aggregates a batch of folios, parallelizes the zswap store of
the folios using IAA hardware acceleration, then returns to run the
serialized flow after the "batch pageout".

The patch attempts to do this with some minimal/necessary amount of code
duplication and by addition of an iteration through the "imc_plug" folios
in shrink_folio_list(). I have validated this extensively, and not seen any
issues. I would appreciate suggestions to improve upon this approach.

Submitting this functionality as a single distinct patch in the RFC
patch-series because all the changes in this specific patch are for
shrink_folio_list() batching; they wouldn't make sense without the
functionality in this patch. Besides the functionality itself, I would also
appreciate comments on whether the patch needs to be organized
differently.

Thanks Ying Huang for suggesting ideas on simplifying the vmscan interface
to the swap_writepage() plug mechanism.

Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/fs.h        |   2 +
 include/linux/mm.h        |   8 ++
 include/linux/writeback.h |   5 ++
 include/linux/zswap.h     |  16 ++++
 kernel/sysctl.c           |   9 +++
 mm/page_io.c              | 152 ++++++++++++++++++++++++++++++++++++-
 mm/swap.c                 |  15 ++++
 mm/swap.h                 |  40 ++++++++++
 mm/vmscan.c               | 154 +++++++++++++++++++++++++++++++-------
 mm/zswap.c                |  20 +++++
 10 files changed, 394 insertions(+), 27 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3559446279c1..2868925568a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -303,6 +303,8 @@ struct iattr {
 enum positive_aop_returns {
 	AOP_WRITEPAGE_ACTIVATE	= 0x80000,
 	AOP_TRUNCATED_PAGE	= 0x80001,
+	AOP_PAGE_BATCHED	= 0x80002,
+	AOP_PAGE_BATCH_SUCCESS	= 0x80003,
 };
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4c32003c8404..a8035e163793 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -80,6 +80,14 @@ extern void * high_memory;
 extern int page_cluster;
 extern const int page_cluster_max;
 
+/*
+ * Compress batching of any-order folios in the reclaim path with IAA.
+ * The number of folios to batch reclaim can be set through
+ * "sysctl vm.compress-batchsize" which can be a value in [1, 32].
+ */
+extern int compress_batchsize;
+extern const int compress_batchsize_max;
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_legacy_va_layout;
 #else
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d6db822e4bb3..41629ea5699d 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -82,6 +82,11 @@ struct writeback_control {
 	/* Target list for splitting a large folio */
 	struct list_head *list;
 
+	/*
+	 * Plug for storing reclaim folios for compress batching.
+	 */
+	struct swap_in_memory_cache_cb *swap_in_memory_cache_plug;
+
 	/* internal fields used by the ->writepages implementation: */
 	struct folio_batch fbatch;
 	pgoff_t index;
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 9bbe330686f6..328a1e09d502 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -11,6 +11,8 @@ extern atomic_long_t zswap_stored_pages;
 
 #ifdef CONFIG_ZSWAP
 
+struct swap_in_memory_cache_cb;
+
 struct zswap_lruvec_state {
 	/*
 	 * Number of swapped in pages from disk, i.e not found in the zswap pool.
@@ -107,6 +109,15 @@ struct zswap_store_pipeline_state {
 };
 
 bool zswap_store_batching_enabled(void);
+void __zswap_store_batch(struct swap_in_memory_cache_cb *simc);
+void __zswap_store_batch_single(struct swap_in_memory_cache_cb *simc);
+static inline void zswap_store_batch(struct swap_in_memory_cache_cb *simc)
+{
+	if (zswap_store_batching_enabled())
+		__zswap_store_batch(simc);
+	else
+		__zswap_store_batch_single(simc);
+}
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
@@ -123,12 +134,17 @@ bool zswap_never_enabled(void);
 struct zswap_lruvec_state {};
 struct zswap_store_sub_batch_page {};
 struct zswap_store_pipeline_state {};
+struct swap_in_memory_cache_cb;
 
 static inline bool zswap_store_batching_enabled(void)
 {
 	return false;
 }
 
+static inline void zswap_store_batch(struct swap_in_memory_cache_cb *simc)
+{
+}
+
 static inline bool zswap_store(struct folio *folio)
 {
 	return false;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 79e6cb1d5c48..b8d6b599e9ae 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2064,6 +2064,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= (void *)&page_cluster_max,
 	},
+	{
+		.procname	= "compress-batchsize",
+		.data		= &compress_batchsize,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ONE,
+		.extra2		= (void *)&compress_batchsize_max,
+	},
 	{
 		.procname	= "dirtytime_expire_seconds",
 		.data		= &dirtytime_expire_interval,
diff --git a/mm/page_io.c b/mm/page_io.c
index a28d28b6b3ce..065db25309b8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -226,6 +226,131 @@ static void swap_zeromap_folio_clear(struct folio *folio)
 	}
 }
 
+/*
+ * For batching of folios in reclaim path for zswap batch compressions
+ * with Intel IAA.
+ */
+static void simc_write_in_memory_cache_complete(
+	struct swap_in_memory_cache_cb *simc,
+	struct writeback_control *wbc)
+{
+	int i;
+
+	/* All elements of a plug write batch have the same swap type. */
+	struct swap_info_struct *sis = swp_swap_info(simc->folios[0]->swap);
+
+	VM_BUG_ON(!sis);
+
+	for (i = 0; i < simc->nr_folios; ++i) {
+		struct folio *folio = simc->folios[i];
+
+		if (!simc->errors[i]) {
+			count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
+			folio_unlock(folio);
+		} else {
+			__swap_writepage(simc->folios[i], wbc);
+		}
+	}
+}
+
+void swap_write_in_memory_cache_unplug(struct swap_in_memory_cache_cb *simc,
+				       struct writeback_control *wbc)
+{
+	unsigned long pflags;
+
+	psi_memstall_enter(&pflags);
+
+	zswap_store_batch(simc);
+
+	simc_write_in_memory_cache_complete(simc, wbc);
+
+	psi_memstall_leave(&pflags);
+
+	simc->processed = true;
+}
+
+/*
+ * Only called by swap_writepage() if (wbc && wbc->swap_in_memory_cache_plug)
+ * is true i.e., from shrink_folio_list()->pageout() path.
+ */
+static bool swap_writepage_in_memory_cache(struct folio *folio,
+					   struct writeback_control *wbc)
+{
+	struct swap_in_memory_cache_cb *simc;
+	unsigned type = swp_type(folio->swap);
+	int node_id = folio_nid(folio);
+	int comp_batch_size = READ_ONCE(compress_batchsize);
+	bool ret = false;
+
+	simc = wbc->swap_in_memory_cache_plug;
+
+	if ((simc->nr_folios > 0) &&
+			((simc->type != type) || (simc->node_id != node_id) ||
+			folio_test_pmd_mappable(folio) ||
+			(simc->nr_folios == comp_batch_size))) {
+		swap_write_in_memory_cache_unplug(simc, wbc);
+		ret = true;
+		simc->next_batch_folio = folio;
+	} else {
+		simc->type = type;
+		simc->node_id = node_id;
+		simc->folios[simc->nr_folios] = folio;
+
+		/*
+		 * If zswap successfully stores a page, it should set
+		 * simc->errors[] to 0.
+		 */
+		simc->errors[simc->nr_folios] = -1;
+		simc->nr_folios++;
+	}
+
+	return ret;
+}
+
+void swap_writepage_in_memory_cache_transition(void *arg)
+{
+	struct swap_in_memory_cache_cb *simc =
+		(struct swap_in_memory_cache_cb *) arg;
+	simc->nr_folios = 0;
+	simc->processed = false;
+
+	if (simc->next_batch_folio) {
+		struct folio *folio = simc->next_batch_folio;
+		simc->folios[simc->nr_folios] = folio;
+		simc->type = swp_type(folio->swap);
+		simc->node_id = folio_nid(folio);
+		simc->next_batch_folio = NULL;
+
+		/*
+		 * If zswap successfully stores a page, it should set
+		 * simc->errors[] to 0.
+		 */
+		simc->errors[simc->nr_folios] = -1;
+		simc->nr_folios++;
+	}
+}
+
+void swap_writepage_in_memory_cache_init(void *arg)
+{
+	struct swap_in_memory_cache_cb *simc =
+		(struct swap_in_memory_cache_cb *) arg;
+
+	simc->nr_folios = 0;
+	simc->processed = false;
+	simc->next_batch_folio = NULL;
+	simc->transition = &swap_writepage_in_memory_cache_transition;
+}
+
+/*
+ * zswap batching of folios with IAA:
+ *
+ * Reclaim batching note for pmd-mappable folios:
+ * Any pmd-mappable folio in the reclaim path will be processed in a batch
+ * comprising only that folio. There will be no mixed batches containing
+ * pmd-mappable folios for batch compression with IAA.
+ * There are no restrictions with other large folios: a reclaim batch
+ * can comprise of any-order mix of non-pmd-mappable folios.
+ */
 /*
  * We may have stale swap cache pages in memory: notice
  * them here and get rid of the unnecessary final write.
@@ -268,7 +393,32 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		 */
 		swap_zeromap_folio_clear(folio);
 	}
-	if (zswap_store(folio)) {
+
+	/*
+	 * Batching of compressions with IAA: If reclaim path pageout has
+	 * invoked swap_writepage with a wbc->swap_in_memory_cache_plug,
+	 * add the page to the plug, or invoke zswap_store_batch() if
+	 * "vm.compress-batchsize" elements have been stored in
+	 * the plug.
+	 *
+	 * If swap_writepage has been called from other kernel code without
+	 * a wbc->swap_in_memory_cache_plug, call zswap_store() with the folio
+	 * (i.e. without adding the folio to a plug for batch processing).
+	 */
+	if (wbc && wbc->swap_in_memory_cache_plug) {
+		if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio)) &&
+			!zswap_is_enabled() &&
+			folio_memcg(folio) &&
+			!READ_ONCE(folio_memcg(folio)->zswap_writeback)) {
+			folio_mark_dirty(folio);
+			return AOP_WRITEPAGE_ACTIVATE;
+		}
+
+		if (swap_writepage_in_memory_cache(folio, wbc))
+			return AOP_PAGE_BATCH_SUCCESS;
+		else
+			return AOP_PAGE_BATCHED;
+	} else if (zswap_store(folio)) {
 		count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
 		folio_unlock(folio);
 		return 0;
diff --git a/mm/swap.c b/mm/swap.c
index 835bdf324b76..095630d6c35e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -38,6 +38,7 @@
 #include <linux/local_lock.h>
 #include <linux/buffer_head.h>
 
+#include "swap.h"
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -47,6 +48,14 @@
 int page_cluster;
 const int page_cluster_max = 31;
 
+/*
+ * Number of pages in a reclaim batch for pageout.
+ * If zswap is enabled, this is the batch-size for zswap
+ * compress batching of multiple any-order folios.
+ */
+int compress_batchsize;
+const int compress_batchsize_max = SWAP_CRYPTO_MAX_COMP_BATCH_SIZE;
+
 struct cpu_fbatches {
 	/*
 	 * The following folio batches are grouped together because they are protected
@@ -1105,4 +1114,10 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
+
+	/*
+	 * Initialize the number of pages in a reclaim batch
+	 * for pageout.
+	 */
+	compress_batchsize = 1;
 }
diff --git a/mm/swap.h b/mm/swap.h
index 4dcb67e2cc33..08c04954304f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -20,6 +20,13 @@ struct mempolicy;
 #define SWAP_CRYPTO_SUB_BATCH_SIZE 1UL
 #endif
 
+/* Set the vm.compress-batchsize limits. */
+#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED)
+#define SWAP_CRYPTO_MAX_COMP_BATCH_SIZE SWAP_CLUSTER_MAX
+#else
+#define SWAP_CRYPTO_MAX_COMP_BATCH_SIZE 1UL
+#endif
+
 /* linux/mm/swap_state.c, zswap.c */
 struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
@@ -53,6 +60,20 @@ void swap_crypto_acomp_compress_batch(
 	int nr_pages,
 	struct crypto_acomp_ctx *acomp_ctx);
 
+/* linux/mm/vmscan.c, linux/mm/page_io.c, linux/mm/zswap.c */
+/* For batching of compressions in reclaim path. */
+struct swap_in_memory_cache_cb {
+	unsigned int type;
+	int node_id;
+	struct folio *folios[SWAP_CLUSTER_MAX];
+	int errors[SWAP_CLUSTER_MAX];
+	unsigned int nr_folios;
+	bool processed;
+	struct folio *next_batch_folio;
+	void (*transition)(void *);
+	void (*init)(void *);
+};
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
@@ -63,6 +84,10 @@ static inline void swap_read_unplug(struct swap_iocb *plug)
 	if (unlikely(plug))
 		__swap_read_unplug(plug);
 }
+void swap_writepage_in_memory_cache_init(void *arg);
+void swap_writepage_in_memory_cache_transition(void *arg);
+void swap_write_in_memory_cache_unplug(struct swap_in_memory_cache_cb *simc,
+				       struct writeback_control *wbc);
 void swap_write_unplug(struct swap_iocb *sio);
 int swap_writepage(struct page *page, struct writeback_control *wbc);
 void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
@@ -164,6 +189,21 @@ static inline void swap_crypto_acomp_compress_batch(
 {
 }
 
+struct swap_in_memory_cache_cb {};
+static void swap_writepage_in_memory_cache_init(void *arg)
+{
+}
+
+static void swap_writepage_in_memory_cache_transition(void *arg)
+{
+}
+
+static inline void swap_write_in_memory_cache_unplug(
+	struct swap_in_memory_cache_cb *simc,
+	struct writeback_control *wbc)
+{
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fd3908d43b07..145e6cde78cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -619,6 +619,13 @@ typedef enum {
 	PAGE_ACTIVATE,
 	/* folio has been sent to the disk successfully, folio is unlocked */
 	PAGE_SUCCESS,
+	/*
+	 * reclaim folio batch has been sent to swap successfully,
+	 * folios are unlocked
+	 */
+	PAGE_BATCH_SUCCESS,
+	/* folio has been added to the reclaim batch. */
+	PAGE_BATCHED,
 	/* folio is clean and locked */
 	PAGE_CLEAN,
 } pageout_t;
@@ -628,7 +635,8 @@ typedef enum {
  * Calls ->writepage().
  */
 static pageout_t pageout(struct folio *folio, struct address_space *mapping,
-			 struct swap_iocb **plug, struct list_head *folio_list)
+			 struct swap_iocb **plug, struct list_head *folio_list,
+			 struct swap_in_memory_cache_cb *imc_plug)
 {
 	/*
 	 * If the folio is dirty, only perform writeback if that write
@@ -674,6 +682,7 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
 			.range_end = LLONG_MAX,
 			.for_reclaim = 1,
 			.swap_plug = plug,
+			.swap_in_memory_cache_plug = imc_plug,
 		};
 
 		/*
@@ -693,6 +702,23 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
 			return PAGE_ACTIVATE;
 		}
 
+		if (res == AOP_PAGE_BATCHED)
+			return PAGE_BATCHED;
+
+		if (res == AOP_PAGE_BATCH_SUCCESS) {
+			int r;
+			for (r = 0; r < imc_plug->nr_folios; ++r) {
+				struct folio *rfolio = imc_plug->folios[r];
+				if (!folio_test_writeback(rfolio)) {
+					/* synchronous write or broken a_ops? */
+					folio_clear_reclaim(rfolio);
+				}
+				trace_mm_vmscan_write_folio(rfolio);
+				node_stat_add_folio(rfolio, NR_VMSCAN_WRITE);
+			}
+			return PAGE_BATCH_SUCCESS;
+		}
+
 		if (!folio_test_writeback(folio)) {
 			/* synchronous write or broken a_ops? */
 			folio_clear_reclaim(folio);
@@ -1035,6 +1061,12 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask)
 	return !data_race(folio_swap_flags(folio) & SWP_FS_OPS);
 }
 
+static __always_inline bool reclaim_batch_being_processed(
+	struct swap_in_memory_cache_cb *imc_plug)
+{
+	return imc_plug->nr_folios && imc_plug->processed;
+}
+
 /*
  * shrink_folio_list() returns the number of reclaimed pages
  */
@@ -1049,22 +1081,54 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
 	struct swap_iocb *plug = NULL;
+	struct swap_in_memory_cache_cb imc_plug;
+	bool imc_plug_path = false;
+	struct folio *folio;
+	int r;
 
+	imc_plug.init = &swap_writepage_in_memory_cache_init;
+	imc_plug.init(&imc_plug);
 	folio_batch_init(&free_folios);
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
 	do_demote_pass = can_demote(pgdat->node_id, sc);
 
 retry:
-	while (!list_empty(folio_list)) {
+	while (!list_empty(folio_list) || (imc_plug.nr_folios && !imc_plug.processed)) {
 		struct address_space *mapping;
-		struct folio *folio;
 		enum folio_references references = FOLIOREF_RECLAIM;
 		bool dirty, writeback;
 		unsigned int nr_pages;
+		imc_plug_path = false;
 
 		cond_resched();
 
+		/* Reclaim path zswap/zram batching using IAA. */
+		if (list_empty(folio_list)) {
+			struct writeback_control wbc = {
+				.sync_mode = WB_SYNC_NONE,
+				.nr_to_write = SWAP_CLUSTER_MAX,
+				.range_start = 0,
+				.range_end = LLONG_MAX,
+				.for_reclaim = 1,
+				.swap_plug = &plug,
+				.swap_in_memory_cache_plug = &imc_plug,
+			};
+
+			swap_write_in_memory_cache_unplug(&imc_plug, &wbc);
+
+			for (r = 0; r < imc_plug.nr_folios; ++r) {
+				struct folio *rfolio = imc_plug.folios[r];
+				if (!folio_test_writeback(rfolio)) {
+					/* synchronous write or broken a_ops? */
+					folio_clear_reclaim(rfolio);
+				}
+				trace_mm_vmscan_write_folio(rfolio);
+				node_stat_add_folio(rfolio, NR_VMSCAN_WRITE);
+			}
+			goto serialize_post_batch_pageout;
+		}
+
 		folio = lru_to_folio(folio_list);
 		list_del(&folio->lru);
 
@@ -1363,7 +1427,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			 * starts and then write it out here.
 			 */
 			try_to_unmap_flush_dirty();
-			switch (pageout(folio, mapping, &plug, folio_list)) {
+			switch (pageout(folio, mapping, &plug, folio_list, &imc_plug)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
@@ -1377,34 +1441,66 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					nr_pages = 1;
 				}
 				goto activate_locked;
+			case PAGE_BATCHED:
+				continue;
 			case PAGE_SUCCESS:
-				if (nr_pages > 1 && !folio_test_large(folio)) {
-					sc->nr_scanned -= (nr_pages - 1);
-					nr_pages = 1;
-				}
-				stat->nr_pageout += nr_pages;
-
-				if (folio_test_writeback(folio))
-					goto keep;
-				if (folio_test_dirty(folio))
-					goto keep;
-
-				/*
-				 * A synchronous write - probably a ramdisk.  Go
-				 * ahead and try to reclaim the folio.
-				 */
-				if (!folio_trylock(folio))
-					goto keep;
-				if (folio_test_dirty(folio) ||
-				    folio_test_writeback(folio))
-					goto keep_locked;
-				mapping = folio_mapping(folio);
-				fallthrough;
+				goto post_single_pageout;
+			case PAGE_BATCH_SUCCESS:
+				goto serialize_post_batch_pageout;
 			case PAGE_CLEAN:
+				goto folio_is_clean;
 				; /* try to free the folio below */
 			}
+		} else {
+			goto folio_is_clean;
+		}
+
+serialize_post_batch_pageout:
+		imc_plug_path = reclaim_batch_being_processed(&imc_plug);
+		if (!imc_plug_path) {
+			pr_err("imc_plug: type %u node_id %d \
+				nr_folios %u processed %d next_batch_folio %px",
+				imc_plug.type, imc_plug.node_id,
+				imc_plug.nr_folios, imc_plug.processed,
+				imc_plug.next_batch_folio);
+		}
+		BUG_ON(!imc_plug_path);
+		r = -1;
+
+next_folio_in_batch:
+		while (++r < imc_plug.nr_folios) {
+			folio = imc_plug.folios[r];
+			goto post_single_pageout;
+		} /* while imc_plug folios. */
+
+		imc_plug.transition(&imc_plug);
+		continue;
+
+post_single_pageout:
+		mapping = folio_mapping(folio);
+		nr_pages = folio_nr_pages(folio);
+		if (nr_pages > 1 && !folio_test_large(folio)) {
+			sc->nr_scanned -= (nr_pages - 1);
+			nr_pages = 1;
 		}
+		stat->nr_pageout += nr_pages;
+
+		if (folio_test_writeback(folio))
+			goto keep;
+		if (folio_test_dirty(folio))
+			goto keep;
+
+		/*
+		 * A synchronous write - probably a ramdisk.  Go
+		 * ahead and try to reclaim the folio.
+		 */
+		if (!folio_trylock(folio))
+			goto keep;
+		if (folio_test_dirty(folio) ||
+		    folio_test_writeback(folio))
+			goto keep_locked;
 
+folio_is_clean:
 		/*
 		 * If the folio has buffers, try to free the buffer
 		 * mappings associated with this folio. If we succeed
@@ -1444,6 +1540,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					 * leave it off the LRU).
 					 */
 					nr_reclaimed += nr_pages;
+					if (imc_plug_path)
+						goto next_folio_in_batch;
 					continue;
 				}
 			}
@@ -1481,6 +1579,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			try_to_unmap_flush();
 			free_unref_folios(&free_folios);
 		}
+		if (imc_plug_path)
+			goto next_folio_in_batch;
 		continue;
 
 activate_locked_split:
@@ -1510,6 +1610,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		list_add(&folio->lru, &ret_folios);
 		VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
 				folio_test_unevictable(folio), folio);
+		if (imc_plug_path)
+			goto next_folio_in_batch;
 	}
 	/* 'folio_list' is always empty here */
 
diff --git a/mm/zswap.c b/mm/zswap.c
index 1c12a7b9f4ff..68ce498ad000 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1666,6 +1666,26 @@ bool zswap_store(struct folio *folio)
 	return ret;
 }
 
+/*
+ * The batch contains <= vm.compress-batchsize nr of folios.
+ * All folios in the batch have the same swap type and folio_nid.
+ */
+void __zswap_store_batch(struct swap_in_memory_cache_cb *simc)
+{
+	__zswap_store_batch_core(simc->node_id, simc->folios,
+				 simc->errors, simc->nr_folios);
+}
+
+void __zswap_store_batch_single(struct swap_in_memory_cache_cb *simc)
+{
+	u8 i;
+
+	for (i = 0; i < simc->nr_folios; ++i) {
+		if (zswap_store(simc->folios[i]))
+			simc->errors[i] = 0;
+	}
+}
+
 /*
  * Note: If SWAP_CRYPTO_SUB_BATCH_SIZE exceeds 256, change the
  * u8 stack variables in the next several functions, to u16.
-- 
2.27.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
  2024-10-18  6:40 ` [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to acomp_alg and acomp_req Kanchana P Sridhar
@ 2024-10-18  7:55   ` Herbert Xu
  2024-10-18 23:01     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 36+ messages in thread
From: Herbert Xu @ 2024-10-18  7:55 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, wajdi.k.feghali,
	vinodh.gopal

On Thu, Oct 17, 2024 at 11:40:49PM -0700, Kanchana P Sridhar wrote:
> For async compress/decompress, provide a way for the caller to poll
> for compress/decompress completion, rather than wait for an interrupt
> to signal completion.
> 
> Callers can submit a compress/decompress using crypto_acomp_compress
> and decompress and rather than wait on a completion, call
> crypto_acomp_poll() to check for completion.
> 
> This is useful for hardware accelerators where the overhead of
> interrupts and waiting for completions is too expensive.  Typically
> the compress/decompress hw operations complete very quickly and in the
> vast majority of cases, adding the overhead of interrupt handling and
> waiting for completions simply adds unnecessary delays and cancels the
> gains of using the hw acceleration.
> 
> Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  crypto/acompress.c                  |  1 +
>  include/crypto/acompress.h          | 18 ++++++++++++++++++
>  include/crypto/internal/acompress.h |  1 +
>  3 files changed, 20 insertions(+)

How about just adding a request flag that tells the driver to
make the request synchronous if possible?

Something like

#define CRYPTO_ACOMP_REQ_POLL	0x00000001

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
  2024-10-18  7:55   ` Herbert Xu
@ 2024-10-18 23:01     ` Sridhar, Kanchana P
  2024-10-19  0:19       ` Herbert Xu
  0 siblings, 1 reply; 36+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-18 23:01 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

Hi Herbert,

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Friday, October 18, 2024 12:55 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; zanussi@kernel.org;
> viro@zeniv.linux.org.uk; brauner@kernel.org; jack@suse.cz;
> mcgrof@kernel.org; kees@kernel.org; joel.granados@kernel.org;
> bfoster@redhat.com; willy@infradead.org; linux-fsdevel@vger.kernel.org;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to
> acomp_alg and acomp_req
> 
> On Thu, Oct 17, 2024 at 11:40:49PM -0700, Kanchana P Sridhar wrote:
> > For async compress/decompress, provide a way for the caller to poll
> > for compress/decompress completion, rather than wait for an interrupt
> > to signal completion.
> >
> > Callers can submit a compress/decompress using crypto_acomp_compress
> > and decompress and rather than wait on a completion, call
> > crypto_acomp_poll() to check for completion.
> >
> > This is useful for hardware accelerators where the overhead of
> > interrupts and waiting for completions is too expensive.  Typically
> > the compress/decompress hw operations complete very quickly and in the
> > vast majority of cases, adding the overhead of interrupt handling and
> > waiting for completions simply adds unnecessary delays and cancels the
> > gains of using the hw acceleration.
> >
> > Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  crypto/acompress.c                  |  1 +
> >  include/crypto/acompress.h          | 18 ++++++++++++++++++
> >  include/crypto/internal/acompress.h |  1 +
> >  3 files changed, 20 insertions(+)
> 
> How about just adding a request flag that tells the driver to
> make the request synchronous if possible?
> 
> Something like
> 
> #define CRYPTO_ACOMP_REQ_POLL	0x00000001

Thanks for your code review comments. Are you referring to how the
async/poll interface is enabled at the level of say zswap (by setting a
flag in the acomp_req), followed by the iaa_crypto driver testing for
the flag and submitting the request and returning -EINPROGRESS.
Wouldn't we still need a separate API to do the polling?

I am not the expert on this, and would like to request Kristen's inputs
on whether this is feasible.

Thanks,
Kanchana


> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
  2024-10-18 23:01     ` Sridhar, Kanchana P
@ 2024-10-19  0:19       ` Herbert Xu
  2024-10-19 19:10         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 36+ messages in thread
From: Herbert Xu @ 2024-10-19  0:19 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh

On Fri, Oct 18, 2024 at 11:01:10PM +0000, Sridhar, Kanchana P wrote:
>
> Thanks for your code review comments. Are you referring to how the
> async/poll interface is enabled at the level of say zswap (by setting a
> flag in the acomp_req), followed by the iaa_crypto driver testing for
> the flag and submitting the request and returning -EINPROGRESS.
> Wouldn't we still need a separate API to do the polling?

Correct me if I'm wrong, but I think what you want to do is this:

	crypto_acomp_compress(req)
	crypto_acomp_poll(req)

So instead of adding this interface, where the poll essentially
turns the request synchronous, just move this logic into the driver,
based on a flag bit in req.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
  2024-10-19  0:19       ` Herbert Xu
@ 2024-10-19 19:10         ` Sridhar, Kanchana P
  0 siblings, 0 replies; 36+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-19 19:10 UTC (permalink / raw)
  To: Herbert Xu
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Friday, October 18, 2024 5:20 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; zanussi@kernel.org;
> viro@zeniv.linux.org.uk; brauner@kernel.org; jack@suse.cz;
> mcgrof@kernel.org; kees@kernel.org; joel.granados@kernel.org;
> bfoster@redhat.com; willy@infradead.org; linux-fsdevel@vger.kernel.org;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to
> acomp_alg and acomp_req
> 
> On Fri, Oct 18, 2024 at 11:01:10PM +0000, Sridhar, Kanchana P wrote:
> >
> > Thanks for your code review comments. Are you referring to how the
> > async/poll interface is enabled at the level of say zswap (by setting a
> > flag in the acomp_req), followed by the iaa_crypto driver testing for
> > the flag and submitting the request and returning -EINPROGRESS.
> > Wouldn't we still need a separate API to do the polling?
> 
> Correct me if I'm wrong, but I think what you want to do is this:
> 
> 	crypto_acomp_compress(req)
> 	crypto_acomp_poll(req)
> 
> So instead of adding this interface, where the poll essentially
> turns the request synchronous, just move this logic into the driver,
> based on a flag bit in req.

Thanks Herbert, for this suggestion. I understand this better now,
and will work with Kristen for addressing this in v2.

Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 04/13] mm: zswap: zswap_compress()/decompress() can submit, then poll an acomp_req.
  2024-10-18  6:40 ` [RFC PATCH v1 04/13] mm: zswap: zswap_compress()/decompress() can submit, then poll an acomp_req Kanchana P Sridhar
@ 2024-10-23  0:48   ` Yosry Ahmed
  2024-10-23  2:01     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 36+ messages in thread
From: Yosry Ahmed @ 2024-10-23  0:48 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, wajdi.k.feghali,
	vinodh.gopal

On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> If the crypto_acomp has a poll interface registered, zswap_compress()
> and zswap_decompress() will submit the acomp_req, and then poll() for a
> successful completion/error status in a busy-wait loop. This allows an
> asynchronous way to manage (potentially multiple) acomp_reqs without
> the use of interrupts, which is supported in the iaa_crypto driver.
>
> This enables us to implement batch submission of multiple
> compression/decompression jobs to the Intel IAA hardware accelerator,
> which will process them in parallel; followed by polling the batch's
> acomp_reqs for completion status.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 39 insertions(+), 12 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index f6316b66fb23..948c9745ee57 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -910,18 +910,34 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>         acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
>
>         /*
> -        * it maybe looks a little bit silly that we send an asynchronous request,
> -        * then wait for its completion synchronously. This makes the process look
> -        * synchronous in fact.
> -        * Theoretically, acomp supports users send multiple acomp requests in one
> -        * acomp instance, then get those requests done simultaneously. but in this
> -        * case, zswap actually does store and load page by page, there is no
> -        * existing method to send the second page before the first page is done
> -        * in one thread doing zwap.
> -        * but in different threads running on different cpu, we have different
> -        * acomp instance, so multiple threads can do (de)compression in parallel.
> +        * If the crypto_acomp provides an asynchronous poll() interface,
> +        * submit the descriptor and poll for a completion status.
> +        *
> +        * It maybe looks a little bit silly that we send an asynchronous
> +        * request, then wait for its completion in a busy-wait poll loop, or,
> +        * synchronously. This makes the process look synchronous in fact.
> +        * Theoretically, acomp supports users send multiple acomp requests in
> +        * one acomp instance, then get those requests done simultaneously.
> +        * But in this case, zswap actually does store and load page by page,
> +        * there is no existing method to send the second page before the
> +        * first page is done in one thread doing zswap.
> +        * But in different threads running on different cpu, we have different
> +        * acomp instance, so multiple threads can do (de)compression in
> +        * parallel.
>          */
> -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> +       if (acomp_ctx->acomp->poll) {
> +               comp_ret = crypto_acomp_compress(acomp_ctx->req);
> +               if (comp_ret == -EINPROGRESS) {
> +                       do {
> +                               comp_ret = crypto_acomp_poll(acomp_ctx->req);
> +                               if (comp_ret && comp_ret != -EAGAIN)
> +                                       break;
> +                       } while (comp_ret);
> +               }
> +       } else {
> +               comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> +       }
> +

Is Herbert suggesting that crypto_wait_req(crypto_acomp_compress(..))
essentially do the poll internally for IAA, and hence this change can
be dropped?

>         dlen = acomp_ctx->req->dlen;
>         if (comp_ret)
>                 goto unlock;
> @@ -959,6 +975,7 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>         struct scatterlist input, output;
>         struct crypto_acomp_ctx *acomp_ctx;
>         u8 *src;
> +       int ret;
>
>         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>         mutex_lock(&acomp_ctx->mutex);
> @@ -984,7 +1001,17 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>         sg_init_table(&output, 1);
>         sg_set_folio(&output, folio, PAGE_SIZE, 0);
>         acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> -       BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> +       if (acomp_ctx->acomp->poll) {
> +               ret = crypto_acomp_decompress(acomp_ctx->req);
> +               if (ret == -EINPROGRESS) {
> +                       do {
> +                               ret = crypto_acomp_poll(acomp_ctx->req);
> +                               BUG_ON(ret && ret != -EAGAIN);
> +                       } while (ret);
> +               }
> +       } else {
> +               BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> +       }
>         BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
>         mutex_unlock(&acomp_ctx->mutex);
>
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable compress batching in zswap_store().
  2024-10-18  6:40 ` [RFC PATCH v1 09/13] mm: zswap: Config variable to enable compress batching in zswap_store() Kanchana P Sridhar
@ 2024-10-23  0:49   ` Yosry Ahmed
  2024-10-23  2:17     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 36+ messages in thread
From: Yosry Ahmed @ 2024-10-23  0:49 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, wajdi.k.feghali,
	vinodh.gopal

On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Add a new zswap config variable that controls whether zswap_store() will
> compress a batch of pages, for instance, the pages in a large folio:
>
>   CONFIG_ZSWAP_STORE_BATCHING_ENABLED
>
> The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in commit
> ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto
> driver core") is used to detect if the system has the Intel Analytics
> Accelerator (IAA), and the iaa_crypto module is available. If so, the
> kernel build will prompt for CONFIG_ZSWAP_STORE_BATCHING_ENABLED. Hence,
> users have the ability to set CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y" only
> on systems that have Intel IAA.
>
> If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, and IAA is configured
> as the zswap compressor, zswap_store() will process the pages in a large
> folio in batches, i.e., multiple pages at a time. Pages in a batch will be
> compressed in parallel in hardware, then stored. On systems without Intel
> IAA and/or if zswap uses software compressors, pages in the batch will be
> compressed sequentially and stored.
>
> The patch also implements a zswap API that returns the status of this
> config variable.

If we are compressing a large folio and batching is an option, is not
batching ever the correct thing to do? Why is the config option
needed?

>
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  include/linux/zswap.h |  6 ++++++
>  mm/Kconfig            | 12 ++++++++++++
>  mm/zswap.c            | 14 ++++++++++++++
>  3 files changed, 32 insertions(+)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index d961ead91bf1..74ad2a24b309 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -24,6 +24,7 @@ struct zswap_lruvec_state {
>         atomic_long_t nr_disk_swapins;
>  };
>
> +bool zswap_store_batching_enabled(void);
>  unsigned long zswap_total_pages(void);
>  bool zswap_store(struct folio *folio);
>  bool zswap_load(struct folio *folio);
> @@ -39,6 +40,11 @@ bool zswap_never_enabled(void);
>
>  struct zswap_lruvec_state {};
>
> +static inline bool zswap_store_batching_enabled(void)
> +{
> +       return false;
> +}
> +
>  static inline bool zswap_store(struct folio *folio)
>  {
>         return false;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 33fa51d608dc..26d1a5cee471 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -125,6 +125,18 @@ config ZSWAP_COMPRESSOR_DEFAULT
>         default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD
>         default ""
>
> +config ZSWAP_STORE_BATCHING_ENABLED
> +       bool "Batching of zswap stores with Intel IAA"
> +       depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO
> +       default n
> +       help
> +       Enables zswap_store to swapout large folios in batches of 8 pages,
> +       rather than a page at a time, if the system has Intel IAA for hardware
> +       acceleration of compressions. If IAA is configured as the zswap
> +       compressor, this will parallelize batch compression of upto 8 pages
> +       in the folio in hardware, thereby improving large folio compression
> +       throughput and reducing swapout latency.
> +
>  choice
>         prompt "Default allocator"
>         depends on ZSWAP
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 948c9745ee57..4893302d8c34 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -127,6 +127,15 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
>                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
>  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
>
> +/*
> + * Enable/disable batching of compressions if zswap_store is called with a
> + * large folio. If enabled, and if IAA is the zswap compressor, pages are
> + * compressed in parallel in batches of say, 8 pages.
> + * If not, every page is compressed sequentially.
> + */
> +static bool __zswap_store_batching_enabled = IS_ENABLED(
> +       CONFIG_ZSWAP_STORE_BATCHING_ENABLED);
> +
>  bool zswap_is_enabled(void)
>  {
>         return zswap_enabled;
> @@ -241,6 +250,11 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>         pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,         \
>                  zpool_get_type((p)->zpool))
>
> +__always_inline bool zswap_store_batching_enabled(void)
> +{
> +       return __zswap_store_batching_enabled;
> +}
> +
>  /*********************************
>  * pool functions
>  **********************************/
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 10/13] mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if platform has IAA.
  2024-10-18  6:40 ` [RFC PATCH v1 10/13] mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if platform has IAA Kanchana P Sridhar
@ 2024-10-23  0:51   ` Yosry Ahmed
  2024-10-23  2:19     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 36+ messages in thread
From: Yosry Ahmed @ 2024-10-23  0:51 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, wajdi.k.feghali,
	vinodh.gopal

On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Intel IAA hardware acceleration can be used effectively to improve the
> zswap_store() performance of large folios by batching multiple pages in a
> folio to be compressed in parallel by IAA. Hence, to build compress batching
> of zswap large folio stores using IAA, we need to be able to submit a batch
> of compress jobs from zswap to the hardware to compress in parallel if the
> iaa_crypto "async" mode is used.
>
> The IAA compress batching paradigm works as follows:
>
>  1) Submit N crypto_acomp_compress() jobs using N requests.
>  2) Use the iaa_crypto driver async poll() method to check for the jobs
>     to complete.
>  3) There are no ordering constraints implied by submission, hence we
>     could loop through the requests and process any job that has
>     completed.
>  4) This would repeat until all jobs have completed with success/error
>     status.
>
> To facilitate this, we need to provide for multiple acomp_reqs in
> "struct crypto_acomp_ctx", each representing a distinct compress
> job. Likewise, there needs to be a distinct destination buffer
> corresponding to each acomp_req.
>
> If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, this patch will set the
> SWAP_CRYPTO_SUB_BATCH_SIZE constant to 8UL. This implies each per-cpu
> crypto_acomp_ctx associated with the zswap_pool can submit up to 8
> acomp_reqs at a time to accomplish parallel compressions.
>
> If IAA is not present and/or CONFIG_ZSWAP_STORE_BATCHING_ENABLED is not
> set, SWAP_CRYPTO_SUB_BATCH_SIZE will be set to 1UL.
>
> On an Intel Sapphire Rapids server, each socket has 4 IAA, each of which
> has 2 compress engines and 8 decompress engines. Experiments modeling a
> contended system with say 72 processes running under a cgroup with a fixed
> memory-limit, have shown that there is a significant performance
> improvement with dispatching compress jobs from all cores to all the
> IAA devices on the socket. Hence, SWAP_CRYPTO_SUB_BATCH_SIZE is set to
> 8 to maximize compression throughput if IAA is available.
>
> The definition of "struct crypto_acomp_ctx" is modified to make the
> req/buffer be arrays of size SWAP_CRYPTO_SUB_BATCH_SIZE. Thus, the
> added memory footprint cost of this per-cpu structure for batching is
> incurred only for platforms that have Intel IAA.
>
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

Does this really need to be done in zswap? Why can't zswap submit a
single compression request with the supported number of pages, and
have the driver handle it as it sees fit?

> ---
>  mm/swap.h  |  11 ++++++
>  mm/zswap.c | 104 ++++++++++++++++++++++++++++++++++-------------------
>  2 files changed, 78 insertions(+), 37 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index ad2f121de970..566616c971d4 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -8,6 +8,17 @@ struct mempolicy;
>  #include <linux/swapops.h> /* for swp_offset */
>  #include <linux/blk_types.h> /* for bio_end_io_t */
>
> +/*
> + * For IAA compression batching:
> + * Maximum number of IAA acomp compress requests that will be processed
> + * in a sub-batch.
> + */
> +#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED)
> +#define SWAP_CRYPTO_SUB_BATCH_SIZE 8UL
> +#else
> +#define SWAP_CRYPTO_SUB_BATCH_SIZE 1UL
> +#endif
> +
>  /* linux/mm/page_io.c */
>  int sio_pool_init(void);
>  struct swap_iocb;
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 4893302d8c34..579869d1bdf6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -152,9 +152,9 @@ bool zswap_never_enabled(void)
>
>  struct crypto_acomp_ctx {
>         struct crypto_acomp *acomp;
> -       struct acomp_req *req;
> +       struct acomp_req *req[SWAP_CRYPTO_SUB_BATCH_SIZE];
> +       u8 *buffer[SWAP_CRYPTO_SUB_BATCH_SIZE];
>         struct crypto_wait wait;
> -       u8 *buffer;
>         struct mutex mutex;
>         bool is_sleepable;
>  };
> @@ -832,49 +832,64 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
>         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
>         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>         struct crypto_acomp *acomp;
> -       struct acomp_req *req;
>         int ret;
> +       int i, j;
>
>         mutex_init(&acomp_ctx->mutex);
>
> -       acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
> -       if (!acomp_ctx->buffer)
> -               return -ENOMEM;
> -
>         acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
>         if (IS_ERR(acomp)) {
>                 pr_err("could not alloc crypto acomp %s : %ld\n",
>                                 pool->tfm_name, PTR_ERR(acomp));
> -               ret = PTR_ERR(acomp);
> -               goto acomp_fail;
> +               return PTR_ERR(acomp);
>         }
>         acomp_ctx->acomp = acomp;
>         acomp_ctx->is_sleepable = acomp_is_async(acomp);
>
> -       req = acomp_request_alloc(acomp_ctx->acomp);
> -       if (!req) {
> -               pr_err("could not alloc crypto acomp_request %s\n",
> -                      pool->tfm_name);
> -               ret = -ENOMEM;
> -               goto req_fail;
> +       for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i) {
> +               acomp_ctx->buffer[i] = kmalloc_node(PAGE_SIZE * 2,
> +                                               GFP_KERNEL, cpu_to_node(cpu));
> +               if (!acomp_ctx->buffer[i]) {
> +                       for (j = 0; j < i; ++j)
> +                               kfree(acomp_ctx->buffer[j]);
> +                       ret = -ENOMEM;
> +                       goto buf_fail;
> +               }
> +       }
> +
> +       for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i) {
> +               acomp_ctx->req[i] = acomp_request_alloc(acomp_ctx->acomp);
> +               if (!acomp_ctx->req[i]) {
> +                       pr_err("could not alloc crypto acomp_request req[%d] %s\n",
> +                              i, pool->tfm_name);
> +                       for (j = 0; j < i; ++j)
> +                               acomp_request_free(acomp_ctx->req[j]);
> +                       ret = -ENOMEM;
> +                       goto req_fail;
> +               }
>         }
> -       acomp_ctx->req = req;
>
> +       /*
> +        * The crypto_wait is used only in fully synchronous, i.e., with scomp
> +        * or non-poll mode of acomp, hence there is only one "wait" per
> +        * acomp_ctx, with callback set to req[0].
> +        */
>         crypto_init_wait(&acomp_ctx->wait);
>         /*
>          * if the backend of acomp is async zip, crypto_req_done() will wakeup
>          * crypto_wait_req(); if the backend of acomp is scomp, the callback
>          * won't be called, crypto_wait_req() will return without blocking.
>          */
> -       acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> +       acomp_request_set_callback(acomp_ctx->req[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
>                                    crypto_req_done, &acomp_ctx->wait);
>
>         return 0;
>
>  req_fail:
> +       for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i)
> +               kfree(acomp_ctx->buffer[i]);
> +buf_fail:
>         crypto_free_acomp(acomp_ctx->acomp);
> -acomp_fail:
> -       kfree(acomp_ctx->buffer);
>         return ret;
>  }
>
> @@ -884,11 +899,17 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
>
>         if (!IS_ERR_OR_NULL(acomp_ctx)) {
> -               if (!IS_ERR_OR_NULL(acomp_ctx->req))
> -                       acomp_request_free(acomp_ctx->req);
> +               int i;
> +
> +               for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i)
> +                       if (!IS_ERR_OR_NULL(acomp_ctx->req[i]))
> +                               acomp_request_free(acomp_ctx->req[i]);
> +
> +               for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i)
> +                       kfree(acomp_ctx->buffer[i]);
> +
>                 if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>                         crypto_free_acomp(acomp_ctx->acomp);
> -               kfree(acomp_ctx->buffer);
>         }
>
>         return 0;
> @@ -911,7 +932,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>
>         mutex_lock(&acomp_ctx->mutex);
>
> -       dst = acomp_ctx->buffer;
> +       dst = acomp_ctx->buffer[0];
>         sg_init_table(&input, 1);
>         sg_set_page(&input, page, PAGE_SIZE, 0);
>
> @@ -921,7 +942,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>          * giving the dst buffer with enough length to avoid buffer overflow.
>          */
>         sg_init_one(&output, dst, PAGE_SIZE * 2);
> -       acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
> +       acomp_request_set_params(acomp_ctx->req[0], &input, &output, PAGE_SIZE, dlen);
>
>         /*
>          * If the crypto_acomp provides an asynchronous poll() interface,
> @@ -940,19 +961,20 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>          * parallel.
>          */
>         if (acomp_ctx->acomp->poll) {
> -               comp_ret = crypto_acomp_compress(acomp_ctx->req);
> +               comp_ret = crypto_acomp_compress(acomp_ctx->req[0]);
>                 if (comp_ret == -EINPROGRESS) {
>                         do {
> -                               comp_ret = crypto_acomp_poll(acomp_ctx->req);
> +                               comp_ret = crypto_acomp_poll(acomp_ctx->req[0]);
>                                 if (comp_ret && comp_ret != -EAGAIN)
>                                         break;
>                         } while (comp_ret);
>                 }
>         } else {
> -               comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
> +               comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req[0]),
> +                                          &acomp_ctx->wait);
>         }
>
> -       dlen = acomp_ctx->req->dlen;
> +       dlen = acomp_ctx->req[0]->dlen;
>         if (comp_ret)
>                 goto unlock;
>
> @@ -1006,31 +1028,39 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>          */
>         if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
>             !virt_addr_valid(src)) {
> -               memcpy(acomp_ctx->buffer, src, entry->length);
> -               src = acomp_ctx->buffer;
> +               memcpy(acomp_ctx->buffer[0], src, entry->length);
> +               src = acomp_ctx->buffer[0];
>                 zpool_unmap_handle(zpool, entry->handle);
>         }
>
>         sg_init_one(&input, src, entry->length);
>         sg_init_table(&output, 1);
>         sg_set_folio(&output, folio, PAGE_SIZE, 0);
> -       acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> +       acomp_request_set_params(acomp_ctx->req[0], &input, &output,
> +                                entry->length, PAGE_SIZE);
>         if (acomp_ctx->acomp->poll) {
> -               ret = crypto_acomp_decompress(acomp_ctx->req);
> +               ret = crypto_acomp_decompress(acomp_ctx->req[0]);
>                 if (ret == -EINPROGRESS) {
>                         do {
> -                               ret = crypto_acomp_poll(acomp_ctx->req);
> +                               ret = crypto_acomp_poll(acomp_ctx->req[0]);
>                                 BUG_ON(ret && ret != -EAGAIN);
>                         } while (ret);
>                 }
>         } else {
> -               BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> +               BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req[0]),
> +                                      &acomp_ctx->wait));
>         }
> -       BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> -       mutex_unlock(&acomp_ctx->mutex);
> +       BUG_ON(acomp_ctx->req[0]->dlen != PAGE_SIZE);
>
> -       if (src != acomp_ctx->buffer)
> +       if (src != acomp_ctx->buffer[0])
>                 zpool_unmap_handle(zpool, entry->handle);
> +
> +       /*
> +        * It is safer to unlock the mutex after the check for
> +        * "src != acomp_ctx->buffer[0]" so that the value of "src"
> +        * does not change.
> +        */
> +       mutex_unlock(&acomp_ctx->mutex);
>  }
>
>  /*********************************
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 11/13] mm: swap: Add IAA batch compression API swap_crypto_acomp_compress_batch().
  2024-10-18  6:40 ` [RFC PATCH v1 11/13] mm: swap: Add IAA batch compression API swap_crypto_acomp_compress_batch() Kanchana P Sridhar
@ 2024-10-23  0:53   ` Yosry Ahmed
  2024-10-23  2:21     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 36+ messages in thread
From: Yosry Ahmed @ 2024-10-23  0:53 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, wajdi.k.feghali,
	vinodh.gopal

On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Added a new API swap_crypto_acomp_compress_batch() that does batch
> compression. A system that has Intel IAA can avail of this API to submit a
> batch of compress jobs for parallel compression in the hardware, to improve
> performance. On a system without IAA, this API will process each compress
> job sequentially.
>
> The purpose of this API is to be invocable from any swap module that needs
> to compress large folios, or a batch of pages in the general case. For
> instance, zswap would batch compress up to SWAP_CRYPTO_SUB_BATCH_SIZE
> (i.e. 8 if the system has IAA) pages in the large folio in parallel to
> improve zswap_store() performance.
>
> Towards this eventual goal:
>
> 1) The definition of "struct crypto_acomp_ctx" is moved to mm/swap.h
>    so that mm modules like swap_state.c and zswap.c can reference it.
> 2) The swap_crypto_acomp_compress_batch() interface is implemented in
>    swap_state.c.
>
> It would be preferable for "struct crypto_acomp_ctx" to be defined in,
> and for swap_crypto_acomp_compress_batch() to be exported via
> include/linux/swap.h so that modules outside mm (for e.g. zram) can
> potentially use the API for batch compressions with IAA. I would
> appreciate RFC comments on this.

Same question as the last patch, why does this need to be in the swap
code? Why can't zswap just submit a single request to compress a large
folio or a range of contiguous subpages at once?

>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/swap.h       |  45 +++++++++++++++++++
>  mm/swap_state.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/zswap.c      |   9 ----
>  3 files changed, 160 insertions(+), 9 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 566616c971d4..4dcb67e2cc33 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -7,6 +7,7 @@ struct mempolicy;
>  #ifdef CONFIG_SWAP
>  #include <linux/swapops.h> /* for swp_offset */
>  #include <linux/blk_types.h> /* for bio_end_io_t */
> +#include <linux/crypto.h>
>
>  /*
>   * For IAA compression batching:
> @@ -19,6 +20,39 @@ struct mempolicy;
>  #define SWAP_CRYPTO_SUB_BATCH_SIZE 1UL
>  #endif
>
> +/* linux/mm/swap_state.c, zswap.c */
> +struct crypto_acomp_ctx {
> +       struct crypto_acomp *acomp;
> +       struct acomp_req *req[SWAP_CRYPTO_SUB_BATCH_SIZE];
> +       u8 *buffer[SWAP_CRYPTO_SUB_BATCH_SIZE];
> +       struct crypto_wait wait;
> +       struct mutex mutex;
> +       bool is_sleepable;
> +};
> +
> +/**
> + * This API provides IAA compress batching functionality for use by swap
> + * modules.
> + * The acomp_ctx mutex should be locked/unlocked before/after calling this
> + * procedure.
> + *
> + * @pages: Pages to be compressed.
> + * @dsts: Pre-allocated destination buffers to store results of IAA compression.
> + * @dlens: Will contain the compressed lengths.
> + * @errors: Will contain a 0 if the page was successfully compressed, or a
> + *          non-0 error value to be processed by the calling function.
> + * @nr_pages: The number of pages, up to SWAP_CRYPTO_SUB_BATCH_SIZE,
> + *            to be compressed.
> + * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
> + */
> +void swap_crypto_acomp_compress_batch(
> +       struct page *pages[],
> +       u8 *dsts[],
> +       unsigned int dlens[],
> +       int errors[],
> +       int nr_pages,
> +       struct crypto_acomp_ctx *acomp_ctx);
> +
>  /* linux/mm/page_io.c */
>  int sio_pool_init(void);
>  struct swap_iocb;
> @@ -119,6 +153,17 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
>
>  #else /* CONFIG_SWAP */
>  struct swap_iocb;
> +struct crypto_acomp_ctx {};
> +static inline void swap_crypto_acomp_compress_batch(
> +       struct page *pages[],
> +       u8 *dsts[],
> +       unsigned int dlens[],
> +       int errors[],
> +       int nr_pages,
> +       struct crypto_acomp_ctx *acomp_ctx)
> +{
> +}
> +
>  static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>  {
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 4669f29cf555..117c3caa5679 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -23,6 +23,8 @@
>  #include <linux/swap_slots.h>
>  #include <linux/huge_mm.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/scatterlist.h>
> +#include <crypto/acompress.h>
>  #include "internal.h"
>  #include "swap.h"
>
> @@ -742,6 +744,119 @@ void exit_swap_address_space(unsigned int type)
>         swapper_spaces[type] = NULL;
>  }
>
> +#ifdef CONFIG_SWAP
> +
> +/**
> + * This API provides IAA compress batching functionality for use by swap
> + * modules.
> + * The acomp_ctx mutex should be locked/unlocked before/after calling this
> + * procedure.
> + *
> + * @pages: Pages to be compressed.
> + * @dsts: Pre-allocated destination buffers to store results of IAA compression.
> + * @dlens: Will contain the compressed lengths.
> + * @errors: Will contain a 0 if the page was successfully compressed, or a
> + *          non-0 error value to be processed by the calling function.
> + * @nr_pages: The number of pages, up to SWAP_CRYPTO_SUB_BATCH_SIZE,
> + *            to be compressed.
> + * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
> + */
> +void swap_crypto_acomp_compress_batch(
> +       struct page *pages[],
> +       u8 *dsts[],
> +       unsigned int dlens[],
> +       int errors[],
> +       int nr_pages,
> +       struct crypto_acomp_ctx *acomp_ctx)
> +{
> +       struct scatterlist inputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
> +       struct scatterlist outputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
> +       bool compressions_done = false;
> +       int i, j;
> +
> +       BUG_ON(nr_pages > SWAP_CRYPTO_SUB_BATCH_SIZE);
> +
> +       /*
> +        * Prepare and submit acomp_reqs to IAA.
> +        * IAA will process these compress jobs in parallel in async mode.
> +        * If the compressor does not support a poll() method, or if IAA is
> +        * used in sync mode, the jobs will be processed sequentially using
> +        * acomp_ctx->req[0] and acomp_ctx->wait.
> +        */
> +       for (i = 0; i < nr_pages; ++i) {
> +               j = acomp_ctx->acomp->poll ? i : 0;
> +               sg_init_table(&inputs[i], 1);
> +               sg_set_page(&inputs[i], pages[i], PAGE_SIZE, 0);
> +
> +               /*
> +                * Each acomp_ctx->buffer[] is of size (PAGE_SIZE * 2).
> +                * Reflect same in sg_list.
> +                */
> +               sg_init_one(&outputs[i], dsts[i], PAGE_SIZE * 2);
> +               acomp_request_set_params(acomp_ctx->req[j], &inputs[i],
> +                                        &outputs[i], PAGE_SIZE, dlens[i]);
> +
> +               /*
> +                * If the crypto_acomp provides an asynchronous poll()
> +                * interface, submit the request to the driver now, and poll for
> +                * a completion status later, after all descriptors have been
> +                * submitted. If the crypto_acomp does not provide a poll()
> +                * interface, submit the request and wait for it to complete,
> +                * i.e., synchronously, before moving on to the next request.
> +                */
> +               if (acomp_ctx->acomp->poll) {
> +                       errors[i] = crypto_acomp_compress(acomp_ctx->req[j]);
> +
> +                       if (errors[i] != -EINPROGRESS)
> +                               errors[i] = -EINVAL;
> +                       else
> +                               errors[i] = -EAGAIN;
> +               } else {
> +                       errors[i] = crypto_wait_req(
> +                                             crypto_acomp_compress(acomp_ctx->req[j]),
> +                                             &acomp_ctx->wait);
> +                       if (!errors[i])
> +                               dlens[i] = acomp_ctx->req[j]->dlen;
> +               }
> +       }
> +
> +       /*
> +        * If not doing async compressions, the batch has been processed at
> +        * this point and we can return.
> +        */
> +       if (!acomp_ctx->acomp->poll)
> +               return;
> +
> +       /*
> +        * Poll for and process IAA compress job completions
> +        * in out-of-order manner.
> +        */
> +       while (!compressions_done) {
> +               compressions_done = true;
> +
> +               for (i = 0; i < nr_pages; ++i) {
> +                       /*
> +                        * Skip, if the compression has already completed
> +                        * successfully or with an error.
> +                        */
> +                       if (errors[i] != -EAGAIN)
> +                               continue;
> +
> +                       errors[i] = crypto_acomp_poll(acomp_ctx->req[i]);
> +
> +                       if (errors[i]) {
> +                               if (errors[i] == -EAGAIN)
> +                                       compressions_done = false;
> +                       } else {
> +                               dlens[i] = acomp_ctx->req[i]->dlen;
> +                       }
> +               }
> +       }
> +}
> +EXPORT_SYMBOL_GPL(swap_crypto_acomp_compress_batch);
> +
> +#endif /* CONFIG_SWAP */
> +
>  static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
>                            unsigned long *end)
>  {
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 579869d1bdf6..cab3114321f9 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -150,15 +150,6 @@ bool zswap_never_enabled(void)
>  * data structures
>  **********************************/
>
> -struct crypto_acomp_ctx {
> -       struct crypto_acomp *acomp;
> -       struct acomp_req *req[SWAP_CRYPTO_SUB_BATCH_SIZE];
> -       u8 *buffer[SWAP_CRYPTO_SUB_BATCH_SIZE];
> -       struct crypto_wait wait;
> -       struct mutex mutex;
> -       bool is_sleepable;
> -};
> -
>  /*
>   * The lock ordering is zswap_tree.lock -> zswap_pool.lru_lock.
>   * The only case where lru_lock is not acquired while holding tree.lock is
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 00/13] zswap IAA compress batching
  2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
                   ` (12 preceding siblings ...)
  2024-10-18  6:41 ` [RFC PATCH v1 13/13] mm: vmscan, swap, zswap: Compress batching of folios in shrink_folio_list() Kanchana P Sridhar
@ 2024-10-23  0:56 ` Yosry Ahmed
  2024-10-23  2:53   ` Sridhar, Kanchana P
  13 siblings, 1 reply; 36+ messages in thread
From: Yosry Ahmed @ 2024-10-23  0:56 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, ying.huang, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, wajdi.k.feghali,
	vinodh.gopal

On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
>
> IAA Compression Batching:
> =========================
>
> This RFC patch-series introduces the use of the Intel Analytics Accelerator
> (IAA) for parallel compression of pages in a folio, and for batched reclaim
> of hybrid any-order batches of folios in shrink_folio_list().
>
> The patch-series is organized as follows:
>
>  1) iaa_crypto driver enablers for batching: Relevant patches are tagged
>     with "crypto:" in the subject:
>
>     a) async poll crypto_acomp interface without interrupts.
>     b) crypto testmgr acomp poll support.
>     c) Modifying the default sync_mode to "async" and disabling
>        verify_compress by default, to facilitate users to run IAA easily for
>        comparison with software compressors.
>     d) Changing the cpu-to-iaa mappings to more evenly balance cores to IAA
>        devices.
>     e) Addition of a "global_wq" per IAA, which can be used as a global
>        resource for the socket. If the user configures 2WQs per IAA device,
>        the driver will distribute compress jobs from all cores on the
>        socket to the "global_wqs" of all the IAA devices on that socket, in
>        a round-robin manner. This can be used to improve compression
>        throughput for workloads that see a lot of swapout activity.
>
>  2) Migrating zswap to use async poll in zswap_compress()/decompress().
>  3) A centralized batch compression API that can be used by swap modules.
>  4) IAA compress batching within large folio zswap stores.
>  5) IAA compress batching of any-order hybrid folios in
>     shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
>     parameter can be used to configure the number of folios in [1, 32] to
>     be reclaimed using compress batching.

I am still digesting this series but I have some high level questions
that I left on some patches. My intuition though is that we should
drop (5) from the initial proposal as it's most controversial.
Batching reclaim of unrelated folios through zswap *might* make sense,
but it needs a broader conversation and it needs justification on its
own merit, without the rest of the series.

>
> IAA compress batching can be enabled only on platforms that have IAA, by
> setting this config variable:
>
>  CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y"
>
> The performance testing data with usemem 30 instances shows throughput
> gains of up to 40%, elapsed time reduction of up to 22% and sys time
> reduction of up to 30% with IAA compression batching.
>
> Our internal validation of IAA compress/decompress batching in highly
> contended Sapphire Rapids server setups with workloads running on 72 cores
> for ~25 minutes under stringent memory limit constraints have shown up to
> 50% reduction in sys time and 3.5% reduction in workload run time as
> compared to software compressors.
>
>
> System setup for testing:
> =========================
> Testing of this patch-series was done with mm-unstable as of 10-16-2024,
> commit 817952b8be34, without and with this patch-series.
> Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores
> per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
> partition swap. Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
> processes were run, each allocating and writing 10G of memory, and sleeping
> for 10 sec before exiting:
>
> usemem --init-time -w -O -s 10 -n 30 10g
>
> Other kernel configuration parameters:
>
>     zswap compressor : deflate-iaa
>     zswap allocator   : zsmalloc
>     vm.page-cluster   : 2,4
>
> IAA "compression verification" is disabled and the async poll acomp
> interface is used in the iaa_crypto driver (the defaults with this
> series).
>
>
> Performance testing (usemem30):
> ===============================
>
>  4K folios: deflate-iaa:
>  =======================
>
>  -------------------------------------------------------------------------------
>                 mm-unstable-10-16-2024  shrink_folio_list()  shrink_folio_list()
>                                          batching of folios   batching of folios
>  -------------------------------------------------------------------------------
>  zswap compressor          deflate-iaa          deflate-iaa          deflate-iaa
>  vm.compress-batchsize             n/a                    1                   32
>  vm.page-cluster                     2                    2                    2
>  -------------------------------------------------------------------------------
>  Total throughput            4,470,466            5,770,824            6,363,045
>            (KB/s)
>  Average throughput            149,015              192,360              212,101
>            (KB/s)
>  elapsed time                   119.24               100.96                92.99
>         (sec)
>  sys time (sec)               2,819.29             2,168.08             1,970.79
>
>  -------------------------------------------------------------------------------
>  memcg_high                    668,185              646,357              613,421
>  memcg_swap_fail                     0                    0                    0
>  zswpout                    62,991,796           58,275,673           53,070,201
>  zswpin                            431                  415                  396
>  pswpout                             0                    0                    0
>  pswpin                              0                    0                    0
>  thp_swpout                          0                    0                    0
>  thp_swpout_fallback                 0                    0                    0
>  pgmajfault                      3,137                3,085                3,440
>  swap_ra                            99                  100                   95
>  swap_ra_hit                        42                   44                   45
>  -------------------------------------------------------------------------------
>
>
>  16k/32/64k folios: deflate-iaa:
>  ===============================
>  All three large folio sizes 16k/32/64k were enabled to "always".
>
>  -------------------------------------------------------------------------------
>                 mm-unstable-  zswap_store()      + shrink_folio_list()
>                   10-16-2024    batching of         batching of folios
>                                    pages in
>                                large folios
>  -------------------------------------------------------------------------------
>  zswap compr     deflate-iaa     deflate-iaa          deflate-iaa
>  vm.compress-            n/a             n/a         4          8             16
>  batchsize
>  vm.page-                  2               2         2          2              2
>   cluster
>  -------------------------------------------------------------------------------
>  Total throughput   7,182,198   8,448,994    8,584,728    8,729,643    8,775,944
>            (KB/s)
>  Avg throughput       239,406     281,633      286,157      290,988      292,531
>          (KB/s)
>  elapsed time           85.04       77.84        77.03        75.18        74.98
>          (sec)
>  sys time (sec)      1,730.77    1,527.40     1,528.52     1,473.76     1,465.97
>
>  -------------------------------------------------------------------------------
>  memcg_high           648,125     694,188      696,004      699,728      724,887
>  memcg_swap_fail        1,550       2,540        1,627        1,577        1,517
>  zswpout           57,606,876  56,624,450   56,125,082    55,999,42   57,352,204
>  zswpin                   421         406          422          400          437
>  pswpout                    0           0            0            0            0
>  pswpin                     0           0            0            0            0
>  thp_swpout                 0           0            0            0            0
>  thp_swpout_fallback        0           0            0            0            0
>  16kB-mthp_swpout_          0           0            0            0            0
>           fallback
>  32kB-mthp_swpout_          0           0            0            0            0
>           fallback
>  64kB-mthp_swpout_      1,550       2,539        1,627        1,577        1,517
>           fallback
>  pgmajfault             3,102       3,126        3,473        3,454        3,134
>  swap_ra                  107         144          109          124          181
>  swap_ra_hit               51          88           45           66          107
>  ZSWPOUT-16kB               2           3            4            4            3
>  ZSWPOUT-32kB               0           2            1            1            0
>  ZSWPOUT-64kB       3,598,889   3,536,556    3,506,134    3,498,324    3,582,921
>  SWPOUT-16kB                0           0            0            0            0
>  SWPOUT-32kB                0           0            0            0            0
>  SWPOUT-64kB                0           0            0            0            0
>  -------------------------------------------------------------------------------
>
>
>  2M folios: deflate-iaa:
>  =======================
>
>  -------------------------------------------------------------------------------
>                    mm-unstable-10-16-2024    zswap_store() batching of pages
>                                                       in pmd-mappable folios
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa                deflate-iaa
>  vm.compress-batchsize                n/a                        n/a
>  vm.page-cluster                        2                          2
>  -------------------------------------------------------------------------------
>  Total throughput               7,444,592                 8,916,349
>            (KB/s)
>  Average throughput               248,153                   297,211
>            (KB/s)
>  elapsed time                       86.29                     73.44
>         (sec)
>  sys time (sec)                  1,833.21                  1,418.58
>
>  -------------------------------------------------------------------------------
>  memcg_high                        81,786                    89,905
>  memcg_swap_fail                       82                       395
>  zswpout                       58,874,092                57,721,884
>  zswpin                               422                       458
>  pswpout                                0                         0
>  pswpin                                 0                         0
>  thp_swpout                             0                         0
>  thp_swpout_fallback                   82                       394
>  pgmajfault                        14,864                    21,544
>  swap_ra                           34,953                    53,751
>  swap_ra_hit                       34,895                    53,660
>  ZSWPOUT-2048kB                   114,815                   112,269
>  SWPOUT-2048kB                          0                         0
>  -------------------------------------------------------------------------------
>
> Since 4K folios account for ~0.4% of all zswapouts when pmd-mappable folios
> are enabled for usemem30, we cannot expect much improvement from reclaim
> batching.
>
>
> Performance testing (Kernel compilation):
> =========================================
>
> As mentioned earlier, for workloads that see a lot of swapout activity, we
> can benefit from configuring 2 WQs per IAA device, with compress jobs from
> all same-socket cores being distributed toothe wq.1 of all IAAs on the
> socket, with the "global_wq" developed in this patch-series.
>
> Although this data includes IAA decompress batching, which will be
> submitted as a separate RFC patch-series, I am listing it here to quantify
> the benefit of distributing compress jobs among all IAAs. The kernel
> compilation test with "allmodconfig" is able to quantify this well:
>
>
>  4K folios: deflate-iaa: kernel compilation to quantify crypto patches
>  =====================================================================
>
>
>  ------------------------------------------------------------------------------
>                    IAA shrink_folio_list() compress batching and
>                        swapin_readahead() decompress batching
>
>                                       1WQ      2WQ (distribute compress jobs)
>
>                         1 local WQ (wq.0)    1 local WQ (wq.0) +
>                                   per IAA    1 global WQ (wq.1) per IAA
>
>  ------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa         deflate-iaa
>  vm.compress-batchsize                 32                  32
>  vm.page-cluster                        4                   4
>  ------------------------------------------------------------------------------
>  real_sec                          746.77              745.42
>  user_sec                       15,732.66           15,738.85
>  sys_sec                         5,384.14            5,247.86
>  Max_Res_Set_Size_KB            1,874,432           1,872,640
>
>  ------------------------------------------------------------------------------
>  zswpout                      101,648,460         104,882,982
>  zswpin                        27,418,319          29,428,515
>  pswpout                              213                  22
>  pswpin                               207                   6
>  pgmajfault                    21,896,616          23,629,768
>  swap_ra                        6,054,409           6,385,080
>  swap_ra_hit                    3,791,628           3,985,141
>  ------------------------------------------------------------------------------
>
> The iaa_crypto wq stats will show almost the same number of compress calls
> for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively.
> We see a latency reduction of 2.5% by distributing compress jobs among all
> IAA devices on the socket.
>
> I would greatly appreciate code review comments for the iaa_crypto driver
> and mm patches included in this series!
>
> Thanks,
> Kanchana
>
>
>
> Kanchana P Sridhar (13):
>   crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
>   crypto: iaa - Add support for irq-less crypto async interface
>   crypto: testmgr - Add crypto testmgr acomp poll support.
>   mm: zswap: zswap_compress()/decompress() can submit, then poll an
>     acomp_req.
>   crypto: iaa - Make async mode the default.
>   crypto: iaa - Disable iaa_verify_compress by default.
>   crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to
>     IAAs.
>   crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA
>     node.
>   mm: zswap: Config variable to enable compress batching in
>     zswap_store().
>   mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if
>     platform has IAA.
>   mm: swap: Add IAA batch compression API
>     swap_crypto_acomp_compress_batch().
>   mm: zswap: Compress batching with Intel IAA in zswap_store() of large
>     folios.
>   mm: vmscan, swap, zswap: Compress batching of folios in
>     shrink_folio_list().
>
>  crypto/acompress.c                         |   1 +
>  crypto/testmgr.c                           |  70 +-
>  drivers/crypto/intel/iaa/iaa_crypto_main.c | 467 +++++++++++--
>  include/crypto/acompress.h                 |  18 +
>  include/crypto/internal/acompress.h        |   1 +
>  include/linux/fs.h                         |   2 +
>  include/linux/mm.h                         |   8 +
>  include/linux/writeback.h                  |   5 +
>  include/linux/zswap.h                      | 106 +++
>  kernel/sysctl.c                            |   9 +
>  mm/Kconfig                                 |  12 +
>  mm/page_io.c                               | 152 +++-
>  mm/swap.c                                  |  15 +
>  mm/swap.h                                  |  96 +++
>  mm/swap_state.c                            | 115 +++
>  mm/vmscan.c                                | 154 +++-
>  mm/zswap.c                                 | 771 +++++++++++++++++++--
>  17 files changed, 1870 insertions(+), 132 deletions(-)
>
>
> base-commit: 817952b8be34aad40e07f6832fb9d1fc08961550
> --
> 2.27.0
>
>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH v1 04/13] mm: zswap: zswap_compress()/decompress() can submit, then poll an acomp_req.
  2024-10-23  0:48   ` Yosry Ahmed
@ 2024-10-23  2:01     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 36+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-23  2:01 UTC (permalink / raw)
  To: Yosry Ahmed, ebiggers
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, surenb, Accardi,
	Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, October 22, 2024 5:49 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 04/13] mm: zswap:
> zswap_compress()/decompress() can submit, then poll an acomp_req.
> 
> On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > If the crypto_acomp has a poll interface registered, zswap_compress()
> > and zswap_decompress() will submit the acomp_req, and then poll() for a
> > successful completion/error status in a busy-wait loop. This allows an
> > asynchronous way to manage (potentially multiple) acomp_reqs without
> > the use of interrupts, which is supported in the iaa_crypto driver.
> >
> > This enables us to implement batch submission of multiple
> > compression/decompression jobs to the Intel IAA hardware accelerator,
> > which will process them in parallel; followed by polling the batch's
> > acomp_reqs for completion status.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 51 +++++++++++++++++++++++++++++++++++++++----------
> --
> >  1 file changed, 39 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index f6316b66fb23..948c9745ee57 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -910,18 +910,34 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >         acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> >
> >         /*
> > -        * it maybe looks a little bit silly that we send an asynchronous request,
> > -        * then wait for its completion synchronously. This makes the process
> look
> > -        * synchronous in fact.
> > -        * Theoretically, acomp supports users send multiple acomp requests in
> one
> > -        * acomp instance, then get those requests done simultaneously. but in
> this
> > -        * case, zswap actually does store and load page by page, there is no
> > -        * existing method to send the second page before the first page is
> done
> > -        * in one thread doing zwap.
> > -        * but in different threads running on different cpu, we have different
> > -        * acomp instance, so multiple threads can do (de)compression in
> parallel.
> > +        * If the crypto_acomp provides an asynchronous poll() interface,
> > +        * submit the descriptor and poll for a completion status.
> > +        *
> > +        * It maybe looks a little bit silly that we send an asynchronous
> > +        * request, then wait for its completion in a busy-wait poll loop, or,
> > +        * synchronously. This makes the process look synchronous in fact.
> > +        * Theoretically, acomp supports users send multiple acomp requests in
> > +        * one acomp instance, then get those requests done simultaneously.
> > +        * But in this case, zswap actually does store and load page by page,
> > +        * there is no existing method to send the second page before the
> > +        * first page is done in one thread doing zswap.
> > +        * But in different threads running on different cpu, we have different
> > +        * acomp instance, so multiple threads can do (de)compression in
> > +        * parallel.
> >          */
> > -       comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >req), &acomp_ctx->wait);
> > +       if (acomp_ctx->acomp->poll) {
> > +               comp_ret = crypto_acomp_compress(acomp_ctx->req);
> > +               if (comp_ret == -EINPROGRESS) {
> > +                       do {
> > +                               comp_ret = crypto_acomp_poll(acomp_ctx->req);
> > +                               if (comp_ret && comp_ret != -EAGAIN)
> > +                                       break;
> > +                       } while (comp_ret);
> > +               }
> > +       } else {
> > +               comp_ret =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx-
> >wait);
> > +       }
> > +
> 
> Is Herbert suggesting that crypto_wait_req(crypto_acomp_compress(..))
> essentially do the poll internally for IAA, and hence this change can
> be dropped?

Yes, you're right. I plan to submit a v2 shortly with Herbert's suggestion.

Thanks,
Kanchana

> 
> >         dlen = acomp_ctx->req->dlen;
> >         if (comp_ret)
> >                 goto unlock;
> > @@ -959,6 +975,7 @@ static void zswap_decompress(struct zswap_entry
> *entry, struct folio *folio)
> >         struct scatterlist input, output;
> >         struct crypto_acomp_ctx *acomp_ctx;
> >         u8 *src;
> > +       int ret;
> >
> >         acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >         mutex_lock(&acomp_ctx->mutex);
> > @@ -984,7 +1001,17 @@ static void zswap_decompress(struct
> zswap_entry *entry, struct folio *folio)
> >         sg_init_table(&output, 1);
> >         sg_set_folio(&output, folio, PAGE_SIZE, 0);
> >         acomp_request_set_params(acomp_ctx->req, &input, &output, entry-
> >length, PAGE_SIZE);
> > -       BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx-
> >req), &acomp_ctx->wait));
> > +       if (acomp_ctx->acomp->poll) {
> > +               ret = crypto_acomp_decompress(acomp_ctx->req);
> > +               if (ret == -EINPROGRESS) {
> > +                       do {
> > +                               ret = crypto_acomp_poll(acomp_ctx->req);
> > +                               BUG_ON(ret && ret != -EAGAIN);
> > +                       } while (ret);
> > +               }
> > +       } else {
> > +
> BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> &acomp_ctx->wait));
> > +       }
> >         BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> >         mutex_unlock(&acomp_ctx->mutex);
> >
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable compress batching in zswap_store().
  2024-10-23  0:49   ` Yosry Ahmed
@ 2024-10-23  2:17     ` Sridhar, Kanchana P
  2024-10-23  2:58       ` Herbert Xu
  2024-10-23 18:12       ` Yosry Ahmed
  0 siblings, 2 replies; 36+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-23  2:17 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, October 22, 2024 5:50 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable
> compress batching in zswap_store().
> 
> On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Add a new zswap config variable that controls whether zswap_store() will
> > compress a batch of pages, for instance, the pages in a large folio:
> >
> >   CONFIG_ZSWAP_STORE_BATCHING_ENABLED
> >
> > The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in commit
> > ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto
> > driver core") is used to detect if the system has the Intel Analytics
> > Accelerator (IAA), and the iaa_crypto module is available. If so, the
> > kernel build will prompt for CONFIG_ZSWAP_STORE_BATCHING_ENABLED.
> Hence,
> > users have the ability to set
> CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y" only
> > on systems that have Intel IAA.
> >
> > If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, and IAA is
> configured
> > as the zswap compressor, zswap_store() will process the pages in a large
> > folio in batches, i.e., multiple pages at a time. Pages in a batch will be
> > compressed in parallel in hardware, then stored. On systems without Intel
> > IAA and/or if zswap uses software compressors, pages in the batch will be
> > compressed sequentially and stored.
> >
> > The patch also implements a zswap API that returns the status of this
> > config variable.
> 
> If we are compressing a large folio and batching is an option, is not
> batching ever the correct thing to do? Why is the config option
> needed?

Thanks Yosry, for the code review comments! This is a good point. The main
consideration here was not to impact software compressors run on non-Intel
platforms, and only incur the memory footprint cost of multiple
acomp_req/buffers in "struct crypto_acomp_ctx" if there is IAA to reduce
latency with parallel compressions.

If the memory footprint cost if acceptable, there is no reason not to do
batching, even if compressions are sequential. We could amortize cost
of the cgroup charging/objcg/stats updates.

Thanks,
Kanchana

> 
> >
> > Suggested-by: Ying Huang <ying.huang@intel.com>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  include/linux/zswap.h |  6 ++++++
> >  mm/Kconfig            | 12 ++++++++++++
> >  mm/zswap.c            | 14 ++++++++++++++
> >  3 files changed, 32 insertions(+)
> >
> > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > index d961ead91bf1..74ad2a24b309 100644
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -24,6 +24,7 @@ struct zswap_lruvec_state {
> >         atomic_long_t nr_disk_swapins;
> >  };
> >
> > +bool zswap_store_batching_enabled(void);
> >  unsigned long zswap_total_pages(void);
> >  bool zswap_store(struct folio *folio);
> >  bool zswap_load(struct folio *folio);
> > @@ -39,6 +40,11 @@ bool zswap_never_enabled(void);
> >
> >  struct zswap_lruvec_state {};
> >
> > +static inline bool zswap_store_batching_enabled(void)
> > +{
> > +       return false;
> > +}
> > +
> >  static inline bool zswap_store(struct folio *folio)
> >  {
> >         return false;
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 33fa51d608dc..26d1a5cee471 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -125,6 +125,18 @@ config ZSWAP_COMPRESSOR_DEFAULT
> >         default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD
> >         default ""
> >
> > +config ZSWAP_STORE_BATCHING_ENABLED
> > +       bool "Batching of zswap stores with Intel IAA"
> > +       depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO
> > +       default n
> > +       help
> > +       Enables zswap_store to swapout large folios in batches of 8 pages,
> > +       rather than a page at a time, if the system has Intel IAA for hardware
> > +       acceleration of compressions. If IAA is configured as the zswap
> > +       compressor, this will parallelize batch compression of upto 8 pages
> > +       in the folio in hardware, thereby improving large folio compression
> > +       throughput and reducing swapout latency.
> > +
> >  choice
> >         prompt "Default allocator"
> >         depends on ZSWAP
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 948c9745ee57..4893302d8c34 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -127,6 +127,15 @@ static bool zswap_shrinker_enabled =
> IS_ENABLED(
> >                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> >  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool,
> 0644);
> >
> > +/*
> > + * Enable/disable batching of compressions if zswap_store is called with a
> > + * large folio. If enabled, and if IAA is the zswap compressor, pages are
> > + * compressed in parallel in batches of say, 8 pages.
> > + * If not, every page is compressed sequentially.
> > + */
> > +static bool __zswap_store_batching_enabled = IS_ENABLED(
> > +       CONFIG_ZSWAP_STORE_BATCHING_ENABLED);
> > +
> >  bool zswap_is_enabled(void)
> >  {
> >         return zswap_enabled;
> > @@ -241,6 +250,11 @@ static inline struct xarray
> *swap_zswap_tree(swp_entry_t swp)
> >         pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,         \
> >                  zpool_get_type((p)->zpool))
> >
> > +__always_inline bool zswap_store_batching_enabled(void)
> > +{
> > +       return __zswap_store_batching_enabled;
> > +}
> > +
> >  /*********************************
> >  * pool functions
> >  **********************************/
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH v1 10/13] mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if platform has IAA.
  2024-10-23  0:51   ` Yosry Ahmed
@ 2024-10-23  2:19     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 36+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-23  2:19 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, October 22, 2024 5:52 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 10/13] mm: zswap: Create multiple reqs/buffers
> in crypto_acomp_ctx if platform has IAA.
> 
> On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Intel IAA hardware acceleration can be used effectively to improve the
> > zswap_store() performance of large folios by batching multiple pages in a
> > folio to be compressed in parallel by IAA. Hence, to build compress batching
> > of zswap large folio stores using IAA, we need to be able to submit a batch
> > of compress jobs from zswap to the hardware to compress in parallel if the
> > iaa_crypto "async" mode is used.
> >
> > The IAA compress batching paradigm works as follows:
> >
> >  1) Submit N crypto_acomp_compress() jobs using N requests.
> >  2) Use the iaa_crypto driver async poll() method to check for the jobs
> >     to complete.
> >  3) There are no ordering constraints implied by submission, hence we
> >     could loop through the requests and process any job that has
> >     completed.
> >  4) This would repeat until all jobs have completed with success/error
> >     status.
> >
> > To facilitate this, we need to provide for multiple acomp_reqs in
> > "struct crypto_acomp_ctx", each representing a distinct compress
> > job. Likewise, there needs to be a distinct destination buffer
> > corresponding to each acomp_req.
> >
> > If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, this patch will
> set the
> > SWAP_CRYPTO_SUB_BATCH_SIZE constant to 8UL. This implies each per-
> cpu
> > crypto_acomp_ctx associated with the zswap_pool can submit up to 8
> > acomp_reqs at a time to accomplish parallel compressions.
> >
> > If IAA is not present and/or CONFIG_ZSWAP_STORE_BATCHING_ENABLED
> is not
> > set, SWAP_CRYPTO_SUB_BATCH_SIZE will be set to 1UL.
> >
> > On an Intel Sapphire Rapids server, each socket has 4 IAA, each of which
> > has 2 compress engines and 8 decompress engines. Experiments modeling a
> > contended system with say 72 processes running under a cgroup with a
> fixed
> > memory-limit, have shown that there is a significant performance
> > improvement with dispatching compress jobs from all cores to all the
> > IAA devices on the socket. Hence, SWAP_CRYPTO_SUB_BATCH_SIZE is set
> to
> > 8 to maximize compression throughput if IAA is available.
> >
> > The definition of "struct crypto_acomp_ctx" is modified to make the
> > req/buffer be arrays of size SWAP_CRYPTO_SUB_BATCH_SIZE. Thus, the
> > added memory footprint cost of this per-cpu structure for batching is
> > incurred only for platforms that have Intel IAA.
> >
> > Suggested-by: Ying Huang <ying.huang@intel.com>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> Does this really need to be done in zswap? Why can't zswap submit a
> single compression request with the supported number of pages, and
> have the driver handle it as it sees fit?

For sure, this approach would work well too.

Thanks,
Kanchana

> 
> > ---
> >  mm/swap.h  |  11 ++++++
> >  mm/zswap.c | 104 ++++++++++++++++++++++++++++++++++----------------
> ---
> >  2 files changed, 78 insertions(+), 37 deletions(-)
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index ad2f121de970..566616c971d4 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -8,6 +8,17 @@ struct mempolicy;
> >  #include <linux/swapops.h> /* for swp_offset */
> >  #include <linux/blk_types.h> /* for bio_end_io_t */
> >
> > +/*
> > + * For IAA compression batching:
> > + * Maximum number of IAA acomp compress requests that will be
> processed
> > + * in a sub-batch.
> > + */
> > +#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED)
> > +#define SWAP_CRYPTO_SUB_BATCH_SIZE 8UL
> > +#else
> > +#define SWAP_CRYPTO_SUB_BATCH_SIZE 1UL
> > +#endif
> > +
> >  /* linux/mm/page_io.c */
> >  int sio_pool_init(void);
> >  struct swap_iocb;
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 4893302d8c34..579869d1bdf6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -152,9 +152,9 @@ bool zswap_never_enabled(void)
> >
> >  struct crypto_acomp_ctx {
> >         struct crypto_acomp *acomp;
> > -       struct acomp_req *req;
> > +       struct acomp_req *req[SWAP_CRYPTO_SUB_BATCH_SIZE];
> > +       u8 *buffer[SWAP_CRYPTO_SUB_BATCH_SIZE];
> >         struct crypto_wait wait;
> > -       u8 *buffer;
> >         struct mutex mutex;
> >         bool is_sleepable;
> >  };
> > @@ -832,49 +832,64 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> >         struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
> >         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx,
> cpu);
> >         struct crypto_acomp *acomp;
> > -       struct acomp_req *req;
> >         int ret;
> > +       int i, j;
> >
> >         mutex_init(&acomp_ctx->mutex);
> >
> > -       acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> cpu_to_node(cpu));
> > -       if (!acomp_ctx->buffer)
> > -               return -ENOMEM;
> > -
> >         acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
> >         if (IS_ERR(acomp)) {
> >                 pr_err("could not alloc crypto acomp %s : %ld\n",
> >                                 pool->tfm_name, PTR_ERR(acomp));
> > -               ret = PTR_ERR(acomp);
> > -               goto acomp_fail;
> > +               return PTR_ERR(acomp);
> >         }
> >         acomp_ctx->acomp = acomp;
> >         acomp_ctx->is_sleepable = acomp_is_async(acomp);
> >
> > -       req = acomp_request_alloc(acomp_ctx->acomp);
> > -       if (!req) {
> > -               pr_err("could not alloc crypto acomp_request %s\n",
> > -                      pool->tfm_name);
> > -               ret = -ENOMEM;
> > -               goto req_fail;
> > +       for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i) {
> > +               acomp_ctx->buffer[i] = kmalloc_node(PAGE_SIZE * 2,
> > +                                               GFP_KERNEL, cpu_to_node(cpu));
> > +               if (!acomp_ctx->buffer[i]) {
> > +                       for (j = 0; j < i; ++j)
> > +                               kfree(acomp_ctx->buffer[j]);
> > +                       ret = -ENOMEM;
> > +                       goto buf_fail;
> > +               }
> > +       }
> > +
> > +       for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i) {
> > +               acomp_ctx->req[i] = acomp_request_alloc(acomp_ctx->acomp);
> > +               if (!acomp_ctx->req[i]) {
> > +                       pr_err("could not alloc crypto acomp_request req[%d] %s\n",
> > +                              i, pool->tfm_name);
> > +                       for (j = 0; j < i; ++j)
> > +                               acomp_request_free(acomp_ctx->req[j]);
> > +                       ret = -ENOMEM;
> > +                       goto req_fail;
> > +               }
> >         }
> > -       acomp_ctx->req = req;
> >
> > +       /*
> > +        * The crypto_wait is used only in fully synchronous, i.e., with scomp
> > +        * or non-poll mode of acomp, hence there is only one "wait" per
> > +        * acomp_ctx, with callback set to req[0].
> > +        */
> >         crypto_init_wait(&acomp_ctx->wait);
> >         /*
> >          * if the backend of acomp is async zip, crypto_req_done() will wakeup
> >          * crypto_wait_req(); if the backend of acomp is scomp, the callback
> >          * won't be called, crypto_wait_req() will return without blocking.
> >          */
> > -       acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
> > +       acomp_request_set_callback(acomp_ctx->req[0],
> CRYPTO_TFM_REQ_MAY_BACKLOG,
> >                                    crypto_req_done, &acomp_ctx->wait);
> >
> >         return 0;
> >
> >  req_fail:
> > +       for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i)
> > +               kfree(acomp_ctx->buffer[i]);
> > +buf_fail:
> >         crypto_free_acomp(acomp_ctx->acomp);
> > -acomp_fail:
> > -       kfree(acomp_ctx->buffer);
> >         return ret;
> >  }
> >
> > @@ -884,11 +899,17 @@ static int zswap_cpu_comp_dead(unsigned int
> cpu, struct hlist_node *node)
> >         struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx,
> cpu);
> >
> >         if (!IS_ERR_OR_NULL(acomp_ctx)) {
> > -               if (!IS_ERR_OR_NULL(acomp_ctx->req))
> > -                       acomp_request_free(acomp_ctx->req);
> > +               int i;
> > +
> > +               for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i)
> > +                       if (!IS_ERR_OR_NULL(acomp_ctx->req[i]))
> > +                               acomp_request_free(acomp_ctx->req[i]);
> > +
> > +               for (i = 0; i < SWAP_CRYPTO_SUB_BATCH_SIZE; ++i)
> > +                       kfree(acomp_ctx->buffer[i]);
> > +
> >                 if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
> >                         crypto_free_acomp(acomp_ctx->acomp);
> > -               kfree(acomp_ctx->buffer);
> >         }
> >
> >         return 0;
> > @@ -911,7 +932,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >
> >         mutex_lock(&acomp_ctx->mutex);
> >
> > -       dst = acomp_ctx->buffer;
> > +       dst = acomp_ctx->buffer[0];
> >         sg_init_table(&input, 1);
> >         sg_set_page(&input, page, PAGE_SIZE, 0);
> >
> > @@ -921,7 +942,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >          * giving the dst buffer with enough length to avoid buffer overflow.
> >          */
> >         sg_init_one(&output, dst, PAGE_SIZE * 2);
> > -       acomp_request_set_params(acomp_ctx->req, &input, &output,
> PAGE_SIZE, dlen);
> > +       acomp_request_set_params(acomp_ctx->req[0], &input, &output,
> PAGE_SIZE, dlen);
> >
> >         /*
> >          * If the crypto_acomp provides an asynchronous poll() interface,
> > @@ -940,19 +961,20 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
> >          * parallel.
> >          */
> >         if (acomp_ctx->acomp->poll) {
> > -               comp_ret = crypto_acomp_compress(acomp_ctx->req);
> > +               comp_ret = crypto_acomp_compress(acomp_ctx->req[0]);
> >                 if (comp_ret == -EINPROGRESS) {
> >                         do {
> > -                               comp_ret = crypto_acomp_poll(acomp_ctx->req);
> > +                               comp_ret = crypto_acomp_poll(acomp_ctx->req[0]);
> >                                 if (comp_ret && comp_ret != -EAGAIN)
> >                                         break;
> >                         } while (comp_ret);
> >                 }
> >         } else {
> > -               comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx-
> >req), &acomp_ctx->wait);
> > +               comp_ret =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->req[0]),
> > +                                          &acomp_ctx->wait);
> >         }
> >
> > -       dlen = acomp_ctx->req->dlen;
> > +       dlen = acomp_ctx->req[0]->dlen;
> >         if (comp_ret)
> >                 goto unlock;
> >
> > @@ -1006,31 +1028,39 @@ static void zswap_decompress(struct
> zswap_entry *entry, struct folio *folio)
> >          */
> >         if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
> >             !virt_addr_valid(src)) {
> > -               memcpy(acomp_ctx->buffer, src, entry->length);
> > -               src = acomp_ctx->buffer;
> > +               memcpy(acomp_ctx->buffer[0], src, entry->length);
> > +               src = acomp_ctx->buffer[0];
> >                 zpool_unmap_handle(zpool, entry->handle);
> >         }
> >
> >         sg_init_one(&input, src, entry->length);
> >         sg_init_table(&output, 1);
> >         sg_set_folio(&output, folio, PAGE_SIZE, 0);
> > -       acomp_request_set_params(acomp_ctx->req, &input, &output, entry-
> >length, PAGE_SIZE);
> > +       acomp_request_set_params(acomp_ctx->req[0], &input, &output,
> > +                                entry->length, PAGE_SIZE);
> >         if (acomp_ctx->acomp->poll) {
> > -               ret = crypto_acomp_decompress(acomp_ctx->req);
> > +               ret = crypto_acomp_decompress(acomp_ctx->req[0]);
> >                 if (ret == -EINPROGRESS) {
> >                         do {
> > -                               ret = crypto_acomp_poll(acomp_ctx->req);
> > +                               ret = crypto_acomp_poll(acomp_ctx->req[0]);
> >                                 BUG_ON(ret && ret != -EAGAIN);
> >                         } while (ret);
> >                 }
> >         } else {
> > -
> BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> &acomp_ctx->wait));
> > +
> BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req[0]),
> > +                                      &acomp_ctx->wait));
> >         }
> > -       BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> > -       mutex_unlock(&acomp_ctx->mutex);
> > +       BUG_ON(acomp_ctx->req[0]->dlen != PAGE_SIZE);
> >
> > -       if (src != acomp_ctx->buffer)
> > +       if (src != acomp_ctx->buffer[0])
> >                 zpool_unmap_handle(zpool, entry->handle);
> > +
> > +       /*
> > +        * It is safer to unlock the mutex after the check for
> > +        * "src != acomp_ctx->buffer[0]" so that the value of "src"
> > +        * does not change.
> > +        */
> > +       mutex_unlock(&acomp_ctx->mutex);
> >  }
> >
> >  /*********************************
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH v1 11/13] mm: swap: Add IAA batch compression API swap_crypto_acomp_compress_batch().
  2024-10-23  0:53   ` Yosry Ahmed
@ 2024-10-23  2:21     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 36+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-23  2:21 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, October 22, 2024 5:53 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 11/13] mm: swap: Add IAA batch compression API
> swap_crypto_acomp_compress_batch().
> 
> On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Added a new API swap_crypto_acomp_compress_batch() that does batch
> > compression. A system that has Intel IAA can avail of this API to submit a
> > batch of compress jobs for parallel compression in the hardware, to improve
> > performance. On a system without IAA, this API will process each compress
> > job sequentially.
> >
> > The purpose of this API is to be invocable from any swap module that needs
> > to compress large folios, or a batch of pages in the general case. For
> > instance, zswap would batch compress up to
> SWAP_CRYPTO_SUB_BATCH_SIZE
> > (i.e. 8 if the system has IAA) pages in the large folio in parallel to
> > improve zswap_store() performance.
> >
> > Towards this eventual goal:
> >
> > 1) The definition of "struct crypto_acomp_ctx" is moved to mm/swap.h
> >    so that mm modules like swap_state.c and zswap.c can reference it.
> > 2) The swap_crypto_acomp_compress_batch() interface is implemented in
> >    swap_state.c.
> >
> > It would be preferable for "struct crypto_acomp_ctx" to be defined in,
> > and for swap_crypto_acomp_compress_batch() to be exported via
> > include/linux/swap.h so that modules outside mm (for e.g. zram) can
> > potentially use the API for batch compressions with IAA. I would
> > appreciate RFC comments on this.
> 
> Same question as the last patch, why does this need to be in the swap
> code? Why can't zswap just submit a single request to compress a large
> folio or a range of contiguous subpages at once?

Sure, this would also work well.

Thanks,
Kanchana

> 
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/swap.h       |  45 +++++++++++++++++++
> >  mm/swap_state.c | 115
> ++++++++++++++++++++++++++++++++++++++++++++++++
> >  mm/zswap.c      |   9 ----
> >  3 files changed, 160 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 566616c971d4..4dcb67e2cc33 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -7,6 +7,7 @@ struct mempolicy;
> >  #ifdef CONFIG_SWAP
> >  #include <linux/swapops.h> /* for swp_offset */
> >  #include <linux/blk_types.h> /* for bio_end_io_t */
> > +#include <linux/crypto.h>
> >
> >  /*
> >   * For IAA compression batching:
> > @@ -19,6 +20,39 @@ struct mempolicy;
> >  #define SWAP_CRYPTO_SUB_BATCH_SIZE 1UL
> >  #endif
> >
> > +/* linux/mm/swap_state.c, zswap.c */
> > +struct crypto_acomp_ctx {
> > +       struct crypto_acomp *acomp;
> > +       struct acomp_req *req[SWAP_CRYPTO_SUB_BATCH_SIZE];
> > +       u8 *buffer[SWAP_CRYPTO_SUB_BATCH_SIZE];
> > +       struct crypto_wait wait;
> > +       struct mutex mutex;
> > +       bool is_sleepable;
> > +};
> > +
> > +/**
> > + * This API provides IAA compress batching functionality for use by swap
> > + * modules.
> > + * The acomp_ctx mutex should be locked/unlocked before/after calling
> this
> > + * procedure.
> > + *
> > + * @pages: Pages to be compressed.
> > + * @dsts: Pre-allocated destination buffers to store results of IAA
> compression.
> > + * @dlens: Will contain the compressed lengths.
> > + * @errors: Will contain a 0 if the page was successfully compressed, or a
> > + *          non-0 error value to be processed by the calling function.
> > + * @nr_pages: The number of pages, up to
> SWAP_CRYPTO_SUB_BATCH_SIZE,
> > + *            to be compressed.
> > + * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
> > + */
> > +void swap_crypto_acomp_compress_batch(
> > +       struct page *pages[],
> > +       u8 *dsts[],
> > +       unsigned int dlens[],
> > +       int errors[],
> > +       int nr_pages,
> > +       struct crypto_acomp_ctx *acomp_ctx);
> > +
> >  /* linux/mm/page_io.c */
> >  int sio_pool_init(void);
> >  struct swap_iocb;
> > @@ -119,6 +153,17 @@ static inline int
> swap_zeromap_batch(swp_entry_t entry, int max_nr,
> >
> >  #else /* CONFIG_SWAP */
> >  struct swap_iocb;
> > +struct crypto_acomp_ctx {};
> > +static inline void swap_crypto_acomp_compress_batch(
> > +       struct page *pages[],
> > +       u8 *dsts[],
> > +       unsigned int dlens[],
> > +       int errors[],
> > +       int nr_pages,
> > +       struct crypto_acomp_ctx *acomp_ctx)
> > +{
> > +}
> > +
> >  static inline void swap_read_folio(struct folio *folio, struct swap_iocb
> **plug)
> >  {
> >  }
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 4669f29cf555..117c3caa5679 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -23,6 +23,8 @@
> >  #include <linux/swap_slots.h>
> >  #include <linux/huge_mm.h>
> >  #include <linux/shmem_fs.h>
> > +#include <linux/scatterlist.h>
> > +#include <crypto/acompress.h>
> >  #include "internal.h"
> >  #include "swap.h"
> >
> > @@ -742,6 +744,119 @@ void exit_swap_address_space(unsigned int
> type)
> >         swapper_spaces[type] = NULL;
> >  }
> >
> > +#ifdef CONFIG_SWAP
> > +
> > +/**
> > + * This API provides IAA compress batching functionality for use by swap
> > + * modules.
> > + * The acomp_ctx mutex should be locked/unlocked before/after calling
> this
> > + * procedure.
> > + *
> > + * @pages: Pages to be compressed.
> > + * @dsts: Pre-allocated destination buffers to store results of IAA
> compression.
> > + * @dlens: Will contain the compressed lengths.
> > + * @errors: Will contain a 0 if the page was successfully compressed, or a
> > + *          non-0 error value to be processed by the calling function.
> > + * @nr_pages: The number of pages, up to
> SWAP_CRYPTO_SUB_BATCH_SIZE,
> > + *            to be compressed.
> > + * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
> > + */
> > +void swap_crypto_acomp_compress_batch(
> > +       struct page *pages[],
> > +       u8 *dsts[],
> > +       unsigned int dlens[],
> > +       int errors[],
> > +       int nr_pages,
> > +       struct crypto_acomp_ctx *acomp_ctx)
> > +{
> > +       struct scatterlist inputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
> > +       struct scatterlist outputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
> > +       bool compressions_done = false;
> > +       int i, j;
> > +
> > +       BUG_ON(nr_pages > SWAP_CRYPTO_SUB_BATCH_SIZE);
> > +
> > +       /*
> > +        * Prepare and submit acomp_reqs to IAA.
> > +        * IAA will process these compress jobs in parallel in async mode.
> > +        * If the compressor does not support a poll() method, or if IAA is
> > +        * used in sync mode, the jobs will be processed sequentially using
> > +        * acomp_ctx->req[0] and acomp_ctx->wait.
> > +        */
> > +       for (i = 0; i < nr_pages; ++i) {
> > +               j = acomp_ctx->acomp->poll ? i : 0;
> > +               sg_init_table(&inputs[i], 1);
> > +               sg_set_page(&inputs[i], pages[i], PAGE_SIZE, 0);
> > +
> > +               /*
> > +                * Each acomp_ctx->buffer[] is of size (PAGE_SIZE * 2).
> > +                * Reflect same in sg_list.
> > +                */
> > +               sg_init_one(&outputs[i], dsts[i], PAGE_SIZE * 2);
> > +               acomp_request_set_params(acomp_ctx->req[j], &inputs[i],
> > +                                        &outputs[i], PAGE_SIZE, dlens[i]);
> > +
> > +               /*
> > +                * If the crypto_acomp provides an asynchronous poll()
> > +                * interface, submit the request to the driver now, and poll for
> > +                * a completion status later, after all descriptors have been
> > +                * submitted. If the crypto_acomp does not provide a poll()
> > +                * interface, submit the request and wait for it to complete,
> > +                * i.e., synchronously, before moving on to the next request.
> > +                */
> > +               if (acomp_ctx->acomp->poll) {
> > +                       errors[i] = crypto_acomp_compress(acomp_ctx->req[j]);
> > +
> > +                       if (errors[i] != -EINPROGRESS)
> > +                               errors[i] = -EINVAL;
> > +                       else
> > +                               errors[i] = -EAGAIN;
> > +               } else {
> > +                       errors[i] = crypto_wait_req(
> > +                                             crypto_acomp_compress(acomp_ctx->req[j]),
> > +                                             &acomp_ctx->wait);
> > +                       if (!errors[i])
> > +                               dlens[i] = acomp_ctx->req[j]->dlen;
> > +               }
> > +       }
> > +
> > +       /*
> > +        * If not doing async compressions, the batch has been processed at
> > +        * this point and we can return.
> > +        */
> > +       if (!acomp_ctx->acomp->poll)
> > +               return;
> > +
> > +       /*
> > +        * Poll for and process IAA compress job completions
> > +        * in out-of-order manner.
> > +        */
> > +       while (!compressions_done) {
> > +               compressions_done = true;
> > +
> > +               for (i = 0; i < nr_pages; ++i) {
> > +                       /*
> > +                        * Skip, if the compression has already completed
> > +                        * successfully or with an error.
> > +                        */
> > +                       if (errors[i] != -EAGAIN)
> > +                               continue;
> > +
> > +                       errors[i] = crypto_acomp_poll(acomp_ctx->req[i]);
> > +
> > +                       if (errors[i]) {
> > +                               if (errors[i] == -EAGAIN)
> > +                                       compressions_done = false;
> > +                       } else {
> > +                               dlens[i] = acomp_ctx->req[i]->dlen;
> > +                       }
> > +               }
> > +       }
> > +}
> > +EXPORT_SYMBOL_GPL(swap_crypto_acomp_compress_batch);
> > +
> > +#endif /* CONFIG_SWAP */
> > +
> >  static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
> >                            unsigned long *end)
> >  {
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 579869d1bdf6..cab3114321f9 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -150,15 +150,6 @@ bool zswap_never_enabled(void)
> >  * data structures
> >  **********************************/
> >
> > -struct crypto_acomp_ctx {
> > -       struct crypto_acomp *acomp;
> > -       struct acomp_req *req[SWAP_CRYPTO_SUB_BATCH_SIZE];
> > -       u8 *buffer[SWAP_CRYPTO_SUB_BATCH_SIZE];
> > -       struct crypto_wait wait;
> > -       struct mutex mutex;
> > -       bool is_sleepable;
> > -};
> > -
> >  /*
> >   * The lock ordering is zswap_tree.lock -> zswap_pool.lru_lock.
> >   * The only case where lru_lock is not acquired while holding tree.lock is
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH v1 00/13] zswap IAA compress batching
  2024-10-23  0:56 ` [RFC PATCH v1 00/13] zswap IAA compress batching Yosry Ahmed
@ 2024-10-23  2:53   ` Sridhar, Kanchana P
  2024-10-23 18:15     ` Yosry Ahmed
  0 siblings, 1 reply; 36+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-23  2:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, October 22, 2024 5:57 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
> 
> On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > IAA Compression Batching:
> > =========================
> >
> > This RFC patch-series introduces the use of the Intel Analytics Accelerator
> > (IAA) for parallel compression of pages in a folio, and for batched reclaim
> > of hybrid any-order batches of folios in shrink_folio_list().
> >
> > The patch-series is organized as follows:
> >
> >  1) iaa_crypto driver enablers for batching: Relevant patches are tagged
> >     with "crypto:" in the subject:
> >
> >     a) async poll crypto_acomp interface without interrupts.
> >     b) crypto testmgr acomp poll support.
> >     c) Modifying the default sync_mode to "async" and disabling
> >        verify_compress by default, to facilitate users to run IAA easily for
> >        comparison with software compressors.
> >     d) Changing the cpu-to-iaa mappings to more evenly balance cores to IAA
> >        devices.
> >     e) Addition of a "global_wq" per IAA, which can be used as a global
> >        resource for the socket. If the user configures 2WQs per IAA device,
> >        the driver will distribute compress jobs from all cores on the
> >        socket to the "global_wqs" of all the IAA devices on that socket, in
> >        a round-robin manner. This can be used to improve compression
> >        throughput for workloads that see a lot of swapout activity.
> >
> >  2) Migrating zswap to use async poll in zswap_compress()/decompress().
> >  3) A centralized batch compression API that can be used by swap modules.
> >  4) IAA compress batching within large folio zswap stores.
> >  5) IAA compress batching of any-order hybrid folios in
> >     shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
> >     parameter can be used to configure the number of folios in [1, 32] to
> >     be reclaimed using compress batching.
> 
> I am still digesting this series but I have some high level questions
> that I left on some patches. My intuition though is that we should
> drop (5) from the initial proposal as it's most controversial.
> Batching reclaim of unrelated folios through zswap *might* make sense,
> but it needs a broader conversation and it needs justification on its
> own merit, without the rest of the series.

Thanks for these suggestions!  Sure, I can drop (5) from the initial patch-set.
Agree also, this needs a broader discussion.

I believe the 4K folios usemem30 data in this patchset does bring across
the batching reclaim benefits to provide justification on its own merit. I added
the data on batching reclaim with kernel compilation as part of the 4K folios
experiments in the IAA decompression batching patch-series [1].
Listing it here as well. I will make sure to add this data in subsequent revs.

--------------------------------------------------------------------------
 Kernel compilation in tmpfs/allmodconfig, 2G max memory:
 
 No large folios          mm-unstable-10-16-2024       shrink_folio_list()        
                                                       batching of folios     
 --------------------------------------------------------------------------
 zswap compressor         zstd       deflate-iaa       deflate-iaa   
 vm.compress-batchsize     n/a               n/a                32   
 vm.page-cluster             3                 3                 3   
 --------------------------------------------------------------------------
 real_sec               783.87            761.69            747.32   
 user_sec            15,750.07         15,716.69         15,728.39   
 sys_sec              6,522.32          5,725.28          5,399.44   
 Max_RSS_KB          1,872,640         1,870,848         1,874,432   
                                                                            
 zswpout            82,364,991        97,739,600       102,780,612   
 zswpin             21,303,393        27,684,166        29,016,252   
 pswpout                    13               222               213   
 pswpin                     12               209               202   
 pgmajfault         17,114,339        22,421,211        23,378,161   
 swap_ra             4,596,035         5,840,082         6,231,646   
 swap_ra_hit         2,903,249         3,682,444         3,940,420   
 --------------------------------------------------------------------------

The performance improvements seen does depend on compression batching in
the swap modules (zswap). The implementation in patch 12 in the compress
batching series sets up this zswap compression pipeline, that takes an array of
folios and processes them in batches of 8 pages compressed in parallel in hardware.
That being said, we do see latency improvements even with reclaim batching
combined with zswap compress batching with zstd/lzo-rle/etc. I haven't done a
lot of analysis of this, but I am guessing fewer calls from the swap layer
(swap_writepage()) into zswap could have something to do with this. If we believe
that batching can be the right thing to do even for the software compressors,
I can gather batching data with zstd for v2.


[1] https://patchwork.kernel.org/project/linux-mm/cover/20241018064805.336490-1-kanchana.p.sridhar@intel.com/

Thanks,
Kanchana

> 
> >
> > IAA compress batching can be enabled only on platforms that have IAA, by
> > setting this config variable:
> >
> >  CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y"
> >
> > The performance testing data with usemem 30 instances shows throughput
> > gains of up to 40%, elapsed time reduction of up to 22% and sys time
> > reduction of up to 30% with IAA compression batching.
> >
> > Our internal validation of IAA compress/decompress batching in highly
> > contended Sapphire Rapids server setups with workloads running on 72
> cores
> > for ~25 minutes under stringent memory limit constraints have shown up to
> > 50% reduction in sys time and 3.5% reduction in workload run time as
> > compared to software compressors.
> >
> >
> > System setup for testing:
> > =========================
> > Testing of this patch-series was done with mm-unstable as of 10-16-2024,
> > commit 817952b8be34, without and with this patch-series.
> > Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores
> > per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
> > partition swap. Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
> > processes were run, each allocating and writing 10G of memory, and
> sleeping
> > for 10 sec before exiting:
> >
> > usemem --init-time -w -O -s 10 -n 30 10g
> >
> > Other kernel configuration parameters:
> >
> >     zswap compressor : deflate-iaa
> >     zswap allocator   : zsmalloc
> >     vm.page-cluster   : 2,4
> >
> > IAA "compression verification" is disabled and the async poll acomp
> > interface is used in the iaa_crypto driver (the defaults with this
> > series).
> >
> >
> > Performance testing (usemem30):
> > ===============================
> >
> >  4K folios: deflate-iaa:
> >  =======================
> >
> >  -------------------------------------------------------------------------------
> >                 mm-unstable-10-16-2024  shrink_folio_list()  shrink_folio_list()
> >                                          batching of folios   batching of folios
> >  -------------------------------------------------------------------------------
> >  zswap compressor          deflate-iaa          deflate-iaa          deflate-iaa
> >  vm.compress-batchsize             n/a                    1                   32
> >  vm.page-cluster                     2                    2                    2
> >  -------------------------------------------------------------------------------
> >  Total throughput            4,470,466            5,770,824            6,363,045
> >            (KB/s)
> >  Average throughput            149,015              192,360              212,101
> >            (KB/s)
> >  elapsed time                   119.24               100.96                92.99
> >         (sec)
> >  sys time (sec)               2,819.29             2,168.08             1,970.79
> >
> >  -------------------------------------------------------------------------------
> >  memcg_high                    668,185              646,357              613,421
> >  memcg_swap_fail                     0                    0                    0
> >  zswpout                    62,991,796           58,275,673           53,070,201
> >  zswpin                            431                  415                  396
> >  pswpout                             0                    0                    0
> >  pswpin                              0                    0                    0
> >  thp_swpout                          0                    0                    0
> >  thp_swpout_fallback                 0                    0                    0
> >  pgmajfault                      3,137                3,085                3,440
> >  swap_ra                            99                  100                   95
> >  swap_ra_hit                        42                   44                   45
> >  -------------------------------------------------------------------------------
> >
> >
> >  16k/32/64k folios: deflate-iaa:
> >  ===============================
> >  All three large folio sizes 16k/32/64k were enabled to "always".
> >
> >  -------------------------------------------------------------------------------
> >                 mm-unstable-  zswap_store()      + shrink_folio_list()
> >                   10-16-2024    batching of         batching of folios
> >                                    pages in
> >                                large folios
> >  -------------------------------------------------------------------------------
> >  zswap compr     deflate-iaa     deflate-iaa          deflate-iaa
> >  vm.compress-            n/a             n/a         4          8             16
> >  batchsize
> >  vm.page-                  2               2         2          2              2
> >   cluster
> >  -------------------------------------------------------------------------------
> >  Total throughput   7,182,198   8,448,994    8,584,728    8,729,643
> 8,775,944
> >            (KB/s)
> >  Avg throughput       239,406     281,633      286,157      290,988      292,531
> >          (KB/s)
> >  elapsed time           85.04       77.84        77.03        75.18        74.98
> >          (sec)
> >  sys time (sec)      1,730.77    1,527.40     1,528.52     1,473.76     1,465.97
> >
> >  -------------------------------------------------------------------------------
> >  memcg_high           648,125     694,188      696,004      699,728      724,887
> >  memcg_swap_fail        1,550       2,540        1,627        1,577        1,517
> >  zswpout           57,606,876  56,624,450   56,125,082    55,999,42   57,352,204
> >  zswpin                   421         406          422          400          437
> >  pswpout                    0           0            0            0            0
> >  pswpin                     0           0            0            0            0
> >  thp_swpout                 0           0            0            0            0
> >  thp_swpout_fallback        0           0            0            0            0
> >  16kB-mthp_swpout_          0           0            0            0            0
> >           fallback
> >  32kB-mthp_swpout_          0           0            0            0            0
> >           fallback
> >  64kB-mthp_swpout_      1,550       2,539        1,627        1,577        1,517
> >           fallback
> >  pgmajfault             3,102       3,126        3,473        3,454        3,134
> >  swap_ra                  107         144          109          124          181
> >  swap_ra_hit               51          88           45           66          107
> >  ZSWPOUT-16kB               2           3            4            4            3
> >  ZSWPOUT-32kB               0           2            1            1            0
> >  ZSWPOUT-64kB       3,598,889   3,536,556    3,506,134    3,498,324
> 3,582,921
> >  SWPOUT-16kB                0           0            0            0            0
> >  SWPOUT-32kB                0           0            0            0            0
> >  SWPOUT-64kB                0           0            0            0            0
> >  -------------------------------------------------------------------------------
> >
> >
> >  2M folios: deflate-iaa:
> >  =======================
> >
> >  -------------------------------------------------------------------------------
> >                    mm-unstable-10-16-2024    zswap_store() batching of pages
> >                                                       in pmd-mappable folios
> >  -------------------------------------------------------------------------------
> >  zswap compressor             deflate-iaa                deflate-iaa
> >  vm.compress-batchsize                n/a                        n/a
> >  vm.page-cluster                        2                          2
> >  -------------------------------------------------------------------------------
> >  Total throughput               7,444,592                 8,916,349
> >            (KB/s)
> >  Average throughput               248,153                   297,211
> >            (KB/s)
> >  elapsed time                       86.29                     73.44
> >         (sec)
> >  sys time (sec)                  1,833.21                  1,418.58
> >
> >  -------------------------------------------------------------------------------
> >  memcg_high                        81,786                    89,905
> >  memcg_swap_fail                       82                       395
> >  zswpout                       58,874,092                57,721,884
> >  zswpin                               422                       458
> >  pswpout                                0                         0
> >  pswpin                                 0                         0
> >  thp_swpout                             0                         0
> >  thp_swpout_fallback                   82                       394
> >  pgmajfault                        14,864                    21,544
> >  swap_ra                           34,953                    53,751
> >  swap_ra_hit                       34,895                    53,660
> >  ZSWPOUT-2048kB                   114,815                   112,269
> >  SWPOUT-2048kB                          0                         0
> >  -------------------------------------------------------------------------------
> >
> > Since 4K folios account for ~0.4% of all zswapouts when pmd-mappable
> folios
> > are enabled for usemem30, we cannot expect much improvement from
> reclaim
> > batching.
> >
> >
> > Performance testing (Kernel compilation):
> > =========================================
> >
> > As mentioned earlier, for workloads that see a lot of swapout activity, we
> > can benefit from configuring 2 WQs per IAA device, with compress jobs from
> > all same-socket cores being distributed toothe wq.1 of all IAAs on the
> > socket, with the "global_wq" developed in this patch-series.
> >
> > Although this data includes IAA decompress batching, which will be
> > submitted as a separate RFC patch-series, I am listing it here to quantify
> > the benefit of distributing compress jobs among all IAAs. The kernel
> > compilation test with "allmodconfig" is able to quantify this well:
> >
> >
> >  4K folios: deflate-iaa: kernel compilation to quantify crypto patches
> >
> ==============================================================
> =======
> >
> >
> >  ------------------------------------------------------------------------------
> >                    IAA shrink_folio_list() compress batching and
> >                        swapin_readahead() decompress batching
> >
> >                                       1WQ      2WQ (distribute compress jobs)
> >
> >                         1 local WQ (wq.0)    1 local WQ (wq.0) +
> >                                   per IAA    1 global WQ (wq.1) per IAA
> >
> >  ------------------------------------------------------------------------------
> >  zswap compressor             deflate-iaa         deflate-iaa
> >  vm.compress-batchsize                 32                  32
> >  vm.page-cluster                        4                   4
> >  ------------------------------------------------------------------------------
> >  real_sec                          746.77              745.42
> >  user_sec                       15,732.66           15,738.85
> >  sys_sec                         5,384.14            5,247.86
> >  Max_Res_Set_Size_KB            1,874,432           1,872,640
> >
> >  ------------------------------------------------------------------------------
> >  zswpout                      101,648,460         104,882,982
> >  zswpin                        27,418,319          29,428,515
> >  pswpout                              213                  22
> >  pswpin                               207                   6
> >  pgmajfault                    21,896,616          23,629,768
> >  swap_ra                        6,054,409           6,385,080
> >  swap_ra_hit                    3,791,628           3,985,141
> >  ------------------------------------------------------------------------------
> >
> > The iaa_crypto wq stats will show almost the same number of compress
> calls
> > for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively.
> > We see a latency reduction of 2.5% by distributing compress jobs among all
> > IAA devices on the socket.
> >
> > I would greatly appreciate code review comments for the iaa_crypto driver
> > and mm patches included in this series!
> >
> > Thanks,
> > Kanchana
> >
> >
> >
> > Kanchana P Sridhar (13):
> >   crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
> >   crypto: iaa - Add support for irq-less crypto async interface
> >   crypto: testmgr - Add crypto testmgr acomp poll support.
> >   mm: zswap: zswap_compress()/decompress() can submit, then poll an
> >     acomp_req.
> >   crypto: iaa - Make async mode the default.
> >   crypto: iaa - Disable iaa_verify_compress by default.
> >   crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to
> >     IAAs.
> >   crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA
> >     node.
> >   mm: zswap: Config variable to enable compress batching in
> >     zswap_store().
> >   mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if
> >     platform has IAA.
> >   mm: swap: Add IAA batch compression API
> >     swap_crypto_acomp_compress_batch().
> >   mm: zswap: Compress batching with Intel IAA in zswap_store() of large
> >     folios.
> >   mm: vmscan, swap, zswap: Compress batching of folios in
> >     shrink_folio_list().
> >
> >  crypto/acompress.c                         |   1 +
> >  crypto/testmgr.c                           |  70 +-
> >  drivers/crypto/intel/iaa/iaa_crypto_main.c | 467 +++++++++++--
> >  include/crypto/acompress.h                 |  18 +
> >  include/crypto/internal/acompress.h        |   1 +
> >  include/linux/fs.h                         |   2 +
> >  include/linux/mm.h                         |   8 +
> >  include/linux/writeback.h                  |   5 +
> >  include/linux/zswap.h                      | 106 +++
> >  kernel/sysctl.c                            |   9 +
> >  mm/Kconfig                                 |  12 +
> >  mm/page_io.c                               | 152 +++-
> >  mm/swap.c                                  |  15 +
> >  mm/swap.h                                  |  96 +++
> >  mm/swap_state.c                            | 115 +++
> >  mm/vmscan.c                                | 154 +++-
> >  mm/zswap.c                                 | 771 +++++++++++++++++++--
> >  17 files changed, 1870 insertions(+), 132 deletions(-)
> >
> >
> > base-commit: 817952b8be34aad40e07f6832fb9d1fc08961550
> > --
> > 2.27.0
> >
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable compress batching in zswap_store().
  2024-10-23  2:17     ` Sridhar, Kanchana P
@ 2024-10-23  2:58       ` Herbert Xu
  2024-10-23  3:06         ` Sridhar, Kanchana P
  2024-10-23 18:12       ` Yosry Ahmed
  1 sibling, 1 reply; 36+ messages in thread
From: Herbert Xu @ 2024-10-23  2:58 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Yosry Ahmed, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh

On Wed, Oct 23, 2024 at 02:17:06AM +0000, Sridhar, Kanchana P wrote:
>
> Thanks Yosry, for the code review comments! This is a good point. The main
> consideration here was not to impact software compressors run on non-Intel
> platforms, and only incur the memory footprint cost of multiple
> acomp_req/buffers in "struct crypto_acomp_ctx" if there is IAA to reduce
> latency with parallel compressions.

I'm working on a batching mechanism for crypto_ahash interface,
where the requests are simply chained together and then submitted.

The same mechanism should work for crypto_acomp as well:

+       for (i = 0; i < num_mb; ++i) {
+               if (testmgr_alloc_buf(data[i].xbuf))
+                       goto out;
+
+               crypto_init_wait(&data[i].wait);
+
+               data[i].req = ahash_request_alloc(tfm, GFP_KERNEL);
+               if (!data[i].req) {
+                       pr_err("alg: hash: Failed to allocate request for %s\n",
+                              algo);
+                       goto out;
+               }
+
+               if (i)
+                       ahash_request_chain(data[i].req, data[0].req);
+               else
+                       ahash_reqchain_init(data[i].req, 0, crypto_req_done,
+                                           &data[i].wait);
+
+               sg_init_table(data[i].sg, XBUFSIZE);
+               for (j = 0; j < XBUFSIZE; j++) {
+                       sg_set_buf(data[i].sg + j, data[i].xbuf[j], PAGE_SIZE);
+                       memset(data[i].xbuf[j], 0xff, PAGE_SIZE);
+               }
+       }

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable compress batching in zswap_store().
  2024-10-23  2:58       ` Herbert Xu
@ 2024-10-23  3:06         ` Sridhar, Kanchana P
  0 siblings, 0 replies; 36+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-23  3:06 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Yosry Ahmed, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, linux-crypto, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Tuesday, October 22, 2024 7:58 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; zanussi@kernel.org;
> viro@zeniv.linux.org.uk; brauner@kernel.org; jack@suse.cz;
> mcgrof@kernel.org; kees@kernel.org; joel.granados@kernel.org;
> bfoster@redhat.com; willy@infradead.org; linux-fsdevel@vger.kernel.org;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable
> compress batching in zswap_store().
> 
> On Wed, Oct 23, 2024 at 02:17:06AM +0000, Sridhar, Kanchana P wrote:
> >
> > Thanks Yosry, for the code review comments! This is a good point. The main
> > consideration here was not to impact software compressors run on non-
> Intel
> > platforms, and only incur the memory footprint cost of multiple
> > acomp_req/buffers in "struct crypto_acomp_ctx" if there is IAA to reduce
> > latency with parallel compressions.
> 
> I'm working on a batching mechanism for crypto_ahash interface,
> where the requests are simply chained together and then submitted.
> 
> The same mechanism should work for crypto_acomp as well:
> 
> +       for (i = 0; i < num_mb; ++i) {
> +               if (testmgr_alloc_buf(data[i].xbuf))
> +                       goto out;
> +
> +               crypto_init_wait(&data[i].wait);
> +
> +               data[i].req = ahash_request_alloc(tfm, GFP_KERNEL);
> +               if (!data[i].req) {
> +                       pr_err("alg: hash: Failed to allocate request for %s\n",
> +                              algo);
> +                       goto out;
> +               }
> +
> +               if (i)
> +                       ahash_request_chain(data[i].req, data[0].req);
> +               else
> +                       ahash_reqchain_init(data[i].req, 0, crypto_req_done,
> +                                           &data[i].wait);
> +
> +               sg_init_table(data[i].sg, XBUFSIZE);
> +               for (j = 0; j < XBUFSIZE; j++) {
> +                       sg_set_buf(data[i].sg + j, data[i].xbuf[j], PAGE_SIZE);
> +                       memset(data[i].xbuf[j], 0xff, PAGE_SIZE);
> +               }
> +       }

Thanks Herbert, for letting us know! Sure, we will look forward to these
changes when they are ready, to look into incorporating in the iaa_crypto
driver.

Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable compress batching in zswap_store().
  2024-10-23  2:17     ` Sridhar, Kanchana P
  2024-10-23  2:58       ` Herbert Xu
@ 2024-10-23 18:12       ` Yosry Ahmed
  2024-10-23 20:32         ` Sridhar, Kanchana P
  1 sibling, 1 reply; 36+ messages in thread
From: Yosry Ahmed @ 2024-10-23 18:12 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh

On Tue, Oct 22, 2024 at 7:17 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Tuesday, October 22, 2024 5:50 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> > Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable
> > compress batching in zswap_store().
> >
> > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > Add a new zswap config variable that controls whether zswap_store() will
> > > compress a batch of pages, for instance, the pages in a large folio:
> > >
> > >   CONFIG_ZSWAP_STORE_BATCHING_ENABLED
> > >
> > > The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in commit
> > > ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto
> > > driver core") is used to detect if the system has the Intel Analytics
> > > Accelerator (IAA), and the iaa_crypto module is available. If so, the
> > > kernel build will prompt for CONFIG_ZSWAP_STORE_BATCHING_ENABLED.
> > Hence,
> > > users have the ability to set
> > CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y" only
> > > on systems that have Intel IAA.
> > >
> > > If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, and IAA is
> > configured
> > > as the zswap compressor, zswap_store() will process the pages in a large
> > > folio in batches, i.e., multiple pages at a time. Pages in a batch will be
> > > compressed in parallel in hardware, then stored. On systems without Intel
> > > IAA and/or if zswap uses software compressors, pages in the batch will be
> > > compressed sequentially and stored.
> > >
> > > The patch also implements a zswap API that returns the status of this
> > > config variable.
> >
> > If we are compressing a large folio and batching is an option, is not
> > batching ever the correct thing to do? Why is the config option
> > needed?
>
> Thanks Yosry, for the code review comments! This is a good point. The main
> consideration here was not to impact software compressors run on non-Intel
> platforms, and only incur the memory footprint cost of multiple
> acomp_req/buffers in "struct crypto_acomp_ctx" if there is IAA to reduce
> latency with parallel compressions.
>
> If the memory footprint cost if acceptable, there is no reason not to do
> batching, even if compressions are sequential. We could amortize cost
> of the cgroup charging/objcg/stats updates.

Hmm yeah based on the next patch it seems like we allocate 7 extra
buffers, each sized 2 * PAGE_SIZE, percpu. That's 56KB percpu (with 4K
page size), which is non-trivial.

Making it a config option seems to be inconvenient though. Users have
to sign up for the memory overhead if some of them won't use IAA
batching, or disable batching all together. I would assume this would
be especially annoying for distros, but also for anyone who wants to
experiment with IAA batching.

The first thing that comes to mind is making this a boot option. But I
think we can make it even more convenient and support enabling it at
runtime. We just need to allocate the additional buffers the first
time batching is enabled. This shouldn't be too complicated, we have
an array of buffers on each CPU but we only allocate the first one
initially (unless batching is enabled at boot). When batching is
enabled, we can allocate the remaining buffers.

The only shortcoming of this approach is that if we enable batching
then disable it, we can't free the buffers without significant
complexity, but I think that should be fine. I don't see this being a
common pattern.

WDYT?



>
> Thanks,
> Kanchana
>
> >
> > >
> > > Suggested-by: Ying Huang <ying.huang@intel.com>
> > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > ---
> > >  include/linux/zswap.h |  6 ++++++
> > >  mm/Kconfig            | 12 ++++++++++++
> > >  mm/zswap.c            | 14 ++++++++++++++
> > >  3 files changed, 32 insertions(+)
> > >
> > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > > index d961ead91bf1..74ad2a24b309 100644
> > > --- a/include/linux/zswap.h
> > > +++ b/include/linux/zswap.h
> > > @@ -24,6 +24,7 @@ struct zswap_lruvec_state {
> > >         atomic_long_t nr_disk_swapins;
> > >  };
> > >
> > > +bool zswap_store_batching_enabled(void);
> > >  unsigned long zswap_total_pages(void);
> > >  bool zswap_store(struct folio *folio);
> > >  bool zswap_load(struct folio *folio);
> > > @@ -39,6 +40,11 @@ bool zswap_never_enabled(void);
> > >
> > >  struct zswap_lruvec_state {};
> > >
> > > +static inline bool zswap_store_batching_enabled(void)
> > > +{
> > > +       return false;
> > > +}
> > > +
> > >  static inline bool zswap_store(struct folio *folio)
> > >  {
> > >         return false;
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 33fa51d608dc..26d1a5cee471 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -125,6 +125,18 @@ config ZSWAP_COMPRESSOR_DEFAULT
> > >         default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD
> > >         default ""
> > >
> > > +config ZSWAP_STORE_BATCHING_ENABLED
> > > +       bool "Batching of zswap stores with Intel IAA"
> > > +       depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO
> > > +       default n
> > > +       help
> > > +       Enables zswap_store to swapout large folios in batches of 8 pages,
> > > +       rather than a page at a time, if the system has Intel IAA for hardware
> > > +       acceleration of compressions. If IAA is configured as the zswap
> > > +       compressor, this will parallelize batch compression of upto 8 pages
> > > +       in the folio in hardware, thereby improving large folio compression
> > > +       throughput and reducing swapout latency.
> > > +
> > >  choice
> > >         prompt "Default allocator"
> > >         depends on ZSWAP
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 948c9745ee57..4893302d8c34 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -127,6 +127,15 @@ static bool zswap_shrinker_enabled =
> > IS_ENABLED(
> > >                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> > >  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool,
> > 0644);
> > >
> > > +/*
> > > + * Enable/disable batching of compressions if zswap_store is called with a
> > > + * large folio. If enabled, and if IAA is the zswap compressor, pages are
> > > + * compressed in parallel in batches of say, 8 pages.
> > > + * If not, every page is compressed sequentially.
> > > + */
> > > +static bool __zswap_store_batching_enabled = IS_ENABLED(
> > > +       CONFIG_ZSWAP_STORE_BATCHING_ENABLED);
> > > +
> > >  bool zswap_is_enabled(void)
> > >  {
> > >         return zswap_enabled;
> > > @@ -241,6 +250,11 @@ static inline struct xarray
> > *swap_zswap_tree(swp_entry_t swp)
> > >         pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,         \
> > >                  zpool_get_type((p)->zpool))
> > >
> > > +__always_inline bool zswap_store_batching_enabled(void)
> > > +{
> > > +       return __zswap_store_batching_enabled;
> > > +}
> > > +
> > >  /*********************************
> > >  * pool functions
> > >  **********************************/
> > > --
> > > 2.27.0
> > >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 00/13] zswap IAA compress batching
  2024-10-23  2:53   ` Sridhar, Kanchana P
@ 2024-10-23 18:15     ` Yosry Ahmed
  2024-10-23 20:34       ` Sridhar, Kanchana P
  0 siblings, 1 reply; 36+ messages in thread
From: Yosry Ahmed @ 2024-10-23 18:15 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh

On Tue, Oct 22, 2024 at 7:53 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Yosry,
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Tuesday, October 22, 2024 5:57 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> > Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
> >
> > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > >
> > > IAA Compression Batching:
> > > =========================
> > >
> > > This RFC patch-series introduces the use of the Intel Analytics Accelerator
> > > (IAA) for parallel compression of pages in a folio, and for batched reclaim
> > > of hybrid any-order batches of folios in shrink_folio_list().
> > >
> > > The patch-series is organized as follows:
> > >
> > >  1) iaa_crypto driver enablers for batching: Relevant patches are tagged
> > >     with "crypto:" in the subject:
> > >
> > >     a) async poll crypto_acomp interface without interrupts.
> > >     b) crypto testmgr acomp poll support.
> > >     c) Modifying the default sync_mode to "async" and disabling
> > >        verify_compress by default, to facilitate users to run IAA easily for
> > >        comparison with software compressors.
> > >     d) Changing the cpu-to-iaa mappings to more evenly balance cores to IAA
> > >        devices.
> > >     e) Addition of a "global_wq" per IAA, which can be used as a global
> > >        resource for the socket. If the user configures 2WQs per IAA device,
> > >        the driver will distribute compress jobs from all cores on the
> > >        socket to the "global_wqs" of all the IAA devices on that socket, in
> > >        a round-robin manner. This can be used to improve compression
> > >        throughput for workloads that see a lot of swapout activity.
> > >
> > >  2) Migrating zswap to use async poll in zswap_compress()/decompress().
> > >  3) A centralized batch compression API that can be used by swap modules.
> > >  4) IAA compress batching within large folio zswap stores.
> > >  5) IAA compress batching of any-order hybrid folios in
> > >     shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
> > >     parameter can be used to configure the number of folios in [1, 32] to
> > >     be reclaimed using compress batching.
> >
> > I am still digesting this series but I have some high level questions
> > that I left on some patches. My intuition though is that we should
> > drop (5) from the initial proposal as it's most controversial.
> > Batching reclaim of unrelated folios through zswap *might* make sense,
> > but it needs a broader conversation and it needs justification on its
> > own merit, without the rest of the series.
>
> Thanks for these suggestions!  Sure, I can drop (5) from the initial patch-set.
> Agree also, this needs a broader discussion.
>
> I believe the 4K folios usemem30 data in this patchset does bring across
> the batching reclaim benefits to provide justification on its own merit. I added
> the data on batching reclaim with kernel compilation as part of the 4K folios
> experiments in the IAA decompression batching patch-series [1].
> Listing it here as well. I will make sure to add this data in subsequent revs.
>
> --------------------------------------------------------------------------
>  Kernel compilation in tmpfs/allmodconfig, 2G max memory:
>
>  No large folios          mm-unstable-10-16-2024       shrink_folio_list()
>                                                        batching of folios
>  --------------------------------------------------------------------------
>  zswap compressor         zstd       deflate-iaa       deflate-iaa
>  vm.compress-batchsize     n/a               n/a                32
>  vm.page-cluster             3                 3                 3
>  --------------------------------------------------------------------------
>  real_sec               783.87            761.69            747.32
>  user_sec            15,750.07         15,716.69         15,728.39
>  sys_sec              6,522.32          5,725.28          5,399.44
>  Max_RSS_KB          1,872,640         1,870,848         1,874,432
>
>  zswpout            82,364,991        97,739,600       102,780,612
>  zswpin             21,303,393        27,684,166        29,016,252
>  pswpout                    13               222               213
>  pswpin                     12               209               202
>  pgmajfault         17,114,339        22,421,211        23,378,161
>  swap_ra             4,596,035         5,840,082         6,231,646
>  swap_ra_hit         2,903,249         3,682,444         3,940,420
>  --------------------------------------------------------------------------
>
> The performance improvements seen does depend on compression batching in
> the swap modules (zswap). The implementation in patch 12 in the compress
> batching series sets up this zswap compression pipeline, that takes an array of
> folios and processes them in batches of 8 pages compressed in parallel in hardware.
> That being said, we do see latency improvements even with reclaim batching
> combined with zswap compress batching with zstd/lzo-rle/etc. I haven't done a
> lot of analysis of this, but I am guessing fewer calls from the swap layer
> (swap_writepage()) into zswap could have something to do with this. If we believe
> that batching can be the right thing to do even for the software compressors,
> I can gather batching data with zstd for v2.

Thanks for sharing the data. What I meant is, I think we should focus
on supporting large folio compression batching for this series, and
only present figures for this support to avoid confusion.

Once this lands, we can discuss support for batching the compression
of different unrelated folios separately, as it spans areas beyond
just zswap and will need broader discussion.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable compress batching in zswap_store().
  2024-10-23 18:12       ` Yosry Ahmed
@ 2024-10-23 20:32         ` Sridhar, Kanchana P
  0 siblings, 0 replies; 36+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-23 20:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, October 23, 2024 11:12 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable
> compress batching in zswap_store().
> 
> On Tue, Oct 22, 2024 at 7:17 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Tuesday, October 22, 2024 5:50 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> > > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; zanussi@kernel.org;
> viro@zeniv.linux.org.uk;
> > > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org;
> kees@kernel.org;
> > > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org;
> linux-
> > > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal,
> > > Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [RFC PATCH v1 09/13] mm: zswap: Config variable to enable
> > > compress batching in zswap_store().
> > >
> > > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> > > <kanchana.p.sridhar@intel.com> wrote:
> > > >
> > > > Add a new zswap config variable that controls whether zswap_store()
> will
> > > > compress a batch of pages, for instance, the pages in a large folio:
> > > >
> > > >   CONFIG_ZSWAP_STORE_BATCHING_ENABLED
> > > >
> > > > The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in
> commit
> > > > ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator
> crypto
> > > > driver core") is used to detect if the system has the Intel Analytics
> > > > Accelerator (IAA), and the iaa_crypto module is available. If so, the
> > > > kernel build will prompt for
> CONFIG_ZSWAP_STORE_BATCHING_ENABLED.
> > > Hence,
> > > > users have the ability to set
> > > CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y" only
> > > > on systems that have Intel IAA.
> > > >
> > > > If CONFIG_ZSWAP_STORE_BATCHING_ENABLED is enabled, and IAA is
> > > configured
> > > > as the zswap compressor, zswap_store() will process the pages in a large
> > > > folio in batches, i.e., multiple pages at a time. Pages in a batch will be
> > > > compressed in parallel in hardware, then stored. On systems without
> Intel
> > > > IAA and/or if zswap uses software compressors, pages in the batch will
> be
> > > > compressed sequentially and stored.
> > > >
> > > > The patch also implements a zswap API that returns the status of this
> > > > config variable.
> > >
> > > If we are compressing a large folio and batching is an option, is not
> > > batching ever the correct thing to do? Why is the config option
> > > needed?
> >
> > Thanks Yosry, for the code review comments! This is a good point. The main
> > consideration here was not to impact software compressors run on non-
> Intel
> > platforms, and only incur the memory footprint cost of multiple
> > acomp_req/buffers in "struct crypto_acomp_ctx" if there is IAA to reduce
> > latency with parallel compressions.
> >
> > If the memory footprint cost if acceptable, there is no reason not to do
> > batching, even if compressions are sequential. We could amortize cost
> > of the cgroup charging/objcg/stats updates.
> 
> Hmm yeah based on the next patch it seems like we allocate 7 extra
> buffers, each sized 2 * PAGE_SIZE, percpu. That's 56KB percpu (with 4K
> page size), which is non-trivial.
> 
> Making it a config option seems to be inconvenient though. Users have
> to sign up for the memory overhead if some of them won't use IAA
> batching, or disable batching all together. I would assume this would
> be especially annoying for distros, but also for anyone who wants to
> experiment with IAA batching.
> 
> The first thing that comes to mind is making this a boot option. But I
> think we can make it even more convenient and support enabling it at
> runtime. We just need to allocate the additional buffers the first
> time batching is enabled. This shouldn't be too complicated, we have
> an array of buffers on each CPU but we only allocate the first one
> initially (unless batching is enabled at boot). When batching is
> enabled, we can allocate the remaining buffers.
> 
> The only shortcoming of this approach is that if we enable batching
> then disable it, we can't free the buffers without significant
> complexity, but I think that should be fine. I don't see this being a
> common pattern.
> 
> WDYT?

Thanks for these suggestions, Yosry. Sure, let me give this a try, and share
updates.

Thanks,
Kanchana

> 
> 
> 
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > >
> > > > Suggested-by: Ying Huang <ying.huang@intel.com>
> > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > > > ---
> > > >  include/linux/zswap.h |  6 ++++++
> > > >  mm/Kconfig            | 12 ++++++++++++
> > > >  mm/zswap.c            | 14 ++++++++++++++
> > > >  3 files changed, 32 insertions(+)
> > > >
> > > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > > > index d961ead91bf1..74ad2a24b309 100644
> > > > --- a/include/linux/zswap.h
> > > > +++ b/include/linux/zswap.h
> > > > @@ -24,6 +24,7 @@ struct zswap_lruvec_state {
> > > >         atomic_long_t nr_disk_swapins;
> > > >  };
> > > >
> > > > +bool zswap_store_batching_enabled(void);
> > > >  unsigned long zswap_total_pages(void);
> > > >  bool zswap_store(struct folio *folio);
> > > >  bool zswap_load(struct folio *folio);
> > > > @@ -39,6 +40,11 @@ bool zswap_never_enabled(void);
> > > >
> > > >  struct zswap_lruvec_state {};
> > > >
> > > > +static inline bool zswap_store_batching_enabled(void)
> > > > +{
> > > > +       return false;
> > > > +}
> > > > +
> > > >  static inline bool zswap_store(struct folio *folio)
> > > >  {
> > > >         return false;
> > > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > > index 33fa51d608dc..26d1a5cee471 100644
> > > > --- a/mm/Kconfig
> > > > +++ b/mm/Kconfig
> > > > @@ -125,6 +125,18 @@ config ZSWAP_COMPRESSOR_DEFAULT
> > > >         default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD
> > > >         default ""
> > > >
> > > > +config ZSWAP_STORE_BATCHING_ENABLED
> > > > +       bool "Batching of zswap stores with Intel IAA"
> > > > +       depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO
> > > > +       default n
> > > > +       help
> > > > +       Enables zswap_store to swapout large folios in batches of 8 pages,
> > > > +       rather than a page at a time, if the system has Intel IAA for
> hardware
> > > > +       acceleration of compressions. If IAA is configured as the zswap
> > > > +       compressor, this will parallelize batch compression of upto 8 pages
> > > > +       in the folio in hardware, thereby improving large folio compression
> > > > +       throughput and reducing swapout latency.
> > > > +
> > > >  choice
> > > >         prompt "Default allocator"
> > > >         depends on ZSWAP
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 948c9745ee57..4893302d8c34 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -127,6 +127,15 @@ static bool zswap_shrinker_enabled =
> > > IS_ENABLED(
> > > >                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> > > >  module_param_named(shrinker_enabled, zswap_shrinker_enabled,
> bool,
> > > 0644);
> > > >
> > > > +/*
> > > > + * Enable/disable batching of compressions if zswap_store is called
> with a
> > > > + * large folio. If enabled, and if IAA is the zswap compressor, pages are
> > > > + * compressed in parallel in batches of say, 8 pages.
> > > > + * If not, every page is compressed sequentially.
> > > > + */
> > > > +static bool __zswap_store_batching_enabled = IS_ENABLED(
> > > > +       CONFIG_ZSWAP_STORE_BATCHING_ENABLED);
> > > > +
> > > >  bool zswap_is_enabled(void)
> > > >  {
> > > >         return zswap_enabled;
> > > > @@ -241,6 +250,11 @@ static inline struct xarray
> > > *swap_zswap_tree(swp_entry_t swp)
> > > >         pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,         \
> > > >                  zpool_get_type((p)->zpool))
> > > >
> > > > +__always_inline bool zswap_store_batching_enabled(void)
> > > > +{
> > > > +       return __zswap_store_batching_enabled;
> > > > +}
> > > > +
> > > >  /*********************************
> > > >  * pool functions
> > > >  **********************************/
> > > > --
> > > > 2.27.0
> > > >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH v1 00/13] zswap IAA compress batching
  2024-10-23 18:15     ` Yosry Ahmed
@ 2024-10-23 20:34       ` Sridhar, Kanchana P
  0 siblings, 0 replies; 36+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-23 20:34 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, ryan.roberts, Huang, Ying, 21cnbao, akpm,
	linux-crypto, herbert, davem, clabbe, ardb, ebiggers, surenb,
	Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof, kees,
	joel.granados, bfoster, willy, linux-fsdevel, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, October 23, 2024 11:16 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
> 
> On Tue, Oct 22, 2024 at 7:53 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Yosry,
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Tuesday, October 22, 2024 5:57 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> > > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; zanussi@kernel.org;
> viro@zeniv.linux.org.uk;
> > > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org;
> kees@kernel.org;
> > > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org;
> linux-
> > > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal,
> > > Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
> > >
> > > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> > > <kanchana.p.sridhar@intel.com> wrote:
> > > >
> > > >
> > > > IAA Compression Batching:
> > > > =========================
> > > >
> > > > This RFC patch-series introduces the use of the Intel Analytics
> Accelerator
> > > > (IAA) for parallel compression of pages in a folio, and for batched reclaim
> > > > of hybrid any-order batches of folios in shrink_folio_list().
> > > >
> > > > The patch-series is organized as follows:
> > > >
> > > >  1) iaa_crypto driver enablers for batching: Relevant patches are tagged
> > > >     with "crypto:" in the subject:
> > > >
> > > >     a) async poll crypto_acomp interface without interrupts.
> > > >     b) crypto testmgr acomp poll support.
> > > >     c) Modifying the default sync_mode to "async" and disabling
> > > >        verify_compress by default, to facilitate users to run IAA easily for
> > > >        comparison with software compressors.
> > > >     d) Changing the cpu-to-iaa mappings to more evenly balance cores to
> IAA
> > > >        devices.
> > > >     e) Addition of a "global_wq" per IAA, which can be used as a global
> > > >        resource for the socket. If the user configures 2WQs per IAA device,
> > > >        the driver will distribute compress jobs from all cores on the
> > > >        socket to the "global_wqs" of all the IAA devices on that socket, in
> > > >        a round-robin manner. This can be used to improve compression
> > > >        throughput for workloads that see a lot of swapout activity.
> > > >
> > > >  2) Migrating zswap to use async poll in
> zswap_compress()/decompress().
> > > >  3) A centralized batch compression API that can be used by swap
> modules.
> > > >  4) IAA compress batching within large folio zswap stores.
> > > >  5) IAA compress batching of any-order hybrid folios in
> > > >     shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
> > > >     parameter can be used to configure the number of folios in [1, 32] to
> > > >     be reclaimed using compress batching.
> > >
> > > I am still digesting this series but I have some high level questions
> > > that I left on some patches. My intuition though is that we should
> > > drop (5) from the initial proposal as it's most controversial.
> > > Batching reclaim of unrelated folios through zswap *might* make sense,
> > > but it needs a broader conversation and it needs justification on its
> > > own merit, without the rest of the series.
> >
> > Thanks for these suggestions!  Sure, I can drop (5) from the initial patch-set.
> > Agree also, this needs a broader discussion.
> >
> > I believe the 4K folios usemem30 data in this patchset does bring across
> > the batching reclaim benefits to provide justification on its own merit. I
> added
> > the data on batching reclaim with kernel compilation as part of the 4K folios
> > experiments in the IAA decompression batching patch-series [1].
> > Listing it here as well. I will make sure to add this data in subsequent revs.
> >
> > --------------------------------------------------------------------------
> >  Kernel compilation in tmpfs/allmodconfig, 2G max memory:
> >
> >  No large folios          mm-unstable-10-16-2024       shrink_folio_list()
> >                                                        batching of folios
> >  --------------------------------------------------------------------------
> >  zswap compressor         zstd       deflate-iaa       deflate-iaa
> >  vm.compress-batchsize     n/a               n/a                32
> >  vm.page-cluster             3                 3                 3
> >  --------------------------------------------------------------------------
> >  real_sec               783.87            761.69            747.32
> >  user_sec            15,750.07         15,716.69         15,728.39
> >  sys_sec              6,522.32          5,725.28          5,399.44
> >  Max_RSS_KB          1,872,640         1,870,848         1,874,432
> >
> >  zswpout            82,364,991        97,739,600       102,780,612
> >  zswpin             21,303,393        27,684,166        29,016,252
> >  pswpout                    13               222               213
> >  pswpin                     12               209               202
> >  pgmajfault         17,114,339        22,421,211        23,378,161
> >  swap_ra             4,596,035         5,840,082         6,231,646
> >  swap_ra_hit         2,903,249         3,682,444         3,940,420
> >  --------------------------------------------------------------------------
> >
> > The performance improvements seen does depend on compression batching
> in
> > the swap modules (zswap). The implementation in patch 12 in the compress
> > batching series sets up this zswap compression pipeline, that takes an array
> of
> > folios and processes them in batches of 8 pages compressed in parallel in
> hardware.
> > That being said, we do see latency improvements even with reclaim
> batching
> > combined with zswap compress batching with zstd/lzo-rle/etc. I haven't
> done a
> > lot of analysis of this, but I am guessing fewer calls from the swap layer
> > (swap_writepage()) into zswap could have something to do with this. If we
> believe
> > that batching can be the right thing to do even for the software
> compressors,
> > I can gather batching data with zstd for v2.
> 
> Thanks for sharing the data. What I meant is, I think we should focus
> on supporting large folio compression batching for this series, and
> only present figures for this support to avoid confusion.
> 
> Once this lands, we can discuss support for batching the compression
> of different unrelated folios separately, as it spans areas beyond
> just zswap and will need broader discussion.

Absolutely, this makes sense, thanks Yosry! I will address this in v2.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v1 13/13] mm: vmscan, swap, zswap: Compress batching of folios in shrink_folio_list().
  2024-10-18  6:41 ` [RFC PATCH v1 13/13] mm: vmscan, swap, zswap: Compress batching of folios in shrink_folio_list() Kanchana P Sridhar
@ 2024-10-28 14:41   ` Joel Granados
  2024-10-28 18:53     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 36+ messages in thread
From: Joel Granados @ 2024-10-28 14:41 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, kristen.c.accardi, zanussi, viro, brauner, jack, mcgrof,
	kees, bfoster, willy, linux-fsdevel, wajdi.k.feghali,
	vinodh.gopal

On Thu, Oct 17, 2024 at 11:41:01PM -0700, Kanchana P Sridhar wrote:
> This patch enables the use of Intel IAA hardware compression acceleration
> to reclaim a batch of folios in shrink_folio_list(). This results in
> reclaim throughput and workload/sys performance improvements.
> 
> The earlier patches on compress batching deployed multiple IAA compress
> engines for compressing up to SWAP_CRYPTO_SUB_BATCH_SIZE pages within a
> large folio that is being stored in zswap_store(). This patch further
> propagates the efficiency improvements demonstrated with IAA "batching
> within folios", to vmscan "batching of folios" which will also use
> batching within folios using the extensible architecture of
> the __zswap_store_batch_core() procedure added earlier, that accepts
> an array of folios.

...

> +static inline void zswap_store_batch(struct swap_in_memory_cache_cb *simc)
> +{
> +}
> +
>  static inline bool zswap_store(struct folio *folio)
>  {
>  	return false;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 79e6cb1d5c48..b8d6b599e9ae 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2064,6 +2064,15 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= (void *)&page_cluster_max,
>  	},
> +	{
> +		.procname	= "compress-batchsize",
> +		.data		= &compress_batchsize,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
Why not use proc_douintvec_minmax? These are the reasons I think you
should use that (please correct me if I miss-read your patch):

1. Your range is [1,32] -> so no negative values
2. You are using the value to compare with an unsinged int
   (simc->nr_folios) in your `struct swap_in_memory_cache_cb`. So
   instead of going from int to uint, you should just do uint all
   around. No?
3. Using proc_douintvec_minmax will automatically error out on negative
   input without event considering your range, so there is less code
   executed at the end.

> +		.extra1		= SYSCTL_ONE,
> +		.extra2		= (void *)&compress_batchsize_max,
> +	},
>  	{
>  		.procname	= "dirtytime_expire_seconds",
>  		.data		= &dirtytime_expire_interval,
> diff --git a/mm/page_io.c b/mm/page_io.c
> index a28d28b6b3ce..065db25309b8 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -226,6 +226,131 @@ static void swap_zeromap_folio_clear(struct folio *folio)
>  	}
>  }

...

Best

-- 

Joel Granados


^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC PATCH v1 13/13] mm: vmscan, swap, zswap: Compress batching of folios in shrink_folio_list().
  2024-10-28 14:41   ` Joel Granados
@ 2024-10-28 18:53     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 36+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-28 18:53 UTC (permalink / raw)
  To: Joel Granados
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, linux-crypto, herbert, davem, clabbe, ardb, ebiggers,
	surenb, Accardi, Kristen C, zanussi, viro, brauner, jack, mcgrof,
	kees, bfoster, willy, linux-fsdevel, Feghali, Wajdi K, Gopal,
	Vinodh, Sridhar, Kanchana P

Hi Joel,

> -----Original Message-----
> From: Joel Granados <joel.granados@kernel.org>
> Sent: Monday, October 28, 2024 7:42 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> bfoster@redhat.com; willy@infradead.org; linux-fsdevel@vger.kernel.org;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 13/13] mm: vmscan, swap, zswap: Compress
> batching of folios in shrink_folio_list().
> 
> On Thu, Oct 17, 2024 at 11:41:01PM -0700, Kanchana P Sridhar wrote:
> > This patch enables the use of Intel IAA hardware compression acceleration
> > to reclaim a batch of folios in shrink_folio_list(). This results in
> > reclaim throughput and workload/sys performance improvements.
> >
> > The earlier patches on compress batching deployed multiple IAA compress
> > engines for compressing up to SWAP_CRYPTO_SUB_BATCH_SIZE pages
> within a
> > large folio that is being stored in zswap_store(). This patch further
> > propagates the efficiency improvements demonstrated with IAA "batching
> > within folios", to vmscan "batching of folios" which will also use
> > batching within folios using the extensible architecture of
> > the __zswap_store_batch_core() procedure added earlier, that accepts
> > an array of folios.
> 
> ...
> 
> > +static inline void zswap_store_batch(struct swap_in_memory_cache_cb
> *simc)
> > +{
> > +}
> > +
> >  static inline bool zswap_store(struct folio *folio)
> >  {
> >  	return false;
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 79e6cb1d5c48..b8d6b599e9ae 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -2064,6 +2064,15 @@ static struct ctl_table vm_table[] = {
> >  		.extra1		= SYSCTL_ZERO,
> >  		.extra2		= (void *)&page_cluster_max,
> >  	},
> > +	{
> > +		.procname	= "compress-batchsize",
> > +		.data		= &compress_batchsize,
> > +		.maxlen		= sizeof(int),
> > +		.mode		= 0644,
> > +		.proc_handler	= proc_dointvec_minmax,
> Why not use proc_douintvec_minmax? These are the reasons I think you
> should use that (please correct me if I miss-read your patch):
> 
> 1. Your range is [1,32] -> so no negative values
> 2. You are using the value to compare with an unsinged int
>    (simc->nr_folios) in your `struct swap_in_memory_cache_cb`. So
>    instead of going from int to uint, you should just do uint all
>    around. No?
> 3. Using proc_douintvec_minmax will automatically error out on negative
>    input without event considering your range, so there is less code
>    executed at the end.

Thanks for your code review comments! Sure, what you suggest makes
sense. Based on Yosry's suggestions, I plan to separate out the
batching reclaim shrink_folio_list() changes into a separate series, and
focus on just the zswap modifications to support large folio compression
batching in the initial series. I will make sure to incorporate your comments
in the shrink_folio_list() batching reclaim series.

Thanks,
Kanchana

> 
> > +		.extra1		= SYSCTL_ONE,
> > +		.extra2		= (void *)&compress_batchsize_max,
> > +	},
> >  	{
> >  		.procname	= "dirtytime_expire_seconds",
> >  		.data		= &dirtytime_expire_interval,
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index a28d28b6b3ce..065db25309b8 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -226,6 +226,131 @@ static void swap_zeromap_folio_clear(struct folio
> *folio)
> >  	}
> >  }
> 
> ...
> 
> Best
> 
> --
> 
> Joel Granados


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-10-28 18:53 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-18  6:40 [RFC PATCH v1 00/13] zswap IAA compress batching Kanchana P Sridhar
2024-10-18  6:40 ` [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to acomp_alg and acomp_req Kanchana P Sridhar
2024-10-18  7:55   ` Herbert Xu
2024-10-18 23:01     ` Sridhar, Kanchana P
2024-10-19  0:19       ` Herbert Xu
2024-10-19 19:10         ` Sridhar, Kanchana P
2024-10-18  6:40 ` [RFC PATCH v1 02/13] crypto: iaa - Add support for irq-less crypto async interface Kanchana P Sridhar
2024-10-18  6:40 ` [RFC PATCH v1 03/13] crypto: testmgr - Add crypto testmgr acomp poll support Kanchana P Sridhar
2024-10-18  6:40 ` [RFC PATCH v1 04/13] mm: zswap: zswap_compress()/decompress() can submit, then poll an acomp_req Kanchana P Sridhar
2024-10-23  0:48   ` Yosry Ahmed
2024-10-23  2:01     ` Sridhar, Kanchana P
2024-10-18  6:40 ` [RFC PATCH v1 05/13] crypto: iaa - Make async mode the default Kanchana P Sridhar
2024-10-18  6:40 ` [RFC PATCH v1 06/13] crypto: iaa - Disable iaa_verify_compress by default Kanchana P Sridhar
2024-10-18  6:40 ` [RFC PATCH v1 07/13] crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to IAAs Kanchana P Sridhar
2024-10-18  6:40 ` [RFC PATCH v1 08/13] crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA node Kanchana P Sridhar
2024-10-18  6:40 ` [RFC PATCH v1 09/13] mm: zswap: Config variable to enable compress batching in zswap_store() Kanchana P Sridhar
2024-10-23  0:49   ` Yosry Ahmed
2024-10-23  2:17     ` Sridhar, Kanchana P
2024-10-23  2:58       ` Herbert Xu
2024-10-23  3:06         ` Sridhar, Kanchana P
2024-10-23 18:12       ` Yosry Ahmed
2024-10-23 20:32         ` Sridhar, Kanchana P
2024-10-18  6:40 ` [RFC PATCH v1 10/13] mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if platform has IAA Kanchana P Sridhar
2024-10-23  0:51   ` Yosry Ahmed
2024-10-23  2:19     ` Sridhar, Kanchana P
2024-10-18  6:40 ` [RFC PATCH v1 11/13] mm: swap: Add IAA batch compression API swap_crypto_acomp_compress_batch() Kanchana P Sridhar
2024-10-23  0:53   ` Yosry Ahmed
2024-10-23  2:21     ` Sridhar, Kanchana P
2024-10-18  6:41 ` [RFC PATCH v1 12/13] mm: zswap: Compress batching with Intel IAA in zswap_store() of large folios Kanchana P Sridhar
2024-10-18  6:41 ` [RFC PATCH v1 13/13] mm: vmscan, swap, zswap: Compress batching of folios in shrink_folio_list() Kanchana P Sridhar
2024-10-28 14:41   ` Joel Granados
2024-10-28 18:53     ` Sridhar, Kanchana P
2024-10-23  0:56 ` [RFC PATCH v1 00/13] zswap IAA compress batching Yosry Ahmed
2024-10-23  2:53   ` Sridhar, Kanchana P
2024-10-23 18:15     ` Yosry Ahmed
2024-10-23 20:34       ` Sridhar, Kanchana P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox