[RFC PATCH v1 0/7] zswap IAA decompress batching

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/7] zswap IAA decompress batching
@ 2024-10-18  6:47 Kanchana P Sridhar
  2024-10-18  6:47 ` [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with " Kanchana P Sridhar
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, hughd, willy, bfoster, dchinner, chrisl, david
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar


IAA Decompression Batching:
===========================

This patch-series applies over [1], the IAA compress batching patch-series.

[1] https://patchwork.kernel.org/project/linux-mm/list/?series=900537

This RFC patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel decompression of 4K folios prefetched by
swapin_readahead(). We have developed zswap batched loading of these
prefetched folios, that deploys the use of parallel decompressions by IAA.

swapin_readahead() provides a natural batching interface because it adapts
to the usefulness of prior prefetches, to adjust the readahead
window. Hence, it allows the page-cluster to be set based on workload
characteristics. For workloads that are prefetching friendly, this can form
the basis for reading ahead up to 32 folios with zswap load batching to
significantly reduce swap in latency, major page-faults and systime;
thereby improving workload performance.

The patch-series builds upon the IAA compress batching patch-series [1],
and is organized as follows:

 1) A centralized batch decompression API that can be used by swap modules.
 2) "struct folio_batch" modifications, e.g., PAGEVEC_SIZE is increased to
    2^5. 
 3) Addition of "zswap_batch" and "non_zswap_batch" folio_batches in
    swap_read_folio() to serve the purposes of a plug.
 4) swap_read_zswap_batch_unplug() API in page_io.c to process a read
    batch of entries found in zswap.
 5) zswap API to add a swap entry to a load batch, init/reinit the batch,
    process the batch using the batch decompression API. 
 6) Modifications to the swapin_readahead() functions,
    swap_vma_readahead() and swap_cluster_readahead() to:
    a) Call swap_read_folio() to add prefetch swap entries to "zswap_batch"
       and "non_zswap_batch" folio_batches.
    b) Process the two readahead folio batches: "non_zswap_batch" folios
       will be read sequentially; "zswap_batch" folios will be batch
       decompressed with IAA.
 7) Modifications to do_swap_page() to invoke swapin_readahead() from both,
    the single-mapped SWP_SYNCHRONOUS_IO and shared/non-SWP_SYNCHRONOUS_IO
    branches. In the former path, we call swapin_readahead() only in the
    !zswap_never_enabled() case.
    a) This causes folios to be read into the swapcache in both paths. This
       design choice was motivated by stability: to handle race conditions
       with say, process 1 faulting in a single-mapped folio; however,
       process 2 could be simultaneously prefetching it as a "readahead"
       folio.
    b) If the single-mapped folio was successfully read and the race did
       not occur, there are checks added to free the swapcache entry for
       the folio, before do_swap_page() returns.
 8) Finally, for IAA batching, we reduce SWAP_BATCH to 16 and modify the
    swap slots cache thresholds to alleviate lock contention on the
    swap_info_struct lock due to reduced swap page-fault latencies.

IAA decompress batching can be enabled only on platforms that have IAA, by
setting this config variable:

 CONFIG_ZSWAP_LOAD_BATCHING_ENABLED="y"

A new swap parameter "singlemapped_ra_enabled" (false by default) is added
for use on platforms that have IAA. If zswap_load_batching_enabled() is
true, this is intended to give the user the option to run experiments with
IAA and with software compressors for zswap.

These are the recommended settings for "singlemapped_ra_enabled", which
takes effect only in the do_swap_page() single-mapped SWP_SYNCHRONOUS_IO
path:

 For IAA:
   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled

 For software compressors:
   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled

If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
path.

IAA decompress batching performance testing was done using the kernel
compilation test "allmodconfig" run in tmpfs, which demonstrates a
significant amount of readahead activity. vm-scalability usemem is not
ideal for decompress batching because there is very little readahead
activity even with page-cluster of 5 (swap_ra is < 150 with 4k/16k/32k/64k
folios).

The kernel compilation experiments with decompress batching demonstrate
significant latency reductions with kernel compilation: up to 4% lower
elapsed time, 14% lower sys time than mm-unstable/zstd. When combined with
compress batching, we see a reduction of 5% in elapsed time and 20% in sys
time as compared to mm-unstable commit 817952b8be34 with zstd.

Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 3.5% reduction in workload run time as
compared to software compressors.


System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 10-16-2024,
commit 817952b8be34, without and with this patch-series ("this
patch-series" includes [1]). Data was gathered on an Intel Sapphire Rapids
server, dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB
RAM and 525G SSD disk partition swap. Core frequency was fixed at 2500MHz.

The kernel compilation test with run in tmpfs, using the "allmodconfig", so
that significant swapout and readahead activity can be observed to quantify
decompress batching.

Other kernel configuration parameters:

    zswap compressor : deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 3,4

IAA "compression verification" is disabled and the async poll acomp
interface is used in the iaa_crypto driver (the defaults with this
series).


Performance testing (Kernel compilation):
=========================================

As mentioned earlier, for workloads that see a lot of swapout activity, we
can benefit from configuring 2 WQs per IAA device, with compress jobs from
all same-socket cores being distributed toothe wq.1 of all IAAs on the
socket, with the "global_wq" developed in this patch-series.

Although this data includes IAA decompress batching, which will be
submitted as a separate RFC patch-series, I am listing it here to quantify
the benefit of distributing compress jobs among all IAAs. The kernel
compilation test with "allmodconfig" is able to quantify this well:


 4K folios: deflate-iaa: kernel compilation
 ==========================================

 ------------------------------------------------------------------------------
                   mm-unstable-10-16-2024       zswap_load_batch with
                                              IAA decompress batching
 ------------------------------------------------------------------------------
 zswap compressor                    zstd         deflate-iaa
 vm.compress-batchsize                n/a                   1
 vm.page-cluster                        3                   3
 ------------------------------------------------------------------------------
 real_sec                          783.87              752.99
 user_sec                       15,750.07           15,746.37
 sys_sec                         6,522.32            5,638.16
 Max_Res_Set_Size_KB            1,872,640           1,872,640

 ------------------------------------------------------------------------------
 zswpout                       82,364,991         105,190,461
 zswpin                        21,303,393          29,684,653
 pswpout                               13                   1
 pswpin                                12                   1
 pgmajfault                    17,114,339          24,034,146
 swap_ra                        4,596,035           6,219,484
 swap_ra_hit                    2,903,249           3,876,195
 ------------------------------------------------------------------------------


 Progression of kernel compilation latency improvements with
 compress/decompress batching:
 ============================================================

 -------------------------------------------------------------------------------
               mm-unstable-10-16-2024   shrink_folio_       zswap_load_batch
                                        list()              w/ IAA decompress
                                        batching            batching 
                                        of folios           
 -------------------------------------------------------------------------------
 zswap compr       zstd   deflate-iaa   deflate-iaa    deflate-iaa   deflate-iaa
 vm.compress-       n/a           n/a            32              1            32
 batchsize
 vm.page-             3             3             3              3             3
  cluster
 -------------------------------------------------------------------------------
 real_sec        783.87        761.69        747.32         752.99        749.25
 user_sec     15,750.07     15,716.69     15,728.39      15,746.37     15,741.71
 sys_sec       6,522.32      5,725.28      5,399.44       5,638.16      5,482.12
 Max_RSS_KB   1,872,640     1,870,848     1,874,432      1,872,640     1,872,640
                                                                                       
 zswpout     82,364,991    97,739,600   102,780,612    105,190,461   106,729,372
 zswpin      21,303,393    27,684,166    29,016,252     29,684,653    30,717,819
 pswpout             13           222           213              1            12
 pswpin              12           209           202              1             8
 pgmajfault  17,114,339    22,421,211    23,378,161     24,034,146    24,852,985
 swap_ra      4,596,035     5,840,082     6,231,646      6,219,484     6,504,878
 swap_ra_hit  2,903,249     3,682,444     3,940,420      3,876,195     4,092,852
 -------------------------------------------------------------------------------


The last 2 columns of the latency reduction progression are as follows:


 IAA decompress batching combined with distributing compress jobs to all
 same-socket IAA devices: 
 =======================================================================

 ------------------------------------------------------------------------------
                   IAA shrink_folio_list() compress batching and
                       swapin_readahead() decompress batching

                                      1WQ      2WQ (distribute compress jobs)

                        1 local WQ (wq.0)    1 local WQ (wq.0) +
                                  per IAA    1 global WQ (wq.1) per IAA
                        
 ------------------------------------------------------------------------------
 zswap compressor             deflate-iaa         deflate-iaa
 vm.compress-batchsize                 32                  32
 vm.page-cluster                        4                   4
 ------------------------------------------------------------------------------
 real_sec                          746.77              745.42  
 user_sec                       15,732.66           15,738.85
 sys_sec                         5,384.14            5,247.86
 Max_Res_Set_Size_KB            1,874,432           1,872,640

 ------------------------------------------------------------------------------
 zswpout                      101,648,460         104,882,982
 zswpin                        27,418,319          29,428,515
 pswpout                              213                  22
 pswpin                               207                   6
 pgmajfault                    21,896,616          23,629,768
 swap_ra                        6,054,409           6,385,080
 swap_ra_hit                    3,791,628           3,985,141
 ------------------------------------------------------------------------------


I would greatly appreciate code review comments for this RFC series!

[1] https://patchwork.kernel.org/project/linux-mm/list/?series=900537


Thanks,
Kanchana



Kanchana P Sridhar (7):
  mm: zswap: Config variable to enable zswap loads with decompress
    batching.
  mm: swap: Add IAA batch decompression API
    swap_crypto_acomp_decompress_batch().
  pagevec: struct folio_batch changes for decompress batching interface.
  mm: swap: swap_read_folio() can add a folio to a folio_batch if it is
    in zswap.
  mm: swap, zswap: zswap folio_batch processing with IAA decompression
    batching.
  mm: do_swap_page() calls swapin_readahead() zswap load batching
    interface.
  mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache
    thresholds.

 include/linux/pagevec.h    |  13 +-
 include/linux/swap.h       |   7 +
 include/linux/swap_slots.h |   7 +
 include/linux/zswap.h      |  65 +++++++++
 mm/Kconfig                 |  13 ++
 mm/memory.c                | 187 +++++++++++++++++++------
 mm/page_io.c               |  61 ++++++++-
 mm/shmem.c                 |   2 +-
 mm/swap.h                  | 102 ++++++++++++--
 mm/swap_state.c            | 272 ++++++++++++++++++++++++++++++++++---
 mm/swapfile.c              |   2 +-
 mm/zswap.c                 | 272 +++++++++++++++++++++++++++++++++++++
 12 files changed, 927 insertions(+), 76 deletions(-)

-- 
2.27.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with decompress batching.
  2024-10-18  6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
@ 2024-10-18  6:47 ` Kanchana P Sridhar
  2024-10-18  6:48 ` [RFC PATCH v1 2/7] mm: swap: Add IAA batch decompression API swap_crypto_acomp_decompress_batch() Kanchana P Sridhar
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, hughd, willy, bfoster, dchinner, chrisl, david
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Add a new zswap config variable that controls whether zswap load will
decompress a batch of 4K folios, for instance, the folios prefetched
during swapin_readahead():

  CONFIG_ZSWAP_LOAD_BATCHING_ENABLED

The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in commit
ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto
driver core") is used to detect if the system has the Intel Analytics
Accelerator (IAA), and the iaa_crypto module is available. If so, the
kernel build will prompt for CONFIG_ZSWAP_LOAD_BATCHING_ENABLED. Hence,
users have the ability to set CONFIG_ZSWAP_LOAD_BATCHING_ENABLED="y" only
on systems that have Intel IAA.

If CONFIG_ZSWAP_LOAD_BATCHING_ENABLED is enabled, and IAA is configured
as the zswap compressor, the vm.page-cluster is used to prefetch up to
32 4K folios using swapin_readahead(). The readahead folios present in
zswap are then loaded as a batch using IAA decompression batching.

The patch also implements a zswap API that returns the status of this
config variable.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h |  8 ++++++++
 mm/Kconfig            | 13 +++++++++++++
 mm/zswap.c            | 12 ++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 328a1e09d502..294d13efbfb1 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -118,6 +118,9 @@ static inline void zswap_store_batch(struct swap_in_memory_cache_cb *simc)
 	else
 		__zswap_store_batch_single(simc);
 }
+
+bool zswap_load_batching_enabled(void);
+
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
@@ -145,6 +148,11 @@ static inline void zswap_store_batch(struct swap_in_memory_cache_cb *simc)
 {
 }
 
+static inline bool zswap_load_batching_enabled(void)
+{
+	return false;
+}
+
 static inline bool zswap_store(struct folio *folio)
 {
 	return false;
diff --git a/mm/Kconfig b/mm/Kconfig
index 26d1a5cee471..98e46a3cf0e3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,19 @@ config ZSWAP_STORE_BATCHING_ENABLED
 	in the folio in	hardware, thereby improving large folio compression
 	throughput and reducing swapout latency.
 
+config ZSWAP_LOAD_BATCHING_ENABLED
+	bool "Batching of zswap loads of 4K folios with Intel IAA"
+	depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO
+	default n
+	help
+	Enables zswap_load to swapin multiple 4K folios in batches of 8,
+	rather than a folio at a time, if the system has Intel IAA for hardware
+	acceleration of decompressions. swapin_readahead will be used to
+	prefetch a batch of folios to be swapped in along with the faulting
+	folio. If IAA is the zswap compressor, this will parallelize batch
+	decompression of upto 8 folios in hardware, thereby reducing swapin
+	and do_swap_page latency.
+
 choice
 	prompt "Default allocator"
 	depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 68ce498ad000..fe7bc2a6672e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -136,6 +136,13 @@ module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
 static bool __zswap_store_batching_enabled = IS_ENABLED(
 	CONFIG_ZSWAP_STORE_BATCHING_ENABLED);
 
+/*
+ * Enable/disable batching of decompressions of multiple 4K folios, if
+ * the system has Intel IAA.
+ */
+static bool __zswap_load_batching_enabled = IS_ENABLED(
+	CONFIG_ZSWAP_LOAD_BATCHING_ENABLED);
+
 bool zswap_is_enabled(void)
 {
 	return zswap_enabled;
@@ -246,6 +253,11 @@ __always_inline bool zswap_store_batching_enabled(void)
 	return __zswap_store_batching_enabled;
 }
 
+__always_inline bool zswap_load_batching_enabled(void)
+{
+	return __zswap_load_batching_enabled;
+}
+
 static void __zswap_store_batch_core(
 	int node_id,
 	struct folio **folios,
-- 
2.27.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 2/7] mm: swap: Add IAA batch decompression API swap_crypto_acomp_decompress_batch().
  2024-10-18  6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
  2024-10-18  6:47 ` [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with " Kanchana P Sridhar
@ 2024-10-18  6:48 ` Kanchana P Sridhar
  2024-10-18  6:48 ` [RFC PATCH v1 3/7] pagevec: struct folio_batch changes for decompress batching interface Kanchana P Sridhar
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, hughd, willy, bfoster, dchinner, chrisl, david
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Added a new API swap_crypto_acomp_decompress_batch() that does batch
decompression. A system that has Intel IAA can avail of this API to submit a
batch of decompress jobs for parallel decompression in the hardware, to improve
performance. On a system without IAA, this API will process each decompress
job sequentially.

The purpose of this API is to be invocable from any swap module that needs
to decompress multiple 4K folios, or a batch of pages in the general case.
For instance, zswap would decompress up to (1UL << SWAP_RA_ORDER_CEILING)
folios in batches of SWAP_CRYPTO_SUB_BATCH_SIZE (i.e. 8 if the system has
IAA) pages prefetched by swapin_readahead(), which would improve readahead
performance.

Towards this eventual goal, the swap_crypto_acomp_decompress_batch()
interface is implemented in swap_state.c and exported via mm/swap.h. It
would be preferable for swap_crypto_acomp_decompress_batch() to be exported
via include/linux/swap.h so that modules outside mm (for e.g. zram) can
potentially use the API for batch decompressions with IAA, since the
swapin_readahead() batching interface is common to all swap modules.
I would appreciate RFC comments on this.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/swap.h       |  42 +++++++++++++++++--
 mm/swap_state.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 147 insertions(+), 4 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 08c04954304f..0bb386b5fdee 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -10,11 +10,12 @@ struct mempolicy;
 #include <linux/crypto.h>
 
 /*
- * For IAA compression batching:
- * Maximum number of IAA acomp compress requests that will be processed
- * in a sub-batch.
+ * For IAA compression/decompression batching:
+ * Maximum number of IAA acomp compress/decompress requests that will be
+ * processed in a sub-batch.
  */
-#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED)
+#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED) || \
+	defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
 #define SWAP_CRYPTO_SUB_BATCH_SIZE 8UL
 #else
 #define SWAP_CRYPTO_SUB_BATCH_SIZE 1UL
@@ -60,6 +61,29 @@ void swap_crypto_acomp_compress_batch(
 	int nr_pages,
 	struct crypto_acomp_ctx *acomp_ctx);
 
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ * The acomp_ctx mutex should be locked/unlocked before/after calling this
+ * procedure.
+ *
+ * @srcs: The src buffers to be decompressed.
+ * @pages: The pages to store the buffers decompressed by IAA.
+ * @slens: src buffers' compressed lengths.
+ * @errors: Will contain a 0 if the page was successfully decompressed, or a
+ *          non-0 error value to be processed by the calling function.
+ * @nr_pages: The number of pages, up to SWAP_CRYPTO_SUB_BATCH_SIZE,
+ *            to be decompressed.
+ * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
+ */
+void swap_crypto_acomp_decompress_batch(
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	int errors[],
+	int nr_pages,
+	struct crypto_acomp_ctx *acomp_ctx);
+
 /* linux/mm/vmscan.c, linux/mm/page_io.c, linux/mm/zswap.c */
 /* For batching of compressions in reclaim path. */
 struct swap_in_memory_cache_cb {
@@ -204,6 +228,16 @@ static inline void swap_write_in_memory_cache_unplug(
 {
 }
 
+static inline void swap_crypto_acomp_decompress_batch(
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	int errors[],
+	int nr_pages,
+	struct crypto_acomp_ctx *acomp_ctx)
+{
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 117c3caa5679..3cebbff40804 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -855,6 +855,115 @@ void swap_crypto_acomp_compress_batch(
 }
 EXPORT_SYMBOL_GPL(swap_crypto_acomp_compress_batch);
 
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ * The acomp_ctx mutex should be locked/unlocked before/after calling this
+ * procedure.
+ *
+ * @srcs: The src buffers to be decompressed.
+ * @pages: The pages to store the buffers decompressed by IAA.
+ * @slens: src buffers' compressed lengths.
+ * @errors: Will contain a 0 if the page was successfully decompressed, or a
+ *          non-0 error value to be processed by the calling function.
+ * @nr_pages: The number of pages, up to SWAP_CRYPTO_SUB_BATCH_SIZE,
+ *            to be decompressed.
+ * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
+ */
+void swap_crypto_acomp_decompress_batch(
+	u8 *srcs[],
+	struct page *pages[],
+	unsigned int slens[],
+	int errors[],
+	int nr_pages,
+	struct crypto_acomp_ctx *acomp_ctx)
+{
+	struct scatterlist inputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	struct scatterlist outputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	unsigned int dlens[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	bool decompressions_done = false;
+	int i, j;
+
+	BUG_ON(nr_pages > SWAP_CRYPTO_SUB_BATCH_SIZE);
+
+	/*
+	 * Prepare and submit acomp_reqs to IAA.
+	 * IAA will process these decompress jobs in parallel in async mode.
+	 * If the compressor does not support a poll() method, or if IAA is
+	 * used in sync mode, the jobs will be processed sequentially using
+	 * acomp_ctx->req[0] and acomp_ctx->wait.
+	 */
+	for (i = 0; i < nr_pages; ++i) {
+		j = acomp_ctx->acomp->poll ? i : 0;
+
+		dlens[i] = PAGE_SIZE;
+		sg_init_one(&inputs[i], srcs[i], slens[i]);
+		sg_init_table(&outputs[i], 1);
+		sg_set_page(&outputs[i], pages[i], PAGE_SIZE, 0);
+		acomp_request_set_params(acomp_ctx->req[j], &inputs[i],
+					&outputs[i], slens[i], dlens[i]);
+		/*
+		 * If the crypto_acomp provides an asynchronous poll()
+		 * interface, submit the request to the driver now, and poll for
+		 * a completion status later, after all descriptors have been
+		 * submitted. If the crypto_acomp does not provide a poll()
+		 * interface, submit the request and wait for it to complete,
+		 * i.e., synchronously, before moving on to the next request.
+		 */
+		if (acomp_ctx->acomp->poll) {
+			errors[i] = crypto_acomp_decompress(acomp_ctx->req[j]);
+
+			if (errors[i] != -EINPROGRESS)
+				errors[i] = -EINVAL;
+			else
+				errors[i] = -EAGAIN;
+		} else {
+			errors[i] = crypto_wait_req(
+					crypto_acomp_decompress(acomp_ctx->req[j]),
+					&acomp_ctx->wait);
+			if (!errors[i]) {
+				dlens[i] = acomp_ctx->req[j]->dlen;
+				BUG_ON(dlens[i] != PAGE_SIZE);
+			}
+		}
+	}
+
+	/*
+	 * If not doing async decompressions, the batch has been processed at
+	 * this point and we can return.
+	 */
+	if (!acomp_ctx->acomp->poll)
+		return;
+
+	/*
+	 * Poll for and process IAA decompress job completions
+	 * in out-of-order manner.
+	 */
+	while (!decompressions_done) {
+		decompressions_done = true;
+
+		for (i = 0; i < nr_pages; ++i) {
+			/*
+			 * Skip, if the decompression has already completed
+			 * successfully or with an error.
+			 */
+			if (errors[i] != -EAGAIN)
+				continue;
+
+			errors[i] = crypto_acomp_poll(acomp_ctx->req[i]);
+
+			if (errors[i]) {
+				if (errors[i] == -EAGAIN)
+					decompressions_done = false;
+			} else {
+				dlens[i] = acomp_ctx->req[i]->dlen;
+				BUG_ON(dlens[i] != PAGE_SIZE);
+			}
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(swap_crypto_acomp_decompress_batch);
+
 #endif /* CONFIG_SWAP */
 
 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
-- 
2.27.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 3/7] pagevec: struct folio_batch changes for decompress batching interface.
  2024-10-18  6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
  2024-10-18  6:47 ` [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with " Kanchana P Sridhar
  2024-10-18  6:48 ` [RFC PATCH v1 2/7] mm: swap: Add IAA batch decompression API swap_crypto_acomp_decompress_batch() Kanchana P Sridhar
@ 2024-10-18  6:48 ` Kanchana P Sridhar
  2024-10-18  6:48 ` [RFC PATCH v1 4/7] mm: swap: swap_read_folio() can add a folio to a folio_batch if it is in zswap Kanchana P Sridhar
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, hughd, willy, bfoster, dchinner, chrisl, david
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Made these changes to "struct folio_batch" for use in the
swapin_readahead() based zswap load batching interface for parallel
decompressions with IAA:

 1) Moved SWAP_RA_ORDER_CEILING definition to pagevec.h.
 2) Increased PAGEVEC_SIZE to (1UL << SWAP_RA_ORDER_CEILING),
    because vm.page-cluster=5 requires capacity for 32 folios.
 3) Made folio_batch_add() more fail-safe.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/pagevec.h | 13 ++++++++++---
 mm/swap_state.c         |  2 --
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 5d3a0cccc6bf..c9bab240fb6e 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -11,8 +11,14 @@
 
 #include <linux/types.h>
 
-/* 31 pointers + header align the folio_batch structure to a power of two */
-#define PAGEVEC_SIZE	31
+/*
+ * For page-cluster of 5, I noticed that space for 31 pointers was
+ * insufficient. Increasing this to meet the requirements for folio_batch
+ * usage in the swap read decompress batching interface that is based on
+ * swapin_readahead().
+ */
+#define SWAP_RA_ORDER_CEILING	5
+#define PAGEVEC_SIZE	(1UL << SWAP_RA_ORDER_CEILING)
 
 struct folio;
 
@@ -74,7 +80,8 @@ static inline unsigned int folio_batch_space(struct folio_batch *fbatch)
 static inline unsigned folio_batch_add(struct folio_batch *fbatch,
 		struct folio *folio)
 {
-	fbatch->folios[fbatch->nr++] = folio;
+	if (folio_batch_space(fbatch) > 0)
+		fbatch->folios[fbatch->nr++] = folio;
 	return folio_batch_space(fbatch);
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3cebbff40804..0673593d363c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -44,8 +44,6 @@ struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static bool enable_vma_readahead __read_mostly = true;
 
-#define SWAP_RA_ORDER_CEILING	5
-
 #define SWAP_RA_WIN_SHIFT	(PAGE_SHIFT / 2)
 #define SWAP_RA_HITS_MASK	((1UL << SWAP_RA_WIN_SHIFT) - 1)
 #define SWAP_RA_HITS_MAX	SWAP_RA_HITS_MASK
-- 
2.27.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 4/7] mm: swap: swap_read_folio() can add a folio to a folio_batch if it is in zswap.
  2024-10-18  6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2024-10-18  6:48 ` [RFC PATCH v1 3/7] pagevec: struct folio_batch changes for decompress batching interface Kanchana P Sridhar
@ 2024-10-18  6:48 ` Kanchana P Sridhar
  2024-10-18  6:48 ` [RFC PATCH v1 5/7] mm: swap, zswap: zswap folio_batch processing with IAA decompression batching Kanchana P Sridhar
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, hughd, willy, bfoster, dchinner, chrisl, david
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch modifies swap_read_folio() to check if the swap entry is present
in zswap, and if so, it will be added to a "zswap_batch" folio_batch, if
the caller (e.g. swapin_readahead()) has passed in a valid "zswap_batch".

If the swap entry is found in zswap, it will be added the next available
index in a sub-batch. This sub-batch is part of "struct zswap_decomp_batch"
which progressively constructs SWAP_CRYPTO_SUB_BATCH_SIZE arrays of zswap
entries/xarrays/pages/source-lengths ready for batch decompression in IAA.
The function that does this, zswap_add_load_batch(), will return true to
swap_read_folio(). If the entry is not found in zswap, it will return
false.

If the swap entry was not found in zswap, and if
zswap_load_batching_enabled() and a valid "non_zswap_batch" folio_batch is
passed to swap_read_folio(), the folio will be added to the
"non_zswap_batch" batch.

Finally, the code falls through to the usual/existing swap_read_folio()
flow.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h | 35 +++++++++++++++++
 mm/memory.c           |  2 +-
 mm/page_io.c          | 26 ++++++++++++-
 mm/swap.h             | 31 ++++++++++++++-
 mm/swap_state.c       | 10 ++---
 mm/zswap.c            | 89 +++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 183 insertions(+), 10 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 294d13efbfb1..1d6de281f243 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -12,6 +12,8 @@ extern atomic_long_t zswap_stored_pages;
 #ifdef CONFIG_ZSWAP
 
 struct swap_in_memory_cache_cb;
+struct zswap_decomp_batch;
+struct zswap_entry;
 
 struct zswap_lruvec_state {
 	/*
@@ -120,6 +122,19 @@ static inline void zswap_store_batch(struct swap_in_memory_cache_cb *simc)
 }
 
 bool zswap_load_batching_enabled(void);
+void zswap_load_batch_init(struct zswap_decomp_batch *zd_batch);
+void zswap_load_batch_reinit(struct zswap_decomp_batch *zd_batch);
+bool __zswap_add_load_batch(struct zswap_decomp_batch *zd_batch,
+			    struct folio *folio);
+static inline bool zswap_add_load_batch(
+	struct zswap_decomp_batch *zd_batch,
+	struct folio *folio)
+{
+	if (zswap_load_batching_enabled())
+		return __zswap_add_load_batch(zd_batch, folio);
+
+	return false;
+}
 
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
@@ -138,6 +153,8 @@ struct zswap_lruvec_state {};
 struct zswap_store_sub_batch_page {};
 struct zswap_store_pipeline_state {};
 struct swap_in_memory_cache_cb;
+struct zswap_decomp_batch;
+struct zswap_entry;
 
 static inline bool zswap_store_batching_enabled(void)
 {
@@ -153,6 +170,24 @@ static inline bool zswap_load_batching_enabled(void)
 	return false;
 }
 
+static inline void zswap_load_batch_init(
+	struct zswap_decomp_batch *zd_batch)
+{
+}
+
+static inline void zswap_load_batch_reinit(
+	struct zswap_decomp_batch *zd_batch)
+{
+}
+
+static inline bool zswap_add_load_batch(
+	struct folio *folio,
+	struct zswap_entry *entry,
+	struct zswap_decomp_batch *zd_batch)
+{
+	return false;
+}
+
 static inline bool zswap_store(struct folio *folio)
 {
 	return false;
diff --git a/mm/memory.c b/mm/memory.c
index 0f614523b9f4..b5745b9ffdf7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4322,7 +4322,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 				/* To provide entry to swap_read_folio() */
 				folio->swap = entry;
-				swap_read_folio(folio, NULL);
+				swap_read_folio(folio, NULL, NULL, NULL);
 				folio->private = NULL;
 			}
 		} else {
diff --git a/mm/page_io.c b/mm/page_io.c
index 065db25309b8..9750302d193b 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -744,11 +744,17 @@ static void swap_read_folio_bdev_async(struct folio *folio,
 	submit_bio(bio);
 }
 
-void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
+/*
+ * Returns true if the folio was read, and false if the folio was added to
+ * the zswap_decomp_batch for batched decompression.
+ */
+bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
+		     struct zswap_decomp_batch *zswap_batch,
+		     struct folio_batch *non_zswap_batch)
 {
 	struct swap_info_struct *sis = swp_swap_info(folio->swap);
 	bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
-	bool workingset = folio_test_workingset(folio);
+	bool workingset;
 	unsigned long pflags;
 	bool in_thrashing;
 
@@ -756,11 +762,26 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio);
 
+	/*
+	 * If entry is found in zswap xarray, and zswap load batching
+	 * is enabled, this is a candidate for zswap batch decompression.
+	 */
+	if (zswap_batch && zswap_add_load_batch(zswap_batch, folio))
+		return false;
+
+	if (zswap_load_batching_enabled() && non_zswap_batch) {
+		BUG_ON(!folio_batch_space(non_zswap_batch));
+		folio_batch_add(non_zswap_batch, folio);
+		return false;
+	}
+
 	/*
 	 * Count submission time as memory stall and delay. When the device
 	 * is congested, or the submitting cgroup IO-throttled, submission
 	 * can be a significant part of overall IO time.
 	 */
+	workingset = folio_test_workingset(folio);
+
 	if (workingset) {
 		delayacct_thrashing_start(&in_thrashing);
 		psi_memstall_enter(&pflags);
@@ -792,6 +813,7 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 		psi_memstall_leave(&pflags);
 	}
 	delayacct_swapin_end();
+	return true;
 }
 
 void __swap_read_unplug(struct swap_iocb *sio)
diff --git a/mm/swap.h b/mm/swap.h
index 0bb386b5fdee..310f99007fe6 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -84,6 +84,27 @@ void swap_crypto_acomp_decompress_batch(
 	int nr_pages,
 	struct crypto_acomp_ctx *acomp_ctx);
 
+#if defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
+#define MAX_NR_ZSWAP_LOAD_SUB_BATCHES DIV_ROUND_UP(PAGEVEC_SIZE, \
+					SWAP_CRYPTO_SUB_BATCH_SIZE)
+#else
+#define MAX_NR_ZSWAP_LOAD_SUB_BATCHES 1UL
+#endif /* CONFIG_ZSWAP_LOAD_BATCHING_ENABLED */
+
+/*
+ * Note: If PAGEVEC_SIZE or SWAP_CRYPTO_SUB_BATCH_SIZE
+ * exceeds 256, change the u8 to u16.
+ */
+struct zswap_decomp_batch {
+	struct folio_batch fbatch;
+	bool swapcache[PAGEVEC_SIZE];
+	struct xarray *trees[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SUB_BATCH_SIZE];
+	struct zswap_entry *entries[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SUB_BATCH_SIZE];
+	struct page *pages[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SUB_BATCH_SIZE];
+	unsigned int slens[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SUB_BATCH_SIZE];
+	u8 nr_decomp[MAX_NR_ZSWAP_LOAD_SUB_BATCHES];
+};
+
 /* linux/mm/vmscan.c, linux/mm/page_io.c, linux/mm/zswap.c */
 /* For batching of compressions in reclaim path. */
 struct swap_in_memory_cache_cb {
@@ -101,7 +122,9 @@ struct swap_in_memory_cache_cb {
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
-void swap_read_folio(struct folio *folio, struct swap_iocb **plug);
+bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
+		     struct zswap_decomp_batch *zswap_batch,
+		     struct folio_batch *non_zswap_batch);
 void __swap_read_unplug(struct swap_iocb *plug);
 static inline void swap_read_unplug(struct swap_iocb *plug)
 {
@@ -238,8 +261,12 @@ static inline void swap_crypto_acomp_decompress_batch(
 {
 }
 
-static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
+struct zswap_decomp_batch {};
+static inline bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
+				   struct zswap_decomp_batch *zswap_batch,
+				   struct folio_batch *non_zswap_batch)
 {
+	return false;
 }
 static inline void swap_write_unplug(struct swap_iocb *sio)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0673593d363c..0aa938e4c34d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -570,7 +570,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	mpol_cond_put(mpol);
 
 	if (page_allocated)
-		swap_read_folio(folio, plug);
+		swap_read_folio(folio, plug, NULL, NULL);
 	return folio;
 }
 
@@ -687,7 +687,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug);
+			swap_read_folio(folio, &splug, NULL, NULL);
 			if (offset != entry_offset) {
 				folio_set_readahead(folio);
 				count_vm_event(SWAP_RA);
@@ -703,7 +703,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
 	if (unlikely(page_allocated))
-		swap_read_folio(folio, NULL);
+		swap_read_folio(folio, NULL, NULL, NULL);
 	return folio;
 }
 
@@ -1057,7 +1057,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug);
+			swap_read_folio(folio, &splug, NULL, NULL);
 			if (addr != vmf->address) {
 				folio_set_readahead(folio);
 				count_vm_event(SWAP_RA);
@@ -1075,7 +1075,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
 					&page_allocated, false);
 	if (unlikely(page_allocated))
-		swap_read_folio(folio, NULL);
+		swap_read_folio(folio, NULL, NULL, NULL);
 	return folio;
 }
 
diff --git a/mm/zswap.c b/mm/zswap.c
index fe7bc2a6672e..1d293f95d525 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -2312,6 +2312,95 @@ bool zswap_load(struct folio *folio)
 	return true;
 }
 
+/* Code for zswap load batch with batch decompress. */
+
+__always_inline void zswap_load_batch_init(struct zswap_decomp_batch *zd_batch)
+{
+	unsigned int sb;
+
+	folio_batch_init(&zd_batch->fbatch);
+
+	for (sb = 0; sb < MAX_NR_ZSWAP_LOAD_SUB_BATCHES; ++sb)
+		zd_batch->nr_decomp[sb] = 0;
+}
+
+__always_inline void zswap_load_batch_reinit(struct zswap_decomp_batch *zd_batch)
+{
+	unsigned int sb;
+
+	folio_batch_reinit(&zd_batch->fbatch);
+
+	for (sb = 0; sb < MAX_NR_ZSWAP_LOAD_SUB_BATCHES; ++sb)
+		zd_batch->nr_decomp[sb] = 0;
+}
+
+/*
+ * All folios in zd_batch are allocated into the swapcache
+ * in swapin_readahead(), before being added to the zd_batch
+ * for batch decompression.
+ */
+bool __zswap_add_load_batch(struct zswap_decomp_batch *zd_batch,
+			    struct folio *folio)
+{
+	swp_entry_t swp = folio->swap;
+	pgoff_t offset = swp_offset(swp);
+	bool swapcache = folio_test_swapcache(folio);
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct zswap_entry *entry;
+	unsigned int batch_idx, sb;
+
+	VM_WARN_ON_ONCE(!folio_test_locked(folio));
+
+	if (zswap_never_enabled())
+		return false;
+
+	/*
+	 * Large folios should not be swapped in while zswap is being used, as
+	 * they are not properly handled. Zswap does not properly load large
+	 * folios, and a large folio may only be partially in zswap.
+	 *
+	 * Returning false here will cause the large folio to be added to
+	 * the "non_zswap_batch" in swap_read_folio(), which will eventually
+	 * call zswap_load() if the folio is not in the zeromap. Finally,
+	 * zswap_load() will return true without marking the folio uptodate
+	 * so that an IO error is emitted (e.g. do_swap_page() will sigbus).
+	 */
+	if (WARN_ON_ONCE(folio_test_large(folio)))
+		return false;
+
+	/*
+	 * When reading into the swapcache, invalidate our entry. The
+	 * swapcache can be the authoritative owner of the page and
+	 * its mappings, and the pressure that results from having two
+	 * in-memory copies outweighs any benefits of caching the
+	 * compression work.
+	 */
+	if (swapcache)
+		entry = xa_erase(tree, offset);
+	else
+		entry = xa_load(tree, offset);
+
+	if (!entry)
+		return false;
+
+	BUG_ON(!folio_batch_space(&zd_batch->fbatch));
+	folio_batch_add(&zd_batch->fbatch, folio);
+
+	batch_idx = folio_batch_count(&zd_batch->fbatch) - 1;
+	zd_batch->swapcache[batch_idx] = swapcache;
+	sb = batch_idx / SWAP_CRYPTO_SUB_BATCH_SIZE;
+
+	if (entry->length) {
+		zd_batch->trees[sb][zd_batch->nr_decomp[sb]] = tree;
+		zd_batch->entries[sb][zd_batch->nr_decomp[sb]] = entry;
+		zd_batch->pages[sb][zd_batch->nr_decomp[sb]] = &folio->page;
+		zd_batch->slens[sb][zd_batch->nr_decomp[sb]] = entry->length;
+		zd_batch->nr_decomp[sb]++;
+	}
+
+	return true;
+}
+
 void zswap_invalidate(swp_entry_t swp)
 {
 	pgoff_t offset = swp_offset(swp);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 5/7] mm: swap, zswap: zswap folio_batch processing with IAA decompression batching.
  2024-10-18  6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2024-10-18  6:48 ` [RFC PATCH v1 4/7] mm: swap: swap_read_folio() can add a folio to a folio_batch if it is in zswap Kanchana P Sridhar
@ 2024-10-18  6:48 ` Kanchana P Sridhar
  2024-10-18  6:48 ` [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface Kanchana P Sridhar
  2024-10-18  6:48 ` [RFC PATCH v1 7/7] mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache thresholds Kanchana P Sridhar
  6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, hughd, willy, bfoster, dchinner, chrisl, david
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch provides the functionality that processes a "zswap_batch" in
which swap_read_folio() had previously stored swap entries found in zswap,
for batched loading.

The newly added zswap_finish_load_batch() API implements the main zswap
load batching functionality. This makes use of the sub-batches of
zswap_entry/xarray/page/source-length readily available from
zswap_add_load_batch(). These sub-batch arrays are processed one at a time,
until the entire zswap folio_batch has been loaded. The existing
zswap_load() functionality of deleting zswap_entries for folios found in
the swapcache, is preserved.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h |  22 ++++++
 mm/page_io.c          |  35 +++++++++
 mm/swap.h             |  17 +++++
 mm/zswap.c            | 171 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 245 insertions(+)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 1d6de281f243..a0792c2b300a 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -110,6 +110,15 @@ struct zswap_store_pipeline_state {
 	u8 nr_comp_pages;
 };
 
+/* Note: If SWAP_CRYPTO_SUB_BATCH_SIZE exceeds 256, change the u8 to u16. */
+struct zswap_load_sub_batch_state {
+	struct xarray **trees;
+	struct zswap_entry **entries;
+	struct page **pages;
+	unsigned int *slens;
+	u8 nr_decomp;
+};
+
 bool zswap_store_batching_enabled(void);
 void __zswap_store_batch(struct swap_in_memory_cache_cb *simc);
 void __zswap_store_batch_single(struct swap_in_memory_cache_cb *simc);
@@ -136,6 +145,14 @@ static inline bool zswap_add_load_batch(
 	return false;
 }
 
+void __zswap_finish_load_batch(struct zswap_decomp_batch *zd_batch);
+static inline void zswap_finish_load_batch(
+	struct zswap_decomp_batch *zd_batch)
+{
+	if (zswap_load_batching_enabled())
+		__zswap_finish_load_batch(zd_batch);
+}
+
 unsigned long zswap_total_pages(void);
 bool zswap_store(struct folio *folio);
 bool zswap_load(struct folio *folio);
@@ -188,6 +205,11 @@ static inline bool zswap_add_load_batch(
 	return false;
 }
 
+static inline void zswap_finish_load_batch(
+	struct zswap_decomp_batch *zd_batch)
+{
+}
+
 static inline bool zswap_store(struct folio *folio)
 {
 	return false;
diff --git a/mm/page_io.c b/mm/page_io.c
index 9750302d193b..aa83221318ef 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -816,6 +816,41 @@ bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
 	return true;
 }
 
+static void __swap_post_process_zswap_load_batch(
+	struct zswap_decomp_batch *zswap_batch)
+{
+	u8 i;
+
+	for (i = 0; i < folio_batch_count(&zswap_batch->fbatch); ++i) {
+		struct folio *folio = zswap_batch->fbatch.folios[i];
+		folio_unlock(folio);
+	}
+}
+
+/*
+ * The swapin_readahead batching interface makes sure that the
+ * input zswap_batch consists of folios belonging to the same swap
+ * device type.
+ */
+void __swap_read_zswap_batch_unplug(struct zswap_decomp_batch *zswap_batch,
+				    struct swap_iocb **splug)
+{
+	unsigned long pflags;
+
+	if (!folio_batch_count(&zswap_batch->fbatch))
+		return;
+
+	psi_memstall_enter(&pflags);
+	delayacct_swapin_start();
+
+	/* Load the zswap batch. */
+	zswap_finish_load_batch(zswap_batch);
+	__swap_post_process_zswap_load_batch(zswap_batch);
+
+	psi_memstall_leave(&pflags);
+	delayacct_swapin_end();
+}
+
 void __swap_read_unplug(struct swap_iocb *sio)
 {
 	struct iov_iter from;
diff --git a/mm/swap.h b/mm/swap.h
index 310f99007fe6..2b82c8ed765c 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -125,6 +125,16 @@ struct swap_iocb;
 bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
 		     struct zswap_decomp_batch *zswap_batch,
 		     struct folio_batch *non_zswap_batch);
+void __swap_read_zswap_batch_unplug(
+	struct zswap_decomp_batch *zswap_batch,
+	struct swap_iocb **splug);
+static inline void swap_read_zswap_batch_unplug(
+	struct zswap_decomp_batch *zswap_batch,
+	struct swap_iocb **splug)
+{
+	if (likely(zswap_batch))
+		__swap_read_zswap_batch_unplug(zswap_batch, splug);
+}
 void __swap_read_unplug(struct swap_iocb *plug);
 static inline void swap_read_unplug(struct swap_iocb *plug)
 {
@@ -268,6 +278,13 @@ static inline bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
 {
 	return false;
 }
+
+static inline void swap_read_zswap_batch_unplug(
+	struct zswap_decomp_batch *zswap_batch,
+	struct swap_iocb **splug)
+{
+}
+
 static inline void swap_write_unplug(struct swap_iocb *sio)
 {
 }
diff --git a/mm/zswap.c b/mm/zswap.c
index 1d293f95d525..39bf7d8810e9 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -35,6 +35,7 @@
 #include <linux/pagemap.h>
 #include <linux/workqueue.h>
 #include <linux/list_lru.h>
+#include <linux/delayacct.h>
 
 #include "swap.h"
 #include "internal.h"
@@ -2401,6 +2402,176 @@ bool __zswap_add_load_batch(struct zswap_decomp_batch *zd_batch,
 	return true;
 }
 
+static __always_inline void zswap_load_sub_batch_init(
+	struct zswap_decomp_batch *zd_batch,
+	unsigned int sb,
+	struct zswap_load_sub_batch_state *zls)
+{
+	zls->trees = zd_batch->trees[sb];
+	zls->entries = zd_batch->entries[sb];
+	zls->pages = zd_batch->pages[sb];
+	zls->slens = zd_batch->slens[sb];
+	zls->nr_decomp = zd_batch->nr_decomp[sb];
+}
+
+static void zswap_load_map_sources(
+	struct zswap_load_sub_batch_state *zls,
+	u8 *srcs[])
+{
+	u8 i;
+
+	for (i = 0; i < zls->nr_decomp; ++i) {
+		struct zswap_entry *entry = zls->entries[i];
+		struct zpool *zpool = entry->pool->zpool;
+		u8 *buf = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
+		memcpy(srcs[i], buf, entry->length);
+		zpool_unmap_handle(zpool, entry->handle);
+	}
+}
+
+static void zswap_decompress_batch(
+	struct zswap_load_sub_batch_state *zls,
+	u8 *srcs[],
+	int decomp_errors[])
+{
+	struct crypto_acomp_ctx *acomp_ctx;
+
+	acomp_ctx = raw_cpu_ptr(zls->entries[0]->pool->acomp_ctx);
+
+	swap_crypto_acomp_decompress_batch(
+		srcs,
+		zls->pages,
+		zls->slens,
+		decomp_errors,
+		zls->nr_decomp,
+		acomp_ctx);
+}
+
+static void zswap_load_batch_updates(
+	struct zswap_decomp_batch *zd_batch,
+	unsigned int sb,
+	struct zswap_load_sub_batch_state *zls,
+	int decomp_errors[])
+{
+	unsigned int j;
+	u8 i;
+
+	for (i = 0; i < zls->nr_decomp; ++i) {
+		j = (sb * SWAP_CRYPTO_SUB_BATCH_SIZE) + i;
+		struct folio *folio = zd_batch->fbatch.folios[j];
+		struct zswap_entry *entry = zls->entries[i];
+
+		BUG_ON(decomp_errors[i]);
+		count_vm_event(ZSWPIN);
+		if (entry->objcg)
+			count_objcg_events(entry->objcg, ZSWPIN, 1);
+
+		if (zd_batch->swapcache[j]) {
+			zswap_entry_free(entry);
+			folio_mark_dirty(folio);
+		}
+
+		folio_mark_uptodate(folio);
+	}
+}
+
+static void zswap_load_decomp_batch(
+	struct zswap_decomp_batch *zd_batch,
+	unsigned int sb,
+	struct zswap_load_sub_batch_state *zls)
+{
+	int decomp_errors[SWAP_CRYPTO_SUB_BATCH_SIZE];
+	struct crypto_acomp_ctx *acomp_ctx;
+
+	acomp_ctx = raw_cpu_ptr(zls->entries[0]->pool->acomp_ctx);
+	mutex_lock(&acomp_ctx->mutex);
+
+	zswap_load_map_sources(zls, acomp_ctx->buffer);
+
+	zswap_decompress_batch(zls, acomp_ctx->buffer, decomp_errors);
+
+	mutex_unlock(&acomp_ctx->mutex);
+
+	zswap_load_batch_updates(zd_batch, sb, zls, decomp_errors);
+}
+
+static void zswap_load_start_accounting(
+	struct zswap_decomp_batch *zd_batch,
+	unsigned int sb,
+	struct zswap_load_sub_batch_state *zls,
+	bool workingset[],
+	bool in_thrashing[])
+{
+	unsigned int j;
+	u8 i;
+
+	for (i = 0; i < zls->nr_decomp; ++i) {
+		j = (sb * SWAP_CRYPTO_SUB_BATCH_SIZE) + i;
+		struct folio *folio = zd_batch->fbatch.folios[j];
+		workingset[i] = folio_test_workingset(folio);
+		if (workingset[i])
+			delayacct_thrashing_start(&in_thrashing[i]);
+	}
+}
+
+static void zswap_load_end_accounting(
+	struct zswap_decomp_batch *zd_batch,
+	struct zswap_load_sub_batch_state *zls,
+	bool workingset[],
+	bool in_thrashing[])
+{
+	u8 i;
+
+	for (i = 0; i < zls->nr_decomp; ++i)
+		if (workingset[i])
+			delayacct_thrashing_end(&in_thrashing[i]);
+}
+
+/*
+ * All entries in a zd_batch belong to the same swap device.
+ */
+void __zswap_finish_load_batch(struct zswap_decomp_batch *zd_batch)
+{
+	struct zswap_load_sub_batch_state zls;
+	unsigned int nr_folios = folio_batch_count(&zd_batch->fbatch);
+	unsigned int nr_sb = DIV_ROUND_UP(nr_folios, SWAP_CRYPTO_SUB_BATCH_SIZE);
+	unsigned int sb;
+
+	/*
+	 * Process the zd_batch in sub-batches of
+	 * SWAP_CRYPTO_SUB_BATCH_SIZE.
+	 */
+	for (sb = 0; sb < nr_sb; ++sb) {
+		bool workingset[SWAP_CRYPTO_SUB_BATCH_SIZE];
+		bool in_thrashing[SWAP_CRYPTO_SUB_BATCH_SIZE];
+
+		zswap_load_sub_batch_init(zd_batch, sb, &zls);
+
+		zswap_load_start_accounting(zd_batch, sb, &zls,
+					    workingset, in_thrashing);
+
+		/* Decompress the batch. */
+		if (zls.nr_decomp)
+			zswap_load_decomp_batch(zd_batch, sb, &zls);
+
+		/*
+		 * Should we free zswap_entries, as in zswap_load():
+		 * With the new swapin_readahead batching interface,
+		 * all prefetch entries are read into the swapcache.
+		 * Freeing the zswap entries here causes segfaults,
+		 * most probably because a page-fault occured while
+		 * the buffer was being decompressed.
+		 * Allowing the regular folio_free_swap() sequence
+		 * in do_swap_page() appears to keep things stable
+		 * without duplicated zswap-swapcache memory, as far
+		 * as I can tell from my testing.
+		 */
+
+		zswap_load_end_accounting(zd_batch, &zls,
+					  workingset, in_thrashing);
+	}
+}
+
 void zswap_invalidate(swp_entry_t swp)
 {
 	pgoff_t offset = swp_offset(swp);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
  2024-10-18  6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2024-10-18  6:48 ` [RFC PATCH v1 5/7] mm: swap, zswap: zswap folio_batch processing with IAA decompression batching Kanchana P Sridhar
@ 2024-10-18  6:48 ` Kanchana P Sridhar
  2024-10-18  7:26   ` David Hildenbrand
  2024-10-18  6:48 ` [RFC PATCH v1 7/7] mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache thresholds Kanchana P Sridhar
  6 siblings, 1 reply; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, hughd, willy, bfoster, dchinner, chrisl, david
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This patch invokes the swapin_readahead() based batching interface to
prefetch a batch of 4K folios for zswap load with batch decompressions
in parallel using IAA hardware. swapin_readahead() prefetches folios based
on vm.page-cluster and the usefulness of prior prefetches to the
workload. As folios are created in the swapcache and the readahead code
calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
respective folio_batches get populated with the folios to be read.

Finally, the swapin_readahead() procedures will call the newly added
process_ra_batch_of_same_type() which:

 1) Reads all the non_zswap_batch folios sequentially by calling
    swap_read_folio().
 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
    zswap_finish_load_batch() that finally decompresses each
    SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
    batch of say, 32 folios) in parallel with IAA.

Within do_swap_page(), we try to benefit from batch decompressions in both
these scenarios:

 1) single-mapped, SWP_SYNCHRONOUS_IO:
      We call swapin_readahead() with "single_mapped_path = true". This is
      done only in the !zswap_never_enabled() case.
 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
      We call swapin_readahead() with "single_mapped_path = false".

This will place folios in the swapcache: a design choice that handles cases
where a folio that is "single-mapped" in process 1 could be prefetched in
process 2; and handles highly contended server scenarios with stability.
There are checks added at the end of do_swap_page(), after the folio has
been successfully loaded, to detect if the single-mapped swapcache folio is
still single-mapped, and if so, folio_free_swap() is called on the folio.

Within the swapin_readahead() functions, if single_mapped_path is true, and
either the platform does not have IAA, or, if the platform has IAA and the
user selects a software compressor for zswap (details of sysfs knob
follow), readahead/batching are skipped and the folio is loaded using
zswap_load().

A new swap parameter "singlemapped_ra_enabled" (false by default) is added
for platforms that have IAA, zswap_load_batching_enabled() is true, and we
want to give the user the option to run experiments with IAA and with
software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):

For IAA:
 echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled

For software compressors:
 echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled

If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
path.

Thanks Ying Huang for the really helpful brainstorming discussions on the
swap_read_folio() plug design.

Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
 mm/shmem.c      |   2 +-
 mm/swap.h       |  12 ++--
 mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
 mm/swapfile.c   |   2 +-
 5 files changed, 299 insertions(+), 61 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b5745b9ffdf7..9655b85fc243 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
+/*
+ * swapin readahead based batching interface for zswap batched loads using IAA:
+ *
+ * Should only be called for and if the faulting swap entry in do_swap_page
+ * is single-mapped and SWP_SYNCHRONOUS_IO.
+ *
+ * Detect if the folio is in the swapcache, is still mapped to only this
+ * process, and further, there are no additional references to this folio
+ * (for e.g. if another process simultaneously readahead this swap entry
+ * while this process was handling the page-fault, and got a pointer to the
+ * folio allocated by this process in the swapcache), besides the references
+ * that were obtained within __read_swap_cache_async() by this process that is
+ * faulting in this single-mapped swap entry.
+ */
+static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
+						   struct folio *folio)
+{
+	if (!folio_test_swapcache(folio))
+		return false;
+
+	if (__swap_count(entry) != 0)
+		return false;
+
+	/*
+	 * The folio ref count for a single-mapped folio that was allocated
+	 * in __read_swap_cache_async(), can be a maximum of 3. These are the
+	 * incrementors of the folio ref count in __read_swap_cache_async():
+	 * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
+	 */
+
+	if (folio_ref_count(folio) <= 3)
+		return true;
+
+	return false;
+}
+
 static inline bool should_try_to_free_swap(struct folio *folio,
 					   struct vm_area_struct *vma,
 					   unsigned int fault_flags)
@@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret = 0;
+	bool single_mapped_swapcache = false;
 	void *shadow = NULL;
 	int nr_pages;
 	unsigned long page_idx;
@@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/* skip swapcache */
-			folio = alloc_swap_folio(vmf);
-			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				nr_pages = folio_nr_pages(folio);
-				if (folio_test_large(folio))
-					entry.val = ALIGN_DOWN(entry.val, nr_pages);
-				/*
-				 * Prevent parallel swapin from proceeding with
-				 * the cache flag. Otherwise, another thread
-				 * may finish swapin first, free the entry, and
-				 * swapout reusing the same entry. It's
-				 * undetectable as pte_same() returns true due
-				 * to entry reuse.
-				 */
-				if (swapcache_prepare(entry, nr_pages)) {
+			if (zswap_never_enabled()) {
+				/* skip swapcache */
+				folio = alloc_swap_folio(vmf);
+				if (folio) {
+					__folio_set_locked(folio);
+					__folio_set_swapbacked(folio);
+
+					nr_pages = folio_nr_pages(folio);
+					if (folio_test_large(folio))
+						entry.val = ALIGN_DOWN(entry.val, nr_pages);
 					/*
-					 * Relax a bit to prevent rapid
-					 * repeated page faults.
+					 * Prevent parallel swapin from proceeding with
+					 * the cache flag. Otherwise, another thread
+					 * may finish swapin first, free the entry, and
+					 * swapout reusing the same entry. It's
+					 * undetectable as pte_same() returns true due
+					 * to entry reuse.
 					 */
-					add_wait_queue(&swapcache_wq, &wait);
-					schedule_timeout_uninterruptible(1);
-					remove_wait_queue(&swapcache_wq, &wait);
-					goto out_page;
+					if (swapcache_prepare(entry, nr_pages)) {
+						/*
+						 * Relax a bit to prevent rapid
+						 * repeated page faults.
+						 */
+						add_wait_queue(&swapcache_wq, &wait);
+						schedule_timeout_uninterruptible(1);
+						remove_wait_queue(&swapcache_wq, &wait);
+						goto out_page;
+					}
+					need_clear_cache = true;
+
+					mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
+
+					shadow = get_shadow_from_swap_cache(entry);
+					if (shadow)
+						workingset_refault(folio, shadow);
+
+					folio_add_lru(folio);
+
+					/* To provide entry to swap_read_folio() */
+					folio->swap = entry;
+					swap_read_folio(folio, NULL, NULL, NULL);
+					folio->private = NULL;
+				}
+			} else {
+				/*
+				 * zswap is enabled or was enabled at some point.
+				 * Don't skip swapcache.
+				 *
+				 * swapin readahead based batching interface
+				 * for zswap batched loads using IAA:
+				 *
+				 * Readahead is invoked in this path only if
+				 * the sys swap "singlemapped_ra_enabled" swap
+				 * parameter is set to true. By default,
+				 * "singlemapped_ra_enabled" is set to false,
+				 * the recommended setting for software compressors.
+				 * For IAA, if "singlemapped_ra_enabled" is set
+				 * to true, readahead will be deployed in this path
+				 * as well.
+				 *
+				 * For single-mapped pages, the batching interface
+				 * calls __read_swap_cache_async() to allocate and
+				 * place the faulting page in the swapcache. This is
+				 * to handle a scenario where the faulting page in
+				 * this process happens to simultaneously be a
+				 * readahead page in another process. By placing the
+				 * single-mapped faulting page in the swapcache,
+				 * we avoid race conditions and duplicate page
+				 * allocations under these scenarios.
+				 */
+				folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+							 vmf, true);
+				if (!folio) {
+					ret = VM_FAULT_OOM;
+					goto out;
 				}
-				need_clear_cache = true;
-
-				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
-
-				shadow = get_shadow_from_swap_cache(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
 
-				/* To provide entry to swap_read_folio() */
-				folio->swap = entry;
-				swap_read_folio(folio, NULL, NULL, NULL);
-				folio->private = NULL;
-			}
+				single_mapped_swapcache = true;
+				nr_pages = folio_nr_pages(folio);
+				swapcache = folio;
+			} /* swapin with zswap support. */
 		} else {
 			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
+						 vmf, false);
 			swapcache = folio;
 		}
 
@@ -4528,8 +4604,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * yet.
 	 */
 	swap_free_nr(entry, nr_pages);
-	if (should_try_to_free_swap(folio, vma, vmf->flags))
+	if (should_try_to_free_swap(folio, vma, vmf->flags)) {
 		folio_free_swap(folio);
+		single_mapped_swapcache = false;
+	}
 
 	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
@@ -4619,6 +4697,30 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
+
+	/*
+	 * swapin readahead based batching interface
+	 * for zswap batched loads using IAA:
+	 *
+	 * Don't skip swapcache strategy for single-mapped
+	 * pages: As described above, we place the
+	 * single-mapped faulting page in the swapcache,
+	 * to avoid race conditions and duplicate page
+	 * allocations between process 1 handling a
+	 * page-fault for a single-mapped page, while
+	 * simultaneously, the same swap entry is a
+	 * readahead prefetch page in another process 2.
+	 *
+	 * One side-effect of this, is that if the race did
+	 * not occur, we need to clean up the swapcache
+	 * entry and free the zswap entry for the faulting
+	 * page, iff it is still single-mapped and is
+	 * exclusive to this process.
+	 */
+	if (single_mapped_swapcache &&
+		data_race(should_free_singlemap_swapcache(entry, folio)))
+		folio_free_swap(folio);
+
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4638,6 +4740,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
+
+	if (single_mapped_swapcache &&
+		data_race(should_free_singlemap_swapcache(entry, folio)))
+		folio_free_swap(folio);
+
 	if (si)
 		put_swap_device(si);
 	return ret;
diff --git a/mm/shmem.c b/mm/shmem.c
index 66eae800ffab..e4549c04f316 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1624,7 +1624,7 @@ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
 	struct folio *folio;
 
 	mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
-	folio = swap_cluster_readahead(swap, gfp, mpol, ilx);
+	folio = swap_cluster_readahead(swap, gfp, mpol, ilx, false);
 	mpol_cond_put(mpol);
 
 	return folio;
diff --git a/mm/swap.h b/mm/swap.h
index 2b82c8ed765c..2861bd8f5a96 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -199,9 +199,11 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
-		struct mempolicy *mpol, pgoff_t ilx);
+		struct mempolicy *mpol, pgoff_t ilx,
+		bool single_mapped_path);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
-		struct vm_fault *vmf);
+		struct vm_fault *vmf,
+		bool single_mapped_path);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -304,13 +306,15 @@ static inline void show_swap_cache_info(void)
 }
 
 static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
-			gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx)
+			gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx,
+			bool single_mapped_path)
 {
 	return NULL;
 }
 
 static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf)
+			struct vm_fault *vmf,
+			bool single_mapped_path)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0aa938e4c34d..66ea8f7f724c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -44,6 +44,12 @@ struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static bool enable_vma_readahead __read_mostly = true;
 
+/*
+ * Enable readahead in single-mapped do_swap_page() path.
+ * Set to "true" for IAA.
+ */
+static bool enable_singlemapped_readahead __read_mostly = false;
+
 #define SWAP_RA_WIN_SHIFT	(PAGE_SHIFT / 2)
 #define SWAP_RA_HITS_MASK	((1UL << SWAP_RA_WIN_SHIFT) - 1)
 #define SWAP_RA_HITS_MAX	SWAP_RA_HITS_MASK
@@ -340,6 +346,11 @@ static inline bool swap_use_vma_readahead(void)
 	return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
 }
 
+static inline bool swap_use_singlemapped_readahead(void)
+{
+	return READ_ONCE(enable_singlemapped_readahead);
+}
+
 /*
  * Lookup a swap entry in the swap cache. A found folio will be returned
  * unlocked and with its refcount incremented - we rely on the kernel
@@ -635,12 +646,49 @@ static unsigned long swapin_nr_pages(unsigned long offset)
 	return pages;
 }
 
+static void process_ra_batch_of_same_type(
+	struct zswap_decomp_batch *zswap_batch,
+	struct folio_batch *non_zswap_batch,
+	swp_entry_t targ_entry,
+	struct swap_iocb **splug)
+{
+	unsigned int i;
+
+	for (i = 0; i < folio_batch_count(non_zswap_batch); ++i) {
+		struct folio *folio = non_zswap_batch->folios[i];
+		swap_read_folio(folio, splug, NULL, NULL);
+		if (folio->swap.val != targ_entry.val) {
+			folio_set_readahead(folio);
+			count_vm_event(SWAP_RA);
+		}
+		folio_put(folio);
+	}
+
+	swap_read_zswap_batch_unplug(zswap_batch, splug);
+
+	for (i = 0; i < folio_batch_count(&zswap_batch->fbatch); ++i) {
+		struct folio *folio = zswap_batch->fbatch.folios[i];
+		if (folio->swap.val != targ_entry.val) {
+			folio_set_readahead(folio);
+			count_vm_event(SWAP_RA);
+		}
+		folio_put(folio);
+	}
+
+	folio_batch_reinit(non_zswap_batch);
+
+	zswap_load_batch_reinit(zswap_batch);
+}
+
 /**
  * swap_cluster_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @mpol: NUMA memory allocation policy to be applied
  * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
+ * @single_mapped_path: Called from do_swap_page() single-mapped path.
+ * Only readahead if the sys "singlemapped_ra_enabled" swap parameter
+ * is set to true.
  *
  * Returns the struct folio for entry and addr, after queueing swapin.
  *
@@ -654,7 +702,8 @@ static unsigned long swapin_nr_pages(unsigned long offset)
  * are fairly likely to have been swapped out from the same node.
  */
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				    struct mempolicy *mpol, pgoff_t ilx)
+				     struct mempolicy *mpol, pgoff_t ilx,
+				     bool single_mapped_path)
 {
 	struct folio *folio;
 	unsigned long entry_offset = swp_offset(entry);
@@ -664,12 +713,22 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct swap_info_struct *si = swp_swap_info(entry);
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
+	struct zswap_decomp_batch zswap_batch;
+	struct folio_batch non_zswap_batch;
 	bool page_allocated;
 
+	if (single_mapped_path &&
+		(!swap_use_singlemapped_readahead() ||
+		 !zswap_load_batching_enabled()))
+		goto skip;
+
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
 
+	zswap_load_batch_init(&zswap_batch);
+	folio_batch_init(&non_zswap_batch);
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -678,6 +737,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (end_offset >= si->max)
 		end_offset = si->max - 1;
 
+	/* Note that all swap entries readahead are of the same swap type. */
 	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
@@ -687,14 +747,22 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug, NULL, NULL);
-			if (offset != entry_offset) {
-				folio_set_readahead(folio);
-				count_vm_event(SWAP_RA);
+			if (swap_read_folio(folio, &splug,
+					    &zswap_batch, &non_zswap_batch)) {
+				if (offset != entry_offset) {
+					folio_set_readahead(folio);
+					count_vm_event(SWAP_RA);
+				}
+				folio_put(folio);
 			}
+		} else {
+			folio_put(folio);
 		}
-		folio_put(folio);
 	}
+
+	process_ra_batch_of_same_type(&zswap_batch, &non_zswap_batch,
+				      entry, &splug);
+
 	blk_finish_plug(&plug);
 	swap_read_unplug(splug);
 	lru_add_drain();	/* Push any new pages onto the LRU now */
@@ -1009,6 +1077,9 @@ static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
  * @mpol: NUMA memory allocation policy to be applied
  * @targ_ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
  * @vmf: fault information
+ * @single_mapped_path: Called from do_swap_page() single-mapped path.
+ * Only readahead if the sys "singlemapped_ra_enabled" swap parameter
+ * is set to true.
  *
  * Returns the struct folio for entry and addr, after queueing swapin.
  *
@@ -1019,10 +1090,14 @@ static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
  *
  */
 static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
-		struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf)
+		struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf,
+		bool single_mapped_path)
 {
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
+	struct zswap_decomp_batch zswap_batch;
+	struct folio_batch non_zswap_batch;
+	int type = -1, prev_type = -1;
 	struct folio *folio;
 	pte_t *pte = NULL, pentry;
 	int win;
@@ -1031,10 +1106,18 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	pgoff_t ilx;
 	bool page_allocated;
 
+	if (single_mapped_path &&
+		(!swap_use_singlemapped_readahead() ||
+		 !zswap_load_batching_enabled()))
+		goto skip;
+
 	win = swap_vma_ra_win(vmf, &start, &end);
 	if (win == 1)
 		goto skip;
 
+	zswap_load_batch_init(&zswap_batch);
+	folio_batch_init(&non_zswap_batch);
+
 	ilx = targ_ilx - PFN_DOWN(vmf->address - start);
 
 	blk_start_plug(&plug);
@@ -1057,16 +1140,38 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug, NULL, NULL);
-			if (addr != vmf->address) {
-				folio_set_readahead(folio);
-				count_vm_event(SWAP_RA);
+			type = swp_type(entry);
+
+			/*
+			 * Process this sub-batch before switching to
+			 * another swap device type.
+			 */
+			if ((prev_type >= 0) && (type != prev_type))
+				process_ra_batch_of_same_type(&zswap_batch,
+							      &non_zswap_batch,
+							      targ_entry,
+							      &splug);
+
+			if (swap_read_folio(folio, &splug,
+					    &zswap_batch, &non_zswap_batch)) {
+				if (addr != vmf->address) {
+					folio_set_readahead(folio);
+					count_vm_event(SWAP_RA);
+				}
+				folio_put(folio);
 			}
+
+			prev_type = type;
+		} else {
+			folio_put(folio);
 		}
-		folio_put(folio);
 	}
 	if (pte)
 		pte_unmap(pte);
+
+	process_ra_batch_of_same_type(&zswap_batch, &non_zswap_batch,
+				      targ_entry, &splug);
+
 	blk_finish_plug(&plug);
 	swap_read_unplug(splug);
 	lru_add_drain();
@@ -1092,7 +1197,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * or vma-based(ie, virtual address based on faulty address) readahead.
  */
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, bool single_mapped_path)
 {
 	struct mempolicy *mpol;
 	pgoff_t ilx;
@@ -1100,8 +1205,10 @@ struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	folio = swap_use_vma_readahead() ?
-		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
-		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf,
+				   single_mapped_path) :
+		swap_cluster_readahead(entry, gfp_mask, mpol, ilx,
+				       single_mapped_path);
 	mpol_cond_put(mpol);
 
 	return folio;
@@ -1126,10 +1233,30 @@ static ssize_t vma_ra_enabled_store(struct kobject *kobj,
 
 	return count;
 }
+static ssize_t singlemapped_ra_enabled_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  enable_singlemapped_readahead ? "true" : "false");
+}
+static ssize_t singlemapped_ra_enabled_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = kstrtobool(buf, &enable_singlemapped_readahead);
+	if (ret)
+		return ret;
+
+	return count;
+}
 static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
+static struct kobj_attribute singlemapped_ra_enabled_attr = __ATTR_RW(singlemapped_ra_enabled);
 
 static struct attribute *swap_attrs[] = {
 	&vma_ra_enabled_attr.attr,
+	&singlemapped_ra_enabled_attr.attr,
 	NULL,
 };
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b0915f3fab31..10367eaee1ff 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2197,7 +2197,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			};
 
 			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						&vmf);
+						&vmf, false);
 		}
 		if (!folio) {
 			swp_count = READ_ONCE(si->swap_map[offset]);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v1 7/7] mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache thresholds.
  2024-10-18  6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2024-10-18  6:48 ` [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface Kanchana P Sridhar
@ 2024-10-18  6:48 ` Kanchana P Sridhar
  6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18  6:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
	akpm, hughd, willy, bfoster, dchinner, chrisl, david
  Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

When IAA is used for compress batching and decompress batching of folios,
we significantly reduce the swapout-swapin path latencies, such
that swap page-faults' latencies are reduced. This means swap entries will
need to be freed more often, and swap slots will have to be released more
often.

The existing SWAP_BATCH and SWAP_SLOTS_CACHE_SIZE value of 64
can cause lock contention of the swap_info_struct lock in
swapcache_free_entries and cpu hardlockups can result in highly contended
server scenarios.

To prevent this, the SWAP_BATCH and SWAP_SLOTS_CACHE_SIZE
has been reduced to 16 if IAA is used for compress/decompress batching. The
swap_slots_cache activate/deactive thresholds have been modified
accordingly, so that we don't compromise performance for stability.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/swap.h       | 7 +++++++
 include/linux/swap_slots.h | 7 +++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ca533b478c21..3987faa0a2ff 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -13,6 +13,7 @@
 #include <linux/pagemap.h>
 #include <linux/atomic.h>
 #include <linux/page-flags.h>
+#include <linux/pagevec.h>
 #include <uapi/linux/mempolicy.h>
 #include <asm/page.h>
 
@@ -32,7 +33,13 @@ struct pagevec;
 #define SWAP_FLAGS_VALID	(SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
 				 SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
 				 SWAP_FLAG_DISCARD_PAGES)
+
+#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED) || \
+	defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
+#define SWAP_BATCH 16
+#else
 #define SWAP_BATCH 64
+#endif
 
 static inline int current_is_kswapd(void)
 {
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 15adfb8c813a..1b6e4e2798bd 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -7,8 +7,15 @@
 #include <linux/mutex.h>
 
 #define SWAP_SLOTS_CACHE_SIZE			SWAP_BATCH
+
+#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED) || \
+	defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
+#define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE	(40*SWAP_SLOTS_CACHE_SIZE)
+#define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE	(16*SWAP_SLOTS_CACHE_SIZE)
+#else
 #define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE	(5*SWAP_SLOTS_CACHE_SIZE)
 #define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE	(2*SWAP_SLOTS_CACHE_SIZE)
+#endif
 
 struct swap_slots_cache {
 	bool		lock_initialized;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
  2024-10-18  6:48 ` [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface Kanchana P Sridhar
@ 2024-10-18  7:26   ` David Hildenbrand
  2024-10-18 11:04     ` Usama Arif
  2024-10-18 18:09     ` Sridhar, Kanchana P
  0 siblings, 2 replies; 15+ messages in thread
From: David Hildenbrand @ 2024-10-18  7:26 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, ying.huang,
	21cnbao, akpm, hughd, willy, bfoster, dchinner, chrisl
  Cc: wajdi.k.feghali, vinodh.gopal

On 18.10.24 08:48, Kanchana P Sridhar wrote:
> This patch invokes the swapin_readahead() based batching interface to
> prefetch a batch of 4K folios for zswap load with batch decompressions
> in parallel using IAA hardware. swapin_readahead() prefetches folios based
> on vm.page-cluster and the usefulness of prior prefetches to the
> workload. As folios are created in the swapcache and the readahead code
> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
> respective folio_batches get populated with the folios to be read.
> 
> Finally, the swapin_readahead() procedures will call the newly added
> process_ra_batch_of_same_type() which:
> 
>   1) Reads all the non_zswap_batch folios sequentially by calling
>      swap_read_folio().
>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
>      zswap_finish_load_batch() that finally decompresses each
>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
>      batch of say, 32 folios) in parallel with IAA.
> 
> Within do_swap_page(), we try to benefit from batch decompressions in both
> these scenarios:
> 
>   1) single-mapped, SWP_SYNCHRONOUS_IO:
>        We call swapin_readahead() with "single_mapped_path = true". This is
>        done only in the !zswap_never_enabled() case.
>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
>        We call swapin_readahead() with "single_mapped_path = false".
> 
> This will place folios in the swapcache: a design choice that handles cases
> where a folio that is "single-mapped" in process 1 could be prefetched in
> process 2; and handles highly contended server scenarios with stability.
> There are checks added at the end of do_swap_page(), after the folio has
> been successfully loaded, to detect if the single-mapped swapcache folio is
> still single-mapped, and if so, folio_free_swap() is called on the folio.
> 
> Within the swapin_readahead() functions, if single_mapped_path is true, and
> either the platform does not have IAA, or, if the platform has IAA and the
> user selects a software compressor for zswap (details of sysfs knob
> follow), readahead/batching are skipped and the folio is loaded using
> zswap_load().
> 
> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> want to give the user the option to run experiments with IAA and with
> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> 
> For IAA:
>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> 
> For software compressors:
>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> 
> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
> path.
> 
> Thanks Ying Huang for the really helpful brainstorming discussions on the
> swap_read_folio() plug design.
> 
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
>   mm/shmem.c      |   2 +-
>   mm/swap.h       |  12 ++--
>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
>   mm/swapfile.c   |   2 +-
>   5 files changed, 299 insertions(+), 61 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b5745b9ffdf7..9655b85fc243 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>   	return 0;
>   }
>   
> +/*
> + * swapin readahead based batching interface for zswap batched loads using IAA:
> + *
> + * Should only be called for and if the faulting swap entry in do_swap_page
> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> + *
> + * Detect if the folio is in the swapcache, is still mapped to only this
> + * process, and further, there are no additional references to this folio
> + * (for e.g. if another process simultaneously readahead this swap entry
> + * while this process was handling the page-fault, and got a pointer to the
> + * folio allocated by this process in the swapcache), besides the references
> + * that were obtained within __read_swap_cache_async() by this process that is
> + * faulting in this single-mapped swap entry.
> + */

How is this supposed to work for large folios?

> +static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
> +						   struct folio *folio)
> +{
> +	if (!folio_test_swapcache(folio))
> +		return false;
> +
> +	if (__swap_count(entry) != 0)
> +		return false;
> +
> +	/*
> +	 * The folio ref count for a single-mapped folio that was allocated
> +	 * in __read_swap_cache_async(), can be a maximum of 3. These are the
> +	 * incrementors of the folio ref count in __read_swap_cache_async():
> +	 * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
> +	 */
> +
> +	if (folio_ref_count(folio) <= 3)
> +		return true;
> +
> +	return false;
> +}
> +
>   static inline bool should_try_to_free_swap(struct folio *folio,
>   					   struct vm_area_struct *vma,
>   					   unsigned int fault_flags)
> @@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	swp_entry_t entry;
>   	pte_t pte;
>   	vm_fault_t ret = 0;
> +	bool single_mapped_swapcache = false;
>   	void *shadow = NULL;
>   	int nr_pages;
>   	unsigned long page_idx;
> @@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	if (!folio) {
>   		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>   		    __swap_count(entry) == 1) {
> -			/* skip swapcache */
> -			folio = alloc_swap_folio(vmf);
> -			if (folio) {
> -				__folio_set_locked(folio);
> -				__folio_set_swapbacked(folio);
> -
> -				nr_pages = folio_nr_pages(folio);
> -				if (folio_test_large(folio))
> -					entry.val = ALIGN_DOWN(entry.val, nr_pages);
> -				/*
> -				 * Prevent parallel swapin from proceeding with
> -				 * the cache flag. Otherwise, another thread
> -				 * may finish swapin first, free the entry, and
> -				 * swapout reusing the same entry. It's
> -				 * undetectable as pte_same() returns true due
> -				 * to entry reuse.
> -				 */
> -				if (swapcache_prepare(entry, nr_pages)) {
> +			if (zswap_never_enabled()) {
> +				/* skip swapcache */
> +				folio = alloc_swap_folio(vmf);
> +				if (folio) {
> +					__folio_set_locked(folio);
> +					__folio_set_swapbacked(folio);
> +
> +					nr_pages = folio_nr_pages(folio);
> +					if (folio_test_large(folio))
> +						entry.val = ALIGN_DOWN(entry.val, nr_pages);
>   					/*
> -					 * Relax a bit to prevent rapid
> -					 * repeated page faults.
> +					 * Prevent parallel swapin from proceeding with
> +					 * the cache flag. Otherwise, another thread
> +					 * may finish swapin first, free the entry, and
> +					 * swapout reusing the same entry. It's
> +					 * undetectable as pte_same() returns true due
> +					 * to entry reuse.
>   					 */
> -					add_wait_queue(&swapcache_wq, &wait);
> -					schedule_timeout_uninterruptible(1);
> -					remove_wait_queue(&swapcache_wq, &wait);
> -					goto out_page;
> +					if (swapcache_prepare(entry, nr_pages)) {
> +						/*
> +						 * Relax a bit to prevent rapid
> +						 * repeated page faults.
> +						 */
> +						add_wait_queue(&swapcache_wq, &wait);
> +						schedule_timeout_uninterruptible(1);
> +						remove_wait_queue(&swapcache_wq, &wait);
> +						goto out_page;
> +					}
> +					need_clear_cache = true;
> +
> +					mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> +
> +					shadow = get_shadow_from_swap_cache(entry);
> +					if (shadow)
> +						workingset_refault(folio, shadow);
> +
> +					folio_add_lru(folio);
> +
> +					/* To provide entry to swap_read_folio() */
> +					folio->swap = entry;
> +					swap_read_folio(folio, NULL, NULL, NULL);
> +					folio->private = NULL;
> +				}
> +			} else {
> +				/*
> +				 * zswap is enabled or was enabled at some point.
> +				 * Don't skip swapcache.
> +				 *
> +				 * swapin readahead based batching interface
> +				 * for zswap batched loads using IAA:
> +				 *
> +				 * Readahead is invoked in this path only if
> +				 * the sys swap "singlemapped_ra_enabled" swap
> +				 * parameter is set to true. By default,
> +				 * "singlemapped_ra_enabled" is set to false,
> +				 * the recommended setting for software compressors.
> +				 * For IAA, if "singlemapped_ra_enabled" is set
> +				 * to true, readahead will be deployed in this path
> +				 * as well.
> +				 *
> +				 * For single-mapped pages, the batching interface
> +				 * calls __read_swap_cache_async() to allocate and
> +				 * place the faulting page in the swapcache. This is
> +				 * to handle a scenario where the faulting page in
> +				 * this process happens to simultaneously be a
> +				 * readahead page in another process. By placing the
> +				 * single-mapped faulting page in the swapcache,
> +				 * we avoid race conditions and duplicate page
> +				 * allocations under these scenarios.
> +				 */
> +				folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +							 vmf, true);
> +				if (!folio) {
> +					ret = VM_FAULT_OOM;
> +					goto out;
>   				}
> -				need_clear_cache = true;
> -
> -				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> -
> -				shadow = get_shadow_from_swap_cache(entry);
> -				if (shadow)
> -					workingset_refault(folio, shadow);
> -
> -				folio_add_lru(folio);
>   
> -				/* To provide entry to swap_read_folio() */
> -				folio->swap = entry;
> -				swap_read_folio(folio, NULL, NULL, NULL);
> -				folio->private = NULL;
> -			}
> +				single_mapped_swapcache = true;
> +				nr_pages = folio_nr_pages(folio);
> +				swapcache = folio;
> +			} /* swapin with zswap support. */
>   		} else {
>   			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -						vmf);
> +						 vmf, false);
>   			swapcache = folio;

I'm sorry, but making this function ever more complicated and ugly is 
not going to fly. The zswap special casing is quite ugly here as well.

Is there a way forward that we can make this code actually readable and 
avoid zswap special casing?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
  2024-10-18  7:26   ` David Hildenbrand
@ 2024-10-18 11:04     ` Usama Arif
  2024-10-18 17:21       ` Nhat Pham
  2024-10-18 18:09     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 15+ messages in thread
From: Usama Arif @ 2024-10-18 11:04 UTC (permalink / raw)
  To: David Hildenbrand, Kanchana P Sridhar, linux-kernel, linux-mm,
	hannes, yosryahmed, nphamcs, chengming.zhou, ryan.roberts,
	ying.huang, 21cnbao, akpm, hughd, willy, bfoster, dchinner,
	chrisl
  Cc: wajdi.k.feghali, vinodh.gopal



On 18/10/2024 08:26, David Hildenbrand wrote:
> On 18.10.24 08:48, Kanchana P Sridhar wrote:
>> This patch invokes the swapin_readahead() based batching interface to
>> prefetch a batch of 4K folios for zswap load with batch decompressions
>> in parallel using IAA hardware. swapin_readahead() prefetches folios based
>> on vm.page-cluster and the usefulness of prior prefetches to the
>> workload. As folios are created in the swapcache and the readahead code
>> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
>> respective folio_batches get populated with the folios to be read.
>>
>> Finally, the swapin_readahead() procedures will call the newly added
>> process_ra_batch_of_same_type() which:
>>
>>   1) Reads all the non_zswap_batch folios sequentially by calling
>>      swap_read_folio().
>>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
>>      zswap_finish_load_batch() that finally decompresses each
>>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
>>      batch of say, 32 folios) in parallel with IAA.
>>
>> Within do_swap_page(), we try to benefit from batch decompressions in both
>> these scenarios:
>>
>>   1) single-mapped, SWP_SYNCHRONOUS_IO:
>>        We call swapin_readahead() with "single_mapped_path = true". This is
>>        done only in the !zswap_never_enabled() case.
>>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
>>        We call swapin_readahead() with "single_mapped_path = false".
>>
>> This will place folios in the swapcache: a design choice that handles cases
>> where a folio that is "single-mapped" in process 1 could be prefetched in
>> process 2; and handles highly contended server scenarios with stability.
>> There are checks added at the end of do_swap_page(), after the folio has
>> been successfully loaded, to detect if the single-mapped swapcache folio is
>> still single-mapped, and if so, folio_free_swap() is called on the folio.
>>
>> Within the swapin_readahead() functions, if single_mapped_path is true, and
>> either the platform does not have IAA, or, if the platform has IAA and the
>> user selects a software compressor for zswap (details of sysfs knob
>> follow), readahead/batching are skipped and the folio is loaded using
>> zswap_load().
>>
>> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
>> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
>> want to give the user the option to run experiments with IAA and with
>> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
>>
>> For IAA:
>>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>
>> For software compressors:
>>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>
>> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
>> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
>> path.
>>
>> Thanks Ying Huang for the really helpful brainstorming discussions on the
>> swap_read_folio() plug design.
>>
>> Suggested-by: Ying Huang <ying.huang@intel.com>
>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
>> ---
>>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
>>   mm/shmem.c      |   2 +-
>>   mm/swap.h       |  12 ++--
>>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
>>   mm/swapfile.c   |   2 +-
>>   5 files changed, 299 insertions(+), 61 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5745b9ffdf7..9655b85fc243 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>>       return 0;
>>   }
>>   +/*
>> + * swapin readahead based batching interface for zswap batched loads using IAA:
>> + *
>> + * Should only be called for and if the faulting swap entry in do_swap_page
>> + * is single-mapped and SWP_SYNCHRONOUS_IO.
>> + *
>> + * Detect if the folio is in the swapcache, is still mapped to only this
>> + * process, and further, there are no additional references to this folio
>> + * (for e.g. if another process simultaneously readahead this swap entry
>> + * while this process was handling the page-fault, and got a pointer to the
>> + * folio allocated by this process in the swapcache), besides the references
>> + * that were obtained within __read_swap_cache_async() by this process that is
>> + * faulting in this single-mapped swap entry.
>> + */
> 
> How is this supposed to work for large folios?
> 

Hi,

I was looking at zswapin large folio support and have posted a RFC in [1].
I got bogged down with some prod stuff, so wasn't able to send it earlier.

It looks quite different, and I think simpler from this series, so might be
a good comparison.

[1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/

Thanks,
Usama


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
  2024-10-18 11:04     ` Usama Arif
@ 2024-10-18 17:21       ` Nhat Pham
  2024-10-18 21:59         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 15+ messages in thread
From: Nhat Pham @ 2024-10-18 17:21 UTC (permalink / raw)
  To: Usama Arif
  Cc: David Hildenbrand, Kanchana P Sridhar, linux-kernel, linux-mm,
	hannes, yosryahmed, chengming.zhou, ryan.roberts, ying.huang,
	21cnbao, akpm, hughd, willy, bfoster, dchinner, chrisl,
	wajdi.k.feghali, vinodh.gopal

On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
> On 18/10/2024 08:26, David Hildenbrand wrote:
> > On 18.10.24 08:48, Kanchana P Sridhar wrote:
> >> This patch invokes the swapin_readahead() based batching interface to
> >> prefetch a batch of 4K folios for zswap load with batch decompressions
> >> in parallel using IAA hardware. swapin_readahead() prefetches folios based
> >> on vm.page-cluster and the usefulness of prior prefetches to the
> >> workload. As folios are created in the swapcache and the readahead code
> >> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
> >> respective folio_batches get populated with the folios to be read.
> >>
> >> Finally, the swapin_readahead() procedures will call the newly added
> >> process_ra_batch_of_same_type() which:
> >>
> >>   1) Reads all the non_zswap_batch folios sequentially by calling
> >>      swap_read_folio().
> >>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
> >>      zswap_finish_load_batch() that finally decompresses each
> >>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
> >>      batch of say, 32 folios) in parallel with IAA.
> >>
> >> Within do_swap_page(), we try to benefit from batch decompressions in both
> >> these scenarios:
> >>
> >>   1) single-mapped, SWP_SYNCHRONOUS_IO:
> >>        We call swapin_readahead() with "single_mapped_path = true". This is
> >>        done only in the !zswap_never_enabled() case.
> >>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> >>        We call swapin_readahead() with "single_mapped_path = false".
> >>
> >> This will place folios in the swapcache: a design choice that handles cases
> >> where a folio that is "single-mapped" in process 1 could be prefetched in
> >> process 2; and handles highly contended server scenarios with stability.
> >> There are checks added at the end of do_swap_page(), after the folio has
> >> been successfully loaded, to detect if the single-mapped swapcache folio is
> >> still single-mapped, and if so, folio_free_swap() is called on the folio.
> >>
> >> Within the swapin_readahead() functions, if single_mapped_path is true, and
> >> either the platform does not have IAA, or, if the platform has IAA and the
> >> user selects a software compressor for zswap (details of sysfs knob
> >> follow), readahead/batching are skipped and the folio is loaded using
> >> zswap_load().
> >>
> >> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
> >> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> >> want to give the user the option to run experiments with IAA and with
> >> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> >>
> >> For IAA:
> >>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>
> >> For software compressors:
> >>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>
> >> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> >> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
> >> path.
> >>
> >> Thanks Ying Huang for the really helpful brainstorming discussions on the
> >> swap_read_folio() plug design.
> >>
> >> Suggested-by: Ying Huang <ying.huang@intel.com>
> >> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> >> ---
> >>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
> >>   mm/shmem.c      |   2 +-
> >>   mm/swap.h       |  12 ++--
> >>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
> >>   mm/swapfile.c   |   2 +-
> >>   5 files changed, 299 insertions(+), 61 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index b5745b9ffdf7..9655b85fc243 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> >>       return 0;
> >>   }
> >>   +/*
> >> + * swapin readahead based batching interface for zswap batched loads using IAA:
> >> + *
> >> + * Should only be called for and if the faulting swap entry in do_swap_page
> >> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> >> + *
> >> + * Detect if the folio is in the swapcache, is still mapped to only this
> >> + * process, and further, there are no additional references to this folio
> >> + * (for e.g. if another process simultaneously readahead this swap entry
> >> + * while this process was handling the page-fault, and got a pointer to the
> >> + * folio allocated by this process in the swapcache), besides the references
> >> + * that were obtained within __read_swap_cache_async() by this process that is
> >> + * faulting in this single-mapped swap entry.
> >> + */
> >
> > How is this supposed to work for large folios?
> >
>
> Hi,
>
> I was looking at zswapin large folio support and have posted a RFC in [1].
> I got bogged down with some prod stuff, so wasn't able to send it earlier.
>
> It looks quite different, and I think simpler from this series, so might be
> a good comparison.
>
> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/
>
> Thanks,
> Usama

I agree.

I think the lower hanging fruit here is to build upon Usama's patch.
Kanchana, do you think we can just use the new batch decompressing
infrastructure, and apply it to Usama's large folio zswap loading?

I'm not denying the readahead idea outright, but that seems much more
complicated. There are questions regarding the benefits of
readahead-ing when apply to zswap in the first place - IIUC, zram
circumvents that logic in several cases, and zswap shares many
characteristics with zram (fast, synchronous compression devices).

So let's reap the low hanging fruits first, get the wins as well as
stress test the new infrastructure. Then we can discuss the readahead
idea later?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
  2024-10-18  7:26   ` David Hildenbrand
  2024-10-18 11:04     ` Usama Arif
@ 2024-10-18 18:09     ` Sridhar, Kanchana P
  1 sibling, 0 replies; 15+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-18 18:09 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, chengming.zhou, usamaarif642, ryan.roberts, Huang, Ying,
	21cnbao, akpm, hughd, willy, bfoster, dchinner, chrisl, Sridhar,
	Kanchana P
  Cc: Feghali, Wajdi K, Gopal, Vinodh

Hi David,

> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Friday, October 18, 2024 12:27 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; hughd@google.com;
> willy@infradead.org; bfoster@redhat.com; dchinner@redhat.com;
> chrisl@kernel.org
> Cc: Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
> 
> On 18.10.24 08:48, Kanchana P Sridhar wrote:
> > This patch invokes the swapin_readahead() based batching interface to
> > prefetch a batch of 4K folios for zswap load with batch decompressions
> > in parallel using IAA hardware. swapin_readahead() prefetches folios based
> > on vm.page-cluster and the usefulness of prior prefetches to the
> > workload. As folios are created in the swapcache and the readahead code
> > calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
> the
> > respective folio_batches get populated with the folios to be read.
> >
> > Finally, the swapin_readahead() procedures will call the newly added
> > process_ra_batch_of_same_type() which:
> >
> >   1) Reads all the non_zswap_batch folios sequentially by calling
> >      swap_read_folio().
> >   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
> calls
> >      zswap_finish_load_batch() that finally decompresses each
> >      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> prefetch
> >      batch of say, 32 folios) in parallel with IAA.
> >
> > Within do_swap_page(), we try to benefit from batch decompressions in
> both
> > these scenarios:
> >
> >   1) single-mapped, SWP_SYNCHRONOUS_IO:
> >        We call swapin_readahead() with "single_mapped_path = true". This is
> >        done only in the !zswap_never_enabled() case.
> >   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> >        We call swapin_readahead() with "single_mapped_path = false".
> >
> > This will place folios in the swapcache: a design choice that handles cases
> > where a folio that is "single-mapped" in process 1 could be prefetched in
> > process 2; and handles highly contended server scenarios with stability.
> > There are checks added at the end of do_swap_page(), after the folio has
> > been successfully loaded, to detect if the single-mapped swapcache folio is
> > still single-mapped, and if so, folio_free_swap() is called on the folio.
> >
> > Within the swapin_readahead() functions, if single_mapped_path is true,
> and
> > either the platform does not have IAA, or, if the platform has IAA and the
> > user selects a software compressor for zswap (details of sysfs knob
> > follow), readahead/batching are skipped and the folio is loaded using
> > zswap_load().
> >
> > A new swap parameter "singlemapped_ra_enabled" (false by default) is
> added
> > for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> > want to give the user the option to run experiments with IAA and with
> > software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> >
> > For IAA:
> >   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >
> > For software compressors:
> >   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >
> > If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> > prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> do_swap_page()
> > path.
> >
> > Thanks Ying Huang for the really helpful brainstorming discussions on the
> > swap_read_folio() plug design.
> >
> > Suggested-by: Ying Huang <ying.huang@intel.com>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++------
> -----
> >   mm/shmem.c      |   2 +-
> >   mm/swap.h       |  12 ++--
> >   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
> >   mm/swapfile.c   |   2 +-
> >   5 files changed, 299 insertions(+), 61 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b5745b9ffdf7..9655b85fc243 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3924,6 +3924,42 @@ static vm_fault_t
> remove_device_exclusive_entry(struct vm_fault *vmf)
> >   	return 0;
> >   }
> >
> > +/*
> > + * swapin readahead based batching interface for zswap batched loads
> using IAA:
> > + *
> > + * Should only be called for and if the faulting swap entry in do_swap_page
> > + * is single-mapped and SWP_SYNCHRONOUS_IO.
> > + *
> > + * Detect if the folio is in the swapcache, is still mapped to only this
> > + * process, and further, there are no additional references to this folio
> > + * (for e.g. if another process simultaneously readahead this swap entry
> > + * while this process was handling the page-fault, and got a pointer to the
> > + * folio allocated by this process in the swapcache), besides the references
> > + * that were obtained within __read_swap_cache_async() by this process
> that is
> > + * faulting in this single-mapped swap entry.
> > + */
> 
> How is this supposed to work for large folios?

Thanks for your code review comments. The main idea behind this
patch-series is to work with the existing kernel page-fault granularity of 4K
folios, that swapin_readahead() builds upon to prefetch other "useful"
4K folios. The intent is to not try to make modifications at page-fault time
to opportunistically synthesize large folios for swapin.

As we know, __read_swap_cache_async() allocates an order-0 folio, which
explains the implementation of should_free_singlemap_swapcache() in this
patch. IOW, this is not supposed to work for large folios based on the existing
page-fault behavior and without making any modifications to that.

> 
> > +static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
> > +						   struct folio *folio)
> > +{
> > +	if (!folio_test_swapcache(folio))
> > +		return false;
> > +
> > +	if (__swap_count(entry) != 0)
> > +		return false;
> > +
> > +	/*
> > +	 * The folio ref count for a single-mapped folio that was allocated
> > +	 * in __read_swap_cache_async(), can be a maximum of 3. These are
> the
> > +	 * incrementors of the folio ref count in __read_swap_cache_async():
> > +	 * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
> > +	 */
> > +
> > +	if (folio_ref_count(folio) <= 3)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> >   static inline bool should_try_to_free_swap(struct folio *folio,
> >   					   struct vm_area_struct *vma,
> >   					   unsigned int fault_flags)
> > @@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >   	swp_entry_t entry;
> >   	pte_t pte;
> >   	vm_fault_t ret = 0;
> > +	bool single_mapped_swapcache = false;
> >   	void *shadow = NULL;
> >   	int nr_pages;
> >   	unsigned long page_idx;
> > @@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault
> *vmf)
> >   	if (!folio) {
> >   		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >   		    __swap_count(entry) == 1) {
> > -			/* skip swapcache */
> > -			folio = alloc_swap_folio(vmf);
> > -			if (folio) {
> > -				__folio_set_locked(folio);
> > -				__folio_set_swapbacked(folio);
> > -
> > -				nr_pages = folio_nr_pages(folio);
> > -				if (folio_test_large(folio))
> > -					entry.val = ALIGN_DOWN(entry.val,
> nr_pages);
> > -				/*
> > -				 * Prevent parallel swapin from proceeding
> with
> > -				 * the cache flag. Otherwise, another thread
> > -				 * may finish swapin first, free the entry, and
> > -				 * swapout reusing the same entry. It's
> > -				 * undetectable as pte_same() returns true
> due
> > -				 * to entry reuse.
> > -				 */
> > -				if (swapcache_prepare(entry, nr_pages)) {
> > +			if (zswap_never_enabled()) {
> > +				/* skip swapcache */
> > +				folio = alloc_swap_folio(vmf);
> > +				if (folio) {
> > +					__folio_set_locked(folio);
> > +					__folio_set_swapbacked(folio);
> > +
> > +					nr_pages = folio_nr_pages(folio);
> > +					if (folio_test_large(folio))
> > +						entry.val =
> ALIGN_DOWN(entry.val, nr_pages);
> >   					/*
> > -					 * Relax a bit to prevent rapid
> > -					 * repeated page faults.
> > +					 * Prevent parallel swapin from
> proceeding with
> > +					 * the cache flag. Otherwise, another
> thread
> > +					 * may finish swapin first, free the
> entry, and
> > +					 * swapout reusing the same entry.
> It's
> > +					 * undetectable as pte_same()
> returns true due
> > +					 * to entry reuse.
> >   					 */
> > -					add_wait_queue(&swapcache_wq,
> &wait);
> > -
> 	schedule_timeout_uninterruptible(1);
> > -
> 	remove_wait_queue(&swapcache_wq, &wait);
> > -					goto out_page;
> > +					if (swapcache_prepare(entry,
> nr_pages)) {
> > +						/*
> > +						 * Relax a bit to prevent rapid
> > +						 * repeated page faults.
> > +						 */
> > +
> 	add_wait_queue(&swapcache_wq, &wait);
> > +
> 	schedule_timeout_uninterruptible(1);
> > +
> 	remove_wait_queue(&swapcache_wq, &wait);
> > +						goto out_page;
> > +					}
> > +					need_clear_cache = true;
> > +
> > +
> 	mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> > +
> > +					shadow =
> get_shadow_from_swap_cache(entry);
> > +					if (shadow)
> > +						workingset_refault(folio,
> shadow);
> > +
> > +					folio_add_lru(folio);
> > +
> > +					/* To provide entry to
> swap_read_folio() */
> > +					folio->swap = entry;
> > +					swap_read_folio(folio, NULL, NULL,
> NULL);
> > +					folio->private = NULL;
> > +				}
> > +			} else {
> > +				/*
> > +				 * zswap is enabled or was enabled at some
> point.
> > +				 * Don't skip swapcache.
> > +				 *
> > +				 * swapin readahead based batching
> interface
> > +				 * for zswap batched loads using IAA:
> > +				 *
> > +				 * Readahead is invoked in this path only if
> > +				 * the sys swap "singlemapped_ra_enabled"
> swap
> > +				 * parameter is set to true. By default,
> > +				 * "singlemapped_ra_enabled" is set to false,
> > +				 * the recommended setting for software
> compressors.
> > +				 * For IAA, if "singlemapped_ra_enabled" is
> set
> > +				 * to true, readahead will be deployed in this
> path
> > +				 * as well.
> > +				 *
> > +				 * For single-mapped pages, the batching
> interface
> > +				 * calls __read_swap_cache_async() to
> allocate and
> > +				 * place the faulting page in the swapcache.
> This is
> > +				 * to handle a scenario where the faulting
> page in
> > +				 * this process happens to simultaneously be
> a
> > +				 * readahead page in another process. By
> placing the
> > +				 * single-mapped faulting page in the
> swapcache,
> > +				 * we avoid race conditions and duplicate
> page
> > +				 * allocations under these scenarios.
> > +				 */
> > +				folio = swapin_readahead(entry,
> GFP_HIGHUSER_MOVABLE,
> > +							 vmf, true);
> > +				if (!folio) {
> > +					ret = VM_FAULT_OOM;
> > +					goto out;
> >   				}
> > -				need_clear_cache = true;
> > -
> > -				mem_cgroup_swapin_uncharge_swap(entry,
> nr_pages);
> > -
> > -				shadow =
> get_shadow_from_swap_cache(entry);
> > -				if (shadow)
> > -					workingset_refault(folio, shadow);
> > -
> > -				folio_add_lru(folio);
> >
> > -				/* To provide entry to swap_read_folio() */
> > -				folio->swap = entry;
> > -				swap_read_folio(folio, NULL, NULL, NULL);
> > -				folio->private = NULL;
> > -			}
> > +				single_mapped_swapcache = true;
> > +				nr_pages = folio_nr_pages(folio);
> > +				swapcache = folio;
> > +			} /* swapin with zswap support. */
> >   		} else {
> >   			folio = swapin_readahead(entry,
> GFP_HIGHUSER_MOVABLE,
> > -						vmf);
> > +						 vmf, false);
> >   			swapcache = folio;
> 
> I'm sorry, but making this function ever more complicated and ugly is
> not going to fly. The zswap special casing is quite ugly here as well.
> 
> Is there a way forward that we can make this code actually readable and
> avoid zswap special casing?

Yes, I realize this is now quite cluttered. I need to think some more about
how to make this more readable, and would appreciate suggestions
towards this.

Thanks,
Kanchana

> 
> --
> Cheers,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
  2024-10-18 17:21       ` Nhat Pham
@ 2024-10-18 21:59         ` Sridhar, Kanchana P
  2024-10-20 16:50           ` Usama Arif
  0 siblings, 1 reply; 15+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-18 21:59 UTC (permalink / raw)
  To: Nhat Pham, Usama Arif
  Cc: David Hildenbrand, linux-kernel, linux-mm, hannes, yosryahmed,
	chengming.zhou, ryan.roberts, Huang, Ying, 21cnbao, akpm, hughd,
	willy, bfoster, dchinner, chrisl, Feghali, Wajdi K, Gopal,
	Vinodh, Sridhar, Kanchana P

Hi Usama, Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Friday, October 18, 2024 10:21 AM
> To: Usama Arif <usamaarif642@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
> 
> On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com>
> wrote:
> >
> >
> > On 18/10/2024 08:26, David Hildenbrand wrote:
> > > On 18.10.24 08:48, Kanchana P Sridhar wrote:
> > >> This patch invokes the swapin_readahead() based batching interface to
> > >> prefetch a batch of 4K folios for zswap load with batch decompressions
> > >> in parallel using IAA hardware. swapin_readahead() prefetches folios
> based
> > >> on vm.page-cluster and the usefulness of prior prefetches to the
> > >> workload. As folios are created in the swapcache and the readahead
> code
> > >> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
> the
> > >> respective folio_batches get populated with the folios to be read.
> > >>
> > >> Finally, the swapin_readahead() procedures will call the newly added
> > >> process_ra_batch_of_same_type() which:
> > >>
> > >>   1) Reads all the non_zswap_batch folios sequentially by calling
> > >>      swap_read_folio().
> > >>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
> calls
> > >>      zswap_finish_load_batch() that finally decompresses each
> > >>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> prefetch
> > >>      batch of say, 32 folios) in parallel with IAA.
> > >>
> > >> Within do_swap_page(), we try to benefit from batch decompressions in
> both
> > >> these scenarios:
> > >>
> > >>   1) single-mapped, SWP_SYNCHRONOUS_IO:
> > >>        We call swapin_readahead() with "single_mapped_path = true". This
> is
> > >>        done only in the !zswap_never_enabled() case.
> > >>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> > >>        We call swapin_readahead() with "single_mapped_path = false".
> > >>
> > >> This will place folios in the swapcache: a design choice that handles
> cases
> > >> where a folio that is "single-mapped" in process 1 could be prefetched in
> > >> process 2; and handles highly contended server scenarios with stability.
> > >> There are checks added at the end of do_swap_page(), after the folio has
> > >> been successfully loaded, to detect if the single-mapped swapcache folio
> is
> > >> still single-mapped, and if so, folio_free_swap() is called on the folio.
> > >>
> > >> Within the swapin_readahead() functions, if single_mapped_path is true,
> and
> > >> either the platform does not have IAA, or, if the platform has IAA and the
> > >> user selects a software compressor for zswap (details of sysfs knob
> > >> follow), readahead/batching are skipped and the folio is loaded using
> > >> zswap_load().
> > >>
> > >> A new swap parameter "singlemapped_ra_enabled" (false by default) is
> added
> > >> for platforms that have IAA, zswap_load_batching_enabled() is true, and
> we
> > >> want to give the user the option to run experiments with IAA and with
> > >> software compressors for zswap (swap device is
> SWP_SYNCHRONOUS_IO):
> > >>
> > >> For IAA:
> > >>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> > >>
> > >> For software compressors:
> > >>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> > >>
> > >> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will
> skip
> > >> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> do_swap_page()
> > >> path.
> > >>
> > >> Thanks Ying Huang for the really helpful brainstorming discussions on the
> > >> swap_read_folio() plug design.
> > >>
> > >> Suggested-by: Ying Huang <ying.huang@intel.com>
> > >> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > >> ---
> > >>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++--
> ---------
> > >>   mm/shmem.c      |   2 +-
> > >>   mm/swap.h       |  12 ++--
> > >>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++--
> --
> > >>   mm/swapfile.c   |   2 +-
> > >>   5 files changed, 299 insertions(+), 61 deletions(-)
> > >>
> > >> diff --git a/mm/memory.c b/mm/memory.c
> > >> index b5745b9ffdf7..9655b85fc243 100644
> > >> --- a/mm/memory.c
> > >> +++ b/mm/memory.c
> > >> @@ -3924,6 +3924,42 @@ static vm_fault_t
> remove_device_exclusive_entry(struct vm_fault *vmf)
> > >>       return 0;
> > >>   }
> > >>   +/*
> > >> + * swapin readahead based batching interface for zswap batched loads
> using IAA:
> > >> + *
> > >> + * Should only be called for and if the faulting swap entry in
> do_swap_page
> > >> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> > >> + *
> > >> + * Detect if the folio is in the swapcache, is still mapped to only this
> > >> + * process, and further, there are no additional references to this folio
> > >> + * (for e.g. if another process simultaneously readahead this swap entry
> > >> + * while this process was handling the page-fault, and got a pointer to
> the
> > >> + * folio allocated by this process in the swapcache), besides the
> references
> > >> + * that were obtained within __read_swap_cache_async() by this
> process that is
> > >> + * faulting in this single-mapped swap entry.
> > >> + */
> > >
> > > How is this supposed to work for large folios?
> > >
> >
> > Hi,
> >
> > I was looking at zswapin large folio support and have posted a RFC in [1].
> > I got bogged down with some prod stuff, so wasn't able to send it earlier.
> >
> > It looks quite different, and I think simpler from this series, so might be
> > a good comparison.
> >
> > [1] https://lore.kernel.org/all/20241018105026.2521366-1-
> usamaarif642@gmail.com/
> >
> > Thanks,
> > Usama
> 
> I agree.
> 
> I think the lower hanging fruit here is to build upon Usama's patch.
> Kanchana, do you think we can just use the new batch decompressing
> infrastructure, and apply it to Usama's large folio zswap loading?
> 
> I'm not denying the readahead idea outright, but that seems much more
> complicated. There are questions regarding the benefits of
> readahead-ing when apply to zswap in the first place - IIUC, zram
> circumvents that logic in several cases, and zswap shares many
> characteristics with zram (fast, synchronous compression devices).
> 
> So let's reap the low hanging fruits first, get the wins as well as
> stress test the new infrastructure. Then we can discuss the readahead
> idea later?

Thanks Usama for publishing the zswap large folios swapin series, and
thanks Nhat for your suggestions.  Sure, I can look into integrating the
new batch decompressing infrastructure with Usama's large folio zswap
loading.

However, I think we need to get clarity on a bigger question: does it
make sense to swapin large folios? Some important considerations
would be:

1) What are the tradeoffs in memory footprint cost of swapping in a
    large folio?
2) If we decide to let the user determine this by say, an option that
     determines the swapin granularity (e.g. no more than 32k at a time),
     how does this constrain compression and zpool storage granularity?

Ultimately, I feel the bigger question is about memory utilization cost
of large folio swapin. The swapin_readahead() based approach tries to
use the prefetch-usefulness characteristics of the workload to improve
the efficiency of multiple 4k folios by using strategies like parallel
decompression, to strike some balance in memory utilization vs.
efficiency.

Usama, I downloaded your patch-series and tried to understand this
better, and wanted to share the data.

I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0,
and 16k/32k/64k large folios enabled to "always":

16k/32k/64k folios: kernel compilation with zstd:
 =================================================

 ------------------------------------------------------------------------------
                        mm-unstable-10-16-2024    + zswap large folios swapin
                                                                       series
 ------------------------------------------------------------------------------
 zswap compressor                         zstd                           zstd
 vm.page-cluster                             0                              0
 ------------------------------------------------------------------------------
 real_sec                               772.53                         870.61
 user_sec                            15,780.29                      15,836.71
 sys_sec                              5,353.20                       6,185.02
 Max_Res_Set_Size_KB                 1,873,348                      1,873,004
                                                                             
 ------------------------------------------------------------------------------
 memcg_high                                  0                              0
 memcg_swap_fail                             0                              0
 zswpout                            93,811,916                    111,663,872
 zswpin                             27,150,029                     54,730,678
 pswpout                                    64                             59
 pswpin                                     78                             53
 thp_swpout                                  0                              0
 thp_swpout_fallback                         0                              0
 16kB-mthp_swpout_fallback                   0                              0
 32kB-mthp_swpout_fallback                   0                              0
 64kB-mthp_swpout_fallback               5,470                              0
 pgmajfault                         29,019,256                     16,615,820
 swap_ra                                     0                              0
 swap_ra_hit                             3,004                          3,614
 ZSWPOUT-16kB                        1,324,160                      2,252,747
 ZSWPOUT-32kB                          730,534                      1,356,640
 ZSWPOUT-64kB                        3,039,760                      3,955,034
 ZSWPIN-16kB                                                        1,496,916
 ZSWPIN-32kB                                                        1,131,176
 ZSWPIN-64kB                                                        1,866,884
 SWPOUT-16kB                                 0                              0
 SWPOUT-32kB                                 0                              0
 SWPOUT-64kB                                 4                              3
 ------------------------------------------------------------------------------

It does appear like there is considerably higher swapout and swapin
activity as a result of swapping in large folios, which does end up
impacting performance.

I would appreciate thoughts on understanding the usefulness of
swapping in large folios, with the considerations outlined earlier/other
factors.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
  2024-10-18 21:59         ` Sridhar, Kanchana P
@ 2024-10-20 16:50           ` Usama Arif
  2024-10-20 20:12             ` Sridhar, Kanchana P
  0 siblings, 1 reply; 15+ messages in thread
From: Usama Arif @ 2024-10-20 16:50 UTC (permalink / raw)
  To: Sridhar, Kanchana P, Nhat Pham
  Cc: David Hildenbrand, linux-kernel, linux-mm, hannes, yosryahmed,
	chengming.zhou, ryan.roberts, Huang, Ying, 21cnbao, akpm, hughd,
	willy, bfoster, dchinner, chrisl, Feghali, Wajdi K, Gopal,
	Vinodh



On 18/10/2024 22:59, Sridhar, Kanchana P wrote:
> Hi Usama, Nhat,
> 
>> -----Original Message-----
>> From: Nhat Pham <nphamcs@gmail.com>
>> Sent: Friday, October 18, 2024 10:21 AM
>> To: Usama Arif <usamaarif642@gmail.com>
>> Cc: David Hildenbrand <david@redhat.com>; Sridhar, Kanchana P
>> <kanchana.p.sridhar@intel.com>; linux-kernel@vger.kernel.org; linux-
>> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
>> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
>> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
>> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
>> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
>> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
>> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
>> swapin_readahead() zswap load batching interface.
>>
>> On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com>
>> wrote:
>>>
>>>
>>> On 18/10/2024 08:26, David Hildenbrand wrote:
>>>> On 18.10.24 08:48, Kanchana P Sridhar wrote:
>>>>> This patch invokes the swapin_readahead() based batching interface to
>>>>> prefetch a batch of 4K folios for zswap load with batch decompressions
>>>>> in parallel using IAA hardware. swapin_readahead() prefetches folios
>> based
>>>>> on vm.page-cluster and the usefulness of prior prefetches to the
>>>>> workload. As folios are created in the swapcache and the readahead
>> code
>>>>> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
>> the
>>>>> respective folio_batches get populated with the folios to be read.
>>>>>
>>>>> Finally, the swapin_readahead() procedures will call the newly added
>>>>> process_ra_batch_of_same_type() which:
>>>>>
>>>>>   1) Reads all the non_zswap_batch folios sequentially by calling
>>>>>      swap_read_folio().
>>>>>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
>> calls
>>>>>      zswap_finish_load_batch() that finally decompresses each
>>>>>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
>> prefetch
>>>>>      batch of say, 32 folios) in parallel with IAA.
>>>>>
>>>>> Within do_swap_page(), we try to benefit from batch decompressions in
>> both
>>>>> these scenarios:
>>>>>
>>>>>   1) single-mapped, SWP_SYNCHRONOUS_IO:
>>>>>        We call swapin_readahead() with "single_mapped_path = true". This
>> is
>>>>>        done only in the !zswap_never_enabled() case.
>>>>>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
>>>>>        We call swapin_readahead() with "single_mapped_path = false".
>>>>>
>>>>> This will place folios in the swapcache: a design choice that handles
>> cases
>>>>> where a folio that is "single-mapped" in process 1 could be prefetched in
>>>>> process 2; and handles highly contended server scenarios with stability.
>>>>> There are checks added at the end of do_swap_page(), after the folio has
>>>>> been successfully loaded, to detect if the single-mapped swapcache folio
>> is
>>>>> still single-mapped, and if so, folio_free_swap() is called on the folio.
>>>>>
>>>>> Within the swapin_readahead() functions, if single_mapped_path is true,
>> and
>>>>> either the platform does not have IAA, or, if the platform has IAA and the
>>>>> user selects a software compressor for zswap (details of sysfs knob
>>>>> follow), readahead/batching are skipped and the folio is loaded using
>>>>> zswap_load().
>>>>>
>>>>> A new swap parameter "singlemapped_ra_enabled" (false by default) is
>> added
>>>>> for platforms that have IAA, zswap_load_batching_enabled() is true, and
>> we
>>>>> want to give the user the option to run experiments with IAA and with
>>>>> software compressors for zswap (swap device is
>> SWP_SYNCHRONOUS_IO):
>>>>>
>>>>> For IAA:
>>>>>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>>>>
>>>>> For software compressors:
>>>>>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>>>>
>>>>> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will
>> skip
>>>>> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
>> do_swap_page()
>>>>> path.
>>>>>
>>>>> Thanks Ying Huang for the really helpful brainstorming discussions on the
>>>>> swap_read_folio() plug design.
>>>>>
>>>>> Suggested-by: Ying Huang <ying.huang@intel.com>
>>>>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
>>>>> ---
>>>>>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++--
>> ---------
>>>>>   mm/shmem.c      |   2 +-
>>>>>   mm/swap.h       |  12 ++--
>>>>>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++--
>> --
>>>>>   mm/swapfile.c   |   2 +-
>>>>>   5 files changed, 299 insertions(+), 61 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index b5745b9ffdf7..9655b85fc243 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -3924,6 +3924,42 @@ static vm_fault_t
>> remove_device_exclusive_entry(struct vm_fault *vmf)
>>>>>       return 0;
>>>>>   }
>>>>>   +/*
>>>>> + * swapin readahead based batching interface for zswap batched loads
>> using IAA:
>>>>> + *
>>>>> + * Should only be called for and if the faulting swap entry in
>> do_swap_page
>>>>> + * is single-mapped and SWP_SYNCHRONOUS_IO.
>>>>> + *
>>>>> + * Detect if the folio is in the swapcache, is still mapped to only this
>>>>> + * process, and further, there are no additional references to this folio
>>>>> + * (for e.g. if another process simultaneously readahead this swap entry
>>>>> + * while this process was handling the page-fault, and got a pointer to
>> the
>>>>> + * folio allocated by this process in the swapcache), besides the
>> references
>>>>> + * that were obtained within __read_swap_cache_async() by this
>> process that is
>>>>> + * faulting in this single-mapped swap entry.
>>>>> + */
>>>>
>>>> How is this supposed to work for large folios?
>>>>
>>>
>>> Hi,
>>>
>>> I was looking at zswapin large folio support and have posted a RFC in [1].
>>> I got bogged down with some prod stuff, so wasn't able to send it earlier.
>>>
>>> It looks quite different, and I think simpler from this series, so might be
>>> a good comparison.
>>>
>>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-
>> usamaarif642@gmail.com/
>>>
>>> Thanks,
>>> Usama
>>
>> I agree.
>>
>> I think the lower hanging fruit here is to build upon Usama's patch.
>> Kanchana, do you think we can just use the new batch decompressing
>> infrastructure, and apply it to Usama's large folio zswap loading?
>>
>> I'm not denying the readahead idea outright, but that seems much more
>> complicated. There are questions regarding the benefits of
>> readahead-ing when apply to zswap in the first place - IIUC, zram
>> circumvents that logic in several cases, and zswap shares many
>> characteristics with zram (fast, synchronous compression devices).
>>
>> So let's reap the low hanging fruits first, get the wins as well as
>> stress test the new infrastructure. Then we can discuss the readahead
>> idea later?
> 
> Thanks Usama for publishing the zswap large folios swapin series, and
> thanks Nhat for your suggestions.  Sure, I can look into integrating the
> new batch decompressing infrastructure with Usama's large folio zswap
> loading.
> 
> However, I think we need to get clarity on a bigger question: does it
> make sense to swapin large folios? Some important considerations
> would be:
> 
> 1) What are the tradeoffs in memory footprint cost of swapping in a
>     large folio?

I would say the pros and cons of swapping in a large folio are the same as
the pros and cons of large folios in general.
As mentioned in my cover letter and the series that introduced large folios
you get fewer page faults, batched PTE and rmap manipulation, reduced lru list,
TLB coalescing (for arm64 and AMD) at the cost of higher memory usage and
fragmentation.

The other thing is that the series I wrote is hopefully just a start.
As shown by Barry in the case of zram in 
https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
there is a significant improvement in CPU utilization and compression
ratios when compressing at large granularity. Hopefully we can
try and do something similar for zswap. Not sure how that would look
for zswap as I haven't started looking at that yet.

> 2) If we decide to let the user determine this by say, an option that
>      determines the swapin granularity (e.g. no more than 32k at a time),
>      how does this constrain compression and zpool storage granularity?
> 
Right now whether or not zswapin happens is determined using
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/enabled
I assume when the someone sets some of these to always, they know that
their workload works best with those page sizes, so they would want folios
to be swapped in and used at that size as well?

There might be some merit in adding something like
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled,
as you might start thrashing swap if you are for e.g. swapping
in 1M folios and there isn't enough memory for it, which causes
you to swapout another folio in its place.

> Ultimately, I feel the bigger question is about memory utilization cost
> of large folio swapin. The swapin_readahead() based approach tries to
> use the prefetch-usefulness characteristics of the workload to improve
> the efficiency of multiple 4k folios by using strategies like parallel
> decompression, to strike some balance in memory utilization vs.
> efficiency.
> 
> Usama, I downloaded your patch-series and tried to understand this
> better, and wanted to share the data.
> 
> I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0,
> and 16k/32k/64k large folios enabled to "always":
> 
> 16k/32k/64k folios: kernel compilation with zstd:
>  =================================================
> 
>  ------------------------------------------------------------------------------
>                         mm-unstable-10-16-2024    + zswap large folios swapin
>                                                                        series
>  ------------------------------------------------------------------------------
>  zswap compressor                         zstd                           zstd
>  vm.page-cluster                             0                              0
>  ------------------------------------------------------------------------------
>  real_sec                               772.53                         870.61
>  user_sec                            15,780.29                      15,836.71
>  sys_sec                              5,353.20                       6,185.02
>  Max_Res_Set_Size_KB                 1,873,348                      1,873,004
>                                                                              
>  ------------------------------------------------------------------------------
>  memcg_high                                  0                              0
>  memcg_swap_fail                             0                              0
>  zswpout                            93,811,916                    111,663,872
>  zswpin                             27,150,029                     54,730,678
>  pswpout                                    64                             59
>  pswpin                                     78                             53
>  thp_swpout                                  0                              0
>  thp_swpout_fallback                         0                              0
>  16kB-mthp_swpout_fallback                   0                              0
>  32kB-mthp_swpout_fallback                   0                              0
>  64kB-mthp_swpout_fallback               5,470                              0
>  pgmajfault                         29,019,256                     16,615,820
>  swap_ra                                     0                              0
>  swap_ra_hit                             3,004                          3,614
>  ZSWPOUT-16kB                        1,324,160                      2,252,747
>  ZSWPOUT-32kB                          730,534                      1,356,640
>  ZSWPOUT-64kB                        3,039,760                      3,955,034
>  ZSWPIN-16kB                                                        1,496,916
>  ZSWPIN-32kB                                                        1,131,176
>  ZSWPIN-64kB                                                        1,866,884
>  SWPOUT-16kB                                 0                              0
>  SWPOUT-32kB                                 0                              0
>  SWPOUT-64kB                                 4                              3
>  ------------------------------------------------------------------------------
> 
> It does appear like there is considerably higher swapout and swapin
> activity as a result of swapping in large folios, which does end up
> impacting performance.

Thanks for having a look!
I had only tested with the microbenchmark for time taken to zswapin that I included in
my coverletter.
In general I expected zswap activity to go up as you are more likely to experience
memory pressure when swapping in large folios, but in return get lower pagefaults
and the advantage of lower TLB pressure in AMD and arm64.

Thanks for the test, those number look quite extreme! I think there is a lot of swap
thrashing. 
I am assuming you are testing on an intel machine, where you don't get the advantage
of lower TLB misses of large folios, I will try and get an AMD machine which has
TLB coalescing or an ARM server with CONT_PTE to see if the numbers get better.

Maybe it might be better for large folio zswapin to be considered along with
larger granuality compression to get all the benefits of large folios (and
hopefully better numbers.) I think that was the approach taken for zram as well.

> 
> I would appreciate thoughts on understanding the usefulness of
> swapping in large folios, with the considerations outlined earlier/other
> factors.
> 
> Thanks,
> Kanchana



^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
  2024-10-20 16:50           ` Usama Arif
@ 2024-10-20 20:12             ` Sridhar, Kanchana P
  0 siblings, 0 replies; 15+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-20 20:12 UTC (permalink / raw)
  To: Usama Arif, Nhat Pham
  Cc: David Hildenbrand, linux-kernel, linux-mm, hannes, yosryahmed,
	chengming.zhou, ryan.roberts, Huang, Ying, 21cnbao, akpm, hughd,
	willy, bfoster, dchinner, chrisl, Feghali, Wajdi K, Gopal,
	Vinodh, Sridhar, Kanchana P

Hi Usama,

> -----Original Message-----
> From: Usama Arif <usamaarif642@gmail.com>
> Sent: Sunday, October 20, 2024 9:50 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham
> <nphamcs@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
> 
> 
> 
> On 18/10/2024 22:59, Sridhar, Kanchana P wrote:
> > Hi Usama, Nhat,
> >
> >> -----Original Message-----
> >> From: Nhat Pham <nphamcs@gmail.com>
> >> Sent: Friday, October 18, 2024 10:21 AM
> >> To: Usama Arif <usamaarif642@gmail.com>
> >> Cc: David Hildenbrand <david@redhat.com>; Sridhar, Kanchana P
> >> <kanchana.p.sridhar@intel.com>; linux-kernel@vger.kernel.org; linux-
> >> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> >> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
> >> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> >> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
> >> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
> >> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> >> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> >> swapin_readahead() zswap load batching interface.
> >>
> >> On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com>
> >> wrote:
> >>>
> >>>
> >>> On 18/10/2024 08:26, David Hildenbrand wrote:
> >>>> On 18.10.24 08:48, Kanchana P Sridhar wrote:
> >>>>> This patch invokes the swapin_readahead() based batching interface to
> >>>>> prefetch a batch of 4K folios for zswap load with batch decompressions
> >>>>> in parallel using IAA hardware. swapin_readahead() prefetches folios
> >> based
> >>>>> on vm.page-cluster and the usefulness of prior prefetches to the
> >>>>> workload. As folios are created in the swapcache and the readahead
> >> code
> >>>>> calls swap_read_folio() with a "zswap_batch" and a
> "non_zswap_batch",
> >> the
> >>>>> respective folio_batches get populated with the folios to be read.
> >>>>>
> >>>>> Finally, the swapin_readahead() procedures will call the newly added
> >>>>> process_ra_batch_of_same_type() which:
> >>>>>
> >>>>>   1) Reads all the non_zswap_batch folios sequentially by calling
> >>>>>      swap_read_folio().
> >>>>>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch
> which
> >> calls
> >>>>>      zswap_finish_load_batch() that finally decompresses each
> >>>>>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> >> prefetch
> >>>>>      batch of say, 32 folios) in parallel with IAA.
> >>>>>
> >>>>> Within do_swap_page(), we try to benefit from batch decompressions
> in
> >> both
> >>>>> these scenarios:
> >>>>>
> >>>>>   1) single-mapped, SWP_SYNCHRONOUS_IO:
> >>>>>        We call swapin_readahead() with "single_mapped_path = true".
> This
> >> is
> >>>>>        done only in the !zswap_never_enabled() case.
> >>>>>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> >>>>>        We call swapin_readahead() with "single_mapped_path = false".
> >>>>>
> >>>>> This will place folios in the swapcache: a design choice that handles
> >> cases
> >>>>> where a folio that is "single-mapped" in process 1 could be prefetched
> in
> >>>>> process 2; and handles highly contended server scenarios with
> stability.
> >>>>> There are checks added at the end of do_swap_page(), after the folio
> has
> >>>>> been successfully loaded, to detect if the single-mapped swapcache
> folio
> >> is
> >>>>> still single-mapped, and if so, folio_free_swap() is called on the folio.
> >>>>>
> >>>>> Within the swapin_readahead() functions, if single_mapped_path is
> true,
> >> and
> >>>>> either the platform does not have IAA, or, if the platform has IAA and
> the
> >>>>> user selects a software compressor for zswap (details of sysfs knob
> >>>>> follow), readahead/batching are skipped and the folio is loaded using
> >>>>> zswap_load().
> >>>>>
> >>>>> A new swap parameter "singlemapped_ra_enabled" (false by default)
> is
> >> added
> >>>>> for platforms that have IAA, zswap_load_batching_enabled() is true,
> and
> >> we
> >>>>> want to give the user the option to run experiments with IAA and with
> >>>>> software compressors for zswap (swap device is
> >> SWP_SYNCHRONOUS_IO):
> >>>>>
> >>>>> For IAA:
> >>>>>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>>>>
> >>>>> For software compressors:
> >>>>>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>>>>
> >>>>> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will
> >> skip
> >>>>> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> >> do_swap_page()
> >>>>> path.
> >>>>>
> >>>>> Thanks Ying Huang for the really helpful brainstorming discussions on
> the
> >>>>> swap_read_folio() plug design.
> >>>>>
> >>>>> Suggested-by: Ying Huang <ying.huang@intel.com>
> >>>>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> >>>>> ---
> >>>>>   mm/memory.c     | 187
> +++++++++++++++++++++++++++++++++++++--
> >> ---------
> >>>>>   mm/shmem.c      |   2 +-
> >>>>>   mm/swap.h       |  12 ++--
> >>>>>   mm/swap_state.c | 157
> ++++++++++++++++++++++++++++++++++++--
> >> --
> >>>>>   mm/swapfile.c   |   2 +-
> >>>>>   5 files changed, 299 insertions(+), 61 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>> index b5745b9ffdf7..9655b85fc243 100644
> >>>>> --- a/mm/memory.c
> >>>>> +++ b/mm/memory.c
> >>>>> @@ -3924,6 +3924,42 @@ static vm_fault_t
> >> remove_device_exclusive_entry(struct vm_fault *vmf)
> >>>>>       return 0;
> >>>>>   }
> >>>>>   +/*
> >>>>> + * swapin readahead based batching interface for zswap batched
> loads
> >> using IAA:
> >>>>> + *
> >>>>> + * Should only be called for and if the faulting swap entry in
> >> do_swap_page
> >>>>> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> >>>>> + *
> >>>>> + * Detect if the folio is in the swapcache, is still mapped to only this
> >>>>> + * process, and further, there are no additional references to this folio
> >>>>> + * (for e.g. if another process simultaneously readahead this swap
> entry
> >>>>> + * while this process was handling the page-fault, and got a pointer to
> >> the
> >>>>> + * folio allocated by this process in the swapcache), besides the
> >> references
> >>>>> + * that were obtained within __read_swap_cache_async() by this
> >> process that is
> >>>>> + * faulting in this single-mapped swap entry.
> >>>>> + */
> >>>>
> >>>> How is this supposed to work for large folios?
> >>>>
> >>>
> >>> Hi,
> >>>
> >>> I was looking at zswapin large folio support and have posted a RFC in [1].
> >>> I got bogged down with some prod stuff, so wasn't able to send it earlier.
> >>>
> >>> It looks quite different, and I think simpler from this series, so might be
> >>> a good comparison.
> >>>
> >>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-
> >> usamaarif642@gmail.com/
> >>>
> >>> Thanks,
> >>> Usama
> >>
> >> I agree.
> >>
> >> I think the lower hanging fruit here is to build upon Usama's patch.
> >> Kanchana, do you think we can just use the new batch decompressing
> >> infrastructure, and apply it to Usama's large folio zswap loading?
> >>
> >> I'm not denying the readahead idea outright, but that seems much more
> >> complicated. There are questions regarding the benefits of
> >> readahead-ing when apply to zswap in the first place - IIUC, zram
> >> circumvents that logic in several cases, and zswap shares many
> >> characteristics with zram (fast, synchronous compression devices).
> >>
> >> So let's reap the low hanging fruits first, get the wins as well as
> >> stress test the new infrastructure. Then we can discuss the readahead
> >> idea later?
> >
> > Thanks Usama for publishing the zswap large folios swapin series, and
> > thanks Nhat for your suggestions.  Sure, I can look into integrating the
> > new batch decompressing infrastructure with Usama's large folio zswap
> > loading.
> >
> > However, I think we need to get clarity on a bigger question: does it
> > make sense to swapin large folios? Some important considerations
> > would be:
> >
> > 1) What are the tradeoffs in memory footprint cost of swapping in a
> >     large folio?
> 
> I would say the pros and cons of swapping in a large folio are the same as
> the pros and cons of large folios in general.
> As mentioned in my cover letter and the series that introduced large folios
> you get fewer page faults, batched PTE and rmap manipulation, reduced lru
> list,
> TLB coalescing (for arm64 and AMD) at the cost of higher memory usage and
> fragmentation.
> 
> The other thing is that the series I wrote is hopefully just a start.
> As shown by Barry in the case of zram in
> https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
> there is a significant improvement in CPU utilization and compression
> ratios when compressing at large granularity. Hopefully we can
> try and do something similar for zswap. Not sure how that would look
> for zswap as I haven't started looking at that yet.

Thanks a lot for sharing your thoughts on this! Yes, this makes sense.
It was helpful to get your thoughts on the larger compression granularity
with large folios, since we have been trying to answer similar questions
ourselves. We look forward to collaborating with you on getting this working
for zswap!

> 
> > 2) If we decide to let the user determine this by say, an option that
> >      determines the swapin granularity (e.g. no more than 32k at a time),
> >      how does this constrain compression and zpool storage granularity?
> >
> Right now whether or not zswapin happens is determined using
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/enabled
> I assume when the someone sets some of these to always, they know that
> their workload works best with those page sizes, so they would want folios
> to be swapped in and used at that size as well?

Sure, this rationale makes sense.

> 
> There might be some merit in adding something like
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled,
> as you might start thrashing swap if you are for e.g. swapping
> in 1M folios and there isn't enough memory for it, which causes
> you to swapout another folio in its place.

Agreed.

> 
> > Ultimately, I feel the bigger question is about memory utilization cost
> > of large folio swapin. The swapin_readahead() based approach tries to
> > use the prefetch-usefulness characteristics of the workload to improve
> > the efficiency of multiple 4k folios by using strategies like parallel
> > decompression, to strike some balance in memory utilization vs.
> > efficiency.
> >
> > Usama, I downloaded your patch-series and tried to understand this
> > better, and wanted to share the data.
> >
> > I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0,
> > and 16k/32k/64k large folios enabled to "always":
> >
> > 16k/32k/64k folios: kernel compilation with zstd:
> >  =================================================
> >
> >  ------------------------------------------------------------------------------
> >                         mm-unstable-10-16-2024    + zswap large folios swapin
> >                                                                        series
> >  ------------------------------------------------------------------------------
> >  zswap compressor                         zstd                           zstd
> >  vm.page-cluster                             0                              0
> >  ------------------------------------------------------------------------------
> >  real_sec                               772.53                         870.61
> >  user_sec                            15,780.29                      15,836.71
> >  sys_sec                              5,353.20                       6,185.02
> >  Max_Res_Set_Size_KB                 1,873,348                      1,873,004
> >
> >  ------------------------------------------------------------------------------
> >  memcg_high                                  0                              0
> >  memcg_swap_fail                             0                              0
> >  zswpout                            93,811,916                    111,663,872
> >  zswpin                             27,150,029                     54,730,678
> >  pswpout                                    64                             59
> >  pswpin                                     78                             53
> >  thp_swpout                                  0                              0
> >  thp_swpout_fallback                         0                              0
> >  16kB-mthp_swpout_fallback                   0                              0
> >  32kB-mthp_swpout_fallback                   0                              0
> >  64kB-mthp_swpout_fallback               5,470                              0
> >  pgmajfault                         29,019,256                     16,615,820
> >  swap_ra                                     0                              0
> >  swap_ra_hit                             3,004                          3,614
> >  ZSWPOUT-16kB                        1,324,160                      2,252,747
> >  ZSWPOUT-32kB                          730,534                      1,356,640
> >  ZSWPOUT-64kB                        3,039,760                      3,955,034
> >  ZSWPIN-16kB                                                        1,496,916
> >  ZSWPIN-32kB                                                        1,131,176
> >  ZSWPIN-64kB                                                        1,866,884
> >  SWPOUT-16kB                                 0                              0
> >  SWPOUT-32kB                                 0                              0
> >  SWPOUT-64kB                                 4                              3
> >  ------------------------------------------------------------------------------
> >
> > It does appear like there is considerably higher swapout and swapin
> > activity as a result of swapping in large folios, which does end up
> > impacting performance.
> 
> Thanks for having a look!
> I had only tested with the microbenchmark for time taken to zswapin that I
> included in
> my coverletter.
> In general I expected zswap activity to go up as you are more likely to
> experience
> memory pressure when swapping in large folios, but in return get lower
> pagefaults
> and the advantage of lower TLB pressure in AMD and arm64.
> 
> Thanks for the test, those number look quite extreme! I think there is a lot of
> swap
> thrashing.
> I am assuming you are testing on an intel machine, where you don't get the
> advantage
> of lower TLB misses of large folios, I will try and get an AMD machine which
> has
> TLB coalescing or an ARM server with CONT_PTE to see if the numbers get
> better.

You're right, these numbers were gathered on an Intel Sapphire Rapids server.
Thanks for confirming kernel compilation/allmodconfig behavior on the
AMD/ARM systems.

> 
> Maybe it might be better for large folio zswapin to be considered along with
> larger granuality compression to get all the benefits of large folios (and
> hopefully better numbers.) I think that was the approach taken for zram as
> well.

Yes, agree with you on this as well. We are planning to run experiments with
the by_n patch-series [1], and Barry's zsmalloc multi-page patch-series [2]
posted earlier, and your zswapin large folio series. It would be great to compare
notes as we understand the overall workload behavior and the trade-offs in
memory vs. latency benefits of larger compression granularity.

[1] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/
[2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/

Thanks again,
Kanchana

> 
> >
> > I would appreciate thoughts on understanding the usefulness of
> > swapping in large folios, with the considerations outlined earlier/other
> > factors.
> >
> > Thanks,
> > Kanchana


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-10-20 20:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-18  6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
2024-10-18  6:47 ` [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with " Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 2/7] mm: swap: Add IAA batch decompression API swap_crypto_acomp_decompress_batch() Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 3/7] pagevec: struct folio_batch changes for decompress batching interface Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 4/7] mm: swap: swap_read_folio() can add a folio to a folio_batch if it is in zswap Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 5/7] mm: swap, zswap: zswap folio_batch processing with IAA decompression batching Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface Kanchana P Sridhar
2024-10-18  7:26   ` David Hildenbrand
2024-10-18 11:04     ` Usama Arif
2024-10-18 17:21       ` Nhat Pham
2024-10-18 21:59         ` Sridhar, Kanchana P
2024-10-20 16:50           ` Usama Arif
2024-10-20 20:12             ` Sridhar, Kanchana P
2024-10-18 18:09     ` Sridhar, Kanchana P
2024-10-18  6:48 ` [RFC PATCH v1 7/7] mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache thresholds Kanchana P Sridhar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox