* [RFC PATCH v1 0/7] zswap IAA decompress batching
@ 2024-10-18 6:47 Kanchana P Sridhar
2024-10-18 6:47 ` [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with " Kanchana P Sridhar
` (6 more replies)
0 siblings, 7 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18 6:47 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
akpm, hughd, willy, bfoster, dchinner, chrisl, david
Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
IAA Decompression Batching:
===========================
This patch-series applies over [1], the IAA compress batching patch-series.
[1] https://patchwork.kernel.org/project/linux-mm/list/?series=900537
This RFC patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel decompression of 4K folios prefetched by
swapin_readahead(). We have developed zswap batched loading of these
prefetched folios, that deploys the use of parallel decompressions by IAA.
swapin_readahead() provides a natural batching interface because it adapts
to the usefulness of prior prefetches, to adjust the readahead
window. Hence, it allows the page-cluster to be set based on workload
characteristics. For workloads that are prefetching friendly, this can form
the basis for reading ahead up to 32 folios with zswap load batching to
significantly reduce swap in latency, major page-faults and systime;
thereby improving workload performance.
The patch-series builds upon the IAA compress batching patch-series [1],
and is organized as follows:
1) A centralized batch decompression API that can be used by swap modules.
2) "struct folio_batch" modifications, e.g., PAGEVEC_SIZE is increased to
2^5.
3) Addition of "zswap_batch" and "non_zswap_batch" folio_batches in
swap_read_folio() to serve the purposes of a plug.
4) swap_read_zswap_batch_unplug() API in page_io.c to process a read
batch of entries found in zswap.
5) zswap API to add a swap entry to a load batch, init/reinit the batch,
process the batch using the batch decompression API.
6) Modifications to the swapin_readahead() functions,
swap_vma_readahead() and swap_cluster_readahead() to:
a) Call swap_read_folio() to add prefetch swap entries to "zswap_batch"
and "non_zswap_batch" folio_batches.
b) Process the two readahead folio batches: "non_zswap_batch" folios
will be read sequentially; "zswap_batch" folios will be batch
decompressed with IAA.
7) Modifications to do_swap_page() to invoke swapin_readahead() from both,
the single-mapped SWP_SYNCHRONOUS_IO and shared/non-SWP_SYNCHRONOUS_IO
branches. In the former path, we call swapin_readahead() only in the
!zswap_never_enabled() case.
a) This causes folios to be read into the swapcache in both paths. This
design choice was motivated by stability: to handle race conditions
with say, process 1 faulting in a single-mapped folio; however,
process 2 could be simultaneously prefetching it as a "readahead"
folio.
b) If the single-mapped folio was successfully read and the race did
not occur, there are checks added to free the swapcache entry for
the folio, before do_swap_page() returns.
8) Finally, for IAA batching, we reduce SWAP_BATCH to 16 and modify the
swap slots cache thresholds to alleviate lock contention on the
swap_info_struct lock due to reduced swap page-fault latencies.
IAA decompress batching can be enabled only on platforms that have IAA, by
setting this config variable:
CONFIG_ZSWAP_LOAD_BATCHING_ENABLED="y"
A new swap parameter "singlemapped_ra_enabled" (false by default) is added
for use on platforms that have IAA. If zswap_load_batching_enabled() is
true, this is intended to give the user the option to run experiments with
IAA and with software compressors for zswap.
These are the recommended settings for "singlemapped_ra_enabled", which
takes effect only in the do_swap_page() single-mapped SWP_SYNCHRONOUS_IO
path:
For IAA:
echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
For software compressors:
echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
path.
IAA decompress batching performance testing was done using the kernel
compilation test "allmodconfig" run in tmpfs, which demonstrates a
significant amount of readahead activity. vm-scalability usemem is not
ideal for decompress batching because there is very little readahead
activity even with page-cluster of 5 (swap_ra is < 150 with 4k/16k/32k/64k
folios).
The kernel compilation experiments with decompress batching demonstrate
significant latency reductions with kernel compilation: up to 4% lower
elapsed time, 14% lower sys time than mm-unstable/zstd. When combined with
compress batching, we see a reduction of 5% in elapsed time and 20% in sys
time as compared to mm-unstable commit 817952b8be34 with zstd.
Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 3.5% reduction in workload run time as
compared to software compressors.
System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 10-16-2024,
commit 817952b8be34, without and with this patch-series ("this
patch-series" includes [1]). Data was gathered on an Intel Sapphire Rapids
server, dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB
RAM and 525G SSD disk partition swap. Core frequency was fixed at 2500MHz.
The kernel compilation test with run in tmpfs, using the "allmodconfig", so
that significant swapout and readahead activity can be observed to quantify
decompress batching.
Other kernel configuration parameters:
zswap compressor : deflate-iaa
zswap allocator : zsmalloc
vm.page-cluster : 3,4
IAA "compression verification" is disabled and the async poll acomp
interface is used in the iaa_crypto driver (the defaults with this
series).
Performance testing (Kernel compilation):
=========================================
As mentioned earlier, for workloads that see a lot of swapout activity, we
can benefit from configuring 2 WQs per IAA device, with compress jobs from
all same-socket cores being distributed toothe wq.1 of all IAAs on the
socket, with the "global_wq" developed in this patch-series.
Although this data includes IAA decompress batching, which will be
submitted as a separate RFC patch-series, I am listing it here to quantify
the benefit of distributing compress jobs among all IAAs. The kernel
compilation test with "allmodconfig" is able to quantify this well:
4K folios: deflate-iaa: kernel compilation
==========================================
------------------------------------------------------------------------------
mm-unstable-10-16-2024 zswap_load_batch with
IAA decompress batching
------------------------------------------------------------------------------
zswap compressor zstd deflate-iaa
vm.compress-batchsize n/a 1
vm.page-cluster 3 3
------------------------------------------------------------------------------
real_sec 783.87 752.99
user_sec 15,750.07 15,746.37
sys_sec 6,522.32 5,638.16
Max_Res_Set_Size_KB 1,872,640 1,872,640
------------------------------------------------------------------------------
zswpout 82,364,991 105,190,461
zswpin 21,303,393 29,684,653
pswpout 13 1
pswpin 12 1
pgmajfault 17,114,339 24,034,146
swap_ra 4,596,035 6,219,484
swap_ra_hit 2,903,249 3,876,195
------------------------------------------------------------------------------
Progression of kernel compilation latency improvements with
compress/decompress batching:
============================================================
-------------------------------------------------------------------------------
mm-unstable-10-16-2024 shrink_folio_ zswap_load_batch
list() w/ IAA decompress
batching batching
of folios
-------------------------------------------------------------------------------
zswap compr zstd deflate-iaa deflate-iaa deflate-iaa deflate-iaa
vm.compress- n/a n/a 32 1 32
batchsize
vm.page- 3 3 3 3 3
cluster
-------------------------------------------------------------------------------
real_sec 783.87 761.69 747.32 752.99 749.25
user_sec 15,750.07 15,716.69 15,728.39 15,746.37 15,741.71
sys_sec 6,522.32 5,725.28 5,399.44 5,638.16 5,482.12
Max_RSS_KB 1,872,640 1,870,848 1,874,432 1,872,640 1,872,640
zswpout 82,364,991 97,739,600 102,780,612 105,190,461 106,729,372
zswpin 21,303,393 27,684,166 29,016,252 29,684,653 30,717,819
pswpout 13 222 213 1 12
pswpin 12 209 202 1 8
pgmajfault 17,114,339 22,421,211 23,378,161 24,034,146 24,852,985
swap_ra 4,596,035 5,840,082 6,231,646 6,219,484 6,504,878
swap_ra_hit 2,903,249 3,682,444 3,940,420 3,876,195 4,092,852
-------------------------------------------------------------------------------
The last 2 columns of the latency reduction progression are as follows:
IAA decompress batching combined with distributing compress jobs to all
same-socket IAA devices:
=======================================================================
------------------------------------------------------------------------------
IAA shrink_folio_list() compress batching and
swapin_readahead() decompress batching
1WQ 2WQ (distribute compress jobs)
1 local WQ (wq.0) 1 local WQ (wq.0) +
per IAA 1 global WQ (wq.1) per IAA
------------------------------------------------------------------------------
zswap compressor deflate-iaa deflate-iaa
vm.compress-batchsize 32 32
vm.page-cluster 4 4
------------------------------------------------------------------------------
real_sec 746.77 745.42
user_sec 15,732.66 15,738.85
sys_sec 5,384.14 5,247.86
Max_Res_Set_Size_KB 1,874,432 1,872,640
------------------------------------------------------------------------------
zswpout 101,648,460 104,882,982
zswpin 27,418,319 29,428,515
pswpout 213 22
pswpin 207 6
pgmajfault 21,896,616 23,629,768
swap_ra 6,054,409 6,385,080
swap_ra_hit 3,791,628 3,985,141
------------------------------------------------------------------------------
I would greatly appreciate code review comments for this RFC series!
[1] https://patchwork.kernel.org/project/linux-mm/list/?series=900537
Thanks,
Kanchana
Kanchana P Sridhar (7):
mm: zswap: Config variable to enable zswap loads with decompress
batching.
mm: swap: Add IAA batch decompression API
swap_crypto_acomp_decompress_batch().
pagevec: struct folio_batch changes for decompress batching interface.
mm: swap: swap_read_folio() can add a folio to a folio_batch if it is
in zswap.
mm: swap, zswap: zswap folio_batch processing with IAA decompression
batching.
mm: do_swap_page() calls swapin_readahead() zswap load batching
interface.
mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache
thresholds.
include/linux/pagevec.h | 13 +-
include/linux/swap.h | 7 +
include/linux/swap_slots.h | 7 +
include/linux/zswap.h | 65 +++++++++
mm/Kconfig | 13 ++
mm/memory.c | 187 +++++++++++++++++++------
mm/page_io.c | 61 ++++++++-
mm/shmem.c | 2 +-
mm/swap.h | 102 ++++++++++++--
mm/swap_state.c | 272 ++++++++++++++++++++++++++++++++++---
mm/swapfile.c | 2 +-
mm/zswap.c | 272 +++++++++++++++++++++++++++++++++++++
12 files changed, 927 insertions(+), 76 deletions(-)
--
2.27.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with decompress batching.
2024-10-18 6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
@ 2024-10-18 6:47 ` Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 2/7] mm: swap: Add IAA batch decompression API swap_crypto_acomp_decompress_batch() Kanchana P Sridhar
` (5 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18 6:47 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
akpm, hughd, willy, bfoster, dchinner, chrisl, david
Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
Add a new zswap config variable that controls whether zswap load will
decompress a batch of 4K folios, for instance, the folios prefetched
during swapin_readahead():
CONFIG_ZSWAP_LOAD_BATCHING_ENABLED
The existing CONFIG_CRYPTO_DEV_IAA_CRYPTO variable added in commit
ea7a5cbb4369 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto
driver core") is used to detect if the system has the Intel Analytics
Accelerator (IAA), and the iaa_crypto module is available. If so, the
kernel build will prompt for CONFIG_ZSWAP_LOAD_BATCHING_ENABLED. Hence,
users have the ability to set CONFIG_ZSWAP_LOAD_BATCHING_ENABLED="y" only
on systems that have Intel IAA.
If CONFIG_ZSWAP_LOAD_BATCHING_ENABLED is enabled, and IAA is configured
as the zswap compressor, the vm.page-cluster is used to prefetch up to
32 4K folios using swapin_readahead(). The readahead folios present in
zswap are then loaded as a batch using IAA decompression batching.
The patch also implements a zswap API that returns the status of this
config variable.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
include/linux/zswap.h | 8 ++++++++
mm/Kconfig | 13 +++++++++++++
mm/zswap.c | 12 ++++++++++++
3 files changed, 33 insertions(+)
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 328a1e09d502..294d13efbfb1 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -118,6 +118,9 @@ static inline void zswap_store_batch(struct swap_in_memory_cache_cb *simc)
else
__zswap_store_batch_single(simc);
}
+
+bool zswap_load_batching_enabled(void);
+
unsigned long zswap_total_pages(void);
bool zswap_store(struct folio *folio);
bool zswap_load(struct folio *folio);
@@ -145,6 +148,11 @@ static inline void zswap_store_batch(struct swap_in_memory_cache_cb *simc)
{
}
+static inline bool zswap_load_batching_enabled(void)
+{
+ return false;
+}
+
static inline bool zswap_store(struct folio *folio)
{
return false;
diff --git a/mm/Kconfig b/mm/Kconfig
index 26d1a5cee471..98e46a3cf0e3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -137,6 +137,19 @@ config ZSWAP_STORE_BATCHING_ENABLED
in the folio in hardware, thereby improving large folio compression
throughput and reducing swapout latency.
+config ZSWAP_LOAD_BATCHING_ENABLED
+ bool "Batching of zswap loads of 4K folios with Intel IAA"
+ depends on ZSWAP && CRYPTO_DEV_IAA_CRYPTO
+ default n
+ help
+ Enables zswap_load to swapin multiple 4K folios in batches of 8,
+ rather than a folio at a time, if the system has Intel IAA for hardware
+ acceleration of decompressions. swapin_readahead will be used to
+ prefetch a batch of folios to be swapped in along with the faulting
+ folio. If IAA is the zswap compressor, this will parallelize batch
+ decompression of upto 8 folios in hardware, thereby reducing swapin
+ and do_swap_page latency.
+
choice
prompt "Default allocator"
depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 68ce498ad000..fe7bc2a6672e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -136,6 +136,13 @@ module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
static bool __zswap_store_batching_enabled = IS_ENABLED(
CONFIG_ZSWAP_STORE_BATCHING_ENABLED);
+/*
+ * Enable/disable batching of decompressions of multiple 4K folios, if
+ * the system has Intel IAA.
+ */
+static bool __zswap_load_batching_enabled = IS_ENABLED(
+ CONFIG_ZSWAP_LOAD_BATCHING_ENABLED);
+
bool zswap_is_enabled(void)
{
return zswap_enabled;
@@ -246,6 +253,11 @@ __always_inline bool zswap_store_batching_enabled(void)
return __zswap_store_batching_enabled;
}
+__always_inline bool zswap_load_batching_enabled(void)
+{
+ return __zswap_load_batching_enabled;
+}
+
static void __zswap_store_batch_core(
int node_id,
struct folio **folios,
--
2.27.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v1 2/7] mm: swap: Add IAA batch decompression API swap_crypto_acomp_decompress_batch().
2024-10-18 6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
2024-10-18 6:47 ` [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with " Kanchana P Sridhar
@ 2024-10-18 6:48 ` Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 3/7] pagevec: struct folio_batch changes for decompress batching interface Kanchana P Sridhar
` (4 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18 6:48 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
akpm, hughd, willy, bfoster, dchinner, chrisl, david
Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
Added a new API swap_crypto_acomp_decompress_batch() that does batch
decompression. A system that has Intel IAA can avail of this API to submit a
batch of decompress jobs for parallel decompression in the hardware, to improve
performance. On a system without IAA, this API will process each decompress
job sequentially.
The purpose of this API is to be invocable from any swap module that needs
to decompress multiple 4K folios, or a batch of pages in the general case.
For instance, zswap would decompress up to (1UL << SWAP_RA_ORDER_CEILING)
folios in batches of SWAP_CRYPTO_SUB_BATCH_SIZE (i.e. 8 if the system has
IAA) pages prefetched by swapin_readahead(), which would improve readahead
performance.
Towards this eventual goal, the swap_crypto_acomp_decompress_batch()
interface is implemented in swap_state.c and exported via mm/swap.h. It
would be preferable for swap_crypto_acomp_decompress_batch() to be exported
via include/linux/swap.h so that modules outside mm (for e.g. zram) can
potentially use the API for batch decompressions with IAA, since the
swapin_readahead() batching interface is common to all swap modules.
I would appreciate RFC comments on this.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
mm/swap.h | 42 +++++++++++++++++--
mm/swap_state.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 147 insertions(+), 4 deletions(-)
diff --git a/mm/swap.h b/mm/swap.h
index 08c04954304f..0bb386b5fdee 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -10,11 +10,12 @@ struct mempolicy;
#include <linux/crypto.h>
/*
- * For IAA compression batching:
- * Maximum number of IAA acomp compress requests that will be processed
- * in a sub-batch.
+ * For IAA compression/decompression batching:
+ * Maximum number of IAA acomp compress/decompress requests that will be
+ * processed in a sub-batch.
*/
-#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED)
+#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED) || \
+ defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
#define SWAP_CRYPTO_SUB_BATCH_SIZE 8UL
#else
#define SWAP_CRYPTO_SUB_BATCH_SIZE 1UL
@@ -60,6 +61,29 @@ void swap_crypto_acomp_compress_batch(
int nr_pages,
struct crypto_acomp_ctx *acomp_ctx);
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ * The acomp_ctx mutex should be locked/unlocked before/after calling this
+ * procedure.
+ *
+ * @srcs: The src buffers to be decompressed.
+ * @pages: The pages to store the buffers decompressed by IAA.
+ * @slens: src buffers' compressed lengths.
+ * @errors: Will contain a 0 if the page was successfully decompressed, or a
+ * non-0 error value to be processed by the calling function.
+ * @nr_pages: The number of pages, up to SWAP_CRYPTO_SUB_BATCH_SIZE,
+ * to be decompressed.
+ * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
+ */
+void swap_crypto_acomp_decompress_batch(
+ u8 *srcs[],
+ struct page *pages[],
+ unsigned int slens[],
+ int errors[],
+ int nr_pages,
+ struct crypto_acomp_ctx *acomp_ctx);
+
/* linux/mm/vmscan.c, linux/mm/page_io.c, linux/mm/zswap.c */
/* For batching of compressions in reclaim path. */
struct swap_in_memory_cache_cb {
@@ -204,6 +228,16 @@ static inline void swap_write_in_memory_cache_unplug(
{
}
+static inline void swap_crypto_acomp_decompress_batch(
+ u8 *srcs[],
+ struct page *pages[],
+ unsigned int slens[],
+ int errors[],
+ int nr_pages,
+ struct crypto_acomp_ctx *acomp_ctx)
+{
+}
+
static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
{
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 117c3caa5679..3cebbff40804 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -855,6 +855,115 @@ void swap_crypto_acomp_compress_batch(
}
EXPORT_SYMBOL_GPL(swap_crypto_acomp_compress_batch);
+/**
+ * This API provides IAA decompress batching functionality for use by swap
+ * modules.
+ * The acomp_ctx mutex should be locked/unlocked before/after calling this
+ * procedure.
+ *
+ * @srcs: The src buffers to be decompressed.
+ * @pages: The pages to store the buffers decompressed by IAA.
+ * @slens: src buffers' compressed lengths.
+ * @errors: Will contain a 0 if the page was successfully decompressed, or a
+ * non-0 error value to be processed by the calling function.
+ * @nr_pages: The number of pages, up to SWAP_CRYPTO_SUB_BATCH_SIZE,
+ * to be decompressed.
+ * @acomp_ctx: The acomp context for iaa_crypto/other compressor.
+ */
+void swap_crypto_acomp_decompress_batch(
+ u8 *srcs[],
+ struct page *pages[],
+ unsigned int slens[],
+ int errors[],
+ int nr_pages,
+ struct crypto_acomp_ctx *acomp_ctx)
+{
+ struct scatterlist inputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
+ struct scatterlist outputs[SWAP_CRYPTO_SUB_BATCH_SIZE];
+ unsigned int dlens[SWAP_CRYPTO_SUB_BATCH_SIZE];
+ bool decompressions_done = false;
+ int i, j;
+
+ BUG_ON(nr_pages > SWAP_CRYPTO_SUB_BATCH_SIZE);
+
+ /*
+ * Prepare and submit acomp_reqs to IAA.
+ * IAA will process these decompress jobs in parallel in async mode.
+ * If the compressor does not support a poll() method, or if IAA is
+ * used in sync mode, the jobs will be processed sequentially using
+ * acomp_ctx->req[0] and acomp_ctx->wait.
+ */
+ for (i = 0; i < nr_pages; ++i) {
+ j = acomp_ctx->acomp->poll ? i : 0;
+
+ dlens[i] = PAGE_SIZE;
+ sg_init_one(&inputs[i], srcs[i], slens[i]);
+ sg_init_table(&outputs[i], 1);
+ sg_set_page(&outputs[i], pages[i], PAGE_SIZE, 0);
+ acomp_request_set_params(acomp_ctx->req[j], &inputs[i],
+ &outputs[i], slens[i], dlens[i]);
+ /*
+ * If the crypto_acomp provides an asynchronous poll()
+ * interface, submit the request to the driver now, and poll for
+ * a completion status later, after all descriptors have been
+ * submitted. If the crypto_acomp does not provide a poll()
+ * interface, submit the request and wait for it to complete,
+ * i.e., synchronously, before moving on to the next request.
+ */
+ if (acomp_ctx->acomp->poll) {
+ errors[i] = crypto_acomp_decompress(acomp_ctx->req[j]);
+
+ if (errors[i] != -EINPROGRESS)
+ errors[i] = -EINVAL;
+ else
+ errors[i] = -EAGAIN;
+ } else {
+ errors[i] = crypto_wait_req(
+ crypto_acomp_decompress(acomp_ctx->req[j]),
+ &acomp_ctx->wait);
+ if (!errors[i]) {
+ dlens[i] = acomp_ctx->req[j]->dlen;
+ BUG_ON(dlens[i] != PAGE_SIZE);
+ }
+ }
+ }
+
+ /*
+ * If not doing async decompressions, the batch has been processed at
+ * this point and we can return.
+ */
+ if (!acomp_ctx->acomp->poll)
+ return;
+
+ /*
+ * Poll for and process IAA decompress job completions
+ * in out-of-order manner.
+ */
+ while (!decompressions_done) {
+ decompressions_done = true;
+
+ for (i = 0; i < nr_pages; ++i) {
+ /*
+ * Skip, if the decompression has already completed
+ * successfully or with an error.
+ */
+ if (errors[i] != -EAGAIN)
+ continue;
+
+ errors[i] = crypto_acomp_poll(acomp_ctx->req[i]);
+
+ if (errors[i]) {
+ if (errors[i] == -EAGAIN)
+ decompressions_done = false;
+ } else {
+ dlens[i] = acomp_ctx->req[i]->dlen;
+ BUG_ON(dlens[i] != PAGE_SIZE);
+ }
+ }
+ }
+}
+EXPORT_SYMBOL_GPL(swap_crypto_acomp_decompress_batch);
+
#endif /* CONFIG_SWAP */
static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
--
2.27.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v1 3/7] pagevec: struct folio_batch changes for decompress batching interface.
2024-10-18 6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
2024-10-18 6:47 ` [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with " Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 2/7] mm: swap: Add IAA batch decompression API swap_crypto_acomp_decompress_batch() Kanchana P Sridhar
@ 2024-10-18 6:48 ` Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 4/7] mm: swap: swap_read_folio() can add a folio to a folio_batch if it is in zswap Kanchana P Sridhar
` (3 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18 6:48 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
akpm, hughd, willy, bfoster, dchinner, chrisl, david
Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
Made these changes to "struct folio_batch" for use in the
swapin_readahead() based zswap load batching interface for parallel
decompressions with IAA:
1) Moved SWAP_RA_ORDER_CEILING definition to pagevec.h.
2) Increased PAGEVEC_SIZE to (1UL << SWAP_RA_ORDER_CEILING),
because vm.page-cluster=5 requires capacity for 32 folios.
3) Made folio_batch_add() more fail-safe.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
include/linux/pagevec.h | 13 ++++++++++---
mm/swap_state.c | 2 --
2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 5d3a0cccc6bf..c9bab240fb6e 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -11,8 +11,14 @@
#include <linux/types.h>
-/* 31 pointers + header align the folio_batch structure to a power of two */
-#define PAGEVEC_SIZE 31
+/*
+ * For page-cluster of 5, I noticed that space for 31 pointers was
+ * insufficient. Increasing this to meet the requirements for folio_batch
+ * usage in the swap read decompress batching interface that is based on
+ * swapin_readahead().
+ */
+#define SWAP_RA_ORDER_CEILING 5
+#define PAGEVEC_SIZE (1UL << SWAP_RA_ORDER_CEILING)
struct folio;
@@ -74,7 +80,8 @@ static inline unsigned int folio_batch_space(struct folio_batch *fbatch)
static inline unsigned folio_batch_add(struct folio_batch *fbatch,
struct folio *folio)
{
- fbatch->folios[fbatch->nr++] = folio;
+ if (folio_batch_space(fbatch) > 0)
+ fbatch->folios[fbatch->nr++] = folio;
return folio_batch_space(fbatch);
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3cebbff40804..0673593d363c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -44,8 +44,6 @@ struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
static bool enable_vma_readahead __read_mostly = true;
-#define SWAP_RA_ORDER_CEILING 5
-
#define SWAP_RA_WIN_SHIFT (PAGE_SHIFT / 2)
#define SWAP_RA_HITS_MASK ((1UL << SWAP_RA_WIN_SHIFT) - 1)
#define SWAP_RA_HITS_MAX SWAP_RA_HITS_MASK
--
2.27.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v1 4/7] mm: swap: swap_read_folio() can add a folio to a folio_batch if it is in zswap.
2024-10-18 6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
` (2 preceding siblings ...)
2024-10-18 6:48 ` [RFC PATCH v1 3/7] pagevec: struct folio_batch changes for decompress batching interface Kanchana P Sridhar
@ 2024-10-18 6:48 ` Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 5/7] mm: swap, zswap: zswap folio_batch processing with IAA decompression batching Kanchana P Sridhar
` (2 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18 6:48 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
akpm, hughd, willy, bfoster, dchinner, chrisl, david
Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
This patch modifies swap_read_folio() to check if the swap entry is present
in zswap, and if so, it will be added to a "zswap_batch" folio_batch, if
the caller (e.g. swapin_readahead()) has passed in a valid "zswap_batch".
If the swap entry is found in zswap, it will be added the next available
index in a sub-batch. This sub-batch is part of "struct zswap_decomp_batch"
which progressively constructs SWAP_CRYPTO_SUB_BATCH_SIZE arrays of zswap
entries/xarrays/pages/source-lengths ready for batch decompression in IAA.
The function that does this, zswap_add_load_batch(), will return true to
swap_read_folio(). If the entry is not found in zswap, it will return
false.
If the swap entry was not found in zswap, and if
zswap_load_batching_enabled() and a valid "non_zswap_batch" folio_batch is
passed to swap_read_folio(), the folio will be added to the
"non_zswap_batch" batch.
Finally, the code falls through to the usual/existing swap_read_folio()
flow.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
include/linux/zswap.h | 35 +++++++++++++++++
mm/memory.c | 2 +-
mm/page_io.c | 26 ++++++++++++-
mm/swap.h | 31 ++++++++++++++-
mm/swap_state.c | 10 ++---
mm/zswap.c | 89 +++++++++++++++++++++++++++++++++++++++++++
6 files changed, 183 insertions(+), 10 deletions(-)
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 294d13efbfb1..1d6de281f243 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -12,6 +12,8 @@ extern atomic_long_t zswap_stored_pages;
#ifdef CONFIG_ZSWAP
struct swap_in_memory_cache_cb;
+struct zswap_decomp_batch;
+struct zswap_entry;
struct zswap_lruvec_state {
/*
@@ -120,6 +122,19 @@ static inline void zswap_store_batch(struct swap_in_memory_cache_cb *simc)
}
bool zswap_load_batching_enabled(void);
+void zswap_load_batch_init(struct zswap_decomp_batch *zd_batch);
+void zswap_load_batch_reinit(struct zswap_decomp_batch *zd_batch);
+bool __zswap_add_load_batch(struct zswap_decomp_batch *zd_batch,
+ struct folio *folio);
+static inline bool zswap_add_load_batch(
+ struct zswap_decomp_batch *zd_batch,
+ struct folio *folio)
+{
+ if (zswap_load_batching_enabled())
+ return __zswap_add_load_batch(zd_batch, folio);
+
+ return false;
+}
unsigned long zswap_total_pages(void);
bool zswap_store(struct folio *folio);
@@ -138,6 +153,8 @@ struct zswap_lruvec_state {};
struct zswap_store_sub_batch_page {};
struct zswap_store_pipeline_state {};
struct swap_in_memory_cache_cb;
+struct zswap_decomp_batch;
+struct zswap_entry;
static inline bool zswap_store_batching_enabled(void)
{
@@ -153,6 +170,24 @@ static inline bool zswap_load_batching_enabled(void)
return false;
}
+static inline void zswap_load_batch_init(
+ struct zswap_decomp_batch *zd_batch)
+{
+}
+
+static inline void zswap_load_batch_reinit(
+ struct zswap_decomp_batch *zd_batch)
+{
+}
+
+static inline bool zswap_add_load_batch(
+ struct folio *folio,
+ struct zswap_entry *entry,
+ struct zswap_decomp_batch *zd_batch)
+{
+ return false;
+}
+
static inline bool zswap_store(struct folio *folio)
{
return false;
diff --git a/mm/memory.c b/mm/memory.c
index 0f614523b9f4..b5745b9ffdf7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4322,7 +4322,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
/* To provide entry to swap_read_folio() */
folio->swap = entry;
- swap_read_folio(folio, NULL);
+ swap_read_folio(folio, NULL, NULL, NULL);
folio->private = NULL;
}
} else {
diff --git a/mm/page_io.c b/mm/page_io.c
index 065db25309b8..9750302d193b 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -744,11 +744,17 @@ static void swap_read_folio_bdev_async(struct folio *folio,
submit_bio(bio);
}
-void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
+/*
+ * Returns true if the folio was read, and false if the folio was added to
+ * the zswap_decomp_batch for batched decompression.
+ */
+bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
+ struct zswap_decomp_batch *zswap_batch,
+ struct folio_batch *non_zswap_batch)
{
struct swap_info_struct *sis = swp_swap_info(folio->swap);
bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
- bool workingset = folio_test_workingset(folio);
+ bool workingset;
unsigned long pflags;
bool in_thrashing;
@@ -756,11 +762,26 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio);
+ /*
+ * If entry is found in zswap xarray, and zswap load batching
+ * is enabled, this is a candidate for zswap batch decompression.
+ */
+ if (zswap_batch && zswap_add_load_batch(zswap_batch, folio))
+ return false;
+
+ if (zswap_load_batching_enabled() && non_zswap_batch) {
+ BUG_ON(!folio_batch_space(non_zswap_batch));
+ folio_batch_add(non_zswap_batch, folio);
+ return false;
+ }
+
/*
* Count submission time as memory stall and delay. When the device
* is congested, or the submitting cgroup IO-throttled, submission
* can be a significant part of overall IO time.
*/
+ workingset = folio_test_workingset(folio);
+
if (workingset) {
delayacct_thrashing_start(&in_thrashing);
psi_memstall_enter(&pflags);
@@ -792,6 +813,7 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
psi_memstall_leave(&pflags);
}
delayacct_swapin_end();
+ return true;
}
void __swap_read_unplug(struct swap_iocb *sio)
diff --git a/mm/swap.h b/mm/swap.h
index 0bb386b5fdee..310f99007fe6 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -84,6 +84,27 @@ void swap_crypto_acomp_decompress_batch(
int nr_pages,
struct crypto_acomp_ctx *acomp_ctx);
+#if defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
+#define MAX_NR_ZSWAP_LOAD_SUB_BATCHES DIV_ROUND_UP(PAGEVEC_SIZE, \
+ SWAP_CRYPTO_SUB_BATCH_SIZE)
+#else
+#define MAX_NR_ZSWAP_LOAD_SUB_BATCHES 1UL
+#endif /* CONFIG_ZSWAP_LOAD_BATCHING_ENABLED */
+
+/*
+ * Note: If PAGEVEC_SIZE or SWAP_CRYPTO_SUB_BATCH_SIZE
+ * exceeds 256, change the u8 to u16.
+ */
+struct zswap_decomp_batch {
+ struct folio_batch fbatch;
+ bool swapcache[PAGEVEC_SIZE];
+ struct xarray *trees[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SUB_BATCH_SIZE];
+ struct zswap_entry *entries[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SUB_BATCH_SIZE];
+ struct page *pages[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SUB_BATCH_SIZE];
+ unsigned int slens[MAX_NR_ZSWAP_LOAD_SUB_BATCHES][SWAP_CRYPTO_SUB_BATCH_SIZE];
+ u8 nr_decomp[MAX_NR_ZSWAP_LOAD_SUB_BATCHES];
+};
+
/* linux/mm/vmscan.c, linux/mm/page_io.c, linux/mm/zswap.c */
/* For batching of compressions in reclaim path. */
struct swap_in_memory_cache_cb {
@@ -101,7 +122,9 @@ struct swap_in_memory_cache_cb {
/* linux/mm/page_io.c */
int sio_pool_init(void);
struct swap_iocb;
-void swap_read_folio(struct folio *folio, struct swap_iocb **plug);
+bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
+ struct zswap_decomp_batch *zswap_batch,
+ struct folio_batch *non_zswap_batch);
void __swap_read_unplug(struct swap_iocb *plug);
static inline void swap_read_unplug(struct swap_iocb *plug)
{
@@ -238,8 +261,12 @@ static inline void swap_crypto_acomp_decompress_batch(
{
}
-static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
+struct zswap_decomp_batch {};
+static inline bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
+ struct zswap_decomp_batch *zswap_batch,
+ struct folio_batch *non_zswap_batch)
{
+ return false;
}
static inline void swap_write_unplug(struct swap_iocb *sio)
{
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0673593d363c..0aa938e4c34d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -570,7 +570,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
mpol_cond_put(mpol);
if (page_allocated)
- swap_read_folio(folio, plug);
+ swap_read_folio(folio, plug, NULL, NULL);
return folio;
}
@@ -687,7 +687,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!folio)
continue;
if (page_allocated) {
- swap_read_folio(folio, &splug);
+ swap_read_folio(folio, &splug, NULL, NULL);
if (offset != entry_offset) {
folio_set_readahead(folio);
count_vm_event(SWAP_RA);
@@ -703,7 +703,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
&page_allocated, false);
if (unlikely(page_allocated))
- swap_read_folio(folio, NULL);
+ swap_read_folio(folio, NULL, NULL, NULL);
return folio;
}
@@ -1057,7 +1057,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
if (!folio)
continue;
if (page_allocated) {
- swap_read_folio(folio, &splug);
+ swap_read_folio(folio, &splug, NULL, NULL);
if (addr != vmf->address) {
folio_set_readahead(folio);
count_vm_event(SWAP_RA);
@@ -1075,7 +1075,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
&page_allocated, false);
if (unlikely(page_allocated))
- swap_read_folio(folio, NULL);
+ swap_read_folio(folio, NULL, NULL, NULL);
return folio;
}
diff --git a/mm/zswap.c b/mm/zswap.c
index fe7bc2a6672e..1d293f95d525 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -2312,6 +2312,95 @@ bool zswap_load(struct folio *folio)
return true;
}
+/* Code for zswap load batch with batch decompress. */
+
+__always_inline void zswap_load_batch_init(struct zswap_decomp_batch *zd_batch)
+{
+ unsigned int sb;
+
+ folio_batch_init(&zd_batch->fbatch);
+
+ for (sb = 0; sb < MAX_NR_ZSWAP_LOAD_SUB_BATCHES; ++sb)
+ zd_batch->nr_decomp[sb] = 0;
+}
+
+__always_inline void zswap_load_batch_reinit(struct zswap_decomp_batch *zd_batch)
+{
+ unsigned int sb;
+
+ folio_batch_reinit(&zd_batch->fbatch);
+
+ for (sb = 0; sb < MAX_NR_ZSWAP_LOAD_SUB_BATCHES; ++sb)
+ zd_batch->nr_decomp[sb] = 0;
+}
+
+/*
+ * All folios in zd_batch are allocated into the swapcache
+ * in swapin_readahead(), before being added to the zd_batch
+ * for batch decompression.
+ */
+bool __zswap_add_load_batch(struct zswap_decomp_batch *zd_batch,
+ struct folio *folio)
+{
+ swp_entry_t swp = folio->swap;
+ pgoff_t offset = swp_offset(swp);
+ bool swapcache = folio_test_swapcache(folio);
+ struct xarray *tree = swap_zswap_tree(swp);
+ struct zswap_entry *entry;
+ unsigned int batch_idx, sb;
+
+ VM_WARN_ON_ONCE(!folio_test_locked(folio));
+
+ if (zswap_never_enabled())
+ return false;
+
+ /*
+ * Large folios should not be swapped in while zswap is being used, as
+ * they are not properly handled. Zswap does not properly load large
+ * folios, and a large folio may only be partially in zswap.
+ *
+ * Returning false here will cause the large folio to be added to
+ * the "non_zswap_batch" in swap_read_folio(), which will eventually
+ * call zswap_load() if the folio is not in the zeromap. Finally,
+ * zswap_load() will return true without marking the folio uptodate
+ * so that an IO error is emitted (e.g. do_swap_page() will sigbus).
+ */
+ if (WARN_ON_ONCE(folio_test_large(folio)))
+ return false;
+
+ /*
+ * When reading into the swapcache, invalidate our entry. The
+ * swapcache can be the authoritative owner of the page and
+ * its mappings, and the pressure that results from having two
+ * in-memory copies outweighs any benefits of caching the
+ * compression work.
+ */
+ if (swapcache)
+ entry = xa_erase(tree, offset);
+ else
+ entry = xa_load(tree, offset);
+
+ if (!entry)
+ return false;
+
+ BUG_ON(!folio_batch_space(&zd_batch->fbatch));
+ folio_batch_add(&zd_batch->fbatch, folio);
+
+ batch_idx = folio_batch_count(&zd_batch->fbatch) - 1;
+ zd_batch->swapcache[batch_idx] = swapcache;
+ sb = batch_idx / SWAP_CRYPTO_SUB_BATCH_SIZE;
+
+ if (entry->length) {
+ zd_batch->trees[sb][zd_batch->nr_decomp[sb]] = tree;
+ zd_batch->entries[sb][zd_batch->nr_decomp[sb]] = entry;
+ zd_batch->pages[sb][zd_batch->nr_decomp[sb]] = &folio->page;
+ zd_batch->slens[sb][zd_batch->nr_decomp[sb]] = entry->length;
+ zd_batch->nr_decomp[sb]++;
+ }
+
+ return true;
+}
+
void zswap_invalidate(swp_entry_t swp)
{
pgoff_t offset = swp_offset(swp);
--
2.27.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v1 5/7] mm: swap, zswap: zswap folio_batch processing with IAA decompression batching.
2024-10-18 6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
` (3 preceding siblings ...)
2024-10-18 6:48 ` [RFC PATCH v1 4/7] mm: swap: swap_read_folio() can add a folio to a folio_batch if it is in zswap Kanchana P Sridhar
@ 2024-10-18 6:48 ` Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 7/7] mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache thresholds Kanchana P Sridhar
6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18 6:48 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
akpm, hughd, willy, bfoster, dchinner, chrisl, david
Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
This patch provides the functionality that processes a "zswap_batch" in
which swap_read_folio() had previously stored swap entries found in zswap,
for batched loading.
The newly added zswap_finish_load_batch() API implements the main zswap
load batching functionality. This makes use of the sub-batches of
zswap_entry/xarray/page/source-length readily available from
zswap_add_load_batch(). These sub-batch arrays are processed one at a time,
until the entire zswap folio_batch has been loaded. The existing
zswap_load() functionality of deleting zswap_entries for folios found in
the swapcache, is preserved.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
include/linux/zswap.h | 22 ++++++
mm/page_io.c | 35 +++++++++
mm/swap.h | 17 +++++
mm/zswap.c | 171 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 245 insertions(+)
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 1d6de281f243..a0792c2b300a 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -110,6 +110,15 @@ struct zswap_store_pipeline_state {
u8 nr_comp_pages;
};
+/* Note: If SWAP_CRYPTO_SUB_BATCH_SIZE exceeds 256, change the u8 to u16. */
+struct zswap_load_sub_batch_state {
+ struct xarray **trees;
+ struct zswap_entry **entries;
+ struct page **pages;
+ unsigned int *slens;
+ u8 nr_decomp;
+};
+
bool zswap_store_batching_enabled(void);
void __zswap_store_batch(struct swap_in_memory_cache_cb *simc);
void __zswap_store_batch_single(struct swap_in_memory_cache_cb *simc);
@@ -136,6 +145,14 @@ static inline bool zswap_add_load_batch(
return false;
}
+void __zswap_finish_load_batch(struct zswap_decomp_batch *zd_batch);
+static inline void zswap_finish_load_batch(
+ struct zswap_decomp_batch *zd_batch)
+{
+ if (zswap_load_batching_enabled())
+ __zswap_finish_load_batch(zd_batch);
+}
+
unsigned long zswap_total_pages(void);
bool zswap_store(struct folio *folio);
bool zswap_load(struct folio *folio);
@@ -188,6 +205,11 @@ static inline bool zswap_add_load_batch(
return false;
}
+static inline void zswap_finish_load_batch(
+ struct zswap_decomp_batch *zd_batch)
+{
+}
+
static inline bool zswap_store(struct folio *folio)
{
return false;
diff --git a/mm/page_io.c b/mm/page_io.c
index 9750302d193b..aa83221318ef 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -816,6 +816,41 @@ bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
return true;
}
+static void __swap_post_process_zswap_load_batch(
+ struct zswap_decomp_batch *zswap_batch)
+{
+ u8 i;
+
+ for (i = 0; i < folio_batch_count(&zswap_batch->fbatch); ++i) {
+ struct folio *folio = zswap_batch->fbatch.folios[i];
+ folio_unlock(folio);
+ }
+}
+
+/*
+ * The swapin_readahead batching interface makes sure that the
+ * input zswap_batch consists of folios belonging to the same swap
+ * device type.
+ */
+void __swap_read_zswap_batch_unplug(struct zswap_decomp_batch *zswap_batch,
+ struct swap_iocb **splug)
+{
+ unsigned long pflags;
+
+ if (!folio_batch_count(&zswap_batch->fbatch))
+ return;
+
+ psi_memstall_enter(&pflags);
+ delayacct_swapin_start();
+
+ /* Load the zswap batch. */
+ zswap_finish_load_batch(zswap_batch);
+ __swap_post_process_zswap_load_batch(zswap_batch);
+
+ psi_memstall_leave(&pflags);
+ delayacct_swapin_end();
+}
+
void __swap_read_unplug(struct swap_iocb *sio)
{
struct iov_iter from;
diff --git a/mm/swap.h b/mm/swap.h
index 310f99007fe6..2b82c8ed765c 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -125,6 +125,16 @@ struct swap_iocb;
bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
struct zswap_decomp_batch *zswap_batch,
struct folio_batch *non_zswap_batch);
+void __swap_read_zswap_batch_unplug(
+ struct zswap_decomp_batch *zswap_batch,
+ struct swap_iocb **splug);
+static inline void swap_read_zswap_batch_unplug(
+ struct zswap_decomp_batch *zswap_batch,
+ struct swap_iocb **splug)
+{
+ if (likely(zswap_batch))
+ __swap_read_zswap_batch_unplug(zswap_batch, splug);
+}
void __swap_read_unplug(struct swap_iocb *plug);
static inline void swap_read_unplug(struct swap_iocb *plug)
{
@@ -268,6 +278,13 @@ static inline bool swap_read_folio(struct folio *folio, struct swap_iocb **plug,
{
return false;
}
+
+static inline void swap_read_zswap_batch_unplug(
+ struct zswap_decomp_batch *zswap_batch,
+ struct swap_iocb **splug)
+{
+}
+
static inline void swap_write_unplug(struct swap_iocb *sio)
{
}
diff --git a/mm/zswap.c b/mm/zswap.c
index 1d293f95d525..39bf7d8810e9 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -35,6 +35,7 @@
#include <linux/pagemap.h>
#include <linux/workqueue.h>
#include <linux/list_lru.h>
+#include <linux/delayacct.h>
#include "swap.h"
#include "internal.h"
@@ -2401,6 +2402,176 @@ bool __zswap_add_load_batch(struct zswap_decomp_batch *zd_batch,
return true;
}
+static __always_inline void zswap_load_sub_batch_init(
+ struct zswap_decomp_batch *zd_batch,
+ unsigned int sb,
+ struct zswap_load_sub_batch_state *zls)
+{
+ zls->trees = zd_batch->trees[sb];
+ zls->entries = zd_batch->entries[sb];
+ zls->pages = zd_batch->pages[sb];
+ zls->slens = zd_batch->slens[sb];
+ zls->nr_decomp = zd_batch->nr_decomp[sb];
+}
+
+static void zswap_load_map_sources(
+ struct zswap_load_sub_batch_state *zls,
+ u8 *srcs[])
+{
+ u8 i;
+
+ for (i = 0; i < zls->nr_decomp; ++i) {
+ struct zswap_entry *entry = zls->entries[i];
+ struct zpool *zpool = entry->pool->zpool;
+ u8 *buf = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
+ memcpy(srcs[i], buf, entry->length);
+ zpool_unmap_handle(zpool, entry->handle);
+ }
+}
+
+static void zswap_decompress_batch(
+ struct zswap_load_sub_batch_state *zls,
+ u8 *srcs[],
+ int decomp_errors[])
+{
+ struct crypto_acomp_ctx *acomp_ctx;
+
+ acomp_ctx = raw_cpu_ptr(zls->entries[0]->pool->acomp_ctx);
+
+ swap_crypto_acomp_decompress_batch(
+ srcs,
+ zls->pages,
+ zls->slens,
+ decomp_errors,
+ zls->nr_decomp,
+ acomp_ctx);
+}
+
+static void zswap_load_batch_updates(
+ struct zswap_decomp_batch *zd_batch,
+ unsigned int sb,
+ struct zswap_load_sub_batch_state *zls,
+ int decomp_errors[])
+{
+ unsigned int j;
+ u8 i;
+
+ for (i = 0; i < zls->nr_decomp; ++i) {
+ j = (sb * SWAP_CRYPTO_SUB_BATCH_SIZE) + i;
+ struct folio *folio = zd_batch->fbatch.folios[j];
+ struct zswap_entry *entry = zls->entries[i];
+
+ BUG_ON(decomp_errors[i]);
+ count_vm_event(ZSWPIN);
+ if (entry->objcg)
+ count_objcg_events(entry->objcg, ZSWPIN, 1);
+
+ if (zd_batch->swapcache[j]) {
+ zswap_entry_free(entry);
+ folio_mark_dirty(folio);
+ }
+
+ folio_mark_uptodate(folio);
+ }
+}
+
+static void zswap_load_decomp_batch(
+ struct zswap_decomp_batch *zd_batch,
+ unsigned int sb,
+ struct zswap_load_sub_batch_state *zls)
+{
+ int decomp_errors[SWAP_CRYPTO_SUB_BATCH_SIZE];
+ struct crypto_acomp_ctx *acomp_ctx;
+
+ acomp_ctx = raw_cpu_ptr(zls->entries[0]->pool->acomp_ctx);
+ mutex_lock(&acomp_ctx->mutex);
+
+ zswap_load_map_sources(zls, acomp_ctx->buffer);
+
+ zswap_decompress_batch(zls, acomp_ctx->buffer, decomp_errors);
+
+ mutex_unlock(&acomp_ctx->mutex);
+
+ zswap_load_batch_updates(zd_batch, sb, zls, decomp_errors);
+}
+
+static void zswap_load_start_accounting(
+ struct zswap_decomp_batch *zd_batch,
+ unsigned int sb,
+ struct zswap_load_sub_batch_state *zls,
+ bool workingset[],
+ bool in_thrashing[])
+{
+ unsigned int j;
+ u8 i;
+
+ for (i = 0; i < zls->nr_decomp; ++i) {
+ j = (sb * SWAP_CRYPTO_SUB_BATCH_SIZE) + i;
+ struct folio *folio = zd_batch->fbatch.folios[j];
+ workingset[i] = folio_test_workingset(folio);
+ if (workingset[i])
+ delayacct_thrashing_start(&in_thrashing[i]);
+ }
+}
+
+static void zswap_load_end_accounting(
+ struct zswap_decomp_batch *zd_batch,
+ struct zswap_load_sub_batch_state *zls,
+ bool workingset[],
+ bool in_thrashing[])
+{
+ u8 i;
+
+ for (i = 0; i < zls->nr_decomp; ++i)
+ if (workingset[i])
+ delayacct_thrashing_end(&in_thrashing[i]);
+}
+
+/*
+ * All entries in a zd_batch belong to the same swap device.
+ */
+void __zswap_finish_load_batch(struct zswap_decomp_batch *zd_batch)
+{
+ struct zswap_load_sub_batch_state zls;
+ unsigned int nr_folios = folio_batch_count(&zd_batch->fbatch);
+ unsigned int nr_sb = DIV_ROUND_UP(nr_folios, SWAP_CRYPTO_SUB_BATCH_SIZE);
+ unsigned int sb;
+
+ /*
+ * Process the zd_batch in sub-batches of
+ * SWAP_CRYPTO_SUB_BATCH_SIZE.
+ */
+ for (sb = 0; sb < nr_sb; ++sb) {
+ bool workingset[SWAP_CRYPTO_SUB_BATCH_SIZE];
+ bool in_thrashing[SWAP_CRYPTO_SUB_BATCH_SIZE];
+
+ zswap_load_sub_batch_init(zd_batch, sb, &zls);
+
+ zswap_load_start_accounting(zd_batch, sb, &zls,
+ workingset, in_thrashing);
+
+ /* Decompress the batch. */
+ if (zls.nr_decomp)
+ zswap_load_decomp_batch(zd_batch, sb, &zls);
+
+ /*
+ * Should we free zswap_entries, as in zswap_load():
+ * With the new swapin_readahead batching interface,
+ * all prefetch entries are read into the swapcache.
+ * Freeing the zswap entries here causes segfaults,
+ * most probably because a page-fault occured while
+ * the buffer was being decompressed.
+ * Allowing the regular folio_free_swap() sequence
+ * in do_swap_page() appears to keep things stable
+ * without duplicated zswap-swapcache memory, as far
+ * as I can tell from my testing.
+ */
+
+ zswap_load_end_accounting(zd_batch, &zls,
+ workingset, in_thrashing);
+ }
+}
+
void zswap_invalidate(swp_entry_t swp)
{
pgoff_t offset = swp_offset(swp);
--
2.27.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
2024-10-18 6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
` (4 preceding siblings ...)
2024-10-18 6:48 ` [RFC PATCH v1 5/7] mm: swap, zswap: zswap folio_batch processing with IAA decompression batching Kanchana P Sridhar
@ 2024-10-18 6:48 ` Kanchana P Sridhar
2024-10-18 7:26 ` David Hildenbrand
2024-10-18 6:48 ` [RFC PATCH v1 7/7] mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache thresholds Kanchana P Sridhar
6 siblings, 1 reply; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18 6:48 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
akpm, hughd, willy, bfoster, dchinner, chrisl, david
Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
This patch invokes the swapin_readahead() based batching interface to
prefetch a batch of 4K folios for zswap load with batch decompressions
in parallel using IAA hardware. swapin_readahead() prefetches folios based
on vm.page-cluster and the usefulness of prior prefetches to the
workload. As folios are created in the swapcache and the readahead code
calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
respective folio_batches get populated with the folios to be read.
Finally, the swapin_readahead() procedures will call the newly added
process_ra_batch_of_same_type() which:
1) Reads all the non_zswap_batch folios sequentially by calling
swap_read_folio().
2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
zswap_finish_load_batch() that finally decompresses each
SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
batch of say, 32 folios) in parallel with IAA.
Within do_swap_page(), we try to benefit from batch decompressions in both
these scenarios:
1) single-mapped, SWP_SYNCHRONOUS_IO:
We call swapin_readahead() with "single_mapped_path = true". This is
done only in the !zswap_never_enabled() case.
2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
We call swapin_readahead() with "single_mapped_path = false".
This will place folios in the swapcache: a design choice that handles cases
where a folio that is "single-mapped" in process 1 could be prefetched in
process 2; and handles highly contended server scenarios with stability.
There are checks added at the end of do_swap_page(), after the folio has
been successfully loaded, to detect if the single-mapped swapcache folio is
still single-mapped, and if so, folio_free_swap() is called on the folio.
Within the swapin_readahead() functions, if single_mapped_path is true, and
either the platform does not have IAA, or, if the platform has IAA and the
user selects a software compressor for zswap (details of sysfs knob
follow), readahead/batching are skipped and the folio is loaded using
zswap_load().
A new swap parameter "singlemapped_ra_enabled" (false by default) is added
for platforms that have IAA, zswap_load_batching_enabled() is true, and we
want to give the user the option to run experiments with IAA and with
software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
For IAA:
echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
For software compressors:
echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
path.
Thanks Ying Huang for the really helpful brainstorming discussions on the
swap_read_folio() plug design.
Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
mm/memory.c | 187 +++++++++++++++++++++++++++++++++++++-----------
mm/shmem.c | 2 +-
mm/swap.h | 12 ++--
mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
mm/swapfile.c | 2 +-
5 files changed, 299 insertions(+), 61 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index b5745b9ffdf7..9655b85fc243 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
return 0;
}
+/*
+ * swapin readahead based batching interface for zswap batched loads using IAA:
+ *
+ * Should only be called for and if the faulting swap entry in do_swap_page
+ * is single-mapped and SWP_SYNCHRONOUS_IO.
+ *
+ * Detect if the folio is in the swapcache, is still mapped to only this
+ * process, and further, there are no additional references to this folio
+ * (for e.g. if another process simultaneously readahead this swap entry
+ * while this process was handling the page-fault, and got a pointer to the
+ * folio allocated by this process in the swapcache), besides the references
+ * that were obtained within __read_swap_cache_async() by this process that is
+ * faulting in this single-mapped swap entry.
+ */
+static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
+ struct folio *folio)
+{
+ if (!folio_test_swapcache(folio))
+ return false;
+
+ if (__swap_count(entry) != 0)
+ return false;
+
+ /*
+ * The folio ref count for a single-mapped folio that was allocated
+ * in __read_swap_cache_async(), can be a maximum of 3. These are the
+ * incrementors of the folio ref count in __read_swap_cache_async():
+ * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
+ */
+
+ if (folio_ref_count(folio) <= 3)
+ return true;
+
+ return false;
+}
+
static inline bool should_try_to_free_swap(struct folio *folio,
struct vm_area_struct *vma,
unsigned int fault_flags)
@@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
swp_entry_t entry;
pte_t pte;
vm_fault_t ret = 0;
+ bool single_mapped_swapcache = false;
void *shadow = NULL;
int nr_pages;
unsigned long page_idx;
@@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!folio) {
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1) {
- /* skip swapcache */
- folio = alloc_swap_folio(vmf);
- if (folio) {
- __folio_set_locked(folio);
- __folio_set_swapbacked(folio);
-
- nr_pages = folio_nr_pages(folio);
- if (folio_test_large(folio))
- entry.val = ALIGN_DOWN(entry.val, nr_pages);
- /*
- * Prevent parallel swapin from proceeding with
- * the cache flag. Otherwise, another thread
- * may finish swapin first, free the entry, and
- * swapout reusing the same entry. It's
- * undetectable as pte_same() returns true due
- * to entry reuse.
- */
- if (swapcache_prepare(entry, nr_pages)) {
+ if (zswap_never_enabled()) {
+ /* skip swapcache */
+ folio = alloc_swap_folio(vmf);
+ if (folio) {
+ __folio_set_locked(folio);
+ __folio_set_swapbacked(folio);
+
+ nr_pages = folio_nr_pages(folio);
+ if (folio_test_large(folio))
+ entry.val = ALIGN_DOWN(entry.val, nr_pages);
/*
- * Relax a bit to prevent rapid
- * repeated page faults.
+ * Prevent parallel swapin from proceeding with
+ * the cache flag. Otherwise, another thread
+ * may finish swapin first, free the entry, and
+ * swapout reusing the same entry. It's
+ * undetectable as pte_same() returns true due
+ * to entry reuse.
*/
- add_wait_queue(&swapcache_wq, &wait);
- schedule_timeout_uninterruptible(1);
- remove_wait_queue(&swapcache_wq, &wait);
- goto out_page;
+ if (swapcache_prepare(entry, nr_pages)) {
+ /*
+ * Relax a bit to prevent rapid
+ * repeated page faults.
+ */
+ add_wait_queue(&swapcache_wq, &wait);
+ schedule_timeout_uninterruptible(1);
+ remove_wait_queue(&swapcache_wq, &wait);
+ goto out_page;
+ }
+ need_clear_cache = true;
+
+ mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
+
+ shadow = get_shadow_from_swap_cache(entry);
+ if (shadow)
+ workingset_refault(folio, shadow);
+
+ folio_add_lru(folio);
+
+ /* To provide entry to swap_read_folio() */
+ folio->swap = entry;
+ swap_read_folio(folio, NULL, NULL, NULL);
+ folio->private = NULL;
+ }
+ } else {
+ /*
+ * zswap is enabled or was enabled at some point.
+ * Don't skip swapcache.
+ *
+ * swapin readahead based batching interface
+ * for zswap batched loads using IAA:
+ *
+ * Readahead is invoked in this path only if
+ * the sys swap "singlemapped_ra_enabled" swap
+ * parameter is set to true. By default,
+ * "singlemapped_ra_enabled" is set to false,
+ * the recommended setting for software compressors.
+ * For IAA, if "singlemapped_ra_enabled" is set
+ * to true, readahead will be deployed in this path
+ * as well.
+ *
+ * For single-mapped pages, the batching interface
+ * calls __read_swap_cache_async() to allocate and
+ * place the faulting page in the swapcache. This is
+ * to handle a scenario where the faulting page in
+ * this process happens to simultaneously be a
+ * readahead page in another process. By placing the
+ * single-mapped faulting page in the swapcache,
+ * we avoid race conditions and duplicate page
+ * allocations under these scenarios.
+ */
+ folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+ vmf, true);
+ if (!folio) {
+ ret = VM_FAULT_OOM;
+ goto out;
}
- need_clear_cache = true;
-
- mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
-
- shadow = get_shadow_from_swap_cache(entry);
- if (shadow)
- workingset_refault(folio, shadow);
-
- folio_add_lru(folio);
- /* To provide entry to swap_read_folio() */
- folio->swap = entry;
- swap_read_folio(folio, NULL, NULL, NULL);
- folio->private = NULL;
- }
+ single_mapped_swapcache = true;
+ nr_pages = folio_nr_pages(folio);
+ swapcache = folio;
+ } /* swapin with zswap support. */
} else {
folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
- vmf);
+ vmf, false);
swapcache = folio;
}
@@ -4528,8 +4604,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* yet.
*/
swap_free_nr(entry, nr_pages);
- if (should_try_to_free_swap(folio, vma, vmf->flags))
+ if (should_try_to_free_swap(folio, vma, vmf->flags)) {
folio_free_swap(folio);
+ single_mapped_swapcache = false;
+ }
add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
@@ -4619,6 +4697,30 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (waitqueue_active(&swapcache_wq))
wake_up(&swapcache_wq);
}
+
+ /*
+ * swapin readahead based batching interface
+ * for zswap batched loads using IAA:
+ *
+ * Don't skip swapcache strategy for single-mapped
+ * pages: As described above, we place the
+ * single-mapped faulting page in the swapcache,
+ * to avoid race conditions and duplicate page
+ * allocations between process 1 handling a
+ * page-fault for a single-mapped page, while
+ * simultaneously, the same swap entry is a
+ * readahead prefetch page in another process 2.
+ *
+ * One side-effect of this, is that if the race did
+ * not occur, we need to clean up the swapcache
+ * entry and free the zswap entry for the faulting
+ * page, iff it is still single-mapped and is
+ * exclusive to this process.
+ */
+ if (single_mapped_swapcache &&
+ data_race(should_free_singlemap_swapcache(entry, folio)))
+ folio_free_swap(folio);
+
if (si)
put_swap_device(si);
return ret;
@@ -4638,6 +4740,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (waitqueue_active(&swapcache_wq))
wake_up(&swapcache_wq);
}
+
+ if (single_mapped_swapcache &&
+ data_race(should_free_singlemap_swapcache(entry, folio)))
+ folio_free_swap(folio);
+
if (si)
put_swap_device(si);
return ret;
diff --git a/mm/shmem.c b/mm/shmem.c
index 66eae800ffab..e4549c04f316 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1624,7 +1624,7 @@ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
struct folio *folio;
mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
- folio = swap_cluster_readahead(swap, gfp, mpol, ilx);
+ folio = swap_cluster_readahead(swap, gfp, mpol, ilx, false);
mpol_cond_put(mpol);
return folio;
diff --git a/mm/swap.h b/mm/swap.h
index 2b82c8ed765c..2861bd8f5a96 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -199,9 +199,11 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
bool skip_if_exists);
struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
- struct mempolicy *mpol, pgoff_t ilx);
+ struct mempolicy *mpol, pgoff_t ilx,
+ bool single_mapped_path);
struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
- struct vm_fault *vmf);
+ struct vm_fault *vmf,
+ bool single_mapped_path);
static inline unsigned int folio_swap_flags(struct folio *folio)
{
@@ -304,13 +306,15 @@ static inline void show_swap_cache_info(void)
}
static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
- gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx)
+ gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx,
+ bool single_mapped_path)
{
return NULL;
}
static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
- struct vm_fault *vmf)
+ struct vm_fault *vmf,
+ bool single_mapped_path)
{
return NULL;
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0aa938e4c34d..66ea8f7f724c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -44,6 +44,12 @@ struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
static bool enable_vma_readahead __read_mostly = true;
+/*
+ * Enable readahead in single-mapped do_swap_page() path.
+ * Set to "true" for IAA.
+ */
+static bool enable_singlemapped_readahead __read_mostly = false;
+
#define SWAP_RA_WIN_SHIFT (PAGE_SHIFT / 2)
#define SWAP_RA_HITS_MASK ((1UL << SWAP_RA_WIN_SHIFT) - 1)
#define SWAP_RA_HITS_MAX SWAP_RA_HITS_MASK
@@ -340,6 +346,11 @@ static inline bool swap_use_vma_readahead(void)
return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
}
+static inline bool swap_use_singlemapped_readahead(void)
+{
+ return READ_ONCE(enable_singlemapped_readahead);
+}
+
/*
* Lookup a swap entry in the swap cache. A found folio will be returned
* unlocked and with its refcount incremented - we rely on the kernel
@@ -635,12 +646,49 @@ static unsigned long swapin_nr_pages(unsigned long offset)
return pages;
}
+static void process_ra_batch_of_same_type(
+ struct zswap_decomp_batch *zswap_batch,
+ struct folio_batch *non_zswap_batch,
+ swp_entry_t targ_entry,
+ struct swap_iocb **splug)
+{
+ unsigned int i;
+
+ for (i = 0; i < folio_batch_count(non_zswap_batch); ++i) {
+ struct folio *folio = non_zswap_batch->folios[i];
+ swap_read_folio(folio, splug, NULL, NULL);
+ if (folio->swap.val != targ_entry.val) {
+ folio_set_readahead(folio);
+ count_vm_event(SWAP_RA);
+ }
+ folio_put(folio);
+ }
+
+ swap_read_zswap_batch_unplug(zswap_batch, splug);
+
+ for (i = 0; i < folio_batch_count(&zswap_batch->fbatch); ++i) {
+ struct folio *folio = zswap_batch->fbatch.folios[i];
+ if (folio->swap.val != targ_entry.val) {
+ folio_set_readahead(folio);
+ count_vm_event(SWAP_RA);
+ }
+ folio_put(folio);
+ }
+
+ folio_batch_reinit(non_zswap_batch);
+
+ zswap_load_batch_reinit(zswap_batch);
+}
+
/**
* swap_cluster_readahead - swap in pages in hope we need them soon
* @entry: swap entry of this memory
* @gfp_mask: memory allocation flags
* @mpol: NUMA memory allocation policy to be applied
* @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
+ * @single_mapped_path: Called from do_swap_page() single-mapped path.
+ * Only readahead if the sys "singlemapped_ra_enabled" swap parameter
+ * is set to true.
*
* Returns the struct folio for entry and addr, after queueing swapin.
*
@@ -654,7 +702,8 @@ static unsigned long swapin_nr_pages(unsigned long offset)
* are fairly likely to have been swapped out from the same node.
*/
struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
- struct mempolicy *mpol, pgoff_t ilx)
+ struct mempolicy *mpol, pgoff_t ilx,
+ bool single_mapped_path)
{
struct folio *folio;
unsigned long entry_offset = swp_offset(entry);
@@ -664,12 +713,22 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
struct swap_iocb *splug = NULL;
+ struct zswap_decomp_batch zswap_batch;
+ struct folio_batch non_zswap_batch;
bool page_allocated;
+ if (single_mapped_path &&
+ (!swap_use_singlemapped_readahead() ||
+ !zswap_load_batching_enabled()))
+ goto skip;
+
mask = swapin_nr_pages(offset) - 1;
if (!mask)
goto skip;
+ zswap_load_batch_init(&zswap_batch);
+ folio_batch_init(&non_zswap_batch);
+
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
end_offset = offset | mask;
@@ -678,6 +737,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (end_offset >= si->max)
end_offset = si->max - 1;
+ /* Note that all swap entries readahead are of the same swap type. */
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
@@ -687,14 +747,22 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!folio)
continue;
if (page_allocated) {
- swap_read_folio(folio, &splug, NULL, NULL);
- if (offset != entry_offset) {
- folio_set_readahead(folio);
- count_vm_event(SWAP_RA);
+ if (swap_read_folio(folio, &splug,
+ &zswap_batch, &non_zswap_batch)) {
+ if (offset != entry_offset) {
+ folio_set_readahead(folio);
+ count_vm_event(SWAP_RA);
+ }
+ folio_put(folio);
}
+ } else {
+ folio_put(folio);
}
- folio_put(folio);
}
+
+ process_ra_batch_of_same_type(&zswap_batch, &non_zswap_batch,
+ entry, &splug);
+
blk_finish_plug(&plug);
swap_read_unplug(splug);
lru_add_drain(); /* Push any new pages onto the LRU now */
@@ -1009,6 +1077,9 @@ static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
* @mpol: NUMA memory allocation policy to be applied
* @targ_ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
* @vmf: fault information
+ * @single_mapped_path: Called from do_swap_page() single-mapped path.
+ * Only readahead if the sys "singlemapped_ra_enabled" swap parameter
+ * is set to true.
*
* Returns the struct folio for entry and addr, after queueing swapin.
*
@@ -1019,10 +1090,14 @@ static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
*
*/
static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
- struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf)
+ struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf,
+ bool single_mapped_path)
{
struct blk_plug plug;
struct swap_iocb *splug = NULL;
+ struct zswap_decomp_batch zswap_batch;
+ struct folio_batch non_zswap_batch;
+ int type = -1, prev_type = -1;
struct folio *folio;
pte_t *pte = NULL, pentry;
int win;
@@ -1031,10 +1106,18 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
pgoff_t ilx;
bool page_allocated;
+ if (single_mapped_path &&
+ (!swap_use_singlemapped_readahead() ||
+ !zswap_load_batching_enabled()))
+ goto skip;
+
win = swap_vma_ra_win(vmf, &start, &end);
if (win == 1)
goto skip;
+ zswap_load_batch_init(&zswap_batch);
+ folio_batch_init(&non_zswap_batch);
+
ilx = targ_ilx - PFN_DOWN(vmf->address - start);
blk_start_plug(&plug);
@@ -1057,16 +1140,38 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
if (!folio)
continue;
if (page_allocated) {
- swap_read_folio(folio, &splug, NULL, NULL);
- if (addr != vmf->address) {
- folio_set_readahead(folio);
- count_vm_event(SWAP_RA);
+ type = swp_type(entry);
+
+ /*
+ * Process this sub-batch before switching to
+ * another swap device type.
+ */
+ if ((prev_type >= 0) && (type != prev_type))
+ process_ra_batch_of_same_type(&zswap_batch,
+ &non_zswap_batch,
+ targ_entry,
+ &splug);
+
+ if (swap_read_folio(folio, &splug,
+ &zswap_batch, &non_zswap_batch)) {
+ if (addr != vmf->address) {
+ folio_set_readahead(folio);
+ count_vm_event(SWAP_RA);
+ }
+ folio_put(folio);
}
+
+ prev_type = type;
+ } else {
+ folio_put(folio);
}
- folio_put(folio);
}
if (pte)
pte_unmap(pte);
+
+ process_ra_batch_of_same_type(&zswap_batch, &non_zswap_batch,
+ targ_entry, &splug);
+
blk_finish_plug(&plug);
swap_read_unplug(splug);
lru_add_drain();
@@ -1092,7 +1197,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
* or vma-based(ie, virtual address based on faulty address) readahead.
*/
struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
- struct vm_fault *vmf)
+ struct vm_fault *vmf, bool single_mapped_path)
{
struct mempolicy *mpol;
pgoff_t ilx;
@@ -1100,8 +1205,10 @@ struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
folio = swap_use_vma_readahead() ?
- swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
- swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+ swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf,
+ single_mapped_path) :
+ swap_cluster_readahead(entry, gfp_mask, mpol, ilx,
+ single_mapped_path);
mpol_cond_put(mpol);
return folio;
@@ -1126,10 +1233,30 @@ static ssize_t vma_ra_enabled_store(struct kobject *kobj,
return count;
}
+static ssize_t singlemapped_ra_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sysfs_emit(buf, "%s\n",
+ enable_singlemapped_readahead ? "true" : "false");
+}
+static ssize_t singlemapped_ra_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ ssize_t ret;
+
+ ret = kstrtobool(buf, &enable_singlemapped_readahead);
+ if (ret)
+ return ret;
+
+ return count;
+}
static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
+static struct kobj_attribute singlemapped_ra_enabled_attr = __ATTR_RW(singlemapped_ra_enabled);
static struct attribute *swap_attrs[] = {
&vma_ra_enabled_attr.attr,
+ &singlemapped_ra_enabled_attr.attr,
NULL,
};
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b0915f3fab31..10367eaee1ff 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2197,7 +2197,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
};
folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
- &vmf);
+ &vmf, false);
}
if (!folio) {
swp_count = READ_ONCE(si->swap_map[offset]);
--
2.27.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH v1 7/7] mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache thresholds.
2024-10-18 6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
` (5 preceding siblings ...)
2024-10-18 6:48 ` [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface Kanchana P Sridhar
@ 2024-10-18 6:48 ` Kanchana P Sridhar
6 siblings, 0 replies; 15+ messages in thread
From: Kanchana P Sridhar @ 2024-10-18 6:48 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, ryan.roberts, ying.huang, 21cnbao,
akpm, hughd, willy, bfoster, dchinner, chrisl, david
Cc: wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
When IAA is used for compress batching and decompress batching of folios,
we significantly reduce the swapout-swapin path latencies, such
that swap page-faults' latencies are reduced. This means swap entries will
need to be freed more often, and swap slots will have to be released more
often.
The existing SWAP_BATCH and SWAP_SLOTS_CACHE_SIZE value of 64
can cause lock contention of the swap_info_struct lock in
swapcache_free_entries and cpu hardlockups can result in highly contended
server scenarios.
To prevent this, the SWAP_BATCH and SWAP_SLOTS_CACHE_SIZE
has been reduced to 16 if IAA is used for compress/decompress batching. The
swap_slots_cache activate/deactive thresholds have been modified
accordingly, so that we don't compromise performance for stability.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
include/linux/swap.h | 7 +++++++
include/linux/swap_slots.h | 7 +++++++
2 files changed, 14 insertions(+)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ca533b478c21..3987faa0a2ff 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -13,6 +13,7 @@
#include <linux/pagemap.h>
#include <linux/atomic.h>
#include <linux/page-flags.h>
+#include <linux/pagevec.h>
#include <uapi/linux/mempolicy.h>
#include <asm/page.h>
@@ -32,7 +33,13 @@ struct pagevec;
#define SWAP_FLAGS_VALID (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \
SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \
SWAP_FLAG_DISCARD_PAGES)
+
+#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED) || \
+ defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
+#define SWAP_BATCH 16
+#else
#define SWAP_BATCH 64
+#endif
static inline int current_is_kswapd(void)
{
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 15adfb8c813a..1b6e4e2798bd 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -7,8 +7,15 @@
#include <linux/mutex.h>
#define SWAP_SLOTS_CACHE_SIZE SWAP_BATCH
+
+#if defined(CONFIG_ZSWAP_STORE_BATCHING_ENABLED) || \
+ defined(CONFIG_ZSWAP_LOAD_BATCHING_ENABLED)
+#define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE (40*SWAP_SLOTS_CACHE_SIZE)
+#define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE (16*SWAP_SLOTS_CACHE_SIZE)
+#else
#define THRESHOLD_ACTIVATE_SWAP_SLOTS_CACHE (5*SWAP_SLOTS_CACHE_SIZE)
#define THRESHOLD_DEACTIVATE_SWAP_SLOTS_CACHE (2*SWAP_SLOTS_CACHE_SIZE)
+#endif
struct swap_slots_cache {
bool lock_initialized;
--
2.27.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
2024-10-18 6:48 ` [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface Kanchana P Sridhar
@ 2024-10-18 7:26 ` David Hildenbrand
2024-10-18 11:04 ` Usama Arif
2024-10-18 18:09 ` Sridhar, Kanchana P
0 siblings, 2 replies; 15+ messages in thread
From: David Hildenbrand @ 2024-10-18 7:26 UTC (permalink / raw)
To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
nphamcs, chengming.zhou, usamaarif642, ryan.roberts, ying.huang,
21cnbao, akpm, hughd, willy, bfoster, dchinner, chrisl
Cc: wajdi.k.feghali, vinodh.gopal
On 18.10.24 08:48, Kanchana P Sridhar wrote:
> This patch invokes the swapin_readahead() based batching interface to
> prefetch a batch of 4K folios for zswap load with batch decompressions
> in parallel using IAA hardware. swapin_readahead() prefetches folios based
> on vm.page-cluster and the usefulness of prior prefetches to the
> workload. As folios are created in the swapcache and the readahead code
> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
> respective folio_batches get populated with the folios to be read.
>
> Finally, the swapin_readahead() procedures will call the newly added
> process_ra_batch_of_same_type() which:
>
> 1) Reads all the non_zswap_batch folios sequentially by calling
> swap_read_folio().
> 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
> zswap_finish_load_batch() that finally decompresses each
> SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
> batch of say, 32 folios) in parallel with IAA.
>
> Within do_swap_page(), we try to benefit from batch decompressions in both
> these scenarios:
>
> 1) single-mapped, SWP_SYNCHRONOUS_IO:
> We call swapin_readahead() with "single_mapped_path = true". This is
> done only in the !zswap_never_enabled() case.
> 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> We call swapin_readahead() with "single_mapped_path = false".
>
> This will place folios in the swapcache: a design choice that handles cases
> where a folio that is "single-mapped" in process 1 could be prefetched in
> process 2; and handles highly contended server scenarios with stability.
> There are checks added at the end of do_swap_page(), after the folio has
> been successfully loaded, to detect if the single-mapped swapcache folio is
> still single-mapped, and if so, folio_free_swap() is called on the folio.
>
> Within the swapin_readahead() functions, if single_mapped_path is true, and
> either the platform does not have IAA, or, if the platform has IAA and the
> user selects a software compressor for zswap (details of sysfs knob
> follow), readahead/batching are skipped and the folio is loaded using
> zswap_load().
>
> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> want to give the user the option to run experiments with IAA and with
> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
>
> For IAA:
> echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
>
> For software compressors:
> echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
>
> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
> path.
>
> Thanks Ying Huang for the really helpful brainstorming discussions on the
> swap_read_folio() plug design.
>
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
> mm/memory.c | 187 +++++++++++++++++++++++++++++++++++++-----------
> mm/shmem.c | 2 +-
> mm/swap.h | 12 ++--
> mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
> mm/swapfile.c | 2 +-
> 5 files changed, 299 insertions(+), 61 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b5745b9ffdf7..9655b85fc243 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> return 0;
> }
>
> +/*
> + * swapin readahead based batching interface for zswap batched loads using IAA:
> + *
> + * Should only be called for and if the faulting swap entry in do_swap_page
> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> + *
> + * Detect if the folio is in the swapcache, is still mapped to only this
> + * process, and further, there are no additional references to this folio
> + * (for e.g. if another process simultaneously readahead this swap entry
> + * while this process was handling the page-fault, and got a pointer to the
> + * folio allocated by this process in the swapcache), besides the references
> + * that were obtained within __read_swap_cache_async() by this process that is
> + * faulting in this single-mapped swap entry.
> + */
How is this supposed to work for large folios?
> +static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
> + struct folio *folio)
> +{
> + if (!folio_test_swapcache(folio))
> + return false;
> +
> + if (__swap_count(entry) != 0)
> + return false;
> +
> + /*
> + * The folio ref count for a single-mapped folio that was allocated
> + * in __read_swap_cache_async(), can be a maximum of 3. These are the
> + * incrementors of the folio ref count in __read_swap_cache_async():
> + * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
> + */
> +
> + if (folio_ref_count(folio) <= 3)
> + return true;
> +
> + return false;
> +}
> +
> static inline bool should_try_to_free_swap(struct folio *folio,
> struct vm_area_struct *vma,
> unsigned int fault_flags)
> @@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> swp_entry_t entry;
> pte_t pte;
> vm_fault_t ret = 0;
> + bool single_mapped_swapcache = false;
> void *shadow = NULL;
> int nr_pages;
> unsigned long page_idx;
> @@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (!folio) {
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> __swap_count(entry) == 1) {
> - /* skip swapcache */
> - folio = alloc_swap_folio(vmf);
> - if (folio) {
> - __folio_set_locked(folio);
> - __folio_set_swapbacked(folio);
> -
> - nr_pages = folio_nr_pages(folio);
> - if (folio_test_large(folio))
> - entry.val = ALIGN_DOWN(entry.val, nr_pages);
> - /*
> - * Prevent parallel swapin from proceeding with
> - * the cache flag. Otherwise, another thread
> - * may finish swapin first, free the entry, and
> - * swapout reusing the same entry. It's
> - * undetectable as pte_same() returns true due
> - * to entry reuse.
> - */
> - if (swapcache_prepare(entry, nr_pages)) {
> + if (zswap_never_enabled()) {
> + /* skip swapcache */
> + folio = alloc_swap_folio(vmf);
> + if (folio) {
> + __folio_set_locked(folio);
> + __folio_set_swapbacked(folio);
> +
> + nr_pages = folio_nr_pages(folio);
> + if (folio_test_large(folio))
> + entry.val = ALIGN_DOWN(entry.val, nr_pages);
> /*
> - * Relax a bit to prevent rapid
> - * repeated page faults.
> + * Prevent parallel swapin from proceeding with
> + * the cache flag. Otherwise, another thread
> + * may finish swapin first, free the entry, and
> + * swapout reusing the same entry. It's
> + * undetectable as pte_same() returns true due
> + * to entry reuse.
> */
> - add_wait_queue(&swapcache_wq, &wait);
> - schedule_timeout_uninterruptible(1);
> - remove_wait_queue(&swapcache_wq, &wait);
> - goto out_page;
> + if (swapcache_prepare(entry, nr_pages)) {
> + /*
> + * Relax a bit to prevent rapid
> + * repeated page faults.
> + */
> + add_wait_queue(&swapcache_wq, &wait);
> + schedule_timeout_uninterruptible(1);
> + remove_wait_queue(&swapcache_wq, &wait);
> + goto out_page;
> + }
> + need_clear_cache = true;
> +
> + mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> +
> + shadow = get_shadow_from_swap_cache(entry);
> + if (shadow)
> + workingset_refault(folio, shadow);
> +
> + folio_add_lru(folio);
> +
> + /* To provide entry to swap_read_folio() */
> + folio->swap = entry;
> + swap_read_folio(folio, NULL, NULL, NULL);
> + folio->private = NULL;
> + }
> + } else {
> + /*
> + * zswap is enabled or was enabled at some point.
> + * Don't skip swapcache.
> + *
> + * swapin readahead based batching interface
> + * for zswap batched loads using IAA:
> + *
> + * Readahead is invoked in this path only if
> + * the sys swap "singlemapped_ra_enabled" swap
> + * parameter is set to true. By default,
> + * "singlemapped_ra_enabled" is set to false,
> + * the recommended setting for software compressors.
> + * For IAA, if "singlemapped_ra_enabled" is set
> + * to true, readahead will be deployed in this path
> + * as well.
> + *
> + * For single-mapped pages, the batching interface
> + * calls __read_swap_cache_async() to allocate and
> + * place the faulting page in the swapcache. This is
> + * to handle a scenario where the faulting page in
> + * this process happens to simultaneously be a
> + * readahead page in another process. By placing the
> + * single-mapped faulting page in the swapcache,
> + * we avoid race conditions and duplicate page
> + * allocations under these scenarios.
> + */
> + folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> + vmf, true);
> + if (!folio) {
> + ret = VM_FAULT_OOM;
> + goto out;
> }
> - need_clear_cache = true;
> -
> - mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> -
> - shadow = get_shadow_from_swap_cache(entry);
> - if (shadow)
> - workingset_refault(folio, shadow);
> -
> - folio_add_lru(folio);
>
> - /* To provide entry to swap_read_folio() */
> - folio->swap = entry;
> - swap_read_folio(folio, NULL, NULL, NULL);
> - folio->private = NULL;
> - }
> + single_mapped_swapcache = true;
> + nr_pages = folio_nr_pages(folio);
> + swapcache = folio;
> + } /* swapin with zswap support. */
> } else {
> folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> - vmf);
> + vmf, false);
> swapcache = folio;
I'm sorry, but making this function ever more complicated and ugly is
not going to fly. The zswap special casing is quite ugly here as well.
Is there a way forward that we can make this code actually readable and
avoid zswap special casing?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
2024-10-18 7:26 ` David Hildenbrand
@ 2024-10-18 11:04 ` Usama Arif
2024-10-18 17:21 ` Nhat Pham
2024-10-18 18:09 ` Sridhar, Kanchana P
1 sibling, 1 reply; 15+ messages in thread
From: Usama Arif @ 2024-10-18 11:04 UTC (permalink / raw)
To: David Hildenbrand, Kanchana P Sridhar, linux-kernel, linux-mm,
hannes, yosryahmed, nphamcs, chengming.zhou, ryan.roberts,
ying.huang, 21cnbao, akpm, hughd, willy, bfoster, dchinner,
chrisl
Cc: wajdi.k.feghali, vinodh.gopal
On 18/10/2024 08:26, David Hildenbrand wrote:
> On 18.10.24 08:48, Kanchana P Sridhar wrote:
>> This patch invokes the swapin_readahead() based batching interface to
>> prefetch a batch of 4K folios for zswap load with batch decompressions
>> in parallel using IAA hardware. swapin_readahead() prefetches folios based
>> on vm.page-cluster and the usefulness of prior prefetches to the
>> workload. As folios are created in the swapcache and the readahead code
>> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
>> respective folio_batches get populated with the folios to be read.
>>
>> Finally, the swapin_readahead() procedures will call the newly added
>> process_ra_batch_of_same_type() which:
>>
>> 1) Reads all the non_zswap_batch folios sequentially by calling
>> swap_read_folio().
>> 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
>> zswap_finish_load_batch() that finally decompresses each
>> SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
>> batch of say, 32 folios) in parallel with IAA.
>>
>> Within do_swap_page(), we try to benefit from batch decompressions in both
>> these scenarios:
>>
>> 1) single-mapped, SWP_SYNCHRONOUS_IO:
>> We call swapin_readahead() with "single_mapped_path = true". This is
>> done only in the !zswap_never_enabled() case.
>> 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
>> We call swapin_readahead() with "single_mapped_path = false".
>>
>> This will place folios in the swapcache: a design choice that handles cases
>> where a folio that is "single-mapped" in process 1 could be prefetched in
>> process 2; and handles highly contended server scenarios with stability.
>> There are checks added at the end of do_swap_page(), after the folio has
>> been successfully loaded, to detect if the single-mapped swapcache folio is
>> still single-mapped, and if so, folio_free_swap() is called on the folio.
>>
>> Within the swapin_readahead() functions, if single_mapped_path is true, and
>> either the platform does not have IAA, or, if the platform has IAA and the
>> user selects a software compressor for zswap (details of sysfs knob
>> follow), readahead/batching are skipped and the folio is loaded using
>> zswap_load().
>>
>> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
>> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
>> want to give the user the option to run experiments with IAA and with
>> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
>>
>> For IAA:
>> echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>
>> For software compressors:
>> echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>
>> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
>> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
>> path.
>>
>> Thanks Ying Huang for the really helpful brainstorming discussions on the
>> swap_read_folio() plug design.
>>
>> Suggested-by: Ying Huang <ying.huang@intel.com>
>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
>> ---
>> mm/memory.c | 187 +++++++++++++++++++++++++++++++++++++-----------
>> mm/shmem.c | 2 +-
>> mm/swap.h | 12 ++--
>> mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
>> mm/swapfile.c | 2 +-
>> 5 files changed, 299 insertions(+), 61 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5745b9ffdf7..9655b85fc243 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>> return 0;
>> }
>> +/*
>> + * swapin readahead based batching interface for zswap batched loads using IAA:
>> + *
>> + * Should only be called for and if the faulting swap entry in do_swap_page
>> + * is single-mapped and SWP_SYNCHRONOUS_IO.
>> + *
>> + * Detect if the folio is in the swapcache, is still mapped to only this
>> + * process, and further, there are no additional references to this folio
>> + * (for e.g. if another process simultaneously readahead this swap entry
>> + * while this process was handling the page-fault, and got a pointer to the
>> + * folio allocated by this process in the swapcache), besides the references
>> + * that were obtained within __read_swap_cache_async() by this process that is
>> + * faulting in this single-mapped swap entry.
>> + */
>
> How is this supposed to work for large folios?
>
Hi,
I was looking at zswapin large folio support and have posted a RFC in [1].
I got bogged down with some prod stuff, so wasn't able to send it earlier.
It looks quite different, and I think simpler from this series, so might be
a good comparison.
[1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/
Thanks,
Usama
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
2024-10-18 11:04 ` Usama Arif
@ 2024-10-18 17:21 ` Nhat Pham
2024-10-18 21:59 ` Sridhar, Kanchana P
0 siblings, 1 reply; 15+ messages in thread
From: Nhat Pham @ 2024-10-18 17:21 UTC (permalink / raw)
To: Usama Arif
Cc: David Hildenbrand, Kanchana P Sridhar, linux-kernel, linux-mm,
hannes, yosryahmed, chengming.zhou, ryan.roberts, ying.huang,
21cnbao, akpm, hughd, willy, bfoster, dchinner, chrisl,
wajdi.k.feghali, vinodh.gopal
On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
> On 18/10/2024 08:26, David Hildenbrand wrote:
> > On 18.10.24 08:48, Kanchana P Sridhar wrote:
> >> This patch invokes the swapin_readahead() based batching interface to
> >> prefetch a batch of 4K folios for zswap load with batch decompressions
> >> in parallel using IAA hardware. swapin_readahead() prefetches folios based
> >> on vm.page-cluster and the usefulness of prior prefetches to the
> >> workload. As folios are created in the swapcache and the readahead code
> >> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
> >> respective folio_batches get populated with the folios to be read.
> >>
> >> Finally, the swapin_readahead() procedures will call the newly added
> >> process_ra_batch_of_same_type() which:
> >>
> >> 1) Reads all the non_zswap_batch folios sequentially by calling
> >> swap_read_folio().
> >> 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
> >> zswap_finish_load_batch() that finally decompresses each
> >> SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
> >> batch of say, 32 folios) in parallel with IAA.
> >>
> >> Within do_swap_page(), we try to benefit from batch decompressions in both
> >> these scenarios:
> >>
> >> 1) single-mapped, SWP_SYNCHRONOUS_IO:
> >> We call swapin_readahead() with "single_mapped_path = true". This is
> >> done only in the !zswap_never_enabled() case.
> >> 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> >> We call swapin_readahead() with "single_mapped_path = false".
> >>
> >> This will place folios in the swapcache: a design choice that handles cases
> >> where a folio that is "single-mapped" in process 1 could be prefetched in
> >> process 2; and handles highly contended server scenarios with stability.
> >> There are checks added at the end of do_swap_page(), after the folio has
> >> been successfully loaded, to detect if the single-mapped swapcache folio is
> >> still single-mapped, and if so, folio_free_swap() is called on the folio.
> >>
> >> Within the swapin_readahead() functions, if single_mapped_path is true, and
> >> either the platform does not have IAA, or, if the platform has IAA and the
> >> user selects a software compressor for zswap (details of sysfs knob
> >> follow), readahead/batching are skipped and the folio is loaded using
> >> zswap_load().
> >>
> >> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
> >> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> >> want to give the user the option to run experiments with IAA and with
> >> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> >>
> >> For IAA:
> >> echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>
> >> For software compressors:
> >> echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>
> >> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> >> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
> >> path.
> >>
> >> Thanks Ying Huang for the really helpful brainstorming discussions on the
> >> swap_read_folio() plug design.
> >>
> >> Suggested-by: Ying Huang <ying.huang@intel.com>
> >> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> >> ---
> >> mm/memory.c | 187 +++++++++++++++++++++++++++++++++++++-----------
> >> mm/shmem.c | 2 +-
> >> mm/swap.h | 12 ++--
> >> mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
> >> mm/swapfile.c | 2 +-
> >> 5 files changed, 299 insertions(+), 61 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index b5745b9ffdf7..9655b85fc243 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> >> return 0;
> >> }
> >> +/*
> >> + * swapin readahead based batching interface for zswap batched loads using IAA:
> >> + *
> >> + * Should only be called for and if the faulting swap entry in do_swap_page
> >> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> >> + *
> >> + * Detect if the folio is in the swapcache, is still mapped to only this
> >> + * process, and further, there are no additional references to this folio
> >> + * (for e.g. if another process simultaneously readahead this swap entry
> >> + * while this process was handling the page-fault, and got a pointer to the
> >> + * folio allocated by this process in the swapcache), besides the references
> >> + * that were obtained within __read_swap_cache_async() by this process that is
> >> + * faulting in this single-mapped swap entry.
> >> + */
> >
> > How is this supposed to work for large folios?
> >
>
> Hi,
>
> I was looking at zswapin large folio support and have posted a RFC in [1].
> I got bogged down with some prod stuff, so wasn't able to send it earlier.
>
> It looks quite different, and I think simpler from this series, so might be
> a good comparison.
>
> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/
>
> Thanks,
> Usama
I agree.
I think the lower hanging fruit here is to build upon Usama's patch.
Kanchana, do you think we can just use the new batch decompressing
infrastructure, and apply it to Usama's large folio zswap loading?
I'm not denying the readahead idea outright, but that seems much more
complicated. There are questions regarding the benefits of
readahead-ing when apply to zswap in the first place - IIUC, zram
circumvents that logic in several cases, and zswap shares many
characteristics with zram (fast, synchronous compression devices).
So let's reap the low hanging fruits first, get the wins as well as
stress test the new infrastructure. Then we can discuss the readahead
idea later?
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
2024-10-18 7:26 ` David Hildenbrand
2024-10-18 11:04 ` Usama Arif
@ 2024-10-18 18:09 ` Sridhar, Kanchana P
1 sibling, 0 replies; 15+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-18 18:09 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel, linux-mm, hannes, yosryahmed,
nphamcs, chengming.zhou, usamaarif642, ryan.roberts, Huang, Ying,
21cnbao, akpm, hughd, willy, bfoster, dchinner, chrisl, Sridhar,
Kanchana P
Cc: Feghali, Wajdi K, Gopal, Vinodh
Hi David,
> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Friday, October 18, 2024 12:27 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; hughd@google.com;
> willy@infradead.org; bfoster@redhat.com; dchinner@redhat.com;
> chrisl@kernel.org
> Cc: Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
>
> On 18.10.24 08:48, Kanchana P Sridhar wrote:
> > This patch invokes the swapin_readahead() based batching interface to
> > prefetch a batch of 4K folios for zswap load with batch decompressions
> > in parallel using IAA hardware. swapin_readahead() prefetches folios based
> > on vm.page-cluster and the usefulness of prior prefetches to the
> > workload. As folios are created in the swapcache and the readahead code
> > calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
> the
> > respective folio_batches get populated with the folios to be read.
> >
> > Finally, the swapin_readahead() procedures will call the newly added
> > process_ra_batch_of_same_type() which:
> >
> > 1) Reads all the non_zswap_batch folios sequentially by calling
> > swap_read_folio().
> > 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
> calls
> > zswap_finish_load_batch() that finally decompresses each
> > SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> prefetch
> > batch of say, 32 folios) in parallel with IAA.
> >
> > Within do_swap_page(), we try to benefit from batch decompressions in
> both
> > these scenarios:
> >
> > 1) single-mapped, SWP_SYNCHRONOUS_IO:
> > We call swapin_readahead() with "single_mapped_path = true". This is
> > done only in the !zswap_never_enabled() case.
> > 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> > We call swapin_readahead() with "single_mapped_path = false".
> >
> > This will place folios in the swapcache: a design choice that handles cases
> > where a folio that is "single-mapped" in process 1 could be prefetched in
> > process 2; and handles highly contended server scenarios with stability.
> > There are checks added at the end of do_swap_page(), after the folio has
> > been successfully loaded, to detect if the single-mapped swapcache folio is
> > still single-mapped, and if so, folio_free_swap() is called on the folio.
> >
> > Within the swapin_readahead() functions, if single_mapped_path is true,
> and
> > either the platform does not have IAA, or, if the platform has IAA and the
> > user selects a software compressor for zswap (details of sysfs knob
> > follow), readahead/batching are skipped and the folio is loaded using
> > zswap_load().
> >
> > A new swap parameter "singlemapped_ra_enabled" (false by default) is
> added
> > for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> > want to give the user the option to run experiments with IAA and with
> > software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> >
> > For IAA:
> > echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >
> > For software compressors:
> > echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >
> > If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> > prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> do_swap_page()
> > path.
> >
> > Thanks Ying Huang for the really helpful brainstorming discussions on the
> > swap_read_folio() plug design.
> >
> > Suggested-by: Ying Huang <ying.huang@intel.com>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> > mm/memory.c | 187 +++++++++++++++++++++++++++++++++++++------
> -----
> > mm/shmem.c | 2 +-
> > mm/swap.h | 12 ++--
> > mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
> > mm/swapfile.c | 2 +-
> > 5 files changed, 299 insertions(+), 61 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b5745b9ffdf7..9655b85fc243 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3924,6 +3924,42 @@ static vm_fault_t
> remove_device_exclusive_entry(struct vm_fault *vmf)
> > return 0;
> > }
> >
> > +/*
> > + * swapin readahead based batching interface for zswap batched loads
> using IAA:
> > + *
> > + * Should only be called for and if the faulting swap entry in do_swap_page
> > + * is single-mapped and SWP_SYNCHRONOUS_IO.
> > + *
> > + * Detect if the folio is in the swapcache, is still mapped to only this
> > + * process, and further, there are no additional references to this folio
> > + * (for e.g. if another process simultaneously readahead this swap entry
> > + * while this process was handling the page-fault, and got a pointer to the
> > + * folio allocated by this process in the swapcache), besides the references
> > + * that were obtained within __read_swap_cache_async() by this process
> that is
> > + * faulting in this single-mapped swap entry.
> > + */
>
> How is this supposed to work for large folios?
Thanks for your code review comments. The main idea behind this
patch-series is to work with the existing kernel page-fault granularity of 4K
folios, that swapin_readahead() builds upon to prefetch other "useful"
4K folios. The intent is to not try to make modifications at page-fault time
to opportunistically synthesize large folios for swapin.
As we know, __read_swap_cache_async() allocates an order-0 folio, which
explains the implementation of should_free_singlemap_swapcache() in this
patch. IOW, this is not supposed to work for large folios based on the existing
page-fault behavior and without making any modifications to that.
>
> > +static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
> > + struct folio *folio)
> > +{
> > + if (!folio_test_swapcache(folio))
> > + return false;
> > +
> > + if (__swap_count(entry) != 0)
> > + return false;
> > +
> > + /*
> > + * The folio ref count for a single-mapped folio that was allocated
> > + * in __read_swap_cache_async(), can be a maximum of 3. These are
> the
> > + * incrementors of the folio ref count in __read_swap_cache_async():
> > + * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
> > + */
> > +
> > + if (folio_ref_count(folio) <= 3)
> > + return true;
> > +
> > + return false;
> > +}
> > +
> > static inline bool should_try_to_free_swap(struct folio *folio,
> > struct vm_area_struct *vma,
> > unsigned int fault_flags)
> > @@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > swp_entry_t entry;
> > pte_t pte;
> > vm_fault_t ret = 0;
> > + bool single_mapped_swapcache = false;
> > void *shadow = NULL;
> > int nr_pages;
> > unsigned long page_idx;
> > @@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault
> *vmf)
> > if (!folio) {
> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > __swap_count(entry) == 1) {
> > - /* skip swapcache */
> > - folio = alloc_swap_folio(vmf);
> > - if (folio) {
> > - __folio_set_locked(folio);
> > - __folio_set_swapbacked(folio);
> > -
> > - nr_pages = folio_nr_pages(folio);
> > - if (folio_test_large(folio))
> > - entry.val = ALIGN_DOWN(entry.val,
> nr_pages);
> > - /*
> > - * Prevent parallel swapin from proceeding
> with
> > - * the cache flag. Otherwise, another thread
> > - * may finish swapin first, free the entry, and
> > - * swapout reusing the same entry. It's
> > - * undetectable as pte_same() returns true
> due
> > - * to entry reuse.
> > - */
> > - if (swapcache_prepare(entry, nr_pages)) {
> > + if (zswap_never_enabled()) {
> > + /* skip swapcache */
> > + folio = alloc_swap_folio(vmf);
> > + if (folio) {
> > + __folio_set_locked(folio);
> > + __folio_set_swapbacked(folio);
> > +
> > + nr_pages = folio_nr_pages(folio);
> > + if (folio_test_large(folio))
> > + entry.val =
> ALIGN_DOWN(entry.val, nr_pages);
> > /*
> > - * Relax a bit to prevent rapid
> > - * repeated page faults.
> > + * Prevent parallel swapin from
> proceeding with
> > + * the cache flag. Otherwise, another
> thread
> > + * may finish swapin first, free the
> entry, and
> > + * swapout reusing the same entry.
> It's
> > + * undetectable as pte_same()
> returns true due
> > + * to entry reuse.
> > */
> > - add_wait_queue(&swapcache_wq,
> &wait);
> > -
> schedule_timeout_uninterruptible(1);
> > -
> remove_wait_queue(&swapcache_wq, &wait);
> > - goto out_page;
> > + if (swapcache_prepare(entry,
> nr_pages)) {
> > + /*
> > + * Relax a bit to prevent rapid
> > + * repeated page faults.
> > + */
> > +
> add_wait_queue(&swapcache_wq, &wait);
> > +
> schedule_timeout_uninterruptible(1);
> > +
> remove_wait_queue(&swapcache_wq, &wait);
> > + goto out_page;
> > + }
> > + need_clear_cache = true;
> > +
> > +
> mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> > +
> > + shadow =
> get_shadow_from_swap_cache(entry);
> > + if (shadow)
> > + workingset_refault(folio,
> shadow);
> > +
> > + folio_add_lru(folio);
> > +
> > + /* To provide entry to
> swap_read_folio() */
> > + folio->swap = entry;
> > + swap_read_folio(folio, NULL, NULL,
> NULL);
> > + folio->private = NULL;
> > + }
> > + } else {
> > + /*
> > + * zswap is enabled or was enabled at some
> point.
> > + * Don't skip swapcache.
> > + *
> > + * swapin readahead based batching
> interface
> > + * for zswap batched loads using IAA:
> > + *
> > + * Readahead is invoked in this path only if
> > + * the sys swap "singlemapped_ra_enabled"
> swap
> > + * parameter is set to true. By default,
> > + * "singlemapped_ra_enabled" is set to false,
> > + * the recommended setting for software
> compressors.
> > + * For IAA, if "singlemapped_ra_enabled" is
> set
> > + * to true, readahead will be deployed in this
> path
> > + * as well.
> > + *
> > + * For single-mapped pages, the batching
> interface
> > + * calls __read_swap_cache_async() to
> allocate and
> > + * place the faulting page in the swapcache.
> This is
> > + * to handle a scenario where the faulting
> page in
> > + * this process happens to simultaneously be
> a
> > + * readahead page in another process. By
> placing the
> > + * single-mapped faulting page in the
> swapcache,
> > + * we avoid race conditions and duplicate
> page
> > + * allocations under these scenarios.
> > + */
> > + folio = swapin_readahead(entry,
> GFP_HIGHUSER_MOVABLE,
> > + vmf, true);
> > + if (!folio) {
> > + ret = VM_FAULT_OOM;
> > + goto out;
> > }
> > - need_clear_cache = true;
> > -
> > - mem_cgroup_swapin_uncharge_swap(entry,
> nr_pages);
> > -
> > - shadow =
> get_shadow_from_swap_cache(entry);
> > - if (shadow)
> > - workingset_refault(folio, shadow);
> > -
> > - folio_add_lru(folio);
> >
> > - /* To provide entry to swap_read_folio() */
> > - folio->swap = entry;
> > - swap_read_folio(folio, NULL, NULL, NULL);
> > - folio->private = NULL;
> > - }
> > + single_mapped_swapcache = true;
> > + nr_pages = folio_nr_pages(folio);
> > + swapcache = folio;
> > + } /* swapin with zswap support. */
> > } else {
> > folio = swapin_readahead(entry,
> GFP_HIGHUSER_MOVABLE,
> > - vmf);
> > + vmf, false);
> > swapcache = folio;
>
> I'm sorry, but making this function ever more complicated and ugly is
> not going to fly. The zswap special casing is quite ugly here as well.
>
> Is there a way forward that we can make this code actually readable and
> avoid zswap special casing?
Yes, I realize this is now quite cluttered. I need to think some more about
how to make this more readable, and would appreciate suggestions
towards this.
Thanks,
Kanchana
>
> --
> Cheers,
>
> David / dhildenb
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
2024-10-18 17:21 ` Nhat Pham
@ 2024-10-18 21:59 ` Sridhar, Kanchana P
2024-10-20 16:50 ` Usama Arif
0 siblings, 1 reply; 15+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-18 21:59 UTC (permalink / raw)
To: Nhat Pham, Usama Arif
Cc: David Hildenbrand, linux-kernel, linux-mm, hannes, yosryahmed,
chengming.zhou, ryan.roberts, Huang, Ying, 21cnbao, akpm, hughd,
willy, bfoster, dchinner, chrisl, Feghali, Wajdi K, Gopal,
Vinodh, Sridhar, Kanchana P
Hi Usama, Nhat,
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Friday, October 18, 2024 10:21 AM
> To: Usama Arif <usamaarif642@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
>
> On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com>
> wrote:
> >
> >
> > On 18/10/2024 08:26, David Hildenbrand wrote:
> > > On 18.10.24 08:48, Kanchana P Sridhar wrote:
> > >> This patch invokes the swapin_readahead() based batching interface to
> > >> prefetch a batch of 4K folios for zswap load with batch decompressions
> > >> in parallel using IAA hardware. swapin_readahead() prefetches folios
> based
> > >> on vm.page-cluster and the usefulness of prior prefetches to the
> > >> workload. As folios are created in the swapcache and the readahead
> code
> > >> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
> the
> > >> respective folio_batches get populated with the folios to be read.
> > >>
> > >> Finally, the swapin_readahead() procedures will call the newly added
> > >> process_ra_batch_of_same_type() which:
> > >>
> > >> 1) Reads all the non_zswap_batch folios sequentially by calling
> > >> swap_read_folio().
> > >> 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
> calls
> > >> zswap_finish_load_batch() that finally decompresses each
> > >> SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> prefetch
> > >> batch of say, 32 folios) in parallel with IAA.
> > >>
> > >> Within do_swap_page(), we try to benefit from batch decompressions in
> both
> > >> these scenarios:
> > >>
> > >> 1) single-mapped, SWP_SYNCHRONOUS_IO:
> > >> We call swapin_readahead() with "single_mapped_path = true". This
> is
> > >> done only in the !zswap_never_enabled() case.
> > >> 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> > >> We call swapin_readahead() with "single_mapped_path = false".
> > >>
> > >> This will place folios in the swapcache: a design choice that handles
> cases
> > >> where a folio that is "single-mapped" in process 1 could be prefetched in
> > >> process 2; and handles highly contended server scenarios with stability.
> > >> There are checks added at the end of do_swap_page(), after the folio has
> > >> been successfully loaded, to detect if the single-mapped swapcache folio
> is
> > >> still single-mapped, and if so, folio_free_swap() is called on the folio.
> > >>
> > >> Within the swapin_readahead() functions, if single_mapped_path is true,
> and
> > >> either the platform does not have IAA, or, if the platform has IAA and the
> > >> user selects a software compressor for zswap (details of sysfs knob
> > >> follow), readahead/batching are skipped and the folio is loaded using
> > >> zswap_load().
> > >>
> > >> A new swap parameter "singlemapped_ra_enabled" (false by default) is
> added
> > >> for platforms that have IAA, zswap_load_batching_enabled() is true, and
> we
> > >> want to give the user the option to run experiments with IAA and with
> > >> software compressors for zswap (swap device is
> SWP_SYNCHRONOUS_IO):
> > >>
> > >> For IAA:
> > >> echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> > >>
> > >> For software compressors:
> > >> echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> > >>
> > >> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will
> skip
> > >> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> do_swap_page()
> > >> path.
> > >>
> > >> Thanks Ying Huang for the really helpful brainstorming discussions on the
> > >> swap_read_folio() plug design.
> > >>
> > >> Suggested-by: Ying Huang <ying.huang@intel.com>
> > >> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > >> ---
> > >> mm/memory.c | 187 +++++++++++++++++++++++++++++++++++++--
> ---------
> > >> mm/shmem.c | 2 +-
> > >> mm/swap.h | 12 ++--
> > >> mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++--
> --
> > >> mm/swapfile.c | 2 +-
> > >> 5 files changed, 299 insertions(+), 61 deletions(-)
> > >>
> > >> diff --git a/mm/memory.c b/mm/memory.c
> > >> index b5745b9ffdf7..9655b85fc243 100644
> > >> --- a/mm/memory.c
> > >> +++ b/mm/memory.c
> > >> @@ -3924,6 +3924,42 @@ static vm_fault_t
> remove_device_exclusive_entry(struct vm_fault *vmf)
> > >> return 0;
> > >> }
> > >> +/*
> > >> + * swapin readahead based batching interface for zswap batched loads
> using IAA:
> > >> + *
> > >> + * Should only be called for and if the faulting swap entry in
> do_swap_page
> > >> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> > >> + *
> > >> + * Detect if the folio is in the swapcache, is still mapped to only this
> > >> + * process, and further, there are no additional references to this folio
> > >> + * (for e.g. if another process simultaneously readahead this swap entry
> > >> + * while this process was handling the page-fault, and got a pointer to
> the
> > >> + * folio allocated by this process in the swapcache), besides the
> references
> > >> + * that were obtained within __read_swap_cache_async() by this
> process that is
> > >> + * faulting in this single-mapped swap entry.
> > >> + */
> > >
> > > How is this supposed to work for large folios?
> > >
> >
> > Hi,
> >
> > I was looking at zswapin large folio support and have posted a RFC in [1].
> > I got bogged down with some prod stuff, so wasn't able to send it earlier.
> >
> > It looks quite different, and I think simpler from this series, so might be
> > a good comparison.
> >
> > [1] https://lore.kernel.org/all/20241018105026.2521366-1-
> usamaarif642@gmail.com/
> >
> > Thanks,
> > Usama
>
> I agree.
>
> I think the lower hanging fruit here is to build upon Usama's patch.
> Kanchana, do you think we can just use the new batch decompressing
> infrastructure, and apply it to Usama's large folio zswap loading?
>
> I'm not denying the readahead idea outright, but that seems much more
> complicated. There are questions regarding the benefits of
> readahead-ing when apply to zswap in the first place - IIUC, zram
> circumvents that logic in several cases, and zswap shares many
> characteristics with zram (fast, synchronous compression devices).
>
> So let's reap the low hanging fruits first, get the wins as well as
> stress test the new infrastructure. Then we can discuss the readahead
> idea later?
Thanks Usama for publishing the zswap large folios swapin series, and
thanks Nhat for your suggestions. Sure, I can look into integrating the
new batch decompressing infrastructure with Usama's large folio zswap
loading.
However, I think we need to get clarity on a bigger question: does it
make sense to swapin large folios? Some important considerations
would be:
1) What are the tradeoffs in memory footprint cost of swapping in a
large folio?
2) If we decide to let the user determine this by say, an option that
determines the swapin granularity (e.g. no more than 32k at a time),
how does this constrain compression and zpool storage granularity?
Ultimately, I feel the bigger question is about memory utilization cost
of large folio swapin. The swapin_readahead() based approach tries to
use the prefetch-usefulness characteristics of the workload to improve
the efficiency of multiple 4k folios by using strategies like parallel
decompression, to strike some balance in memory utilization vs.
efficiency.
Usama, I downloaded your patch-series and tried to understand this
better, and wanted to share the data.
I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0,
and 16k/32k/64k large folios enabled to "always":
16k/32k/64k folios: kernel compilation with zstd:
=================================================
------------------------------------------------------------------------------
mm-unstable-10-16-2024 + zswap large folios swapin
series
------------------------------------------------------------------------------
zswap compressor zstd zstd
vm.page-cluster 0 0
------------------------------------------------------------------------------
real_sec 772.53 870.61
user_sec 15,780.29 15,836.71
sys_sec 5,353.20 6,185.02
Max_Res_Set_Size_KB 1,873,348 1,873,004
------------------------------------------------------------------------------
memcg_high 0 0
memcg_swap_fail 0 0
zswpout 93,811,916 111,663,872
zswpin 27,150,029 54,730,678
pswpout 64 59
pswpin 78 53
thp_swpout 0 0
thp_swpout_fallback 0 0
16kB-mthp_swpout_fallback 0 0
32kB-mthp_swpout_fallback 0 0
64kB-mthp_swpout_fallback 5,470 0
pgmajfault 29,019,256 16,615,820
swap_ra 0 0
swap_ra_hit 3,004 3,614
ZSWPOUT-16kB 1,324,160 2,252,747
ZSWPOUT-32kB 730,534 1,356,640
ZSWPOUT-64kB 3,039,760 3,955,034
ZSWPIN-16kB 1,496,916
ZSWPIN-32kB 1,131,176
ZSWPIN-64kB 1,866,884
SWPOUT-16kB 0 0
SWPOUT-32kB 0 0
SWPOUT-64kB 4 3
------------------------------------------------------------------------------
It does appear like there is considerably higher swapout and swapin
activity as a result of swapping in large folios, which does end up
impacting performance.
I would appreciate thoughts on understanding the usefulness of
swapping in large folios, with the considerations outlined earlier/other
factors.
Thanks,
Kanchana
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
2024-10-18 21:59 ` Sridhar, Kanchana P
@ 2024-10-20 16:50 ` Usama Arif
2024-10-20 20:12 ` Sridhar, Kanchana P
0 siblings, 1 reply; 15+ messages in thread
From: Usama Arif @ 2024-10-20 16:50 UTC (permalink / raw)
To: Sridhar, Kanchana P, Nhat Pham
Cc: David Hildenbrand, linux-kernel, linux-mm, hannes, yosryahmed,
chengming.zhou, ryan.roberts, Huang, Ying, 21cnbao, akpm, hughd,
willy, bfoster, dchinner, chrisl, Feghali, Wajdi K, Gopal,
Vinodh
On 18/10/2024 22:59, Sridhar, Kanchana P wrote:
> Hi Usama, Nhat,
>
>> -----Original Message-----
>> From: Nhat Pham <nphamcs@gmail.com>
>> Sent: Friday, October 18, 2024 10:21 AM
>> To: Usama Arif <usamaarif642@gmail.com>
>> Cc: David Hildenbrand <david@redhat.com>; Sridhar, Kanchana P
>> <kanchana.p.sridhar@intel.com>; linux-kernel@vger.kernel.org; linux-
>> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
>> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
>> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
>> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
>> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
>> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
>> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
>> swapin_readahead() zswap load batching interface.
>>
>> On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com>
>> wrote:
>>>
>>>
>>> On 18/10/2024 08:26, David Hildenbrand wrote:
>>>> On 18.10.24 08:48, Kanchana P Sridhar wrote:
>>>>> This patch invokes the swapin_readahead() based batching interface to
>>>>> prefetch a batch of 4K folios for zswap load with batch decompressions
>>>>> in parallel using IAA hardware. swapin_readahead() prefetches folios
>> based
>>>>> on vm.page-cluster and the usefulness of prior prefetches to the
>>>>> workload. As folios are created in the swapcache and the readahead
>> code
>>>>> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
>> the
>>>>> respective folio_batches get populated with the folios to be read.
>>>>>
>>>>> Finally, the swapin_readahead() procedures will call the newly added
>>>>> process_ra_batch_of_same_type() which:
>>>>>
>>>>> 1) Reads all the non_zswap_batch folios sequentially by calling
>>>>> swap_read_folio().
>>>>> 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
>> calls
>>>>> zswap_finish_load_batch() that finally decompresses each
>>>>> SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
>> prefetch
>>>>> batch of say, 32 folios) in parallel with IAA.
>>>>>
>>>>> Within do_swap_page(), we try to benefit from batch decompressions in
>> both
>>>>> these scenarios:
>>>>>
>>>>> 1) single-mapped, SWP_SYNCHRONOUS_IO:
>>>>> We call swapin_readahead() with "single_mapped_path = true". This
>> is
>>>>> done only in the !zswap_never_enabled() case.
>>>>> 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
>>>>> We call swapin_readahead() with "single_mapped_path = false".
>>>>>
>>>>> This will place folios in the swapcache: a design choice that handles
>> cases
>>>>> where a folio that is "single-mapped" in process 1 could be prefetched in
>>>>> process 2; and handles highly contended server scenarios with stability.
>>>>> There are checks added at the end of do_swap_page(), after the folio has
>>>>> been successfully loaded, to detect if the single-mapped swapcache folio
>> is
>>>>> still single-mapped, and if so, folio_free_swap() is called on the folio.
>>>>>
>>>>> Within the swapin_readahead() functions, if single_mapped_path is true,
>> and
>>>>> either the platform does not have IAA, or, if the platform has IAA and the
>>>>> user selects a software compressor for zswap (details of sysfs knob
>>>>> follow), readahead/batching are skipped and the folio is loaded using
>>>>> zswap_load().
>>>>>
>>>>> A new swap parameter "singlemapped_ra_enabled" (false by default) is
>> added
>>>>> for platforms that have IAA, zswap_load_batching_enabled() is true, and
>> we
>>>>> want to give the user the option to run experiments with IAA and with
>>>>> software compressors for zswap (swap device is
>> SWP_SYNCHRONOUS_IO):
>>>>>
>>>>> For IAA:
>>>>> echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>>>>
>>>>> For software compressors:
>>>>> echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>>>>
>>>>> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will
>> skip
>>>>> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
>> do_swap_page()
>>>>> path.
>>>>>
>>>>> Thanks Ying Huang for the really helpful brainstorming discussions on the
>>>>> swap_read_folio() plug design.
>>>>>
>>>>> Suggested-by: Ying Huang <ying.huang@intel.com>
>>>>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
>>>>> ---
>>>>> mm/memory.c | 187 +++++++++++++++++++++++++++++++++++++--
>> ---------
>>>>> mm/shmem.c | 2 +-
>>>>> mm/swap.h | 12 ++--
>>>>> mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++--
>> --
>>>>> mm/swapfile.c | 2 +-
>>>>> 5 files changed, 299 insertions(+), 61 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index b5745b9ffdf7..9655b85fc243 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -3924,6 +3924,42 @@ static vm_fault_t
>> remove_device_exclusive_entry(struct vm_fault *vmf)
>>>>> return 0;
>>>>> }
>>>>> +/*
>>>>> + * swapin readahead based batching interface for zswap batched loads
>> using IAA:
>>>>> + *
>>>>> + * Should only be called for and if the faulting swap entry in
>> do_swap_page
>>>>> + * is single-mapped and SWP_SYNCHRONOUS_IO.
>>>>> + *
>>>>> + * Detect if the folio is in the swapcache, is still mapped to only this
>>>>> + * process, and further, there are no additional references to this folio
>>>>> + * (for e.g. if another process simultaneously readahead this swap entry
>>>>> + * while this process was handling the page-fault, and got a pointer to
>> the
>>>>> + * folio allocated by this process in the swapcache), besides the
>> references
>>>>> + * that were obtained within __read_swap_cache_async() by this
>> process that is
>>>>> + * faulting in this single-mapped swap entry.
>>>>> + */
>>>>
>>>> How is this supposed to work for large folios?
>>>>
>>>
>>> Hi,
>>>
>>> I was looking at zswapin large folio support and have posted a RFC in [1].
>>> I got bogged down with some prod stuff, so wasn't able to send it earlier.
>>>
>>> It looks quite different, and I think simpler from this series, so might be
>>> a good comparison.
>>>
>>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-
>> usamaarif642@gmail.com/
>>>
>>> Thanks,
>>> Usama
>>
>> I agree.
>>
>> I think the lower hanging fruit here is to build upon Usama's patch.
>> Kanchana, do you think we can just use the new batch decompressing
>> infrastructure, and apply it to Usama's large folio zswap loading?
>>
>> I'm not denying the readahead idea outright, but that seems much more
>> complicated. There are questions regarding the benefits of
>> readahead-ing when apply to zswap in the first place - IIUC, zram
>> circumvents that logic in several cases, and zswap shares many
>> characteristics with zram (fast, synchronous compression devices).
>>
>> So let's reap the low hanging fruits first, get the wins as well as
>> stress test the new infrastructure. Then we can discuss the readahead
>> idea later?
>
> Thanks Usama for publishing the zswap large folios swapin series, and
> thanks Nhat for your suggestions. Sure, I can look into integrating the
> new batch decompressing infrastructure with Usama's large folio zswap
> loading.
>
> However, I think we need to get clarity on a bigger question: does it
> make sense to swapin large folios? Some important considerations
> would be:
>
> 1) What are the tradeoffs in memory footprint cost of swapping in a
> large folio?
I would say the pros and cons of swapping in a large folio are the same as
the pros and cons of large folios in general.
As mentioned in my cover letter and the series that introduced large folios
you get fewer page faults, batched PTE and rmap manipulation, reduced lru list,
TLB coalescing (for arm64 and AMD) at the cost of higher memory usage and
fragmentation.
The other thing is that the series I wrote is hopefully just a start.
As shown by Barry in the case of zram in
https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
there is a significant improvement in CPU utilization and compression
ratios when compressing at large granularity. Hopefully we can
try and do something similar for zswap. Not sure how that would look
for zswap as I haven't started looking at that yet.
> 2) If we decide to let the user determine this by say, an option that
> determines the swapin granularity (e.g. no more than 32k at a time),
> how does this constrain compression and zpool storage granularity?
>
Right now whether or not zswapin happens is determined using
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/enabled
I assume when the someone sets some of these to always, they know that
their workload works best with those page sizes, so they would want folios
to be swapped in and used at that size as well?
There might be some merit in adding something like
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled,
as you might start thrashing swap if you are for e.g. swapping
in 1M folios and there isn't enough memory for it, which causes
you to swapout another folio in its place.
> Ultimately, I feel the bigger question is about memory utilization cost
> of large folio swapin. The swapin_readahead() based approach tries to
> use the prefetch-usefulness characteristics of the workload to improve
> the efficiency of multiple 4k folios by using strategies like parallel
> decompression, to strike some balance in memory utilization vs.
> efficiency.
>
> Usama, I downloaded your patch-series and tried to understand this
> better, and wanted to share the data.
>
> I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0,
> and 16k/32k/64k large folios enabled to "always":
>
> 16k/32k/64k folios: kernel compilation with zstd:
> =================================================
>
> ------------------------------------------------------------------------------
> mm-unstable-10-16-2024 + zswap large folios swapin
> series
> ------------------------------------------------------------------------------
> zswap compressor zstd zstd
> vm.page-cluster 0 0
> ------------------------------------------------------------------------------
> real_sec 772.53 870.61
> user_sec 15,780.29 15,836.71
> sys_sec 5,353.20 6,185.02
> Max_Res_Set_Size_KB 1,873,348 1,873,004
>
> ------------------------------------------------------------------------------
> memcg_high 0 0
> memcg_swap_fail 0 0
> zswpout 93,811,916 111,663,872
> zswpin 27,150,029 54,730,678
> pswpout 64 59
> pswpin 78 53
> thp_swpout 0 0
> thp_swpout_fallback 0 0
> 16kB-mthp_swpout_fallback 0 0
> 32kB-mthp_swpout_fallback 0 0
> 64kB-mthp_swpout_fallback 5,470 0
> pgmajfault 29,019,256 16,615,820
> swap_ra 0 0
> swap_ra_hit 3,004 3,614
> ZSWPOUT-16kB 1,324,160 2,252,747
> ZSWPOUT-32kB 730,534 1,356,640
> ZSWPOUT-64kB 3,039,760 3,955,034
> ZSWPIN-16kB 1,496,916
> ZSWPIN-32kB 1,131,176
> ZSWPIN-64kB 1,866,884
> SWPOUT-16kB 0 0
> SWPOUT-32kB 0 0
> SWPOUT-64kB 4 3
> ------------------------------------------------------------------------------
>
> It does appear like there is considerably higher swapout and swapin
> activity as a result of swapping in large folios, which does end up
> impacting performance.
Thanks for having a look!
I had only tested with the microbenchmark for time taken to zswapin that I included in
my coverletter.
In general I expected zswap activity to go up as you are more likely to experience
memory pressure when swapping in large folios, but in return get lower pagefaults
and the advantage of lower TLB pressure in AMD and arm64.
Thanks for the test, those number look quite extreme! I think there is a lot of swap
thrashing.
I am assuming you are testing on an intel machine, where you don't get the advantage
of lower TLB misses of large folios, I will try and get an AMD machine which has
TLB coalescing or an ARM server with CONT_PTE to see if the numbers get better.
Maybe it might be better for large folio zswapin to be considered along with
larger granuality compression to get all the benefits of large folios (and
hopefully better numbers.) I think that was the approach taken for zram as well.
>
> I would appreciate thoughts on understanding the usefulness of
> swapping in large folios, with the considerations outlined earlier/other
> factors.
>
> Thanks,
> Kanchana
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
2024-10-20 16:50 ` Usama Arif
@ 2024-10-20 20:12 ` Sridhar, Kanchana P
0 siblings, 0 replies; 15+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-20 20:12 UTC (permalink / raw)
To: Usama Arif, Nhat Pham
Cc: David Hildenbrand, linux-kernel, linux-mm, hannes, yosryahmed,
chengming.zhou, ryan.roberts, Huang, Ying, 21cnbao, akpm, hughd,
willy, bfoster, dchinner, chrisl, Feghali, Wajdi K, Gopal,
Vinodh, Sridhar, Kanchana P
Hi Usama,
> -----Original Message-----
> From: Usama Arif <usamaarif642@gmail.com>
> Sent: Sunday, October 20, 2024 9:50 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham
> <nphamcs@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
>
>
>
> On 18/10/2024 22:59, Sridhar, Kanchana P wrote:
> > Hi Usama, Nhat,
> >
> >> -----Original Message-----
> >> From: Nhat Pham <nphamcs@gmail.com>
> >> Sent: Friday, October 18, 2024 10:21 AM
> >> To: Usama Arif <usamaarif642@gmail.com>
> >> Cc: David Hildenbrand <david@redhat.com>; Sridhar, Kanchana P
> >> <kanchana.p.sridhar@intel.com>; linux-kernel@vger.kernel.org; linux-
> >> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> >> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
> >> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> >> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
> >> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
> >> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> >> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> >> swapin_readahead() zswap load batching interface.
> >>
> >> On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com>
> >> wrote:
> >>>
> >>>
> >>> On 18/10/2024 08:26, David Hildenbrand wrote:
> >>>> On 18.10.24 08:48, Kanchana P Sridhar wrote:
> >>>>> This patch invokes the swapin_readahead() based batching interface to
> >>>>> prefetch a batch of 4K folios for zswap load with batch decompressions
> >>>>> in parallel using IAA hardware. swapin_readahead() prefetches folios
> >> based
> >>>>> on vm.page-cluster and the usefulness of prior prefetches to the
> >>>>> workload. As folios are created in the swapcache and the readahead
> >> code
> >>>>> calls swap_read_folio() with a "zswap_batch" and a
> "non_zswap_batch",
> >> the
> >>>>> respective folio_batches get populated with the folios to be read.
> >>>>>
> >>>>> Finally, the swapin_readahead() procedures will call the newly added
> >>>>> process_ra_batch_of_same_type() which:
> >>>>>
> >>>>> 1) Reads all the non_zswap_batch folios sequentially by calling
> >>>>> swap_read_folio().
> >>>>> 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch
> which
> >> calls
> >>>>> zswap_finish_load_batch() that finally decompresses each
> >>>>> SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> >> prefetch
> >>>>> batch of say, 32 folios) in parallel with IAA.
> >>>>>
> >>>>> Within do_swap_page(), we try to benefit from batch decompressions
> in
> >> both
> >>>>> these scenarios:
> >>>>>
> >>>>> 1) single-mapped, SWP_SYNCHRONOUS_IO:
> >>>>> We call swapin_readahead() with "single_mapped_path = true".
> This
> >> is
> >>>>> done only in the !zswap_never_enabled() case.
> >>>>> 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> >>>>> We call swapin_readahead() with "single_mapped_path = false".
> >>>>>
> >>>>> This will place folios in the swapcache: a design choice that handles
> >> cases
> >>>>> where a folio that is "single-mapped" in process 1 could be prefetched
> in
> >>>>> process 2; and handles highly contended server scenarios with
> stability.
> >>>>> There are checks added at the end of do_swap_page(), after the folio
> has
> >>>>> been successfully loaded, to detect if the single-mapped swapcache
> folio
> >> is
> >>>>> still single-mapped, and if so, folio_free_swap() is called on the folio.
> >>>>>
> >>>>> Within the swapin_readahead() functions, if single_mapped_path is
> true,
> >> and
> >>>>> either the platform does not have IAA, or, if the platform has IAA and
> the
> >>>>> user selects a software compressor for zswap (details of sysfs knob
> >>>>> follow), readahead/batching are skipped and the folio is loaded using
> >>>>> zswap_load().
> >>>>>
> >>>>> A new swap parameter "singlemapped_ra_enabled" (false by default)
> is
> >> added
> >>>>> for platforms that have IAA, zswap_load_batching_enabled() is true,
> and
> >> we
> >>>>> want to give the user the option to run experiments with IAA and with
> >>>>> software compressors for zswap (swap device is
> >> SWP_SYNCHRONOUS_IO):
> >>>>>
> >>>>> For IAA:
> >>>>> echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>>>>
> >>>>> For software compressors:
> >>>>> echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>>>>
> >>>>> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will
> >> skip
> >>>>> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> >> do_swap_page()
> >>>>> path.
> >>>>>
> >>>>> Thanks Ying Huang for the really helpful brainstorming discussions on
> the
> >>>>> swap_read_folio() plug design.
> >>>>>
> >>>>> Suggested-by: Ying Huang <ying.huang@intel.com>
> >>>>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> >>>>> ---
> >>>>> mm/memory.c | 187
> +++++++++++++++++++++++++++++++++++++--
> >> ---------
> >>>>> mm/shmem.c | 2 +-
> >>>>> mm/swap.h | 12 ++--
> >>>>> mm/swap_state.c | 157
> ++++++++++++++++++++++++++++++++++++--
> >> --
> >>>>> mm/swapfile.c | 2 +-
> >>>>> 5 files changed, 299 insertions(+), 61 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>> index b5745b9ffdf7..9655b85fc243 100644
> >>>>> --- a/mm/memory.c
> >>>>> +++ b/mm/memory.c
> >>>>> @@ -3924,6 +3924,42 @@ static vm_fault_t
> >> remove_device_exclusive_entry(struct vm_fault *vmf)
> >>>>> return 0;
> >>>>> }
> >>>>> +/*
> >>>>> + * swapin readahead based batching interface for zswap batched
> loads
> >> using IAA:
> >>>>> + *
> >>>>> + * Should only be called for and if the faulting swap entry in
> >> do_swap_page
> >>>>> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> >>>>> + *
> >>>>> + * Detect if the folio is in the swapcache, is still mapped to only this
> >>>>> + * process, and further, there are no additional references to this folio
> >>>>> + * (for e.g. if another process simultaneously readahead this swap
> entry
> >>>>> + * while this process was handling the page-fault, and got a pointer to
> >> the
> >>>>> + * folio allocated by this process in the swapcache), besides the
> >> references
> >>>>> + * that were obtained within __read_swap_cache_async() by this
> >> process that is
> >>>>> + * faulting in this single-mapped swap entry.
> >>>>> + */
> >>>>
> >>>> How is this supposed to work for large folios?
> >>>>
> >>>
> >>> Hi,
> >>>
> >>> I was looking at zswapin large folio support and have posted a RFC in [1].
> >>> I got bogged down with some prod stuff, so wasn't able to send it earlier.
> >>>
> >>> It looks quite different, and I think simpler from this series, so might be
> >>> a good comparison.
> >>>
> >>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-
> >> usamaarif642@gmail.com/
> >>>
> >>> Thanks,
> >>> Usama
> >>
> >> I agree.
> >>
> >> I think the lower hanging fruit here is to build upon Usama's patch.
> >> Kanchana, do you think we can just use the new batch decompressing
> >> infrastructure, and apply it to Usama's large folio zswap loading?
> >>
> >> I'm not denying the readahead idea outright, but that seems much more
> >> complicated. There are questions regarding the benefits of
> >> readahead-ing when apply to zswap in the first place - IIUC, zram
> >> circumvents that logic in several cases, and zswap shares many
> >> characteristics with zram (fast, synchronous compression devices).
> >>
> >> So let's reap the low hanging fruits first, get the wins as well as
> >> stress test the new infrastructure. Then we can discuss the readahead
> >> idea later?
> >
> > Thanks Usama for publishing the zswap large folios swapin series, and
> > thanks Nhat for your suggestions. Sure, I can look into integrating the
> > new batch decompressing infrastructure with Usama's large folio zswap
> > loading.
> >
> > However, I think we need to get clarity on a bigger question: does it
> > make sense to swapin large folios? Some important considerations
> > would be:
> >
> > 1) What are the tradeoffs in memory footprint cost of swapping in a
> > large folio?
>
> I would say the pros and cons of swapping in a large folio are the same as
> the pros and cons of large folios in general.
> As mentioned in my cover letter and the series that introduced large folios
> you get fewer page faults, batched PTE and rmap manipulation, reduced lru
> list,
> TLB coalescing (for arm64 and AMD) at the cost of higher memory usage and
> fragmentation.
>
> The other thing is that the series I wrote is hopefully just a start.
> As shown by Barry in the case of zram in
> https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
> there is a significant improvement in CPU utilization and compression
> ratios when compressing at large granularity. Hopefully we can
> try and do something similar for zswap. Not sure how that would look
> for zswap as I haven't started looking at that yet.
Thanks a lot for sharing your thoughts on this! Yes, this makes sense.
It was helpful to get your thoughts on the larger compression granularity
with large folios, since we have been trying to answer similar questions
ourselves. We look forward to collaborating with you on getting this working
for zswap!
>
> > 2) If we decide to let the user determine this by say, an option that
> > determines the swapin granularity (e.g. no more than 32k at a time),
> > how does this constrain compression and zpool storage granularity?
> >
> Right now whether or not zswapin happens is determined using
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/enabled
> I assume when the someone sets some of these to always, they know that
> their workload works best with those page sizes, so they would want folios
> to be swapped in and used at that size as well?
Sure, this rationale makes sense.
>
> There might be some merit in adding something like
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled,
> as you might start thrashing swap if you are for e.g. swapping
> in 1M folios and there isn't enough memory for it, which causes
> you to swapout another folio in its place.
Agreed.
>
> > Ultimately, I feel the bigger question is about memory utilization cost
> > of large folio swapin. The swapin_readahead() based approach tries to
> > use the prefetch-usefulness characteristics of the workload to improve
> > the efficiency of multiple 4k folios by using strategies like parallel
> > decompression, to strike some balance in memory utilization vs.
> > efficiency.
> >
> > Usama, I downloaded your patch-series and tried to understand this
> > better, and wanted to share the data.
> >
> > I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0,
> > and 16k/32k/64k large folios enabled to "always":
> >
> > 16k/32k/64k folios: kernel compilation with zstd:
> > =================================================
> >
> > ------------------------------------------------------------------------------
> > mm-unstable-10-16-2024 + zswap large folios swapin
> > series
> > ------------------------------------------------------------------------------
> > zswap compressor zstd zstd
> > vm.page-cluster 0 0
> > ------------------------------------------------------------------------------
> > real_sec 772.53 870.61
> > user_sec 15,780.29 15,836.71
> > sys_sec 5,353.20 6,185.02
> > Max_Res_Set_Size_KB 1,873,348 1,873,004
> >
> > ------------------------------------------------------------------------------
> > memcg_high 0 0
> > memcg_swap_fail 0 0
> > zswpout 93,811,916 111,663,872
> > zswpin 27,150,029 54,730,678
> > pswpout 64 59
> > pswpin 78 53
> > thp_swpout 0 0
> > thp_swpout_fallback 0 0
> > 16kB-mthp_swpout_fallback 0 0
> > 32kB-mthp_swpout_fallback 0 0
> > 64kB-mthp_swpout_fallback 5,470 0
> > pgmajfault 29,019,256 16,615,820
> > swap_ra 0 0
> > swap_ra_hit 3,004 3,614
> > ZSWPOUT-16kB 1,324,160 2,252,747
> > ZSWPOUT-32kB 730,534 1,356,640
> > ZSWPOUT-64kB 3,039,760 3,955,034
> > ZSWPIN-16kB 1,496,916
> > ZSWPIN-32kB 1,131,176
> > ZSWPIN-64kB 1,866,884
> > SWPOUT-16kB 0 0
> > SWPOUT-32kB 0 0
> > SWPOUT-64kB 4 3
> > ------------------------------------------------------------------------------
> >
> > It does appear like there is considerably higher swapout and swapin
> > activity as a result of swapping in large folios, which does end up
> > impacting performance.
>
> Thanks for having a look!
> I had only tested with the microbenchmark for time taken to zswapin that I
> included in
> my coverletter.
> In general I expected zswap activity to go up as you are more likely to
> experience
> memory pressure when swapping in large folios, but in return get lower
> pagefaults
> and the advantage of lower TLB pressure in AMD and arm64.
>
> Thanks for the test, those number look quite extreme! I think there is a lot of
> swap
> thrashing.
> I am assuming you are testing on an intel machine, where you don't get the
> advantage
> of lower TLB misses of large folios, I will try and get an AMD machine which
> has
> TLB coalescing or an ARM server with CONT_PTE to see if the numbers get
> better.
You're right, these numbers were gathered on an Intel Sapphire Rapids server.
Thanks for confirming kernel compilation/allmodconfig behavior on the
AMD/ARM systems.
>
> Maybe it might be better for large folio zswapin to be considered along with
> larger granuality compression to get all the benefits of large folios (and
> hopefully better numbers.) I think that was the approach taken for zram as
> well.
Yes, agree with you on this as well. We are planning to run experiments with
the by_n patch-series [1], and Barry's zsmalloc multi-page patch-series [2]
posted earlier, and your zswapin large folio series. It would be great to compare
notes as we understand the overall workload behavior and the trade-offs in
memory vs. latency benefits of larger compression granularity.
[1] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/
[2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
Thanks again,
Kanchana
>
> >
> > I would appreciate thoughts on understanding the usefulness of
> > swapping in large folios, with the considerations outlined earlier/other
> > factors.
> >
> > Thanks,
> > Kanchana
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-10-20 20:12 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-18 6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
2024-10-18 6:47 ` [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with " Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 2/7] mm: swap: Add IAA batch decompression API swap_crypto_acomp_decompress_batch() Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 3/7] pagevec: struct folio_batch changes for decompress batching interface Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 4/7] mm: swap: swap_read_folio() can add a folio to a folio_batch if it is in zswap Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 5/7] mm: swap, zswap: zswap folio_batch processing with IAA decompression batching Kanchana P Sridhar
2024-10-18 6:48 ` [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface Kanchana P Sridhar
2024-10-18 7:26 ` David Hildenbrand
2024-10-18 11:04 ` Usama Arif
2024-10-18 17:21 ` Nhat Pham
2024-10-18 21:59 ` Sridhar, Kanchana P
2024-10-20 16:50 ` Usama Arif
2024-10-20 20:12 ` Sridhar, Kanchana P
2024-10-18 18:09 ` Sridhar, Kanchana P
2024-10-18 6:48 ` [RFC PATCH v1 7/7] mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache thresholds Kanchana P Sridhar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox