linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios
@ 2024-08-16  5:48 Kanchana P Sridhar
  2024-08-16  5:48 ` [PATCH v2 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Kanchana P Sridhar @ 2024-08-16  5:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Hi All,

This patch-series enables zswap_store() to accept and store mTHP
folios. The most significant contribution in this series is from the 
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to v6.11-rc3 in patch 2/4 of this series.

[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
     https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Additionally, there is an attempt to modularize some of the functionality
in zswap_store(), to make it more amenable to supporting any-order
mTHPs.

For instance, the determination of whether a folio is same-filled is
based on mapping an index into the folio to derive the page. Likewise,
there is a function "zswap_store_entry" added to store a zswap_entry in
the xarray.

For accounting purposes, the patch-series adds per-order mTHP sysfs
"zswpout" counters that get incremented upon successful zswap_store of
an mTHP folio:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

This patch-series is a precursor to ZSWAP compress batching of mTHP
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent RFC patch-series, with performance improvement data.

Thanks to Ying Huang for pre-posting review feedback and suggestions!

Changes since RFC v1:
=====================

1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
   Thanks Barry!
2) Addressed some of the code review comments that Nhat Pham provided in
   Ryan's initial RFC [1]:
   - Added a comment about the cgroup zswap limit checks occuring once per
     folio at the beginning of zswap_store().
     Nhat, Ryan, please do let me know if the comments convey the summary
     from the RFC discussion. Thanks!
   - Posted data on running the cgroup suite's zswap kselftest.
3) Rebased to v6.11-rc3.
4) Gathered performance data with usemem and the rebased patch-series.

Performance Testing:
====================
Testing of this patch-series was done with the v6.11-rc3 mainline, without
and with this patch-series, on an Intel Sapphire Rapids server,
dual-socket 56 cores per socket, 4 IAA devices per socket.

The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the backing
swap device. Core frequency was fixed at 2500MHz.

The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 40G. Following a similar methodology as in Ryan Roberts'
"Swap-out mTHP without splitting" series [2], 70 usemem processes were
run, each allocating and writing 1G of memory:

    usemem --init-time -w -O -n 70 1g

Other kernel configuration parameters:

    ZSWAP Compressor  : LZ4, DEFLATE-IAA
    ZSWAP Allocator   : ZSMALLOC
    ZRAM Compressor   : LZO-RLE
    SWAP page-cluster : 2

In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
IAA "compression verification" is enabled. Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors.

Throughput reported by usemem and perf sys time for running the test
are as follows:

 64KB mTHP:
 ==========
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |    118,928 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     82,665 |       -30% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    176,210 |        48% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |   1,032.20 |   Baseline |
 |zswap-mTHP=Store    | ZSWAP lz4         |   1,854.51 |       -80% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     582.71 |        44% |
  ------------------------------------------------------------------

  -----------------------------------------------------------------------
 | VMSTATS, mTHP ZSWAP stats,   |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 | mTHP ZRAM stats:             |   mainline |       Store |       Store |
 |                              |            |         lz4 | deflate-iaa |
 |-----------------------------------------------------------------------|
 | pswpin                       |         16 |           0 |           0 |
 | pswpout                      |  7,770,720 |           0 |           0 |
 | zswpin                       |        547 |         695 |         579 |
 | zswpout                      |      1,394 |  15,462,778 |   7,284,554 |
 |-----------------------------------------------------------------------|
 | thp_swpout                   |          0 |           0 |           0 |
 | thp_swpout_fallback          |          0 |           0 |           0 |
 | pgmajfault                   |      3,786 |       3,541 |       3,367 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/zswpout |            |     966,328 |     455,196 |
 |-----------------------------------------------------------------------|
 | hugepages-64kB/stats/swpout  |    485,670 |           0 |           0 |
  -----------------------------------------------------------------------


 2MB PMD-THP/2048K mTHP:
 =======================
  ------------------------------------------------------------------
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
 |                    |                   |       KB/s |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |    177,340 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |     84,030 |       -53% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |    185,691 |         5% |
 |------------------------------------------------------------------|
 |                    |                   |            |            |
 |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
 |                    |                   |        sec |            |
 |--------------------|-------------------|------------|------------|
 |v6.11-rc3 mainline  | ZRAM lzo-rle      |     876.29 |   Baseline |
 |zswap-mTHP-Store    | ZSWAP lz4         |   1,740.55 |       -99% |
 |zswap-mTHP-Store    | ZSWAP deflate-iaa |     650.33 |        26% |
  ------------------------------------------------------------------

  ------------------------------------------------------------------------- 
 | VMSTATS, mTHP ZSWAP stats,     |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
 | mTHP ZRAM stats:               |   mainline |       Store |       Store |
 |                                |            |         lz4 | deflate-iaa |
 |-------------------------------------------------------------------------|
 | pswpin                         |          0 |           0 |           0 |
 | pswpout                        |  8,628,224 |           0 |           0 |
 | zswpin                         |        678 |      22,733 |       1,641 |
 | zswpout                        |      1,481 |  14,828,597 |   9,404,937 |
 |-------------------------------------------------------------------------|
 | thp_swpout                     |     16,852 |           0 |           0 |
 | thp_swpout_fallback            |          0 |           0 |           0 |
 | pgmajfault                     |      3,467 |      25,550 |       4,800 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/zswpout |            |      28,924 |      18,366 |
 |-------------------------------------------------------------------------|
 | hugepages-2048kB/stats/swpout  |     16,852 |           0 |           0 |
  -------------------------------------------------------------------------

As expected, in the "Before" experiment, there are relatively fewer
swapouts because ZRAM utilization is not accounted in the cgroup.

With the introduction of zswap_store mTHP, the "After" data reflects the
higher swapout activity, and consequent throughput/sys time degradation
when LZ4 is used as the zswap compressor. However, we observe considerable
throughput and sys time improvement in the "After" data when DEFLATE-IAA
is the zswap compressor. This observation holds for 64K mTHP and 2MB THP
experiments. IAA's higher compression ratio and better compress latency
can be attributed to fewer swap-outs and major page-faults, that result
in better throughput and sys time.

Our goal is to improve ZSWAP mTHP store performance using batching. With
Intel IAA compress/decompress batching used in ZSWAP (to be submitted as
additional RFC series), we are able to demonstrate significant
performance improvements and memory savings with IAA as compared to
software compressors.

cgroup zswap kselftest:
=======================

"Before":
=========
  Test run with v6.11-rc3 and no code changes:
    mTHP 64K set to 'always'
    zswap compressor set to 'lz4'
    page-cluster = 3

  zswap shrinker_enabled = N:
  ---------------------------
  ok 1 test_zswap_usage
  ok 2 test_swapin_nozswap
  # at least 24MB should be brought back from zswap
  not ok 3 test_zswapin
  # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled
  # Failed to reclaim all of the requested memory
  not ok 5 test_zswap_writeback_disabled
  ok 6 # SKIP test_no_kmem_bypass
  ok 7 test_no_invasive_cgroup_shrink

  zswap shrinker_enabled = Y:
  ---------------------------
  ok 1 test_zswap_usage
  ok 2 test_swapin_nozswap
  # at least 24MB should be brought back from zswap
  not ok 3 test_zswapin
  # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled
  # Failed to reclaim all of the requested memory
  not ok 5 test_zswap_writeback_disabled
  ok 6 # SKIP test_no_kmem_bypass
  not ok 7 test_no_invasive_cgroup_shrink

"After":
========
  Test run with this patch-series and v6.11-rc3:
    mTHP 64K set to 'always'
    zswap compressor set to 'deflate-iaa'
    page-cluster = 3

  zswap shrinker_enabled = N:
  ---------------------------
  ok 1 test_zswap_usage
  ok 2 test_swapin_nozswap
  ok 3 test_zswapin
  ok 4 test_zswap_writeback_enabled
  ok 5 test_zswap_writeback_disabled
  ok 6 # SKIP test_no_kmem_bypass
  ok 7 test_no_invasive_cgroup_shrink
  
  zswap shrinker_enabled = Y:
  ---------------------------
  ok 1 test_zswap_usage
  ok 2 test_swapin_nozswap
  # at least 24MB should be brought back from zswap
  not ok 3 test_zswapin
  ok 4 test_zswap_writeback_enabled
  ok 5 test_zswap_writeback_disabled
  ok 6 # SKIP test_no_kmem_bypass
  not ok 7 test_no_invasive_cgroup_shrink

I haven't taken an in-depth look into the cgroup zswap tests, but it
looks like the results with the patch-series are no worse than without,
and in some cases better (not exactly sure why, this needs more
analysis).

I would greatly appreciate your code review comments and suggestions!

Thanks,
Kanchana

[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/


Kanchana P Sridhar (4):
  mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
  mm: zswap: zswap_store() extended to handle mTHP folios.
  mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats.
  mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.

 include/linux/huge_mm.h |   1 +
 mm/huge_memory.c        |   2 +
 mm/page_io.c            |   7 ++
 mm/zswap.c              | 238 +++++++++++++++++++++++++++++-----------
 4 files changed, 184 insertions(+), 64 deletions(-)

-- 
2.27.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
  2024-08-16  5:48 [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
@ 2024-08-16  5:48 ` Kanchana P Sridhar
  2024-08-16  5:48 ` [PATCH v2 2/4] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Kanchana P Sridhar @ 2024-08-16  5:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This change is being made so that zswap_store can process mTHP folios.

Modified zswap_is_folio_same_filled() to work for any-order folios,
by accepting an additional "index" parameter to arrive at the
page within the folio to run the same-filled page check.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index adeaf9c97fde..6c5c656ec282 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1358,14 +1358,14 @@ static void shrink_worker(struct work_struct *w)
 /*********************************
 * same-filled functions
 **********************************/
-static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
+static bool zswap_is_folio_same_filled(struct folio *folio, long index, unsigned long *value)
 {
 	unsigned long *data;
 	unsigned long val;
 	unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
 	bool ret = false;
 
-	data = kmap_local_folio(folio, 0);
+	data = kmap_local_folio(folio, index * PAGE_SIZE);
 	val = data[0];
 
 	if (val != data[last_pos])
@@ -1435,7 +1435,7 @@ bool zswap_store(struct folio *folio)
 		goto reject;
 	}
 
-	if (zswap_is_folio_same_filled(folio, &value)) {
+	if (zswap_is_folio_same_filled(folio, 0, &value)) {
 		entry->length = 0;
 		entry->value = value;
 		atomic_inc(&zswap_same_filled_pages);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 2/4] mm: zswap: zswap_store() extended to handle mTHP folios.
  2024-08-16  5:48 [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
  2024-08-16  5:48 ` [PATCH v2 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar
@ 2024-08-16  5:48 ` Kanchana P Sridhar
  2024-08-16  5:48 ` [PATCH v2 3/4] mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats Kanchana P Sridhar
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Kanchana P Sridhar @ 2024-08-16  5:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

zswap_store() will now process and store mTHP and PMD-size THP folios.

This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:

  "[RFC,v1] mm: zswap: Store large folios without splitting"

  [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

This patch provides a sequential implementation of storing an mTHP in
zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.

Towards this goal, zswap_compress() is modified to take a page instead
of a folio as input.

Each page's swap offset is stored as a separate zswap entry.

If an error is encountered during the store of any page in the mTHP,
all previous pages/entries stored will be invalidated. Thus, an mTHP
is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.

This forms the basis for building batching of pages during zswap store
of large folios, by compressing batches of up to say, 8 pages in an
mTHP in parallel in hardware, with the Intel In-Memory Analytics
Accelerator (Intel IAA).

Also, addressed some of the RFC comments from the discussion in [1].

Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 234 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 172 insertions(+), 62 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 6c5c656ec282..7a712be2f3cb 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -884,7 +884,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
-static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
+static bool zswap_compress(struct page *page, struct zswap_entry *entry)
 {
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct scatterlist input, output;
@@ -902,7 +902,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
 
 	dst = acomp_ctx->buffer;
 	sg_init_table(&input, 1);
-	sg_set_folio(&input, folio, PAGE_SIZE, 0);
+	sg_set_page(&input, page, PAGE_SIZE, 0);
 
 	/*
 	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
@@ -1394,36 +1394,83 @@ static void zswap_fill_folio(struct folio *folio, unsigned long value)
 /*********************************
 * main API
 **********************************/
-bool zswap_store(struct folio *folio)
+
+/*
+ * Returns true if the entry was successfully
+ * stored in the xarray, and false otherwise.
+ */
+static bool zswap_store_entry(struct xarray *tree,
+			      struct zswap_entry *entry)
 {
-	swp_entry_t swp = folio->swap;
-	pgoff_t offset = swp_offset(swp);
-	struct xarray *tree = swap_zswap_tree(swp);
-	struct zswap_entry *entry, *old;
-	struct obj_cgroup *objcg = NULL;
-	struct mem_cgroup *memcg = NULL;
-	unsigned long value;
+	struct zswap_entry *old;
+	pgoff_t offset = swp_offset(entry->swpentry);
 
-	VM_WARN_ON_ONCE(!folio_test_locked(folio));
-	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+	old = xa_store(tree, offset, entry, GFP_KERNEL);
 
-	/* Large folios aren't supported */
-	if (folio_test_large(folio))
+	if (xa_is_err(old)) {
+		int err = xa_err(old);
+
+		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+		zswap_reject_alloc_fail++;
 		return false;
+	}
 
-	if (!zswap_enabled)
-		goto check_old;
+	/*
+	 * We may have had an existing entry that became stale when
+	 * the folio was redirtied and now the new version is being
+	 * swapped out. Get rid of the old.
+	 */
+	if (old)
+		zswap_entry_free(old);
 
-	/* Check cgroup limits */
-	objcg = get_obj_cgroup_from_folio(folio);
-	if (objcg && !obj_cgroup_may_zswap(objcg)) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (shrink_memcg(memcg)) {
-			mem_cgroup_put(memcg);
-			goto reject;
-		}
-		mem_cgroup_put(memcg);
+	return true;
+}
+
+/*
+ * If the zswap store fails or zswap is disabled, we must invalidate the
+ * possibly stale entries which were previously stored at the offsets
+ * corresponding to each page of the folio. Otherwise, writeback could
+ * overwrite the new data in the swapfile.
+ *
+ * This is called after the store of the i-th offset in a large folio has
+ * failed. All zswap entries in the folio must be deleted. This helps make
+ * sure that a swapped-out mTHP is either entirely stored in zswap, or
+ * entirely not stored in zswap.
+ *
+ * This is also called if zswap_store() is invoked, but zswap is not enabled.
+ * All offsets for the folio are deleted from zswap in this case.
+ */
+static void zswap_delete_stored_offsets(struct xarray *tree,
+					pgoff_t offset,
+					long nr_pages)
+{
+	struct zswap_entry *entry;
+	long i;
+
+	for (i = 0; i < nr_pages; ++i) {
+		entry = xa_erase(tree, offset + i);
+		if (entry)
+			zswap_entry_free(entry);
 	}
+}
+
+/*
+ * Stores the page at specified "index" in a folio.
+ */
+static bool zswap_store_page(struct folio *folio, long index,
+			     struct obj_cgroup *objcg,
+			     struct zswap_pool *pool)
+{
+	swp_entry_t swp = folio->swap;
+	int type = swp_type(swp);
+	pgoff_t offset = swp_offset(swp) + index;
+	struct page *page = folio_page(folio, index);
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct zswap_entry *entry;
+	unsigned long value;
+
+	if (objcg)
+		obj_cgroup_get(objcg);
 
 	if (zswap_check_limits())
 		goto reject;
@@ -1435,7 +1482,7 @@ bool zswap_store(struct folio *folio)
 		goto reject;
 	}
 
-	if (zswap_is_folio_same_filled(folio, 0, &value)) {
+	if (zswap_is_folio_same_filled(folio, index, &value)) {
 		entry->length = 0;
 		entry->value = value;
 		atomic_inc(&zswap_same_filled_pages);
@@ -1443,42 +1490,20 @@ bool zswap_store(struct folio *folio)
 	}
 
 	/* if entry is successfully added, it keeps the reference */
-	entry->pool = zswap_pool_current_get();
-	if (!entry->pool)
+	if (!zswap_pool_get(pool))
 		goto freepage;
 
-	if (objcg) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
-			mem_cgroup_put(memcg);
-			goto put_pool;
-		}
-		mem_cgroup_put(memcg);
-	}
+	entry->pool = pool;
 
-	if (!zswap_compress(folio, entry))
+	if (!zswap_compress(page, entry))
 		goto put_pool;
 
 store_entry:
-	entry->swpentry = swp;
+	entry->swpentry = swp_entry(type, offset);
 	entry->objcg = objcg;
 
-	old = xa_store(tree, offset, entry, GFP_KERNEL);
-	if (xa_is_err(old)) {
-		int err = xa_err(old);
-
-		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-		zswap_reject_alloc_fail++;
+	if (!zswap_store_entry(tree, entry))
 		goto store_failed;
-	}
-
-	/*
-	 * We may have had an existing entry that became stale when
-	 * the folio was redirtied and now the new version is being
-	 * swapped out. Get rid of the old.
-	 */
-	if (old)
-		zswap_entry_free(old);
 
 	if (objcg) {
 		obj_cgroup_charge_zswap(objcg, entry->length);
@@ -1512,7 +1537,7 @@ bool zswap_store(struct folio *folio)
 	else {
 		zpool_free(entry->pool->zpool, entry->handle);
 put_pool:
-		zswap_pool_put(entry->pool);
+		zswap_pool_put(pool);
 	}
 freepage:
 	zswap_entry_cache_free(entry);
@@ -1520,16 +1545,101 @@ bool zswap_store(struct folio *folio)
 	obj_cgroup_put(objcg);
 	if (zswap_pool_reached_full)
 		queue_work(shrink_wq, &zswap_shrink_work);
-check_old:
+
+	return false;
+}
+
+/*
+ * Modified to store mTHP folios. Each page in the mTHP will be compressed
+ * and stored sequentially.
+ */
+bool zswap_store(struct folio *folio)
+{
+	long nr_pages = folio_nr_pages(folio);
+	swp_entry_t swp = folio->swap;
+	pgoff_t offset = swp_offset(swp);
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg = NULL;
+	struct zswap_pool *pool;
+	bool ret = false;
+	long index;
+
+	VM_WARN_ON_ONCE(!folio_test_locked(folio));
+	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+
+	if (!zswap_enabled)
+		goto reject;
+
 	/*
-	 * If the zswap store fails or zswap is disabled, we must invalidate the
-	 * possibly stale entry which was previously stored at this offset.
-	 * Otherwise, writeback could overwrite the new data in the swapfile.
+	 * Check cgroup limits:
+	 *
+	 * The cgroup zswap limit check is done once at the beginning of an
+	 * mTHP store, and not within zswap_store_page() for each page
+	 * in the mTHP. We do however check the zswap pool limits at the
+	 * start of zswap_store_page(). What this means is, the cgroup
+	 * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
+	 * However, the per-store-page zswap pool limits check should
+	 * hopefully trigger the cgroup aware and zswap LRU aware global
+	 * reclaim implemented in the shrinker. If this assumption holds,
+	 * the cgroup exceeding the zswap limits could potentially be
+	 * resolved before the next zswap_store, and if it is not, the next
+	 * zswap_store would fail the cgroup zswap limit check at the start.
 	 */
-	entry = xa_erase(tree, offset);
-	if (entry)
-		zswap_entry_free(entry);
-	return false;
+	objcg = get_obj_cgroup_from_folio(folio);
+	if (objcg && !obj_cgroup_may_zswap(objcg)) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (shrink_memcg(memcg)) {
+			mem_cgroup_put(memcg);
+			goto put_objcg;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	if (zswap_check_limits())
+		goto put_objcg;
+
+	pool = zswap_pool_current_get();
+	if (!pool)
+		goto put_objcg;
+
+	if (objcg) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
+			mem_cgroup_put(memcg);
+			goto put_pool;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	/*
+	 * Store each page of the folio as a separate entry. If we fail to store
+	 * a page, unwind by removing all the previous pages we stored.
+	 */
+	for (index = 0; index < nr_pages; ++index) {
+		if (!zswap_store_page(folio, index, objcg, pool))
+			goto put_pool;
+	}
+
+	ret = true;
+
+put_pool:
+	zswap_pool_put(pool);
+put_objcg:
+	obj_cgroup_put(objcg);
+	if (zswap_pool_reached_full)
+		queue_work(shrink_wq, &zswap_shrink_work);
+reject:
+	/*
+	 * If the zswap store fails or zswap is disabled, we must invalidate
+	 * the possibly stale entries which were previously stored at the
+	 * offsets corresponding to each page of the folio. Otherwise,
+	 * writeback could overwrite the new data in the swapfile.
+	 */
+	if (!ret)
+		zswap_delete_stored_offsets(tree, offset, nr_pages);
+
+	return ret;
 }
 
 bool zswap_load(struct folio *folio)
-- 
2.27.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 3/4] mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats.
  2024-08-16  5:48 [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
  2024-08-16  5:48 ` [PATCH v2 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar
  2024-08-16  5:48 ` [PATCH v2 2/4] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
@ 2024-08-16  5:48 ` Kanchana P Sridhar
  2024-08-16  5:48 ` [PATCH v2 4/4] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats Kanchana P Sridhar
  2024-08-16  9:02 ` [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios Huang, Ying
  4 siblings, 0 replies; 9+ messages in thread
From: Kanchana P Sridhar @ 2024-08-16  5:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Add a new MTHP_STAT_ZSWPOUT entry to the sysfs mTHP stats so that
per-order mTHP folio ZSWAP stores can be accounted.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/huge_mm.h | 1 +
 mm/huge_memory.c        | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e25d9ebfdf89..44609d84f2dd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -273,6 +273,7 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPOUT,
 	MTHP_STAT_SWPOUT_FALLBACK,
 	MTHP_STAT_SHMEM_ALLOC,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f4be468e06a4..7e97b6ed6ff1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -574,6 +574,7 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(shmem_alloc, MTHP_STAT_SHMEM_ALLOC);
@@ -587,6 +588,7 @@ static struct attribute *stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
 	&anon_fault_fallback_attr.attr,
 	&anon_fault_fallback_charge_attr.attr,
+	&zswpout_attr.attr,
 	&swpout_attr.attr,
 	&swpout_fallback_attr.attr,
 	&shmem_alloc_attr.attr,
-- 
2.27.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 4/4] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.
  2024-08-16  5:48 [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2024-08-16  5:48 ` [PATCH v2 3/4] mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats Kanchana P Sridhar
@ 2024-08-16  5:48 ` Kanchana P Sridhar
  2024-08-16 12:44   ` kernel test robot
  2024-08-16 17:03   ` kernel test robot
  2024-08-16  9:02 ` [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios Huang, Ying
  4 siblings, 2 replies; 9+ messages in thread
From: Kanchana P Sridhar @ 2024-08-16  5:48 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

If zswap_store() successfully stores an mTHP, it will be
counted under the per-order sysfs "zswpout" stats:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

Other block dev/fs mTHP swap-out events will be counted under
the existing sysfs "swpout" stats:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/page_io.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/page_io.c b/mm/page_io.c
index ff8c99ee3af7..debd04fbdfd0 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -172,6 +172,12 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 	goto out;
 }
 
+static inline void count_mthp_zswpout_vm_event(struct folio *folio)
+{
+	if (IS_ENABLED(CONFIG_THP_SWAP))
+		count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
+}
+
 /*
  * We may have stale swap cache pages in memory: notice
  * them here and get rid of the unnecessary final write.
@@ -196,6 +202,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		return ret;
 	}
 	if (zswap_store(folio)) {
+		count_mthp_zswpout_vm_event(folio);
 		folio_unlock(folio);
 		return 0;
 	}
-- 
2.27.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios
  2024-08-16  5:48 [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2024-08-16  5:48 ` [PATCH v2 4/4] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats Kanchana P Sridhar
@ 2024-08-16  9:02 ` Huang, Ying
  2024-08-16 17:50   ` Sridhar, Kanchana P
  4 siblings, 1 reply; 9+ messages in thread
From: Huang, Ying @ 2024-08-16  9:02 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:

> Hi All,
>
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the 
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to v6.11-rc3 in patch 2/4 of this series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs.
>
> For instance, the determination of whether a folio is same-filled is
> based on mapping an index into the folio to derive the page. Likewise,
> there is a function "zswap_store_entry" added to store a zswap_entry in
> the xarray.
>
> For accounting purposes, the patch-series adds per-order mTHP sysfs
> "zswpout" counters that get incremented upon successful zswap_store of
> an mTHP folio:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> This patch-series is a precursor to ZSWAP compress batching of mTHP
> swap-out and decompress batching of swap-ins based on swapin_readahead(),
> using Intel IAA hardware acceleration, which we would like to submit in
> subsequent RFC patch-series, with performance improvement data.
>
> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>
> Changes since RFC v1:
> =====================
>
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>    Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
>    Ryan's initial RFC [1]:
>    - Added a comment about the cgroup zswap limit checks occuring once per
>      folio at the beginning of zswap_store().
>      Nhat, Ryan, please do let me know if the comments convey the summary
>      from the RFC discussion. Thanks!
>    - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
>
> Performance Testing:
> ====================
> Testing of this patch-series was done with the v6.11-rc3 mainline, without
> and with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket.
>
> The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the backing
> swap device. Core frequency was fixed at 2500MHz.

I don't think that this is a reasonable test configuration, there's no
benefit to use ZSWAP+ZRAM.  We should use a normal SSD as backing swap
device.

> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 40G. Following a similar methodology as in Ryan Roberts'
> "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> run, each allocating and writing 1G of memory:
>
>     usemem --init-time -w -O -n 70 1g
>
> Other kernel configuration parameters:
>
>     ZSWAP Compressor  : LZ4, DEFLATE-IAA
>     ZSWAP Allocator   : ZSMALLOC
>     ZRAM Compressor   : LZO-RLE
>     SWAP page-cluster : 2
>
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
>
> Throughput reported by usemem and perf sys time for running the test
> are as follows:
>
>  64KB mTHP:
>  ==========
>   ------------------------------------------------------------------
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
>  |                    |                   |       KB/s |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |    118,928 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |     82,665 |       -30% |

Because the test configuration isn't reasonable, the performance drop
isn't reasonable too.  We should compare between zswap+SSD w/o mTHP
zswap and zswap+SSD w/ mTHP zswap.  I think that there should be
performance improvement for that.

>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    176,210 |        48% |
>  |------------------------------------------------------------------|
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
>  |                    |                   |        sec |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |   1,032.20 |   Baseline |
>  |zswap-mTHP=Store    | ZSWAP lz4         |   1,854.51 |       -80% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     582.71 |        44% |
>   ------------------------------------------------------------------
>
>   -----------------------------------------------------------------------
>  | VMSTATS, mTHP ZSWAP stats,   |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
>  | mTHP ZRAM stats:             |   mainline |       Store |       Store |
>  |                              |            |         lz4 | deflate-iaa |
>  |-----------------------------------------------------------------------|
>  | pswpin                       |         16 |           0 |           0 |
>  | pswpout                      |  7,770,720 |           0 |           0 |
>  | zswpin                       |        547 |         695 |         579 |
>  | zswpout                      |      1,394 |  15,462,778 |   7,284,554 |
>  |-----------------------------------------------------------------------|
>  | thp_swpout                   |          0 |           0 |           0 |
>  | thp_swpout_fallback          |          0 |           0 |           0 |
>  | pgmajfault                   |      3,786 |       3,541 |       3,367 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/zswpout |            |     966,328 |     455,196 |
>  |-----------------------------------------------------------------------|
>  | hugepages-64kB/stats/swpout  |    485,670 |           0 |           0 |
>   -----------------------------------------------------------------------
>
>
>  2MB PMD-THP/2048K mTHP:
>  =======================
>   ------------------------------------------------------------------
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
>  |                    |                   |       KB/s |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |    177,340 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |     84,030 |       -53% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    185,691 |         5% |
>  |------------------------------------------------------------------|
>  |                    |                   |            |            |
>  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
>  |                    |                   |        sec |            |
>  |--------------------|-------------------|------------|------------|
>  |v6.11-rc3 mainline  | ZRAM lzo-rle      |     876.29 |   Baseline |
>  |zswap-mTHP-Store    | ZSWAP lz4         |   1,740.55 |       -99% |
>  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     650.33 |        26% |
>   ------------------------------------------------------------------
>
>   ------------------------------------------------------------------------- 
>  | VMSTATS, mTHP ZSWAP stats,     |  v6.11-rc3 |  zswap-mTHP |  zswap-mTHP |
>  | mTHP ZRAM stats:               |   mainline |       Store |       Store |
>  |                                |            |         lz4 | deflate-iaa |
>  |-------------------------------------------------------------------------|
>  | pswpin                         |          0 |           0 |           0 |
>  | pswpout                        |  8,628,224 |           0 |           0 |
>  | zswpin                         |        678 |      22,733 |       1,641 |
>  | zswpout                        |      1,481 |  14,828,597 |   9,404,937 |
>  |-------------------------------------------------------------------------|
>  | thp_swpout                     |     16,852 |           0 |           0 |
>  | thp_swpout_fallback            |          0 |           0 |           0 |
>  | pgmajfault                     |      3,467 |      25,550 |       4,800 |
>  |-------------------------------------------------------------------------|
>  | hugepages-2048kB/stats/zswpout |            |      28,924 |      18,366 |
>  |-------------------------------------------------------------------------|
>  | hugepages-2048kB/stats/swpout  |     16,852 |           0 |           0 |
>   -------------------------------------------------------------------------
>
> As expected, in the "Before" experiment, there are relatively fewer
> swapouts because ZRAM utilization is not accounted in the cgroup.
>
> With the introduction of zswap_store mTHP, the "After" data reflects the
> higher swapout activity, and consequent throughput/sys time degradation
> when LZ4 is used as the zswap compressor. However, we observe considerable
> throughput and sys time improvement in the "After" data when DEFLATE-IAA
> is the zswap compressor. This observation holds for 64K mTHP and 2MB THP
> experiments. IAA's higher compression ratio and better compress latency
> can be attributed to fewer swap-outs and major page-faults, that result
> in better throughput and sys time.
>
> Our goal is to improve ZSWAP mTHP store performance using batching. With
> Intel IAA compress/decompress batching used in ZSWAP (to be submitted as
> additional RFC series), we are able to demonstrate significant
> performance improvements and memory savings with IAA as compared to
> software compressors.
>
> cgroup zswap kselftest:
> =======================
>
> "Before":
> =========
>   Test run with v6.11-rc3 and no code changes:
>     mTHP 64K set to 'always'
>     zswap compressor set to 'lz4'
>     page-cluster = 3
>
>   zswap shrinker_enabled = N:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   # at least 24MB should be brought back from zswap
>   not ok 3 test_zswapin
>   # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled
>   # Failed to reclaim all of the requested memory
>   not ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   ok 7 test_no_invasive_cgroup_shrink
>
>   zswap shrinker_enabled = Y:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   # at least 24MB should be brought back from zswap
>   not ok 3 test_zswapin
>   # zswpwb_after is 0 while wb is enablednot ok 4 test_zswap_writeback_enabled
>   # Failed to reclaim all of the requested memory
>   not ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   not ok 7 test_no_invasive_cgroup_shrink
>
> "After":
> ========
>   Test run with this patch-series and v6.11-rc3:
>     mTHP 64K set to 'always'
>     zswap compressor set to 'deflate-iaa'
>     page-cluster = 3
>
>   zswap shrinker_enabled = N:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   ok 3 test_zswapin
>   ok 4 test_zswap_writeback_enabled
>   ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   ok 7 test_no_invasive_cgroup_shrink
>   
>   zswap shrinker_enabled = Y:
>   ---------------------------
>   ok 1 test_zswap_usage
>   ok 2 test_swapin_nozswap
>   # at least 24MB should be brought back from zswap
>   not ok 3 test_zswapin
>   ok 4 test_zswap_writeback_enabled
>   ok 5 test_zswap_writeback_disabled
>   ok 6 # SKIP test_no_kmem_bypass
>   not ok 7 test_no_invasive_cgroup_shrink
>
> I haven't taken an in-depth look into the cgroup zswap tests, but it
> looks like the results with the patch-series are no worse than without,
> and in some cases better (not exactly sure why, this needs more
> analysis).
>
> I would greatly appreciate your code review comments and suggestions!
>
> Thanks,
> Kanchana
>
> [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
>
>
> Kanchana P Sridhar (4):
>   mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
>   mm: zswap: zswap_store() extended to handle mTHP folios.
>   mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats.
>   mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.
>
>  include/linux/huge_mm.h |   1 +
>  mm/huge_memory.c        |   2 +
>  mm/page_io.c            |   7 ++
>  mm/zswap.c              | 238 +++++++++++++++++++++++++++++-----------
>  4 files changed, 184 insertions(+), 64 deletions(-)

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 4/4] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.
  2024-08-16  5:48 ` [PATCH v2 4/4] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats Kanchana P Sridhar
@ 2024-08-16 12:44   ` kernel test robot
  2024-08-16 17:03   ` kernel test robot
  1 sibling, 0 replies; 9+ messages in thread
From: kernel test robot @ 2024-08-16 12:44 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: oe-kbuild-all, nanhai.zou, wajdi.k.feghali, vinodh.gopal,
	kanchana.p.sridhar

Hi Kanchana,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.11-rc3]
[cannot apply to akpm-mm/mm-everything next-20240816]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchana-P-Sridhar/mm-zswap-zswap_is_folio_same_filled-takes-an-index-in-the-folio/20240816-134948
base:   linus/master
patch link:    https://lore.kernel.org/r/20240816054805.5201-5-kanchana.p.sridhar%40intel.com
patch subject: [PATCH v2 4/4] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.
config: i386-buildonly-randconfig-005-20240816 (https://download.01.org/0day-ci/archive/20240816/202408162056.MQnreTgK-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240816/202408162056.MQnreTgK-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408162056.MQnreTgK-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/page_io.c: In function 'count_mthp_zswpout_vm_event':
>> mm/page_io.c:178:17: error: implicit declaration of function 'count_mthp_stat' [-Werror=implicit-function-declaration]
     178 |                 count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
         |                 ^~~~~~~~~~~~~~~
>> mm/page_io.c:178:53: error: 'MTHP_STAT_ZSWPOUT' undeclared (first use in this function)
     178 |                 count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
         |                                                     ^~~~~~~~~~~~~~~~~
   mm/page_io.c:178:53: note: each undeclared identifier is reported only once for each function it appears in
   cc1: some warnings being treated as errors


vim +/count_mthp_stat +178 mm/page_io.c

   174	
   175	static inline void count_mthp_zswpout_vm_event(struct folio *folio)
   176	{
   177		if (IS_ENABLED(CONFIG_THP_SWAP))
 > 178			count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
   179	}
   180	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 4/4] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.
  2024-08-16  5:48 ` [PATCH v2 4/4] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats Kanchana P Sridhar
  2024-08-16 12:44   ` kernel test robot
@ 2024-08-16 17:03   ` kernel test robot
  1 sibling, 0 replies; 9+ messages in thread
From: kernel test robot @ 2024-08-16 17:03 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, ryan.roberts, ying.huang, 21cnbao, akpm
  Cc: llvm, oe-kbuild-all, nanhai.zou, wajdi.k.feghali, vinodh.gopal,
	kanchana.p.sridhar

Hi Kanchana,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.11-rc3]
[cannot apply to akpm-mm/mm-everything next-20240816]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kanchana-P-Sridhar/mm-zswap-zswap_is_folio_same_filled-takes-an-index-in-the-folio/20240816-134948
base:   linus/master
patch link:    https://lore.kernel.org/r/20240816054805.5201-5-kanchana.p.sridhar%40intel.com
patch subject: [PATCH v2 4/4] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.
config: arm-randconfig-002-20240816 (https://download.01.org/0day-ci/archive/20240817/202408170059.sq8QoVWB-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 26670e7fa4f032a019d23d56c6a02926e854e8af)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240817/202408170059.sq8QoVWB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408170059.sq8QoVWB-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/page_io.c:14:
   In file included from include/linux/mm.h:2228:
   include/linux/vmstat.h:514:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     514 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
>> mm/page_io.c:178:3: error: call to undeclared function 'count_mthp_stat'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     178 |                 count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
         |                 ^
>> mm/page_io.c:178:39: error: use of undeclared identifier 'MTHP_STAT_ZSWPOUT'
     178 |                 count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
         |                                                     ^
   1 warning and 2 errors generated.


vim +/count_mthp_stat +178 mm/page_io.c

   174	
   175	static inline void count_mthp_zswpout_vm_event(struct folio *folio)
   176	{
   177		if (IS_ENABLED(CONFIG_THP_SWAP))
 > 178			count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
   179	}
   180	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios
  2024-08-16  9:02 ` [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios Huang, Ying
@ 2024-08-16 17:50   ` Sridhar, Kanchana P
  0 siblings, 0 replies; 9+ messages in thread
From: Sridhar, Kanchana P @ 2024-08-16 17:50 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	ryan.roberts, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K,
	Gopal, Vinodh, Sridhar, Kanchana P

Hi Ying,

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Friday, August 16, 2024 2:03 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios
> 
> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
> 
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs.
> >
> > For instance, the determination of whether a folio is same-filled is
> > based on mapping an index into the folio to derive the page. Likewise,
> > there is a function "zswap_store_entry" added to store a zswap_entry in
> > the xarray.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, 176 GiB swap/ZSWAP with ZRAM as the
> backing
> > swap device. Core frequency was fixed at 2500MHz.
> 
> I don't think that this is a reasonable test configuration, there's no
> benefit to use ZSWAP+ZRAM.  We should use a normal SSD as backing swap
> device.

Thanks for this suggestion. Sure, I will gather data using SSD instead of ZRAM
as the backing swap device.

> 
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. Following a similar methodology as in Ryan Roberts'
> > "Swap-out mTHP without splitting" series [2], 70 usemem processes were
> > run, each allocating and writing 1G of memory:
> >
> >     usemem --init-time -w -O -n 70 1g
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressor  : LZ4, DEFLATE-IAA
> >     ZSWAP Allocator   : ZSMALLOC
> >     ZRAM Compressor   : LZO-RLE
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput reported by usemem and perf sys time for running the test
> > are as follows:
> >
> >  64KB mTHP:
> >  ==========
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | ZRAM lzo-rle      |    118,928 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |     82,665 |       -30% |
> 
> Because the test configuration isn't reasonable, the performance drop
> isn't reasonable too.  We should compare between zswap+SSD w/o mTHP
> zswap and zswap+SSD w/ mTHP zswap.  I think that there should be
> performance improvement for that.

Sure, I will gather and post the data with these two configurations.

Thanks,
Kanchana

> 
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    176,210 |        48% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | ZRAM lzo-rle      |   1,032.20 |   Baseline |
> >  |zswap-mTHP=Store    | ZSWAP lz4         |   1,854.51 |       -80% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     582.71 |        44% |
> >   ------------------------------------------------------------------
> >
> >   -----------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP stats,   |  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  | mTHP ZRAM stats:             |   mainline |       Store |       Store |
> >  |                              |            |         lz4 | deflate-iaa |
> >  |-----------------------------------------------------------------------|
> >  | pswpin                       |         16 |           0 |           0 |
> >  | pswpout                      |  7,770,720 |           0 |           0 |
> >  | zswpin                       |        547 |         695 |         579 |
> >  | zswpout                      |      1,394 |  15,462,778 |   7,284,554 |
> >  |-----------------------------------------------------------------------|
> >  | thp_swpout                   |          0 |           0 |           0 |
> >  | thp_swpout_fallback          |          0 |           0 |           0 |
> >  | pgmajfault                   |      3,786 |       3,541 |       3,367 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/zswpout |            |     966,328 |     455,196 |
> >  |-----------------------------------------------------------------------|
> >  | hugepages-64kB/stats/swpout  |    485,670 |           0 |           0 |
> >   -----------------------------------------------------------------------
> >
> >
> >  2MB PMD-THP/2048K mTHP:
> >  =======================
> >   ------------------------------------------------------------------
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     | Throughput | Improvement|
> >  |                    |                   |       KB/s |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | ZRAM lzo-rle      |    177,340 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |     84,030 |       -53% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |    185,691 |         5% |
> >  |------------------------------------------------------------------|
> >  |                    |                   |            |            |
> >  |Kernel              | mTHP SWAP-OUT     |   Sys time | Improvement|
> >  |                    |                   |        sec |            |
> >  |--------------------|-------------------|------------|------------|
> >  |v6.11-rc3 mainline  | ZRAM lzo-rle      |     876.29 |   Baseline |
> >  |zswap-mTHP-Store    | ZSWAP lz4         |   1,740.55 |       -99% |
> >  |zswap-mTHP-Store    | ZSWAP deflate-iaa |     650.33 |        26% |
> >   ------------------------------------------------------------------
> >
> >   -------------------------------------------------------------------------
> >  | VMSTATS, mTHP ZSWAP stats,     |  v6.11-rc3 |  zswap-mTHP |  zswap-
> mTHP |
> >  | mTHP ZRAM stats:               |   mainline |       Store |       Store |
> >  |                                |            |         lz4 | deflate-iaa |
> >  |-------------------------------------------------------------------------|
> >  | pswpin                         |          0 |           0 |           0 |
> >  | pswpout                        |  8,628,224 |           0 |           0 |
> >  | zswpin                         |        678 |      22,733 |       1,641 |
> >  | zswpout                        |      1,481 |  14,828,597 |   9,404,937 |
> >  |-------------------------------------------------------------------------|
> >  | thp_swpout                     |     16,852 |           0 |           0 |
> >  | thp_swpout_fallback            |          0 |           0 |           0 |
> >  | pgmajfault                     |      3,467 |      25,550 |       4,800 |
> >  |-------------------------------------------------------------------------|
> >  | hugepages-2048kB/stats/zswpout |            |      28,924 |      18,366 |
> >  |-------------------------------------------------------------------------|
> >  | hugepages-2048kB/stats/swpout  |     16,852 |           0 |           0 |
> >   -------------------------------------------------------------------------
> >
> > As expected, in the "Before" experiment, there are relatively fewer
> > swapouts because ZRAM utilization is not accounted in the cgroup.
> >
> > With the introduction of zswap_store mTHP, the "After" data reflects the
> > higher swapout activity, and consequent throughput/sys time degradation
> > when LZ4 is used as the zswap compressor. However, we observe
> considerable
> > throughput and sys time improvement in the "After" data when DEFLATE-
> IAA
> > is the zswap compressor. This observation holds for 64K mTHP and 2MB THP
> > experiments. IAA's higher compression ratio and better compress latency
> > can be attributed to fewer swap-outs and major page-faults, that result
> > in better throughput and sys time.
> >
> > Our goal is to improve ZSWAP mTHP store performance using batching. With
> > Intel IAA compress/decompress batching used in ZSWAP (to be submitted as
> > additional RFC series), we are able to demonstrate significant
> > performance improvements and memory savings with IAA as compared to
> > software compressors.
> >
> > cgroup zswap kselftest:
> > =======================
> >
> > "Before":
> > =========
> >   Test run with v6.11-rc3 and no code changes:
> >     mTHP 64K set to 'always'
> >     zswap compressor set to 'lz4'
> >     page-cluster = 3
> >
> >   zswap shrinker_enabled = N:
> >   ---------------------------
> >   ok 1 test_zswap_usage
> >   ok 2 test_swapin_nozswap
> >   # at least 24MB should be brought back from zswap
> >   not ok 3 test_zswapin
> >   # zswpwb_after is 0 while wb is enablednot ok 4
> test_zswap_writeback_enabled
> >   # Failed to reclaim all of the requested memory
> >   not ok 5 test_zswap_writeback_disabled
> >   ok 6 # SKIP test_no_kmem_bypass
> >   ok 7 test_no_invasive_cgroup_shrink
> >
> >   zswap shrinker_enabled = Y:
> >   ---------------------------
> >   ok 1 test_zswap_usage
> >   ok 2 test_swapin_nozswap
> >   # at least 24MB should be brought back from zswap
> >   not ok 3 test_zswapin
> >   # zswpwb_after is 0 while wb is enablednot ok 4
> test_zswap_writeback_enabled
> >   # Failed to reclaim all of the requested memory
> >   not ok 5 test_zswap_writeback_disabled
> >   ok 6 # SKIP test_no_kmem_bypass
> >   not ok 7 test_no_invasive_cgroup_shrink
> >
> > "After":
> > ========
> >   Test run with this patch-series and v6.11-rc3:
> >     mTHP 64K set to 'always'
> >     zswap compressor set to 'deflate-iaa'
> >     page-cluster = 3
> >
> >   zswap shrinker_enabled = N:
> >   ---------------------------
> >   ok 1 test_zswap_usage
> >   ok 2 test_swapin_nozswap
> >   ok 3 test_zswapin
> >   ok 4 test_zswap_writeback_enabled
> >   ok 5 test_zswap_writeback_disabled
> >   ok 6 # SKIP test_no_kmem_bypass
> >   ok 7 test_no_invasive_cgroup_shrink
> >
> >   zswap shrinker_enabled = Y:
> >   ---------------------------
> >   ok 1 test_zswap_usage
> >   ok 2 test_swapin_nozswap
> >   # at least 24MB should be brought back from zswap
> >   not ok 3 test_zswapin
> >   ok 4 test_zswap_writeback_enabled
> >   ok 5 test_zswap_writeback_disabled
> >   ok 6 # SKIP test_no_kmem_bypass
> >   not ok 7 test_no_invasive_cgroup_shrink
> >
> > I haven't taken an in-depth look into the cgroup zswap tests, but it
> > looks like the results with the patch-series are no worse than without,
> > and in some cases better (not exactly sure why, this needs more
> > analysis).
> >
> > I would greatly appreciate your code review comments and suggestions!
> >
> > Thanks,
> > Kanchana
> >
> > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-
> ryan.roberts@arm.com/
> >
> >
> > Kanchana P Sridhar (4):
> >   mm: zswap: zswap_is_folio_same_filled() takes an index in the folio.
> >   mm: zswap: zswap_store() extended to handle mTHP folios.
> >   mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats.
> >   mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats.
> >
> >  include/linux/huge_mm.h |   1 +
> >  mm/huge_memory.c        |   2 +
> >  mm/page_io.c            |   7 ++
> >  mm/zswap.c              | 238 +++++++++++++++++++++++++++++-----------
> >  4 files changed, 184 insertions(+), 64 deletions(-)
> 
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-08-16 17:50 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-16  5:48 [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
2024-08-16  5:48 ` [PATCH v2 1/4] mm: zswap: zswap_is_folio_same_filled() takes an index in the folio Kanchana P Sridhar
2024-08-16  5:48 ` [PATCH v2 2/4] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
2024-08-16  5:48 ` [PATCH v2 3/4] mm: Add MTHP_STAT_ZSWPOUT to sysfs per-order mthp stats Kanchana P Sridhar
2024-08-16  5:48 ` [PATCH v2 4/4] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP stats Kanchana P Sridhar
2024-08-16 12:44   ` kernel test robot
2024-08-16 17:03   ` kernel test robot
2024-08-16  9:02 ` [PATCH v2 0/4] mm: ZSWAP swap-out of mTHP folios Huang, Ying
2024-08-16 17:50   ` Sridhar, Kanchana P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox