* [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
@ 2024-08-28 9:35 Kanchana P Sridhar
2024-08-28 9:35 ` [PATCH v5 1/3] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
` (5 more replies)
0 siblings, 6 replies; 23+ messages in thread
From: Kanchana P Sridhar @ 2024-08-28 9:35 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
ryan.roberts, ying.huang, 21cnbao, akpm
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
Hi All,
This patch-series enables zswap_store() to accept and store mTHP
folios. The most significant contribution in this series is from the
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to v6.11-rc3 in patch 2/4 of this series.
[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
Additionally, there is an attempt to modularize some of the functionality
in zswap_store(), to make it more amenable to supporting any-order
mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
delete all offsets corresponding to a higher order folio stored in zswap.
For accounting purposes, the patch-series adds per-order mTHP sysfs
"zswpout" counters that get incremented upon successful zswap_store of
an mTHP folio:
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
This patch-series is a precursor to ZSWAP compress batching of mTHP
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent RFC patch-series, with performance improvement data.
Thanks to Ying Huang for pre-posting review feedback and suggestions!
Changes since v4:
=================
1) Published before/after data with zstd, as suggested by Nhat (Thanks
Nhat for the data reviews!).
2) Rebased to mm-unstable from 8/27/2024,
commit b659edec079c90012cf8d05624e312d1062b8b87.
3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
robot; as per Nhat's and Michal's suggestion to not require a separate
patch to fix the build errors (thanks both!).
4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
suggested by Yosry (Thanks Yosry!).
5) Squashed the commits that define new mthp zswpout stat counters, and
invoke count_mthp_stat() after successful zswap_store()s; into a single
commit. Thanks Yosry for this suggestion!
Changes since v3:
=================
1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
Thanks to Barry for suggesting aligning with Ryan Roberts' latest
changes to count_mthp_stat() so that it's always defined, even when THP
is disabled. Barry, I have also made one other change in page_io.c
where count_mthp_stat() is called by count_swpout_vm_event(). I would
appreciate it if you can review this. Thanks!
Hopefully this should resolve the kernel robot build errors.
Changes since v2:
=================
1) Gathered usemem data using SSD as the backing swap device for zswap,
as suggested by Ying Huang. Ying, I would appreciate it if you can
review the latest data. Thanks!
2) Generated the base commit info in the patches to attempt to address
the kernel test robot build errors.
3) No code changes to the individual patches themselves.
Changes since RFC v1:
=====================
1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
Thanks Barry!
2) Addressed some of the code review comments that Nhat Pham provided in
Ryan's initial RFC [1]:
- Added a comment about the cgroup zswap limit checks occuring once per
folio at the beginning of zswap_store().
Nhat, Ryan, please do let me know if the comments convey the summary
from the RFC discussion. Thanks!
- Posted data on running the cgroup suite's zswap kselftest.
3) Rebased to v6.11-rc3.
4) Gathered performance data with usemem and the rebased patch-series.
Performance Testing:
====================
Testing of this patch-series was done with the v6.11-rc3 mainline, without
and with this patch-series, on an Intel Sapphire Rapids server,
dual-socket 56 cores per socket, 4 IAA devices per socket.
The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as the
backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
Core frequency was fixed at 2500MHz.
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 40G. The is no swap limit set for the cgroup. Following a
similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
series [2], 70 usemem processes were run, each allocating and writing 1G of
memory:
usemem --init-time -w -O -n 70 1g
The vm/sysfs mTHP stats included with the performance data provide details
on the swapout activity to ZSWAP/swap.
Other kernel configuration parameters:
ZSWAP Compressors : zstd, deflate-iaa
ZSWAP Allocator : zsmalloc
SWAP page-cluster : 2
In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
IAA "compression verification" is enabled. Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors.
Throughput is derived by averaging the individual 70 processes' throughputs
reported by usemem. sys time is measured with perf. All data points are
averaged across 3 runs.
64KB mTHP (cgroup memory.high set to 40G):
==========================================
------------------------------------------------------------------------------
v6.11-rc3 mainline zswap-mTHP Change wrt
Baseline Baseline
------------------------------------------------------------------------------
ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
iaa iaa iaa
------------------------------------------------------------------------------
Throughput (KB/s) 161,496 156,343 140,363 151,938 -13% -3%
sys time (sec) 771.68 802.08 954.85 735.47 -24% 8%
memcg_high 111,223 110,889 138,651 133,884
memcg_swap_high 0 0 0 0
memcg_swap_fail 0 0 0 0
pswpin 16 16 0 0
pswpout 7,471,472 7,527,963 0 0
zswpin 635 605 624 639
zswpout 1,509 1,478 9,453,761 9,385,910
thp_swpout 0 0 0 0
thp_swpout_ 0 0 0 0
fallback
pgmajfault 3,616 3,430 4,633 3,611
ZSWPOUT-64kB n/a n/a 590,768 586,521
SWPOUT-64kB 466,967 470,498 0 0
------------------------------------------------------------------------------
2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
=======================================================
------------------------------------------------------------------------------
v6.11-rc3 mainline zswap-mTHP Change wrt
Baseline Baseline
------------------------------------------------------------------------------
ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
iaa iaa iaa
------------------------------------------------------------------------------
Throughput (KB/s) 192,164 194,643 165,005 174,536 -14% -10%
sys time (sec) 823.55 830.42 801.72 676.65 3% 19%
memcg_high 16,054 15,936 14,951 16,096
memcg_swap_high 0 0 0 0
memcg_swap_fail 0 0 0 0
pswpin 0 0 0 0
pswpout 8,629,248 8,628,907 0 0
zswpin 560 645 5,333 781
zswpout 1,416 1,503 8,546,895 9,355,760
thp_swpout 16,854 16,853 0 0
thp_swpout_ 0 0 0 0
fallback
pgmajfault 3,341 3,574 8,139 3,582
ZSWPOUT-2048kB n/a n/a 16,684 18,270
SWPOUT-2048kB 16,854 16,853 0 0
------------------------------------------------------------------------------
In the "Before" scenario, when zswap does not store mTHP, only allocations
count towards the cgroup memory limit. However, in the "After" scenario,
with the introduction of zswap_store() mTHP, both, allocations as well as
the zswap compressed pool usage from all 70 processes are counted towards
the memory limit. As a result, we see higher swapout activity in the
"After" data. Hence, more time is spent doing reclaim as the zswap cgroup
charge leads to more frequent memory.high breaches.
This causes degradation in throughput and sys time with zswap mTHP, more so
in case of zstd than deflate-iaa. Compress latency could play a part in
this - when there is more swapout activity happening, a slower compressor
would cause allocations to stall for any/all of the 70 processes.
In my opinion, even though the test set up does not provide an accurate
way for a direct before/after comparison (because of zswap usage being
counted in cgroup, hence towards the memory.high), it still seems
reasonable for zswap_store to support (m)THP, so that further performance
improvements can be implemented.
One of the ideas that has shown promise in our experiments is to improve
ZSWAP mTHP store performance using batching. With IAA compress/decompress
batching used in ZSWAP, we are able to demonstrate significant
performance improvements and memory savings with IAA in scalability
experiments, as compared to software compressors. We hope to submit
this work as subsequent RFCs.
I would greatly appreciate your code review comments and suggestions!
Thanks,
Kanchana
[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
Kanchana P Sridhar (3):
mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
mm: zswap: zswap_store() extended to handle mTHP folios.
mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
stats.
include/linux/huge_mm.h | 1 +
include/linux/memcontrol.h | 4 +
mm/huge_memory.c | 3 +
mm/page_io.c | 3 +-
mm/zswap.c | 231 +++++++++++++++++++++++++++----------
5 files changed, 180 insertions(+), 62 deletions(-)
base-commit: b659edec079c90012cf8d05624e312d1062b8b87
--
2.27.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v5 1/3] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
2024-08-28 9:35 [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
@ 2024-08-28 9:35 ` Kanchana P Sridhar
2024-08-28 9:35 ` [PATCH v5 2/3] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
` (4 subsequent siblings)
5 siblings, 0 replies; 23+ messages in thread
From: Kanchana P Sridhar @ 2024-08-28 9:35 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
ryan.roberts, ying.huang, 21cnbao, akpm
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
This resolves an issue with obj_cgroup_get() not being defined if
CONFIG_MEMCG is not defined.
Before this patch, we would see build errors if obj_cgroup_get() is
called from code that is agnostic of CONFIG_MEMCG.
The zswap_store() changes for mTHP in subsequent commits will require
the use of obj_cgroup_get() in zswap code that falls into this category.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
include/linux/memcontrol.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fe05fdb92779..f693d254ab2a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1281,6 +1281,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
return NULL;
}
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+{
+}
+
static inline void obj_cgroup_put(struct obj_cgroup *objcg)
{
}
--
2.27.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v5 2/3] mm: zswap: zswap_store() extended to handle mTHP folios.
2024-08-28 9:35 [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
2024-08-28 9:35 ` [PATCH v5 1/3] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
@ 2024-08-28 9:35 ` Kanchana P Sridhar
2024-08-28 9:35 ` [PATCH v5 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
` (3 subsequent siblings)
5 siblings, 0 replies; 23+ messages in thread
From: Kanchana P Sridhar @ 2024-08-28 9:35 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
ryan.roberts, ying.huang, 21cnbao, akpm
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
zswap_store() will now process and store mTHP and PMD-size THP folios.
This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:
"[RFC,v1] mm: zswap: Store large folios without splitting"
[1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
This patch provides a sequential implementation of storing an mTHP in
zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.
Towards this goal, zswap_compress() is modified to take a page instead
of a folio as input.
Each page's swap offset is stored as a separate zswap entry.
If an error is encountered during the store of any page in the mTHP,
all previous pages/entries stored will be invalidated. Thus, an mTHP
is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
This forms the basis for building batching of pages during zswap store
of large folios, by compressing batches of up to say, 8 pages in an
mTHP in parallel in hardware, with the Intel In-Memory Analytics
Accelerator (Intel IAA).
Also, addressed some of the RFC comments from the discussion in [1].
Made a minor edit in the comments for "struct zswap_entry" to delete
the comments related to "value" since same-filled page handling has
been removed from zswap.
Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
mm/zswap.c | 231 +++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 170 insertions(+), 61 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 449914ea9919..d6f012ca67d8 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -190,7 +190,6 @@ static struct shrinker *zswap_shrinker;
* section for context.
* pool - the zswap_pool the entry's data is in
* handle - zpool allocation handle that stores the compressed page data
- * value - value of the same-value filled pages which have same content
* objcg - the obj_cgroup that the compressed memory is charged to
* lru - handle to the pool's lru used to evict pages.
*/
@@ -876,7 +875,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
return 0;
}
-static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
+static bool zswap_compress(struct page *page, struct zswap_entry *entry)
{
struct crypto_acomp_ctx *acomp_ctx;
struct scatterlist input, output;
@@ -894,7 +893,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
dst = acomp_ctx->buffer;
sg_init_table(&input, 1);
- sg_set_folio(&input, folio, PAGE_SIZE, 0);
+ sg_set_page(&input, page, PAGE_SIZE, 0);
/*
* We need PAGE_SIZE * 2 here since there maybe over-compression case,
@@ -1404,35 +1403,82 @@ static void shrink_worker(struct work_struct *w)
/*********************************
* main API
**********************************/
-bool zswap_store(struct folio *folio)
+
+/*
+ * Returns true if the entry was successfully
+ * stored in the xarray, and false otherwise.
+ */
+static bool zswap_store_entry(struct xarray *tree,
+ struct zswap_entry *entry)
{
- swp_entry_t swp = folio->swap;
- pgoff_t offset = swp_offset(swp);
- struct xarray *tree = swap_zswap_tree(swp);
- struct zswap_entry *entry, *old;
- struct obj_cgroup *objcg = NULL;
- struct mem_cgroup *memcg = NULL;
+ struct zswap_entry *old;
+ pgoff_t offset = swp_offset(entry->swpentry);
- VM_WARN_ON_ONCE(!folio_test_locked(folio));
- VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+ old = xa_store(tree, offset, entry, GFP_KERNEL);
- /* Large folios aren't supported */
- if (folio_test_large(folio))
+ if (xa_is_err(old)) {
+ int err = xa_err(old);
+
+ WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+ zswap_reject_alloc_fail++;
return false;
+ }
- if (!zswap_enabled)
- goto check_old;
+ /*
+ * We may have had an existing entry that became stale when
+ * the folio was redirtied and now the new version is being
+ * swapped out. Get rid of the old.
+ */
+ if (old)
+ zswap_entry_free(old);
- /* Check cgroup limits */
- objcg = get_obj_cgroup_from_folio(folio);
- if (objcg && !obj_cgroup_may_zswap(objcg)) {
- memcg = get_mem_cgroup_from_objcg(objcg);
- if (shrink_memcg(memcg)) {
- mem_cgroup_put(memcg);
- goto reject;
- }
- mem_cgroup_put(memcg);
+ return true;
+}
+
+/*
+ * If the zswap store fails or zswap is disabled, we must invalidate the
+ * possibly stale entries which were previously stored at the offsets
+ * corresponding to each page of the folio. Otherwise, writeback could
+ * overwrite the new data in the swapfile.
+ *
+ * This is called after the store of the i-th offset in a large folio has
+ * failed. All zswap entries in the folio must be deleted. This helps make
+ * sure that a swapped-out mTHP is either entirely stored in zswap, or
+ * entirely not stored in zswap.
+ *
+ * This is also called if zswap_store() is invoked, but zswap is not enabled.
+ * All offsets for the folio are deleted from zswap in this case.
+ */
+static void zswap_delete_stored_offsets(struct xarray *tree,
+ pgoff_t offset,
+ long nr_pages)
+{
+ struct zswap_entry *entry;
+ long i;
+
+ for (i = 0; i < nr_pages; ++i) {
+ entry = xa_erase(tree, offset + i);
+ if (entry)
+ zswap_entry_free(entry);
}
+}
+
+/*
+ * Stores the page at specified "index" in a folio.
+ */
+static bool zswap_store_page(struct folio *folio, long index,
+ struct obj_cgroup *objcg,
+ struct zswap_pool *pool)
+{
+ swp_entry_t swp = folio->swap;
+ int type = swp_type(swp);
+ pgoff_t offset = swp_offset(swp) + index;
+ struct page *page = folio_page(folio, index);
+ struct xarray *tree = swap_zswap_tree(swp);
+ struct zswap_entry *entry;
+
+ if (objcg)
+ obj_cgroup_get(objcg);
if (zswap_check_limits())
goto reject;
@@ -1445,42 +1491,20 @@ bool zswap_store(struct folio *folio)
}
/* if entry is successfully added, it keeps the reference */
- entry->pool = zswap_pool_current_get();
- if (!entry->pool)
+ if (!zswap_pool_get(pool))
goto freepage;
- if (objcg) {
- memcg = get_mem_cgroup_from_objcg(objcg);
- if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
- mem_cgroup_put(memcg);
- goto put_pool;
- }
- mem_cgroup_put(memcg);
- }
+ entry->pool = pool;
- if (!zswap_compress(folio, entry))
+ if (!zswap_compress(page, entry))
goto put_pool;
- entry->swpentry = swp;
+ entry->swpentry = swp_entry(type, offset);
entry->objcg = objcg;
entry->referenced = true;
- old = xa_store(tree, offset, entry, GFP_KERNEL);
- if (xa_is_err(old)) {
- int err = xa_err(old);
-
- WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
- zswap_reject_alloc_fail++;
+ if (!zswap_store_entry(tree, entry))
goto store_failed;
- }
-
- /*
- * We may have had an existing entry that became stale when
- * the folio was redirtied and now the new version is being
- * swapped out. Get rid of the old.
- */
- if (old)
- zswap_entry_free(old);
if (objcg) {
obj_cgroup_charge_zswap(objcg, entry->length);
@@ -1511,23 +1535,108 @@ bool zswap_store(struct folio *folio)
store_failed:
zpool_free(entry->pool->zpool, entry->handle);
put_pool:
- zswap_pool_put(entry->pool);
+ zswap_pool_put(pool);
freepage:
zswap_entry_cache_free(entry);
reject:
obj_cgroup_put(objcg);
if (zswap_pool_reached_full)
queue_work(shrink_wq, &zswap_shrink_work);
-check_old:
+
+ return false;
+}
+
+/*
+ * Modified to store mTHP folios. Each page in the mTHP will be compressed
+ * and stored sequentially.
+ */
+bool zswap_store(struct folio *folio)
+{
+ long nr_pages = folio_nr_pages(folio);
+ swp_entry_t swp = folio->swap;
+ pgoff_t offset = swp_offset(swp);
+ struct xarray *tree = swap_zswap_tree(swp);
+ struct obj_cgroup *objcg = NULL;
+ struct mem_cgroup *memcg = NULL;
+ struct zswap_pool *pool;
+ bool ret = false;
+ long index;
+
+ VM_WARN_ON_ONCE(!folio_test_locked(folio));
+ VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+
+ if (!zswap_enabled)
+ goto reject;
+
/*
- * If the zswap store fails or zswap is disabled, we must invalidate the
- * possibly stale entry which was previously stored at this offset.
- * Otherwise, writeback could overwrite the new data in the swapfile.
+ * Check cgroup limits:
+ *
+ * The cgroup zswap limit check is done once at the beginning of an
+ * mTHP store, and not within zswap_store_page() for each page
+ * in the mTHP. We do however check the zswap pool limits at the
+ * start of zswap_store_page(). What this means is, the cgroup
+ * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
+ * However, the per-store-page zswap pool limits check should
+ * hopefully trigger the cgroup aware and zswap LRU aware global
+ * reclaim implemented in the shrinker. If this assumption holds,
+ * the cgroup exceeding the zswap limits could potentially be
+ * resolved before the next zswap_store, and if it is not, the next
+ * zswap_store would fail the cgroup zswap limit check at the start.
*/
- entry = xa_erase(tree, offset);
- if (entry)
- zswap_entry_free(entry);
- return false;
+ objcg = get_obj_cgroup_from_folio(folio);
+ if (objcg && !obj_cgroup_may_zswap(objcg)) {
+ memcg = get_mem_cgroup_from_objcg(objcg);
+ if (shrink_memcg(memcg)) {
+ mem_cgroup_put(memcg);
+ goto put_objcg;
+ }
+ mem_cgroup_put(memcg);
+ }
+
+ if (zswap_check_limits())
+ goto put_objcg;
+
+ pool = zswap_pool_current_get();
+ if (!pool)
+ goto put_objcg;
+
+ if (objcg) {
+ memcg = get_mem_cgroup_from_objcg(objcg);
+ if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
+ mem_cgroup_put(memcg);
+ goto put_pool;
+ }
+ mem_cgroup_put(memcg);
+ }
+
+ /*
+ * Store each page of the folio as a separate entry. If we fail to store
+ * a page, unwind by removing all the previous pages we stored.
+ */
+ for (index = 0; index < nr_pages; ++index) {
+ if (!zswap_store_page(folio, index, objcg, pool))
+ goto put_pool;
+ }
+
+ ret = true;
+
+put_pool:
+ zswap_pool_put(pool);
+put_objcg:
+ obj_cgroup_put(objcg);
+ if (zswap_pool_reached_full)
+ queue_work(shrink_wq, &zswap_shrink_work);
+reject:
+ /*
+ * If the zswap store fails or zswap is disabled, we must invalidate
+ * the possibly stale entries which were previously stored at the
+ * offsets corresponding to each page of the folio. Otherwise,
+ * writeback could overwrite the new data in the swapfile.
+ */
+ if (!ret)
+ zswap_delete_stored_offsets(tree, offset, nr_pages);
+
+ return ret;
}
bool zswap_load(struct folio *folio)
--
2.27.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v5 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats.
2024-08-28 9:35 [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
2024-08-28 9:35 ` [PATCH v5 1/3] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
2024-08-28 9:35 ` [PATCH v5 2/3] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
@ 2024-08-28 9:35 ` Kanchana P Sridhar
2024-08-28 15:55 ` [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Nhat Pham
` (2 subsequent siblings)
5 siblings, 0 replies; 23+ messages in thread
From: Kanchana P Sridhar @ 2024-08-28 9:35 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
ryan.roberts, ying.huang, 21cnbao, akpm
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
Add a new MTHP_STAT_ZSWPOUT entry to the sysfs mTHP stats so that
per-order mTHP folio ZSWAP stores can be accounted.
If zswap_store() successfully swaps out an mTHP, it will be counted under
the per-order sysfs "zswpout" stats:
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
Other block dev/fs mTHP swap-out events will be counted under
the existing sysfs "swpout" stats:
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
Based on changes made in commit 61e751c01466ffef5dc72cb64349454a691c6bfe
("mm: cleanup count_mthp_stat() definition"), this patch also moves
the call to count_mthp_stat() in count_swpout_vm_event() to be outside
the "ifdef CONFIG_TRANSPARENT_HUGEPAGE".
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
include/linux/huge_mm.h | 1 +
mm/huge_memory.c | 3 +++
mm/page_io.c | 3 ++-
3 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4902e2f7e896..8b8045d4a351 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -118,6 +118,7 @@ enum mthp_stat_item {
MTHP_STAT_ANON_FAULT_ALLOC,
MTHP_STAT_ANON_FAULT_FALLBACK,
MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+ MTHP_STAT_ZSWPOUT,
MTHP_STAT_SWPOUT,
MTHP_STAT_SWPOUT_FALLBACK,
MTHP_STAT_SHMEM_ALLOC,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a81eab98d6b8..45b26c8b3d8a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -587,6 +587,7 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
#ifdef CONFIG_SHMEM
@@ -605,6 +606,7 @@ static struct attribute *anon_stats_attrs[] = {
&anon_fault_fallback_attr.attr,
&anon_fault_fallback_charge_attr.attr,
#ifndef CONFIG_SHMEM
+ &zswpout_attr.attr,
&swpout_attr.attr,
&swpout_fallback_attr.attr,
#endif
@@ -637,6 +639,7 @@ static struct attribute_group file_stats_attr_grp = {
static struct attribute *any_stats_attrs[] = {
#ifdef CONFIG_SHMEM
+ &zswpout_attr.attr,
&swpout_attr.attr,
&swpout_fallback_attr.attr,
#endif
diff --git a/mm/page_io.c b/mm/page_io.c
index b6f1519d63b0..26106e745d73 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -289,6 +289,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
swap_zeromap_folio_clear(folio);
}
if (zswap_store(folio)) {
+ count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
folio_unlock(folio);
return 0;
}
@@ -308,8 +309,8 @@ static inline void count_swpout_vm_event(struct folio *folio)
count_memcg_folio_events(folio, THP_SWPOUT, 1);
count_vm_event(THP_SWPOUT);
}
- count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
#endif
+ count_mthp_stat(folio_order(folio), MTHP_STAT_SWPOUT);
count_vm_events(PSWPOUT, folio_nr_pages(folio));
}
--
2.27.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-28 9:35 [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
` (2 preceding siblings ...)
2024-08-28 9:35 ` [PATCH v5 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
@ 2024-08-28 15:55 ` Nhat Pham
2024-08-28 17:23 ` Nhat Pham
2024-08-28 19:24 ` Sridhar, Kanchana P
2024-08-28 21:35 ` Nhat Pham
2024-08-28 22:37 ` Yosry Ahmed
5 siblings, 2 replies; 23+ messages in thread
From: Nhat Pham @ 2024-08-28 15:55 UTC (permalink / raw)
To: Kanchana P Sridhar
Cc: linux-kernel, linux-mm, hannes, yosryahmed, ryan.roberts,
ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
vinodh.gopal
On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi All,
>
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to v6.11-rc3 in patch 2/4 of this series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> delete all offsets corresponding to a higher order folio stored in zswap.
>
> For accounting purposes, the patch-series adds per-order mTHP sysfs
> "zswpout" counters that get incremented upon successful zswap_store of
> an mTHP folio:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> This patch-series is a precursor to ZSWAP compress batching of mTHP
> swap-out and decompress batching of swap-ins based on swapin_readahead(),
> using Intel IAA hardware acceleration, which we would like to submit in
> subsequent RFC patch-series, with performance improvement data.
>
> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>
> Changes since v4:
> =================
> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> Nhat for the data reviews!).
> 2) Rebased to mm-unstable from 8/27/2024,
> commit b659edec079c90012cf8d05624e312d1062b8b87.
> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> robot; as per Nhat's and Michal's suggestion to not require a separate
> patch to fix the build errors (thanks both!).
> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> suggested by Yosry (Thanks Yosry!).
> 5) Squashed the commits that define new mthp zswpout stat counters, and
> invoke count_mthp_stat() after successful zswap_store()s; into a single
> commit. Thanks Yosry for this suggestion!
>
> Changes since v3:
> =================
> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> changes to count_mthp_stat() so that it's always defined, even when THP
> is disabled. Barry, I have also made one other change in page_io.c
> where count_mthp_stat() is called by count_swpout_vm_event(). I would
> appreciate it if you can review this. Thanks!
> Hopefully this should resolve the kernel robot build errors.
>
> Changes since v2:
> =================
> 1) Gathered usemem data using SSD as the backing swap device for zswap,
> as suggested by Ying Huang. Ying, I would appreciate it if you can
> review the latest data. Thanks!
> 2) Generated the base commit info in the patches to attempt to address
> the kernel test robot build errors.
> 3) No code changes to the individual patches themselves.
>
> Changes since RFC v1:
> =====================
>
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
> Ryan's initial RFC [1]:
> - Added a comment about the cgroup zswap limit checks occuring once per
> folio at the beginning of zswap_store().
> Nhat, Ryan, please do let me know if the comments convey the summary
> from the RFC discussion. Thanks!
> - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
>
> Performance Testing:
> ====================
> Testing of this patch-series was done with the v6.11-rc3 mainline, without
> and with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket.
>
> The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as the
> backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 40G. The is no swap limit set for the cgroup. Following a
I thought it was 60G. Why are we reducing it to 40G here? Just curious :)
> similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> series [2], 70 usemem processes were run, each allocating and writing 1G of
> memory:
>
> usemem --init-time -w -O -n 70 1g
>
> The vm/sysfs mTHP stats included with the performance data provide details
> on the swapout activity to ZSWAP/swap.
>
> Other kernel configuration parameters:
>
> ZSWAP Compressors : zstd, deflate-iaa
> ZSWAP Allocator : zsmalloc
> SWAP page-cluster : 2
>
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
>
> Throughput is derived by averaging the individual 70 processes' throughputs
> reported by usemem. sys time is measured with perf. All data points are
> averaged across 3 runs.
>
> 64KB mTHP (cgroup memory.high set to 40G):
> ==========================================
>
> ------------------------------------------------------------------------------
> v6.11-rc3 mainline zswap-mTHP Change wrt
> Baseline Baseline
> ------------------------------------------------------------------------------
> ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
> iaa iaa iaa
> ------------------------------------------------------------------------------
> Throughput (KB/s) 161,496 156,343 140,363 151,938 -13% -3%
> sys time (sec) 771.68 802.08 954.85 735.47 -24% 8%
> memcg_high 111,223 110,889 138,651 133,884
> memcg_swap_high 0 0 0 0
> memcg_swap_fail 0 0 0 0
> pswpin 16 16 0 0
> pswpout 7,471,472 7,527,963 0 0
> zswpin 635 605 624 639
> zswpout 1,509 1,478 9,453,761 9,385,910
> thp_swpout 0 0 0 0
> thp_swpout_ 0 0 0 0
> fallback
> pgmajfault 3,616 3,430 4,633 3,611
> ZSWPOUT-64kB n/a n/a 590,768 586,521
> SWPOUT-64kB 466,967 470,498 0 0
> ------------------------------------------------------------------------------
>
> 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> =======================================================
>
> ------------------------------------------------------------------------------
> v6.11-rc3 mainline zswap-mTHP Change wrt
> Baseline Baseline
> ------------------------------------------------------------------------------
> ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
> iaa iaa iaa
> ------------------------------------------------------------------------------
> Throughput (KB/s) 192,164 194,643 165,005 174,536 -14% -10%
> sys time (sec) 823.55 830.42 801.72 676.65 3% 19%
> memcg_high 16,054 15,936 14,951 16,096
> memcg_swap_high 0 0 0 0
> memcg_swap_fail 0 0 0 0
> pswpin 0 0 0 0
> pswpout 8,629,248 8,628,907 0 0
> zswpin 560 645 5,333 781
> zswpout 1,416 1,503 8,546,895 9,355,760
> thp_swpout 16,854 16,853 0 0
> thp_swpout_ 0 0 0 0
> fallback
> pgmajfault 3,341 3,574 8,139 3,582
> ZSWPOUT-2048kB n/a n/a 16,684 18,270
> SWPOUT-2048kB 16,854 16,853 0 0
> ------------------------------------------------------------------------------
OK these numbers are much more positive now. Some observation:
1. The pswpout and zswpout cells are much more sane now. I still think
we have issues with the way zswap cgroup charging interacts with our
reclaim dynamics, but my theory is that these issues only manifest in
more extreme conditions - high concurrency + fast reclaim path ==
memory.high limit constantly violated, leading to the vicious cycle of
overreclaim? zstd has a much better compression ratio than lz4, so
that probably lowers the violation amount per iteration, which
compounds overtime and drastically reduces the overreclaiming issue.
We probably should still investigate and fix it though.
2. That said, there are still regressions with respect to the mTHP
case. But it is outperforming in big THP now! This is strange.
3. I also noticed that your pswpin and zswpin rows are all 0 or really
small. Is this why we are not seeing much gains with zswap? I mean, if
you are not going to use these pages, offloading them to swap is
better by definition... I wonder if lowering the memory limit even
further would show positive numbers? Or
>
> In the "Before" scenario, when zswap does not store mTHP, only allocations
> count towards the cgroup memory limit. However, in the "After" scenario,
> with the introduction of zswap_store() mTHP, both, allocations as well as
> the zswap compressed pool usage from all 70 processes are counted towards
> the memory limit. As a result, we see higher swapout activity in the
> "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> charge leads to more frequent memory.high breaches.
>
> This causes degradation in throughput and sys time with zswap mTHP, more so
> in case of zstd than deflate-iaa. Compress latency could play a part in
> this - when there is more swapout activity happening, a slower compressor
> would cause allocations to stall for any/all of the 70 processes.
>
> In my opinion, even though the test set up does not provide an accurate
> way for a direct before/after comparison (because of zswap usage being
> counted in cgroup, hence towards the memory.high), it still seems
> reasonable for zswap_store to support (m)THP, so that further performance
> improvements can be implemented.
Can we add a knob/config to enable/disable this? Just in case we are
regressing software compressor users for the sake of hardware
compressor users. Especially when the former are the majority of the
users, and the latter requires more investment :)
>
> One of the ideas that has shown promise in our experiments is to improve
> ZSWAP mTHP store performance using batching. With IAA compress/decompress
> batching used in ZSWAP, we are able to demonstrate significant
> performance improvements and memory savings with IAA in scalability
> experiments, as compared to software compressors. We hope to submit
> this work as subsequent RFCs.
>
> I would greatly appreciate your code review comments and suggestions!
>
> Thanks,
> Kanchana
Thanks for the hard work, Kanchana!
>
> [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
>
>
> Kanchana P Sridhar (3):
> mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
> mm: zswap: zswap_store() extended to handle mTHP folios.
> mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
> stats.
>
> include/linux/huge_mm.h | 1 +
> include/linux/memcontrol.h | 4 +
> mm/huge_memory.c | 3 +
> mm/page_io.c | 3 +-
> mm/zswap.c | 231 +++++++++++++++++++++++++++----------
> 5 files changed, 180 insertions(+), 62 deletions(-)
>
>
> base-commit: b659edec079c90012cf8d05624e312d1062b8b87
> --
> 2.27.0
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-28 15:55 ` [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Nhat Pham
@ 2024-08-28 17:23 ` Nhat Pham
2024-08-28 19:30 ` Sridhar, Kanchana P
2024-08-28 19:24 ` Sridhar, Kanchana P
1 sibling, 1 reply; 23+ messages in thread
From: Nhat Pham @ 2024-08-28 17:23 UTC (permalink / raw)
To: Kanchana P Sridhar
Cc: linux-kernel, linux-mm, hannes, yosryahmed, ryan.roberts,
ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
vinodh.gopal
On Wed, Aug 28, 2024 at 8:55 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> better by definition... I wonder if lowering the memory limit even
> further would show positive numbers? Or
... perhaps with a workload that has less cold data? or using the
zswap shrinker to off load some of these cold objects to swap?
Food for thought :)
^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-28 15:55 ` [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Nhat Pham
2024-08-28 17:23 ` Nhat Pham
@ 2024-08-28 19:24 ` Sridhar, Kanchana P
1 sibling, 0 replies; 23+ messages in thread
From: Sridhar, Kanchana P @ 2024-08-28 19:24 UTC (permalink / raw)
To: Nhat Pham
Cc: linux-kernel, linux-mm, hannes, yosryahmed, ryan.roberts, Huang,
Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
Vinodh, Sridhar, Kanchana P
Hi Nhat,
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Wednesday, August 28, 2024 8:55 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>
> On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs. For instance, the function zswap_store_entry() stores a
> zswap_entry
> > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > delete all offsets corresponding to a higher order folio stored in zswap.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Changes since v4:
> > =================
> > 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> > Nhat for the data reviews!).
> > 2) Rebased to mm-unstable from 8/27/2024,
> > commit b659edec079c90012cf8d05624e312d1062b8b87.
> > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> > robot; as per Nhat's and Michal's suggestion to not require a separate
> > patch to fix the build errors (thanks both!).
> > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> > suggested by Yosry (Thanks Yosry!).
> > 5) Squashed the commits that define new mthp zswpout stat counters, and
> > invoke count_mthp_stat() after successful zswap_store()s; into a single
> > commit. Thanks Yosry for this suggestion!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> > Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> > changes to count_mthp_stat() so that it's always defined, even when THP
> > is disabled. Barry, I have also made one other change in page_io.c
> > where count_mthp_stat() is called by count_swpout_vm_event(). I would
> > appreciate it if you can review this. Thanks!
> > Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> > as suggested by Ying Huang. Ying, I would appreciate it if you can
> > review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> > the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> > Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> > Ryan's initial RFC [1]:
> > - Added a comment about the cgroup zswap limit checks occuring once
> per
> > folio at the beginning of zswap_store().
> > Nhat, Ryan, please do let me know if the comments convey the summary
> > from the RFC discussion. Thanks!
> > - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as
> the
> > backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> > Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. The is no swap limit set for the cgroup. Following a
>
> I thought it was 60G. Why are we reducing it to 40G here? Just curious :)
That's correct, Nhat. This is going back to the original 40G memory.high setup
that Ryan has reported using in [2].
[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
Since I am back to using the 176GiB ZRAM as the backing swap device for ZSWAP,
I could use the more stringent 40G limit.
I had to increase the memory limit for the v4 experiments with 4G SSD swap for
the experiment to be viable and still generate swap-out activity, as follows:
64K mTHP experiments: cgroup memory fixed at 60G
2M THP experiments : cgroup memory fixed at 55G
>
> > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> > series [2], 70 usemem processes were run, each allocating and writing 1G of
> > memory:
> >
> > usemem --init-time -w -O -n 70 1g
> >
> > The vm/sysfs mTHP stats included with the performance data provide
> details
> > on the swapout activity to ZSWAP/swap.
> >
> > Other kernel configuration parameters:
> >
> > ZSWAP Compressors : zstd, deflate-iaa
> > ZSWAP Allocator : zsmalloc
> > SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput is derived by averaging the individual 70 processes' throughputs
> > reported by usemem. sys time is measured with perf. All data points are
> > averaged across 3 runs.
> >
> > 64KB mTHP (cgroup memory.high set to 40G):
> > ==========================================
> >
> > ------------------------------------------------------------------------------
> > v6.11-rc3 mainline zswap-mTHP Change wrt
> > Baseline Baseline
> > ------------------------------------------------------------------------------
> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
> > iaa iaa iaa
> > ------------------------------------------------------------------------------
> > Throughput (KB/s) 161,496 156,343 140,363 151,938 -13% -3%
> > sys time (sec) 771.68 802.08 954.85 735.47 -24% 8%
> > memcg_high 111,223 110,889 138,651 133,884
> > memcg_swap_high 0 0 0 0
> > memcg_swap_fail 0 0 0 0
> > pswpin 16 16 0 0
> > pswpout 7,471,472 7,527,963 0 0
> > zswpin 635 605 624 639
> > zswpout 1,509 1,478 9,453,761 9,385,910
> > thp_swpout 0 0 0 0
> > thp_swpout_ 0 0 0 0
> > fallback
> > pgmajfault 3,616 3,430 4,633 3,611
> > ZSWPOUT-64kB n/a n/a 590,768 586,521
> > SWPOUT-64kB 466,967 470,498 0 0
> > ------------------------------------------------------------------------------
> >
> > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> > =======================================================
> >
> > ------------------------------------------------------------------------------
> > v6.11-rc3 mainline zswap-mTHP Change wrt
> > Baseline Baseline
> > ------------------------------------------------------------------------------
> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
> > iaa iaa iaa
> > ------------------------------------------------------------------------------
> > Throughput (KB/s) 192,164 194,643 165,005 174,536 -14% -10%
> > sys time (sec) 823.55 830.42 801.72 676.65 3% 19%
> > memcg_high 16,054 15,936 14,951 16,096
> > memcg_swap_high 0 0 0 0
> > memcg_swap_fail 0 0 0 0
> > pswpin 0 0 0 0
> > pswpout 8,629,248 8,628,907 0 0
> > zswpin 560 645 5,333 781
> > zswpout 1,416 1,503 8,546,895 9,355,760
> > thp_swpout 16,854 16,853 0 0
> > thp_swpout_ 0 0 0 0
> > fallback
> > pgmajfault 3,341 3,574 8,139 3,582
> > ZSWPOUT-2048kB n/a n/a 16,684 18,270
> > SWPOUT-2048kB 16,854 16,853 0 0
> > ------------------------------------------------------------------------------
>
> OK these numbers are much more positive now. Some observation:
>
> 1. The pswpout and zswpout cells are much more sane now. I still think
> we have issues with the way zswap cgroup charging interacts with our
> reclaim dynamics, but my theory is that these issues only manifest in
> more extreme conditions - high concurrency + fast reclaim path ==
> memory.high limit constantly violated, leading to the vicious cycle of
> overreclaim? zstd has a much better compression ratio than lz4, so
> that probably lowers the violation amount per iteration, which
> compounds overtime and drastically reduces the overreclaiming issue.
> We probably should still investigate and fix it though.
I agree with this analysis and summary!
>
> 2. That said, there are still regressions with respect to the mTHP
> case. But it is outperforming in big THP now! This is strange.
Yes. Although, it is possible that the kernel optimizations for PMD-size THP
are helping in this case.
>
> 3. I also noticed that your pswpin and zswpin rows are all 0 or really
> small. Is this why we are not seeing much gains with zswap? I mean, if
> you are not going to use these pages, offloading them to swap is
> better by definition... I wonder if lowering the memory limit even
> further would show positive numbers? Or
Great observation. I suppose this is in part due to the nature of the workload,
which (as in my latest reply to Yosry to his comments on v4) accesses each 8-bytes
chunk to write to it once, and that's it. Also, because of the fact that when the
workload exits, the zswap zpool size is 0 in case of 64K mTHP, combined with
the very few swapins, it appears that the swapped out folios were mostly
part of the working set, not faulted back in (hence "cold" memory) but were
ultimately released when the workload exited.
In the case of 2M THP however, the kernel seems to have reclaimed truly
cold memory, since the zswap zpool size is 3,134,619,648 (3.1G) after the workload exits.
>
> >
> > In the "Before" scenario, when zswap does not store mTHP, only allocations
> > count towards the cgroup memory limit. However, in the "After" scenario,
> > with the introduction of zswap_store() mTHP, both, allocations as well as
> > the zswap compressed pool usage from all 70 processes are counted
> towards
> > the memory limit. As a result, we see higher swapout activity in the
> > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> > charge leads to more frequent memory.high breaches.
> >
> > This causes degradation in throughput and sys time with zswap mTHP, more
> so
> > in case of zstd than deflate-iaa. Compress latency could play a part in
> > this - when there is more swapout activity happening, a slower compressor
> > would cause allocations to stall for any/all of the 70 processes.
> >
> > In my opinion, even though the test set up does not provide an accurate
> > way for a direct before/after comparison (because of zswap usage being
> > counted in cgroup, hence towards the memory.high), it still seems
> > reasonable for zswap_store to support (m)THP, so that further performance
> > improvements can be implemented.
>
> Can we add a knob/config to enable/disable this? Just in case we are
> regressing software compressor users for the sake of hardware
> compressor users. Especially when the former are the majority of the
> users, and the latter requires more investment :)
Sure, I am thinking it would be better to add a config variable, say,
CONFIG_THP_ZSWAP_STORE that is OFF by default? If you think this sounds Ok,
I will submit a v6 with this change.
>
> >
> > One of the ideas that has shown promise in our experiments is to improve
> > ZSWAP mTHP store performance using batching. With IAA
> compress/decompress
> > batching used in ZSWAP, we are able to demonstrate significant
> > performance improvements and memory savings with IAA in scalability
> > experiments, as compared to software compressors. We hope to submit
> > this work as subsequent RFCs.
> >
> > I would greatly appreciate your code review comments and suggestions!
> >
> > Thanks,
> > Kanchana
>
> Thanks for the hard work, Kanchana!
Thanks Nhat :) I really appreciate your reviews, comments and analysis!
Thanks,
Kanchana
>
> >
> > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-
> ryan.roberts@arm.com/
> >
> >
> > Kanchana P Sridhar (3):
> > mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
> > mm: zswap: zswap_store() extended to handle mTHP folios.
> > mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
> > stats.
> >
> > include/linux/huge_mm.h | 1 +
> > include/linux/memcontrol.h | 4 +
> > mm/huge_memory.c | 3 +
> > mm/page_io.c | 3 +-
> > mm/zswap.c | 231 +++++++++++++++++++++++++++----------
> > 5 files changed, 180 insertions(+), 62 deletions(-)
> >
> >
> > base-commit: b659edec079c90012cf8d05624e312d1062b8b87
> > --
> > 2.27.0
> >
^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-28 17:23 ` Nhat Pham
@ 2024-08-28 19:30 ` Sridhar, Kanchana P
0 siblings, 0 replies; 23+ messages in thread
From: Sridhar, Kanchana P @ 2024-08-28 19:30 UTC (permalink / raw)
To: Nhat Pham
Cc: linux-kernel, linux-mm, hannes, yosryahmed, ryan.roberts, Huang,
Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
Vinodh, Sridhar, Kanchana P
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Wednesday, August 28, 2024 10:24 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>
> On Wed, Aug 28, 2024 at 8:55 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > better by definition... I wonder if lowering the memory limit even
> > further would show positive numbers? Or
>
> ... perhaps with a workload that has less cold data? or using the
> zswap shrinker to off load some of these cold objects to swap?
>
> Food for thought :)
This makes sense. Given the nature of this workload wherein it makes
a one-time read/write access to each 8-bytes chunk in the mmap-ed
region, this would be a good use-case to try with the zswap shrinker enabled.
I can run some experiments and share the results.
Thanks,
Kanchana
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-28 9:35 [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
` (3 preceding siblings ...)
2024-08-28 15:55 ` [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Nhat Pham
@ 2024-08-28 21:35 ` Nhat Pham
2024-08-29 0:06 ` Sridhar, Kanchana P
2024-08-29 3:59 ` Sridhar, Kanchana P
2024-08-28 22:37 ` Yosry Ahmed
5 siblings, 2 replies; 23+ messages in thread
From: Nhat Pham @ 2024-08-28 21:35 UTC (permalink / raw)
To: Kanchana P Sridhar
Cc: linux-kernel, linux-mm, hannes, yosryahmed, ryan.roberts,
ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
vinodh.gopal
On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi All,
>
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to v6.11-rc3 in patch 2/4 of this series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> delete all offsets corresponding to a higher order folio stored in zswap.
>
Will this have any conflict with mTHP swap work? Especially with mTHP
swap-in and zswap writeback.
My understanding is from zswap's perspective, the large folio is
broken apart into independent subpages, correct? What happens when we
have partially written back mTHP (i.e some subpages are in zswap
still, whereas others are written back to swap). Would this
automatically prevent mTHP swapin?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-28 9:35 [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
` (4 preceding siblings ...)
2024-08-28 21:35 ` Nhat Pham
@ 2024-08-28 22:37 ` Yosry Ahmed
2024-08-29 0:20 ` Sridhar, Kanchana P
2024-08-29 23:33 ` Nhat Pham
5 siblings, 2 replies; 23+ messages in thread
From: Yosry Ahmed @ 2024-08-28 22:37 UTC (permalink / raw)
To: Kanchana P Sridhar
Cc: linux-kernel, linux-mm, hannes, nphamcs, ryan.roberts,
ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
vinodh.gopal
On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi All,
>
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to v6.11-rc3 in patch 2/4 of this series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> delete all offsets corresponding to a higher order folio stored in zswap.
>
> For accounting purposes, the patch-series adds per-order mTHP sysfs
> "zswpout" counters that get incremented upon successful zswap_store of
> an mTHP folio:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> This patch-series is a precursor to ZSWAP compress batching of mTHP
> swap-out and decompress batching of swap-ins based on swapin_readahead(),
> using Intel IAA hardware acceleration, which we would like to submit in
> subsequent RFC patch-series, with performance improvement data.
>
> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>
> Changes since v4:
> =================
> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> Nhat for the data reviews!).
> 2) Rebased to mm-unstable from 8/27/2024,
> commit b659edec079c90012cf8d05624e312d1062b8b87.
> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> robot; as per Nhat's and Michal's suggestion to not require a separate
> patch to fix the build errors (thanks both!).
> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> suggested by Yosry (Thanks Yosry!).
> 5) Squashed the commits that define new mthp zswpout stat counters, and
> invoke count_mthp_stat() after successful zswap_store()s; into a single
> commit. Thanks Yosry for this suggestion!
>
> Changes since v3:
> =================
> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> changes to count_mthp_stat() so that it's always defined, even when THP
> is disabled. Barry, I have also made one other change in page_io.c
> where count_mthp_stat() is called by count_swpout_vm_event(). I would
> appreciate it if you can review this. Thanks!
> Hopefully this should resolve the kernel robot build errors.
>
> Changes since v2:
> =================
> 1) Gathered usemem data using SSD as the backing swap device for zswap,
> as suggested by Ying Huang. Ying, I would appreciate it if you can
> review the latest data. Thanks!
> 2) Generated the base commit info in the patches to attempt to address
> the kernel test robot build errors.
> 3) No code changes to the individual patches themselves.
>
> Changes since RFC v1:
> =====================
>
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
> Ryan's initial RFC [1]:
> - Added a comment about the cgroup zswap limit checks occuring once per
> folio at the beginning of zswap_store().
> Nhat, Ryan, please do let me know if the comments convey the summary
> from the RFC discussion. Thanks!
> - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
>
> Performance Testing:
> ====================
> Testing of this patch-series was done with the v6.11-rc3 mainline, without
> and with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket.
>
> The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as the
> backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 40G. The is no swap limit set for the cgroup. Following a
> similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> series [2], 70 usemem processes were run, each allocating and writing 1G of
> memory:
>
> usemem --init-time -w -O -n 70 1g
>
> The vm/sysfs mTHP stats included with the performance data provide details
> on the swapout activity to ZSWAP/swap.
>
> Other kernel configuration parameters:
>
> ZSWAP Compressors : zstd, deflate-iaa
> ZSWAP Allocator : zsmalloc
> SWAP page-cluster : 2
>
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
>
> Throughput is derived by averaging the individual 70 processes' throughputs
> reported by usemem. sys time is measured with perf. All data points are
> averaged across 3 runs.
>
> 64KB mTHP (cgroup memory.high set to 40G):
> ==========================================
>
> ------------------------------------------------------------------------------
> v6.11-rc3 mainline zswap-mTHP Change wrt
> Baseline Baseline
> ------------------------------------------------------------------------------
> ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
> iaa iaa iaa
> ------------------------------------------------------------------------------
> Throughput (KB/s) 161,496 156,343 140,363 151,938 -13% -3%
> sys time (sec) 771.68 802.08 954.85 735.47 -24% 8%
> memcg_high 111,223 110,889 138,651 133,884
> memcg_swap_high 0 0 0 0
> memcg_swap_fail 0 0 0 0
> pswpin 16 16 0 0
> pswpout 7,471,472 7,527,963 0 0
> zswpin 635 605 624 639
> zswpout 1,509 1,478 9,453,761 9,385,910
> thp_swpout 0 0 0 0
> thp_swpout_ 0 0 0 0
> fallback
> pgmajfault 3,616 3,430 4,633 3,611
> ZSWPOUT-64kB n/a n/a 590,768 586,521
> SWPOUT-64kB 466,967 470,498 0 0
> ------------------------------------------------------------------------------
>
> 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> =======================================================
>
> ------------------------------------------------------------------------------
> v6.11-rc3 mainline zswap-mTHP Change wrt
> Baseline Baseline
> ------------------------------------------------------------------------------
> ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
> iaa iaa iaa
> ------------------------------------------------------------------------------
> Throughput (KB/s) 192,164 194,643 165,005 174,536 -14% -10%
> sys time (sec) 823.55 830.42 801.72 676.65 3% 19%
> memcg_high 16,054 15,936 14,951 16,096
> memcg_swap_high 0 0 0 0
> memcg_swap_fail 0 0 0 0
> pswpin 0 0 0 0
> pswpout 8,629,248 8,628,907 0 0
> zswpin 560 645 5,333 781
> zswpout 1,416 1,503 8,546,895 9,355,760
> thp_swpout 16,854 16,853 0 0
> thp_swpout_ 0 0 0 0
> fallback
> pgmajfault 3,341 3,574 8,139 3,582
> ZSWPOUT-2048kB n/a n/a 16,684 18,270
> SWPOUT-2048kB 16,854 16,853 0 0
> ------------------------------------------------------------------------------
>
> In the "Before" scenario, when zswap does not store mTHP, only allocations
> count towards the cgroup memory limit. However, in the "After" scenario,
> with the introduction of zswap_store() mTHP, both, allocations as well as
> the zswap compressed pool usage from all 70 processes are counted towards
> the memory limit. As a result, we see higher swapout activity in the
> "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> charge leads to more frequent memory.high breaches.
>
> This causes degradation in throughput and sys time with zswap mTHP, more so
> in case of zstd than deflate-iaa. Compress latency could play a part in
> this - when there is more swapout activity happening, a slower compressor
> would cause allocations to stall for any/all of the 70 processes.
>
> In my opinion, even though the test set up does not provide an accurate
> way for a direct before/after comparison (because of zswap usage being
> counted in cgroup, hence towards the memory.high), it still seems
> reasonable for zswap_store to support (m)THP, so that further performance
> improvements can be implemented.
Are you saying that in the "Before" data we end up skipping zswap
completely because of using mTHPs?
Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data
to force the mTHPs to be split and for the data to be stored in zswap?
This would be a more fair Before/After comparison where the memory
goes to zswap in both cases, but "Before" has to be split because of
zswap's lack of support for mTHP. I assume most setups relying on
zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not.
Nhat, is this something you can share?
^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-28 21:35 ` Nhat Pham
@ 2024-08-29 0:06 ` Sridhar, Kanchana P
2024-08-29 17:10 ` Nhat Pham
2024-08-29 3:59 ` Sridhar, Kanchana P
1 sibling, 1 reply; 23+ messages in thread
From: Sridhar, Kanchana P @ 2024-08-29 0:06 UTC (permalink / raw)
To: Nhat Pham
Cc: linux-kernel, linux-mm, hannes, yosryahmed, ryan.roberts, Huang,
Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
Vinodh, Sridhar, Kanchana P
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Wednesday, August 28, 2024 2:35 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>
> On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs. For instance, the function zswap_store_entry() stores a
> zswap_entry
> > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > delete all offsets corresponding to a higher order folio stored in zswap.
> >
>
> Will this have any conflict with mTHP swap work? Especially with mTHP
> swap-in and zswap writeback.
>
> My understanding is from zswap's perspective, the large folio is
> broken apart into independent subpages, correct? What happens when we
> have partially written back mTHP (i.e some subpages are in zswap
> still, whereas others are written back to swap). Would this
> automatically prevent mTHP swapin?
That is a good point. To begin with, this patch-series would make the default
behavior for mTHP swapout/storage and swapin for ZSWAP to be on par with
ZRAM. From zswap's perspective, imo this is a significant step forward towards
realizing cold memory storage with mTHP folios. However, it is only a starting
point that makes the behavior uniform across zswap/zram. Initially, workloads
would see a one-time benefit with reclaim being able to swapout mTHP
folios without splitting, to zswap. If the mTHPs were cold memory, then we
would have derived latency gains towards memory savings (with zswap).
However, if the mTHP were part of "not so cold" memory, this would result
in a one-way mTHP conversion to 4K folios. Depending on workloads and their
access patterns, we could either see individual 4K folios being swapped in,
or entire chunks if not the entire (original) mTHP needing to be swapped in.
It should be noted that this is more of a performance vs. cold memory
preservation trade-off that needs to drive mTHP reclaim, storage, swapin and
writeback policy. Different workloads could require different policies. However,
even though this patch is only a starting point, it is still functionally correct
by being equivalent to zram-mTHP, and compatible with the rest of mm and
swap as far as mTHP. Another important functionality/data consistency decision
I made in this patch series is error handling during zswap_store() of mTHP:
in case of any errors, all swap offsets for the mTHP are deleted from the
zswap xarray/zpool, since we know that the mTHP will now have to be stored
in the backing swap device. IOW, an mTHP is either entirely stored in zswap,
or entirely not stored in zswap.
To answer your question, we would need to come up with what the semantics
would need to be for zswap zpool storage granularity, swapin granularity,
readahead granularity and writeback wrt mTHP and how the overall swap
sub-system needs to "preserve" mTHP vs. splitting mTHP into 4K/lower-order
folios during swapout. Once we have a good understanding of these policies,
we could implement them in zswap. Alternately, develop an abstraction that is
one level above zswap/zram and makes things easier and shareable between
zswap and zram. By this, I mean fundamental assumptions such as consecutive
swap offsets (for instance). To some extent, this implies that an mTHP as a
swap entity is defined by consecutiveness of swap offsets. Maybe the policy
to keep mTHPs in the system over extended duration might be to assemble
them dynamically based on swapin_readahead() decisions (which is based on
workload access patterns). In other words, mTHPs could be a useful abstraction
that can be static or even dynamic based on working set characteristics, and
cold memory preservation. This is quite a complex topic imho.
As we know, Barry Song and Chuanhua Han have started the discussion on
this in their zram mTHP swapin series [1].
[1] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@oppo.com/T/#u
Thanks,
Kanchana
^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-28 22:37 ` Yosry Ahmed
@ 2024-08-29 0:20 ` Sridhar, Kanchana P
2024-08-29 1:01 ` Yosry Ahmed
2024-08-29 23:33 ` Nhat Pham
1 sibling, 1 reply; 23+ messages in thread
From: Sridhar, Kanchana P @ 2024-08-29 0:20 UTC (permalink / raw)
To: Yosry Ahmed
Cc: linux-kernel, linux-mm, hannes, nphamcs, ryan.roberts, Huang,
Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
Vinodh, Sridhar, Kanchana P
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, August 28, 2024 3:37 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>
> On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to v6.11-rc3 in patch 2/4 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs. For instance, the function zswap_store_entry() stores a
> zswap_entry
> > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > delete all offsets corresponding to a higher order folio stored in zswap.
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > This patch-series is a precursor to ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent RFC patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Changes since v4:
> > =================
> > 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> > Nhat for the data reviews!).
> > 2) Rebased to mm-unstable from 8/27/2024,
> > commit b659edec079c90012cf8d05624e312d1062b8b87.
> > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> > robot; as per Nhat's and Michal's suggestion to not require a separate
> > patch to fix the build errors (thanks both!).
> > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> > suggested by Yosry (Thanks Yosry!).
> > 5) Squashed the commits that define new mthp zswpout stat counters, and
> > invoke count_mthp_stat() after successful zswap_store()s; into a single
> > commit. Thanks Yosry for this suggestion!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> > Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> > changes to count_mthp_stat() so that it's always defined, even when THP
> > is disabled. Barry, I have also made one other change in page_io.c
> > where count_mthp_stat() is called by count_swpout_vm_event(). I would
> > appreciate it if you can review this. Thanks!
> > Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> > as suggested by Ying Huang. Ying, I would appreciate it if you can
> > review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> > the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> > Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> > Ryan's initial RFC [1]:
> > - Added a comment about the cgroup zswap limit checks occuring once
> per
> > folio at the beginning of zswap_store().
> > Nhat, Ryan, please do let me know if the comments convey the summary
> > from the RFC discussion. Thanks!
> > - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with the v6.11-rc3 mainline, without
> > and with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket.
> >
> > The system has 503 GiB RAM, with 176GiB ZRAM (35% of available RAM) as
> the
> > backing swap device for ZSWAP. zstd is configured as the ZRAM compressor.
> > Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. The is no swap limit set for the cgroup. Following a
> > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> > series [2], 70 usemem processes were run, each allocating and writing 1G of
> > memory:
> >
> > usemem --init-time -w -O -n 70 1g
> >
> > The vm/sysfs mTHP stats included with the performance data provide
> details
> > on the swapout activity to ZSWAP/swap.
> >
> > Other kernel configuration parameters:
> >
> > ZSWAP Compressors : zstd, deflate-iaa
> > ZSWAP Allocator : zsmalloc
> > SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput is derived by averaging the individual 70 processes' throughputs
> > reported by usemem. sys time is measured with perf. All data points are
> > averaged across 3 runs.
> >
> > 64KB mTHP (cgroup memory.high set to 40G):
> > ==========================================
> >
> > ------------------------------------------------------------------------------
> > v6.11-rc3 mainline zswap-mTHP Change wrt
> > Baseline Baseline
> > ------------------------------------------------------------------------------
> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
> > iaa iaa iaa
> > ------------------------------------------------------------------------------
> > Throughput (KB/s) 161,496 156,343 140,363 151,938 -13% -3%
> > sys time (sec) 771.68 802.08 954.85 735.47 -24% 8%
> > memcg_high 111,223 110,889 138,651 133,884
> > memcg_swap_high 0 0 0 0
> > memcg_swap_fail 0 0 0 0
> > pswpin 16 16 0 0
> > pswpout 7,471,472 7,527,963 0 0
> > zswpin 635 605 624 639
> > zswpout 1,509 1,478 9,453,761 9,385,910
> > thp_swpout 0 0 0 0
> > thp_swpout_ 0 0 0 0
> > fallback
> > pgmajfault 3,616 3,430 4,633 3,611
> > ZSWPOUT-64kB n/a n/a 590,768 586,521
> > SWPOUT-64kB 466,967 470,498 0 0
> > ------------------------------------------------------------------------------
> >
> > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> > =======================================================
> >
> > ------------------------------------------------------------------------------
> > v6.11-rc3 mainline zswap-mTHP Change wrt
> > Baseline Baseline
> > ------------------------------------------------------------------------------
> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
> > iaa iaa iaa
> > ------------------------------------------------------------------------------
> > Throughput (KB/s) 192,164 194,643 165,005 174,536 -14% -10%
> > sys time (sec) 823.55 830.42 801.72 676.65 3% 19%
> > memcg_high 16,054 15,936 14,951 16,096
> > memcg_swap_high 0 0 0 0
> > memcg_swap_fail 0 0 0 0
> > pswpin 0 0 0 0
> > pswpout 8,629,248 8,628,907 0 0
> > zswpin 560 645 5,333 781
> > zswpout 1,416 1,503 8,546,895 9,355,760
> > thp_swpout 16,854 16,853 0 0
> > thp_swpout_ 0 0 0 0
> > fallback
> > pgmajfault 3,341 3,574 8,139 3,582
> > ZSWPOUT-2048kB n/a n/a 16,684 18,270
> > SWPOUT-2048kB 16,854 16,853 0 0
> > ------------------------------------------------------------------------------
> >
> > In the "Before" scenario, when zswap does not store mTHP, only allocations
> > count towards the cgroup memory limit. However, in the "After" scenario,
> > with the introduction of zswap_store() mTHP, both, allocations as well as
> > the zswap compressed pool usage from all 70 processes are counted
> towards
> > the memory limit. As a result, we see higher swapout activity in the
> > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> > charge leads to more frequent memory.high breaches.
> >
> > This causes degradation in throughput and sys time with zswap mTHP, more
> so
> > in case of zstd than deflate-iaa. Compress latency could play a part in
> > this - when there is more swapout activity happening, a slower compressor
> > would cause allocations to stall for any/all of the 70 processes.
> >
> > In my opinion, even though the test set up does not provide an accurate
> > way for a direct before/after comparison (because of zswap usage being
> > counted in cgroup, hence towards the memory.high), it still seems
> > reasonable for zswap_store to support (m)THP, so that further performance
> > improvements can be implemented.
>
> Are you saying that in the "Before" data we end up skipping zswap
> completely because of using mTHPs?
That's right, Yosry.
>
> Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data
We could do this, however I am not sure if turning off CONFIG_THP_SWAP
will have other side-effects in terms of disabling mm code paths outside of
zswap that are intended to be mTHP optimizations that could again skew
the before/after comparisons.
Will wait for Nhat's comments as well.
Thanks,
Kanchana
> to force the mTHPs to be split and for the data to be stored in zswap?
> This would be a more fair Before/After comparison where the memory
> goes to zswap in both cases, but "Before" has to be split because of
> zswap's lack of support for mTHP. I assume most setups relying on
> zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not.
> Nhat, is this something you can share?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-29 0:20 ` Sridhar, Kanchana P
@ 2024-08-29 1:01 ` Yosry Ahmed
2024-08-29 3:10 ` Sridhar, Kanchana P
0 siblings, 1 reply; 23+ messages in thread
From: Yosry Ahmed @ 2024-08-29 1:01 UTC (permalink / raw)
To: Sridhar, Kanchana P
Cc: linux-kernel, linux-mm, hannes, nphamcs, ryan.roberts, Huang,
Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
Vinodh, Chris Li
[..]
> > > In the "Before" scenario, when zswap does not store mTHP, only allocations
> > > count towards the cgroup memory limit. However, in the "After" scenario,
> > > with the introduction of zswap_store() mTHP, both, allocations as well as
> > > the zswap compressed pool usage from all 70 processes are counted
> > towards
> > > the memory limit. As a result, we see higher swapout activity in the
> > > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> > > charge leads to more frequent memory.high breaches.
> > >
> > > This causes degradation in throughput and sys time with zswap mTHP, more
> > so
> > > in case of zstd than deflate-iaa. Compress latency could play a part in
> > > this - when there is more swapout activity happening, a slower compressor
> > > would cause allocations to stall for any/all of the 70 processes.
> > >
> > > In my opinion, even though the test set up does not provide an accurate
> > > way for a direct before/after comparison (because of zswap usage being
> > > counted in cgroup, hence towards the memory.high), it still seems
> > > reasonable for zswap_store to support (m)THP, so that further performance
> > > improvements can be implemented.
> >
> > Are you saying that in the "Before" data we end up skipping zswap
> > completely because of using mTHPs?
>
> That's right, Yosry.
>
> >
> > Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data
>
> We could do this, however I am not sure if turning off CONFIG_THP_SWAP
> will have other side-effects in terms of disabling mm code paths outside of
> zswap that are intended to be mTHP optimizations that could again skew
> the before/after comparisons.
Yeah that's possible, but right now we are testing mTHP swapout that
does not go through zswap at all vs. mTHP swapout going through zswap.
I think what we really want to measure is 4K swapout going through
zswap vs. mTHP swapout going through zswap. This assumes that current
zswap setups disable CONFIG_THP_SWAP, so we would be measuring the
benefit of allowing them to enable CONFIG_THP_SWAP by supporting it
properly in zswap.
If some setups with zswap have CONFIG_THP_SWAP enabled then that's a
different story, but we already have the data for this case as well
right now in case this is a legitimate setup.
Adding Chris Li here from Google. We have CONFIG_THP_SWAP disabled
with zswap, so for us we would want to know the benefit of supporting
CONFIG_THP_SWAP properly in zswap. At least I think so :)
>
> Will wait for Nhat's comments as well.
>
> Thanks,
> Kanchana
>
> > to force the mTHPs to be split and for the data to be stored in zswap?
> > This would be a more fair Before/After comparison where the memory
> > goes to zswap in both cases, but "Before" has to be split because of
> > zswap's lack of support for mTHP. I assume most setups relying on
> > zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not.
> > Nhat, is this something you can share?
^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-29 1:01 ` Yosry Ahmed
@ 2024-08-29 3:10 ` Sridhar, Kanchana P
0 siblings, 0 replies; 23+ messages in thread
From: Sridhar, Kanchana P @ 2024-08-29 3:10 UTC (permalink / raw)
To: Yosry Ahmed
Cc: linux-kernel, linux-mm, hannes, nphamcs, ryan.roberts, Huang,
Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
Vinodh, Chris Li, Sridhar, Kanchana P
Hi Yosry,
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, August 28, 2024 6:02 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>; Chris
> Li <chrisl@kernel.org>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>
> [..]
> > > > In the "Before" scenario, when zswap does not store mTHP, only
> allocations
> > > > count towards the cgroup memory limit. However, in the "After"
> scenario,
> > > > with the introduction of zswap_store() mTHP, both, allocations as well as
> > > > the zswap compressed pool usage from all 70 processes are counted
> > > towards
> > > > the memory limit. As a result, we see higher swapout activity in the
> > > > "After" data. Hence, more time is spent doing reclaim as the zswap
> cgroup
> > > > charge leads to more frequent memory.high breaches.
> > > >
> > > > This causes degradation in throughput and sys time with zswap mTHP,
> more
> > > so
> > > > in case of zstd than deflate-iaa. Compress latency could play a part in
> > > > this - when there is more swapout activity happening, a slower
> compressor
> > > > would cause allocations to stall for any/all of the 70 processes.
> > > >
> > > > In my opinion, even though the test set up does not provide an accurate
> > > > way for a direct before/after comparison (because of zswap usage being
> > > > counted in cgroup, hence towards the memory.high), it still seems
> > > > reasonable for zswap_store to support (m)THP, so that further
> performance
> > > > improvements can be implemented.
> > >
> > > Are you saying that in the "Before" data we end up skipping zswap
> > > completely because of using mTHPs?
> >
> > That's right, Yosry.
> >
> > >
> > > Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data
> >
> > We could do this, however I am not sure if turning off CONFIG_THP_SWAP
> > will have other side-effects in terms of disabling mm code paths outside of
> > zswap that are intended to be mTHP optimizations that could again skew
> > the before/after comparisons.
>
> Yeah that's possible, but right now we are testing mTHP swapout that
> does not go through zswap at all vs. mTHP swapout going through zswap.
>
> I think what we really want to measure is 4K swapout going through
> zswap vs. mTHP swapout going through zswap. This assumes that current
> zswap setups disable CONFIG_THP_SWAP, so we would be measuring the
> benefit of allowing them to enable CONFIG_THP_SWAP by supporting it
> properly in zswap.
>
> If some setups with zswap have CONFIG_THP_SWAP enabled then that's a
> different story, but we already have the data for this case as well
> right now in case this is a legitimate setup.
>
> Adding Chris Li here from Google. We have CONFIG_THP_SWAP disabled
> with zswap, so for us we would want to know the benefit of supporting
> CONFIG_THP_SWAP properly in zswap. At least I think so :)
Sure, this makes sense. Here's the data that I gathered with CONFIG_THP_SWAP
disabled. We see improvements overall in throughput and sys time for zstd and
deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y):
64K mTHP:
=========
-------------------------------------------------------------------------------
v6.11-rc3 mainline zswap-mTHP Change wrt
Baseline Baseline
CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y
--------------------------------------------------------------------------------
ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
iaa iaa iaa
-------------------------------------------------------------------------------
Throughput (KB/s) 136,113 140,044 140,363 151,938 3% 8%
sys time (sec) 986.78 951.95 954.85 735.47 3% 23%
memcg_high 124,183 127,513 138,651 133,884
memcg_swap_high 0 0 0 0
memcg_swap_fail 619,020 751,099 0 0
pswpin 0 0 0 0
pswpout 0 0 0 0
zswpin 656 569 624 639
zswpout 9,413,603 11,284,812 9,453,761 9,385,910
thp_swpout 0 0 0 0
thp_swpout_ 0 0 0 0
fallback
pgmajfault 3,470 3,382 4,633 3,611
ZSWPOUT-64kB n/a n/a 590,768 586,521
SWPOUT-64kB 0 0 0 0
-------------------------------------------------------------------------------
2M THP:
=======
------------------------------------------------------------------------------
v6.11-rc3 mainline zswap-mTHP Change wrt
Baseline Baseline
CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y
------------------------------------------------------------------------------
ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
iaa iaa iaa
------------------------------------------------------------------------------
Throughput (KB/s) 164,220 172,523 165,005 174,536 0.5% 1%
sys time (sec) 855.76 686.94 801.72 676.65 6% 1%
memcg_high 14,628 16,247 14,951 16,096
memcg_swap_high 0 0 0 0
memcg_swap_fail 18,698 21,114 0 0
pswpin 0 0 0 0
pswpout 0 0 0 0
zswpin 663 665 5,333 781
zswpout 8,419,458 8,992,065 8,546,895 9,355,760
thp_swpout 0 0 0 0
thp_swpout_ 18,697 21,113 0 0
fallback
pgmajfault 3,439 3,496 8,139 3,582
ZSWPOUT-2048kB n/a n/a 16,684 18,270
SWPOUT-2048kB 0 0 0 0
-----------------------------------------------------------------------------
Thanks,
Kanchana
>
> >
> > Will wait for Nhat's comments as well.
> >
> > Thanks,
> > Kanchana
> >
> > > to force the mTHPs to be split and for the data to be stored in zswap?
> > > This would be a more fair Before/After comparison where the memory
> > > goes to zswap in both cases, but "Before" has to be split because of
> > > zswap's lack of support for mTHP. I assume most setups relying on
> > > zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe
> not.
> > > Nhat, is this something you can share?
^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-28 21:35 ` Nhat Pham
2024-08-29 0:06 ` Sridhar, Kanchana P
@ 2024-08-29 3:59 ` Sridhar, Kanchana P
1 sibling, 0 replies; 23+ messages in thread
From: Sridhar, Kanchana P @ 2024-08-29 3:59 UTC (permalink / raw)
To: Nhat Pham
Cc: linux-kernel, linux-mm, hannes, yosryahmed, ryan.roberts, Huang,
Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
Vinodh, Sridhar, Kanchana P
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Wednesday, August 28, 2024 2:35 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>
> On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
[snip]
> My understanding is from zswap's perspective, the large folio is
> broken apart into independent subpages, correct? What happens when we
Yes, this is correct.
> have partially written back mTHP (i.e some subpages are in zswap
> still, whereas others are written back to swap). Would this
> automatically prevent mTHP swapin?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-29 0:06 ` Sridhar, Kanchana P
@ 2024-08-29 17:10 ` Nhat Pham
2024-08-29 19:38 ` Sridhar, Kanchana P
0 siblings, 1 reply; 23+ messages in thread
From: Nhat Pham @ 2024-08-29 17:10 UTC (permalink / raw)
To: Sridhar, Kanchana P
Cc: linux-kernel, linux-mm, hannes, yosryahmed, ryan.roberts, Huang,
Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
Vinodh, Usama Arif, Chengming Zhou
On Wed, Aug 28, 2024 at 5:06 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Wednesday, August 28, 2024 2:35 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
> >
> > On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > Hi All,
> > >
> > > This patch-series enables zswap_store() to accept and store mTHP
> > > folios. The most significant contribution in this series is from the
> > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > > migrated to v6.11-rc3 in patch 2/4 of this series.
> > >
> > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> > > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> > ryan.roberts@arm.com/T/#u
> > >
> > > Additionally, there is an attempt to modularize some of the functionality
> > > in zswap_store(), to make it more amenable to supporting any-order
> > > mTHPs. For instance, the function zswap_store_entry() stores a
> > zswap_entry
> > > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > > delete all offsets corresponding to a higher order folio stored in zswap.
> > >
> >
> > Will this have any conflict with mTHP swap work? Especially with mTHP
> > swap-in and zswap writeback.
> >
> > My understanding is from zswap's perspective, the large folio is
> > broken apart into independent subpages, correct? What happens when we
> > have partially written back mTHP (i.e some subpages are in zswap
> > still, whereas others are written back to swap). Would this
> > automatically prevent mTHP swapin?
>
> That is a good point. To begin with, this patch-series would make the default
> behavior for mTHP swapout/storage and swapin for ZSWAP to be on par with
> ZRAM. From zswap's perspective, imo this is a significant step forward towards
> realizing cold memory storage with mTHP folios. However, it is only a starting
> point that makes the behavior uniform across zswap/zram. Initially, workloads
> would see a one-time benefit with reclaim being able to swapout mTHP
> folios without splitting, to zswap. If the mTHPs were cold memory, then we
> would have derived latency gains towards memory savings (with zswap).
>
> However, if the mTHP were part of "not so cold" memory, this would result
> in a one-way mTHP conversion to 4K folios. Depending on workloads and their
> access patterns, we could either see individual 4K folios being swapped in,
> or entire chunks if not the entire (original) mTHP needing to be swapped in.
>
> It should be noted that this is more of a performance vs. cold memory
> preservation trade-off that needs to drive mTHP reclaim, storage, swapin and
> writeback policy. Different workloads could require different policies. However,
> even though this patch is only a starting point, it is still functionally correct
> by being equivalent to zram-mTHP, and compatible with the rest of mm and
> swap as far as mTHP. Another important functionality/data consistency decision
> I made in this patch series is error handling during zswap_store() of mTHP:
> in case of any errors, all swap offsets for the mTHP are deleted from the
> zswap xarray/zpool, since we know that the mTHP will now have to be stored
> in the backing swap device. IOW, an mTHP is either entirely stored in zswap,
> or entirely not stored in zswap.
>
> To answer your question, we would need to come up with what the semantics
> would need to be for zswap zpool storage granularity, swapin granularity,
> readahead granularity and writeback wrt mTHP and how the overall swap
> sub-system needs to "preserve" mTHP vs. splitting mTHP into 4K/lower-order
> folios during swapout. Once we have a good understanding of these policies,
> we could implement them in zswap. Alternately, develop an abstraction that is
> one level above zswap/zram and makes things easier and shareable between
> zswap and zram. By this, I mean fundamental assumptions such as consecutive
> swap offsets (for instance). To some extent, this implies that an mTHP as a
> swap entity is defined by consecutiveness of swap offsets. Maybe the policy
> to keep mTHPs in the system over extended duration might be to assemble
> them dynamically based on swapin_readahead() decisions (which is based on
> workload access patterns). In other words, mTHPs could be a useful abstraction
> that can be static or even dynamic based on working set characteristics, and
> cold memory preservation. This is quite a complex topic imho.
>
> As we know, Barry Song and Chuanhua Han have started the discussion on
> this in their zram mTHP swapin series [1].
Yeah I'm a bit more concerned with the correctness aspect. As long as
it's not buggy, then we can implement mTHP zswapout first, and force
individual subpage (z)swapin for now (since we cannot control
writeback from writing individual subpages).
We can discuss strategy to harmonize mTHP, zswap (with writeback) as
we go along.
BTW, I think we're not cc-ing Chengming? Is the get_maintainers script
not working properly... Let me manually add him in - please include
him in future submission and responses, as he is also a zswap reviewer
:)
Also cc-ing Usama who is interested in this work.
>
> [1] https://lore.kernel.org/all/20240821074541.516249-3-hanchuanhua@oppo.com/T/#u
>
> Thanks,
> Kanchana
^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-29 17:10 ` Nhat Pham
@ 2024-08-29 19:38 ` Sridhar, Kanchana P
2024-08-30 4:52 ` Chengming Zhou
0 siblings, 1 reply; 23+ messages in thread
From: Sridhar, Kanchana P @ 2024-08-29 19:38 UTC (permalink / raw)
To: Nhat Pham
Cc: linux-kernel, linux-mm, hannes, yosryahmed, ryan.roberts, Huang,
Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
Vinodh, Usama Arif, Chengming Zhou, Sridhar, Kanchana P
Hi Nhat,
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Thursday, August 29, 2024 10:11 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Usama Arif <usamaarif642@gmail.com>; Chengming Zhou
> <chengming.zhou@linux.dev>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>
> On Wed, Aug 28, 2024 at 5:06 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Nhat Pham <nphamcs@gmail.com>
> > > Sent: Wednesday, August 28, 2024 2:35 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; yosryahmed@google.com;
> ryan.roberts@arm.com;
> > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
> > >
> > > On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
> > > <kanchana.p.sridhar@intel.com> wrote:
> > > >
> > > > Hi All,
> > > >
> > > > This patch-series enables zswap_store() to accept and store mTHP
> > > > folios. The most significant contribution in this series is from the
> > > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > > > migrated to v6.11-rc3 in patch 2/4 of this series.
> > > >
> > > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> > > > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> > > ryan.roberts@arm.com/T/#u
> > > >
> > > > Additionally, there is an attempt to modularize some of the functionality
> > > > in zswap_store(), to make it more amenable to supporting any-order
> > > > mTHPs. For instance, the function zswap_store_entry() stores a
> > > zswap_entry
> > > > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > > > delete all offsets corresponding to a higher order folio stored in zswap.
> > > >
> > >
> > > Will this have any conflict with mTHP swap work? Especially with mTHP
> > > swap-in and zswap writeback.
> > >
> > > My understanding is from zswap's perspective, the large folio is
> > > broken apart into independent subpages, correct? What happens when
> we
> > > have partially written back mTHP (i.e some subpages are in zswap
> > > still, whereas others are written back to swap). Would this
> > > automatically prevent mTHP swapin?
> >
> > That is a good point. To begin with, this patch-series would make the default
> > behavior for mTHP swapout/storage and swapin for ZSWAP to be on par
> with
> > ZRAM. From zswap's perspective, imo this is a significant step forward
> towards
> > realizing cold memory storage with mTHP folios. However, it is only a
> starting
> > point that makes the behavior uniform across zswap/zram. Initially,
> workloads
> > would see a one-time benefit with reclaim being able to swapout mTHP
> > folios without splitting, to zswap. If the mTHPs were cold memory, then we
> > would have derived latency gains towards memory savings (with zswap).
> >
> > However, if the mTHP were part of "not so cold" memory, this would result
> > in a one-way mTHP conversion to 4K folios. Depending on workloads and
> their
> > access patterns, we could either see individual 4K folios being swapped in,
> > or entire chunks if not the entire (original) mTHP needing to be swapped in.
> >
> > It should be noted that this is more of a performance vs. cold memory
> > preservation trade-off that needs to drive mTHP reclaim, storage, swapin
> and
> > writeback policy. Different workloads could require different policies.
> However,
> > even though this patch is only a starting point, it is still functionally correct
> > by being equivalent to zram-mTHP, and compatible with the rest of mm and
> > swap as far as mTHP. Another important functionality/data consistency
> decision
> > I made in this patch series is error handling during zswap_store() of mTHP:
> > in case of any errors, all swap offsets for the mTHP are deleted from the
> > zswap xarray/zpool, since we know that the mTHP will now have to be
> stored
> > in the backing swap device. IOW, an mTHP is either entirely stored in zswap,
> > or entirely not stored in zswap.
> >
> > To answer your question, we would need to come up with what the
> semantics
> > would need to be for zswap zpool storage granularity, swapin granularity,
> > readahead granularity and writeback wrt mTHP and how the overall swap
> > sub-system needs to "preserve" mTHP vs. splitting mTHP into 4K/lower-
> order
> > folios during swapout. Once we have a good understanding of these policies,
> > we could implement them in zswap. Alternately, develop an abstraction that
> is
> > one level above zswap/zram and makes things easier and shareable
> between
> > zswap and zram. By this, I mean fundamental assumptions such as
> consecutive
> > swap offsets (for instance). To some extent, this implies that an mTHP as a
> > swap entity is defined by consecutiveness of swap offsets. Maybe the policy
> > to keep mTHPs in the system over extended duration might be to assemble
> > them dynamically based on swapin_readahead() decisions (which is based
> on
> > workload access patterns). In other words, mTHPs could be a useful
> abstraction
> > that can be static or even dynamic based on working set characteristics, and
> > cold memory preservation. This is quite a complex topic imho.
> >
> > As we know, Barry Song and Chuanhua Han have started the discussion on
> > this in their zram mTHP swapin series [1].
>
> Yeah I'm a bit more concerned with the correctness aspect. As long as
> it's not buggy, then we can implement mTHP zswapout first, and force
> individual subpage (z)swapin for now (since we cannot control
> writeback from writing individual subpages).
Absolutely, this sounds like the way to go!
>
> We can discuss strategy to harmonize mTHP, zswap (with writeback) as
> we go along.
Sounds great :)
>
> BTW, I think we're not cc-ing Chengming? Is the get_maintainers script
> not working properly... Let me manually add him in - please include
> him in future submission and responses, as he is also a zswap reviewer
> :)
I think when I ran get_maintainers.pl, I was in v6.10. For sure, will include
Chengming in future submissions and responses :)
>
> Also cc-ing Usama who is interested in this work.
Sounds great.
Thanks,
Kanchana
>
> >
> > [1] https://lore.kernel.org/all/20240821074541.516249-3-
> hanchuanhua@oppo.com/T/#u
> >
> > Thanks,
> > Kanchana
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-28 22:37 ` Yosry Ahmed
2024-08-29 0:20 ` Sridhar, Kanchana P
@ 2024-08-29 23:33 ` Nhat Pham
2024-08-29 23:38 ` Yosry Ahmed
1 sibling, 1 reply; 23+ messages in thread
From: Nhat Pham @ 2024-08-29 23:33 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, ryan.roberts,
ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
vinodh.gopal
On Wed, Aug 28, 2024 at 3:38 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> Are you saying that in the "Before" data we end up skipping zswap
> completely because of using mTHPs?
>
> Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data
> to force the mTHPs to be split and for the data to be stored in zswap?
> This would be a more fair Before/After comparison where the memory
> goes to zswap in both cases, but "Before" has to be split because of
> zswap's lack of support for mTHP. I assume most setups relying on
> zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not.
> Nhat, is this something you can share?
I think we're enabling it, but we're a zswap heavy shop + THP
allocation is not suuuper reliable until recently with Johannes'
latest (and upcoming) work, so I don't have much data to share :)
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-29 23:33 ` Nhat Pham
@ 2024-08-29 23:38 ` Yosry Ahmed
2024-08-29 23:47 ` Nhat Pham
0 siblings, 1 reply; 23+ messages in thread
From: Yosry Ahmed @ 2024-08-29 23:38 UTC (permalink / raw)
To: Nhat Pham
Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, ryan.roberts,
ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
vinodh.gopal
On Thu, Aug 29, 2024 at 4:33 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Wed, Aug 28, 2024 at 3:38 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > Are you saying that in the "Before" data we end up skipping zswap
> > completely because of using mTHPs?
> >
> > Does it make more sense to turn CONFIG_THP_SWAP in the "Before" data
> > to force the mTHPs to be split and for the data to be stored in zswap?
> > This would be a more fair Before/After comparison where the memory
> > goes to zswap in both cases, but "Before" has to be split because of
> > zswap's lack of support for mTHP. I assume most setups relying on
> > zswap will be turning CONFIG_THP_SWAP off today anyway, but maybe not.
> > Nhat, is this something you can share?
>
> I think we're enabling it, but we're a zswap heavy shop + THP
> allocation is not suuuper reliable until recently with Johannes'
> latest (and upcoming) work, so I don't have much data to share :)
Interesting. If CONFIG_THP_SWAP is enabled this basically means your
zswap utilization keeps going down as your THP utilization goes up. So
the swapin cost would go higher. How do you deal with that?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-29 23:38 ` Yosry Ahmed
@ 2024-08-29 23:47 ` Nhat Pham
2024-08-29 23:55 ` Yosry Ahmed
0 siblings, 1 reply; 23+ messages in thread
From: Nhat Pham @ 2024-08-29 23:47 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, ryan.roberts,
ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
vinodh.gopal
On Thu, Aug 29, 2024 at 4:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Interesting. If CONFIG_THP_SWAP is enabled this basically means your
> zswap utilization keeps going down as your THP utilization goes up. So
> the swapin cost would go higher. How do you deal with that?
Johannes definitely knows more than me about this, so please fact
check me. But my understanding is we don't get enough THP for this to
become a problem just yet :)
But yes, we're working hard to make THP become more readily available.
Which will lead to the problem you're describing, hence my enthusiasm
in this work :)
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-29 23:47 ` Nhat Pham
@ 2024-08-29 23:55 ` Yosry Ahmed
0 siblings, 0 replies; 23+ messages in thread
From: Yosry Ahmed @ 2024-08-29 23:55 UTC (permalink / raw)
To: Nhat Pham
Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, ryan.roberts,
ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
vinodh.gopal, Shakeel Butt
On Thu, Aug 29, 2024 at 4:48 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Aug 29, 2024 at 4:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > Interesting. If CONFIG_THP_SWAP is enabled this basically means your
> > zswap utilization keeps going down as your THP utilization goes up. So
> > the swapin cost would go higher. How do you deal with that?
>
> Johannes definitely knows more than me about this, so please fact
> check me. But my understanding is we don't get enough THP for this to
> become a problem just yet :)
>
> But yes, we're working hard to make THP become more readily available.
> Which will lead to the problem you're describing, hence my enthusiasm
> in this work :)
Adding Shakeel here as well as I am sure he's familiar with the
problem I was talking about.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-29 19:38 ` Sridhar, Kanchana P
@ 2024-08-30 4:52 ` Chengming Zhou
2024-09-20 2:34 ` Sridhar, Kanchana P
0 siblings, 1 reply; 23+ messages in thread
From: Chengming Zhou @ 2024-08-30 4:52 UTC (permalink / raw)
To: Sridhar, Kanchana P, Nhat Pham
Cc: linux-kernel, linux-mm, hannes, yosryahmed, ryan.roberts, Huang,
Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
Vinodh, Usama Arif
On 2024/8/30 03:38, Sridhar, Kanchana P wrote:
> Hi Nhat,
>
>> -----Original Message-----
>> From: Nhat Pham <nphamcs@gmail.com>
>> Sent: Thursday, August 29, 2024 10:11 AM
>> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
>> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
>> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
>> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
>> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
>> Usama Arif <usamaarif642@gmail.com>; Chengming Zhou
>> <chengming.zhou@linux.dev>
>> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>>
>> On Wed, Aug 28, 2024 at 5:06 PM Sridhar, Kanchana P
>> <kanchana.p.sridhar@intel.com> wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Nhat Pham <nphamcs@gmail.com>
>>>> Sent: Wednesday, August 28, 2024 2:35 PM
>>>> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
>>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
>>>> hannes@cmpxchg.org; yosryahmed@google.com;
>> ryan.roberts@arm.com;
>>>> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
>>>> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
>>>> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
>>>> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>>>>
>>>> On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
>>>> <kanchana.p.sridhar@intel.com> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> This patch-series enables zswap_store() to accept and store mTHP
>>>>> folios. The most significant contribution in this series is from the
>>>>> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
>>>>> migrated to v6.11-rc3 in patch 2/4 of this series.
>>>>>
>>>>> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>>>>> https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
>>>> ryan.roberts@arm.com/T/#u
>>>>>
>>>>> Additionally, there is an attempt to modularize some of the functionality
>>>>> in zswap_store(), to make it more amenable to supporting any-order
>>>>> mTHPs. For instance, the function zswap_store_entry() stores a
>>>> zswap_entry
>>>>> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
>>>>> delete all offsets corresponding to a higher order folio stored in zswap.
>>>>>
>>>>
>>>> Will this have any conflict with mTHP swap work? Especially with mTHP
>>>> swap-in and zswap writeback.
>>>>
>>>> My understanding is from zswap's perspective, the large folio is
>>>> broken apart into independent subpages, correct? What happens when
>> we
>>>> have partially written back mTHP (i.e some subpages are in zswap
>>>> still, whereas others are written back to swap). Would this
>>>> automatically prevent mTHP swapin?
>>>
>>> That is a good point. To begin with, this patch-series would make the default
>>> behavior for mTHP swapout/storage and swapin for ZSWAP to be on par
>> with
>>> ZRAM. From zswap's perspective, imo this is a significant step forward
>> towards
>>> realizing cold memory storage with mTHP folios. However, it is only a
>> starting
>>> point that makes the behavior uniform across zswap/zram. Initially,
>> workloads
>>> would see a one-time benefit with reclaim being able to swapout mTHP
>>> folios without splitting, to zswap. If the mTHPs were cold memory, then we
>>> would have derived latency gains towards memory savings (with zswap).
>>>
>>> However, if the mTHP were part of "not so cold" memory, this would result
>>> in a one-way mTHP conversion to 4K folios. Depending on workloads and
>> their
>>> access patterns, we could either see individual 4K folios being swapped in,
>>> or entire chunks if not the entire (original) mTHP needing to be swapped in.
>>>
>>> It should be noted that this is more of a performance vs. cold memory
>>> preservation trade-off that needs to drive mTHP reclaim, storage, swapin
>> and
>>> writeback policy. Different workloads could require different policies.
>> However,
>>> even though this patch is only a starting point, it is still functionally correct
>>> by being equivalent to zram-mTHP, and compatible with the rest of mm and
>>> swap as far as mTHP. Another important functionality/data consistency
>> decision
>>> I made in this patch series is error handling during zswap_store() of mTHP:
>>> in case of any errors, all swap offsets for the mTHP are deleted from the
>>> zswap xarray/zpool, since we know that the mTHP will now have to be
>> stored
>>> in the backing swap device. IOW, an mTHP is either entirely stored in zswap,
>>> or entirely not stored in zswap.
>>>
>>> To answer your question, we would need to come up with what the
>> semantics
>>> would need to be for zswap zpool storage granularity, swapin granularity,
>>> readahead granularity and writeback wrt mTHP and how the overall swap
>>> sub-system needs to "preserve" mTHP vs. splitting mTHP into 4K/lower-
>> order
>>> folios during swapout. Once we have a good understanding of these policies,
>>> we could implement them in zswap. Alternately, develop an abstraction that
>> is
>>> one level above zswap/zram and makes things easier and shareable
>> between
>>> zswap and zram. By this, I mean fundamental assumptions such as
>> consecutive
>>> swap offsets (for instance). To some extent, this implies that an mTHP as a
>>> swap entity is defined by consecutiveness of swap offsets. Maybe the policy
>>> to keep mTHPs in the system over extended duration might be to assemble
>>> them dynamically based on swapin_readahead() decisions (which is based
>> on
>>> workload access patterns). In other words, mTHPs could be a useful
>> abstraction
>>> that can be static or even dynamic based on working set characteristics, and
>>> cold memory preservation. This is quite a complex topic imho.
>>>
>>> As we know, Barry Song and Chuanhua Han have started the discussion on
>>> this in their zram mTHP swapin series [1].
>>
>> Yeah I'm a bit more concerned with the correctness aspect. As long as
>> it's not buggy, then we can implement mTHP zswapout first, and force
>> individual subpage (z)swapin for now (since we cannot control
>> writeback from writing individual subpages).
>
> Absolutely, this sounds like the way to go!
>
>>
>> We can discuss strategy to harmonize mTHP, zswap (with writeback) as
>> we go along.
>
> Sounds great :)
>
>>
>> BTW, I think we're not cc-ing Chengming? Is the get_maintainers script
>> not working properly... Let me manually add him in - please include
>> him in future submission and responses, as he is also a zswap reviewer
>> :)
>
> I think when I ran get_maintainers.pl, I was in v6.10. For sure, will include
> Chengming in future submissions and responses :)
Maybe a little late for the party, will take a look ASAP.
It's an interesting and great work.
Thanks!
>
>>
>> Also cc-ing Usama who is interested in this work.
>
> Sounds great.
>
> Thanks,
> Kanchana
>
>>
>>>
>>> [1] https://lore.kernel.org/all/20240821074541.516249-3-
>> hanchuanhua@oppo.com/T/#u
>>>
>>> Thanks,
>>> Kanchana
^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
2024-08-30 4:52 ` Chengming Zhou
@ 2024-09-20 2:34 ` Sridhar, Kanchana P
0 siblings, 0 replies; 23+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-20 2:34 UTC (permalink / raw)
To: Chengming Zhou, Nhat Pham
Cc: linux-kernel, linux-mm, hannes, yosryahmed, ryan.roberts, Huang,
Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
Vinodh, Usama Arif, Sridhar, Kanchana P
Hi Chengming,
> -----Original Message-----
> From: Chengming Zhou <chengming.zhou@linux.dev>
> Sent: Thursday, August 29, 2024 9:52 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham
> <nphamcs@gmail.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Usama Arif <usamaarif642@gmail.com>
> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
>
> On 2024/8/30 03:38, Sridhar, Kanchana P wrote:
> > Hi Nhat,
> >
> >> -----Original Message-----
> >> From: Nhat Pham <nphamcs@gmail.com>
> >> Sent: Thursday, August 29, 2024 10:11 AM
> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> >> hannes@cmpxchg.org; yosryahmed@google.com; ryan.roberts@arm.com;
> >> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> >> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> >> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> >> Usama Arif <usamaarif642@gmail.com>; Chengming Zhou
> >> <chengming.zhou@linux.dev>
> >> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
> >>
> >> On Wed, Aug 28, 2024 at 5:06 PM Sridhar, Kanchana P
> >> <kanchana.p.sridhar@intel.com> wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Nhat Pham <nphamcs@gmail.com>
> >>>> Sent: Wednesday, August 28, 2024 2:35 PM
> >>>> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> >>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> >>>> hannes@cmpxchg.org; yosryahmed@google.com;
> >> ryan.roberts@arm.com;
> >>>> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com;
> akpm@linux-
> >>>> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> >>>> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> >>>> Subject: Re: [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios
> >>>>
> >>>> On Wed, Aug 28, 2024 at 2:35 AM Kanchana P Sridhar
> >>>> <kanchana.p.sridhar@intel.com> wrote:
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> This patch-series enables zswap_store() to accept and store mTHP
> >>>>> folios. The most significant contribution in this series is from the
> >>>>> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has
> been
> >>>>> migrated to v6.11-rc3 in patch 2/4 of this series.
> >>>>>
> >>>>> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >>>>> https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> >>>> ryan.roberts@arm.com/T/#u
> >>>>>
> >>>>> Additionally, there is an attempt to modularize some of the
> functionality
> >>>>> in zswap_store(), to make it more amenable to supporting any-order
> >>>>> mTHPs. For instance, the function zswap_store_entry() stores a
> >>>> zswap_entry
> >>>>> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> >>>>> delete all offsets corresponding to a higher order folio stored in zswap.
> >>>>>
> >>>>
> >>>> Will this have any conflict with mTHP swap work? Especially with mTHP
> >>>> swap-in and zswap writeback.
> >>>>
> >>>> My understanding is from zswap's perspective, the large folio is
> >>>> broken apart into independent subpages, correct? What happens when
> >> we
> >>>> have partially written back mTHP (i.e some subpages are in zswap
> >>>> still, whereas others are written back to swap). Would this
> >>>> automatically prevent mTHP swapin?
> >>>
> >>> That is a good point. To begin with, this patch-series would make the
> default
> >>> behavior for mTHP swapout/storage and swapin for ZSWAP to be on par
> >> with
> >>> ZRAM. From zswap's perspective, imo this is a significant step forward
> >> towards
> >>> realizing cold memory storage with mTHP folios. However, it is only a
> >> starting
> >>> point that makes the behavior uniform across zswap/zram. Initially,
> >> workloads
> >>> would see a one-time benefit with reclaim being able to swapout mTHP
> >>> folios without splitting, to zswap. If the mTHPs were cold memory, then
> we
> >>> would have derived latency gains towards memory savings (with zswap).
> >>>
> >>> However, if the mTHP were part of "not so cold" memory, this would
> result
> >>> in a one-way mTHP conversion to 4K folios. Depending on workloads and
> >> their
> >>> access patterns, we could either see individual 4K folios being swapped in,
> >>> or entire chunks if not the entire (original) mTHP needing to be swapped
> in.
> >>>
> >>> It should be noted that this is more of a performance vs. cold memory
> >>> preservation trade-off that needs to drive mTHP reclaim, storage, swapin
> >> and
> >>> writeback policy. Different workloads could require different policies.
> >> However,
> >>> even though this patch is only a starting point, it is still functionally
> correct
> >>> by being equivalent to zram-mTHP, and compatible with the rest of mm
> and
> >>> swap as far as mTHP. Another important functionality/data consistency
> >> decision
> >>> I made in this patch series is error handling during zswap_store() of
> mTHP:
> >>> in case of any errors, all swap offsets for the mTHP are deleted from the
> >>> zswap xarray/zpool, since we know that the mTHP will now have to be
> >> stored
> >>> in the backing swap device. IOW, an mTHP is either entirely stored in
> zswap,
> >>> or entirely not stored in zswap.
> >>>
> >>> To answer your question, we would need to come up with what the
> >> semantics
> >>> would need to be for zswap zpool storage granularity, swapin granularity,
> >>> readahead granularity and writeback wrt mTHP and how the overall
> swap
> >>> sub-system needs to "preserve" mTHP vs. splitting mTHP into 4K/lower-
> >> order
> >>> folios during swapout. Once we have a good understanding of these
> policies,
> >>> we could implement them in zswap. Alternately, develop an abstraction
> that
> >> is
> >>> one level above zswap/zram and makes things easier and shareable
> >> between
> >>> zswap and zram. By this, I mean fundamental assumptions such as
> >> consecutive
> >>> swap offsets (for instance). To some extent, this implies that an mTHP as
> a
> >>> swap entity is defined by consecutiveness of swap offsets. Maybe the
> policy
> >>> to keep mTHPs in the system over extended duration might be to
> assemble
> >>> them dynamically based on swapin_readahead() decisions (which is
> based
> >> on
> >>> workload access patterns). In other words, mTHPs could be a useful
> >> abstraction
> >>> that can be static or even dynamic based on working set characteristics,
> and
> >>> cold memory preservation. This is quite a complex topic imho.
> >>>
> >>> As we know, Barry Song and Chuanhua Han have started the discussion
> on
> >>> this in their zram mTHP swapin series [1].
> >>
> >> Yeah I'm a bit more concerned with the correctness aspect. As long as
> >> it's not buggy, then we can implement mTHP zswapout first, and force
> >> individual subpage (z)swapin for now (since we cannot control
> >> writeback from writing individual subpages).
> >
> > Absolutely, this sounds like the way to go!
> >
> >>
> >> We can discuss strategy to harmonize mTHP, zswap (with writeback) as
> >> we go along.
> >
> > Sounds great :)
> >
> >>
> >> BTW, I think we're not cc-ing Chengming? Is the get_maintainers script
> >> not working properly... Let me manually add him in - please include
> >> him in future submission and responses, as he is also a zswap reviewer
> >> :)
> >
> > I think when I ran get_maintainers.pl, I was in v6.10. For sure, will include
> > Chengming in future submissions and responses :)
>
> Maybe a little late for the party, will take a look ASAP.
> It's an interesting and great work.
Thanks! Appreciate your code review and suggestions to improve
the patchset.
Thanks,
Kanchana
>
> Thanks!
>
> >
> >>
> >> Also cc-ing Usama who is interested in this work.
> >
> > Sounds great.
> >
> > Thanks,
> > Kanchana
> >
> >>
> >>>
> >>> [1] https://lore.kernel.org/all/20240821074541.516249-3-
> >> hanchuanhua@oppo.com/T/#u
> >>>
> >>> Thanks,
> >>> Kanchana
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-09-20 2:35 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-28 9:35 [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
2024-08-28 9:35 ` [PATCH v5 1/3] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
2024-08-28 9:35 ` [PATCH v5 2/3] mm: zswap: zswap_store() extended to handle mTHP folios Kanchana P Sridhar
2024-08-28 9:35 ` [PATCH v5 3/3] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
2024-08-28 15:55 ` [PATCH v5 0/3] mm: ZSWAP swap-out of mTHP folios Nhat Pham
2024-08-28 17:23 ` Nhat Pham
2024-08-28 19:30 ` Sridhar, Kanchana P
2024-08-28 19:24 ` Sridhar, Kanchana P
2024-08-28 21:35 ` Nhat Pham
2024-08-29 0:06 ` Sridhar, Kanchana P
2024-08-29 17:10 ` Nhat Pham
2024-08-29 19:38 ` Sridhar, Kanchana P
2024-08-30 4:52 ` Chengming Zhou
2024-09-20 2:34 ` Sridhar, Kanchana P
2024-08-29 3:59 ` Sridhar, Kanchana P
2024-08-28 22:37 ` Yosry Ahmed
2024-08-29 0:20 ` Sridhar, Kanchana P
2024-08-29 1:01 ` Yosry Ahmed
2024-08-29 3:10 ` Sridhar, Kanchana P
2024-08-29 23:33 ` Nhat Pham
2024-08-29 23:38 ` Yosry Ahmed
2024-08-29 23:47 ` Nhat Pham
2024-08-29 23:55 ` Yosry Ahmed
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox